A simple, concurrent webcrawler written in Go
I have been playing with the Go programming language on an off for about a year. I started learning Go, because I was running into lots of issues distributing my Python codes dependencies to production machines, with specific security constraints. Go solves this problem by allowing you to compile a single binary that can be easily copied to all of your systems and you can cross compile it easily for different platforms. In addition, Go has a really great & simple way of dealing with concurrency, not to mention it really is concurrent unlike my beloved Python (GIL), which is great for plenty of use cases, but sometimes you need real concurrency. Here is some code I wrote for a simple concurrent webcrawler.
Here is the code for the command line utility fetcher. Notice it imports another package, crawler.
package main import ( "flag" "fmt" "strings" "github.com/jasonriedel/fetcher/crawler" "time" ) var ( sites = flag.String("url", "", "Name of sites to crawl comma delimitted") ) func main() { flag.Parse() start := time.Now() count := 0 if *sites != "" { ch := make(chan string) if strings.Contains(*sites, ",") { u := strings.Split(*sites, ",") for _, cu := range u { count++ go crawler.Crawl(cu, ch) // start goroutine } for range u { fmt.Println(<-ch) } } else { count++ go crawler.Crawl(*sites, ch) // start goroutine fmt.Println(<-ch) } } else { fmt.Println("Please specify urls") } secs := time.Since(start).Seconds() fmt.Printf("Total time: %.2fs - %d site(s)", secs, count) }
I am not going go over the basics in this code, because that should be fairly self explanatory. What is important here is how we are implementing concurrency. Once the scripts validates you passed a string in (that is hopefully a URL – No input validation yet!) it starts by creating a channel via
ch := make(chan string)
After we initialized the channel, we need to split the sites passed in from the command line -url flag via comma in case there is more than 1 site to crawl. Then we loop through each site and kick off a go routine like so.
go crawler.Crawl(cu, ch) // start goroutine
At this point, our go routine is executing code from the imported crawler package mentioned above. Calling the method Crawl. Let’s take a look at it now…
package crawler import ( "fmt" "time" "io/ioutil" "io" "net/http" ) func Crawl(url string, ch chan<- string) { start := time.Now() resp, err := http.Get(url) if err != nil { ch <- fmt.Sprint(err) // send to channel ch return } nbytes, err := io.Copy(ioutil.Discard, resp.Body) resp.Body.Close() // dont leak resources if err != nil { ch <- fmt.Sprintf("While reading %s: %v", url, err) return } secs := time.Since(start).Seconds() ch <- fmt.Sprintf("%.2fs %7d %s", secs, nbytes, url) }
This is pretty straight forward. We start a timer, take the passed in URL and do an http.get …if that doesn’t error, the response Body is copied into nbytes, which is ultimately returned to the channel at the bottom of the function.
Once the code returns from crawler.Crawl to fetcher… it loops through each URL for channel output. This is very important. If you try placing a print inside of the same loop as your go routine you are going to change the behavior of your application to work in a serial/slower fashion because after each go routine it will wait for output. Putting the loop for channel output outside of the loop that launches the go routine enables them to all be launched one right after another, and then output is gathered after they have all been launched. This creates a very highly performant outcome. Here is an example of this script once it has been compiled.
➜ fetcher ./fetcher -url http://www.google.com,http://www.tuxlabs.com,http://www.imdb.com,http://www.southwest.com,http://www.udemy.com,http://www.microsoft.com,http://www.github.com,http://www.yahoo.com,http://www.linkedin.com,http://www.facebook.com,http://www.twitter.com,http://www.apple.com,http://www.betterment.com,http://www.cox.net,http://www.centurylink.net,http://www.att.com,http://www.toyota.com,http://www.netflix.com,http://www.etrade.com,http://www.thestreet.com 0.10s 195819 http://www.toyota.com 0.10s 33200 http://www.apple.com 0.14s 10383 http://www.google.com 0.37s 57338 http://www.facebook.com 0.39s 84816 http://www.microsoft.com 0.47s 207124 http://www.att.com 0.53s 294608 http://www.thestreet.com 0.65s 264782 http://www.twitter.com 0.66s 428256 http://www.southwest.com 0.74s 99983 http://www.betterment.com 0.80s 41372 http://www.linkedin.com 0.82s 520502 http://www.yahoo.com 0.87s 150688 http://www.etrade.com 0.89s 51826 http://www.udemy.com 1.13s 71862 http://www.tuxlabs.com 1.16s 25509 http://www.github.com 1.30s 311818 http://www.centurylink.net 1.33s 169775 http://www.imdb.com 1.33s 87346 http://www.cox.net 1.75s 247502 http://www.netflix.com Total time: 1.75s - 20 site(s)%
20 sites in 1.75s seconds..that is not too shabby. The remainder of the fetcher code runs a go routine if only one site is passed..then returns an error if message if a url is not passed in on the command line, and finally outputs the time it took total to run for all sites. The go routine is not necessary in the case of running a single url, however, it doesn’t hurt and I like the consistency of how the code reads this way.
Hopefully you enjoyed this brief show of the Go programming language. If you decide to get into Go, I cannot recommend this book enough : https://www.amazon.com/Programming-Language-Addison-Wesley-Professional-Computing/dp/0134190440 . This book has a bit of a cult following due to one of the authors being https://en.wikipedia.org/wiki/Brian_Kernighan who co-authored what consider to be the best book on C ever written (I own it, it’s really good too). I bought other Go books before this one, and I have to say don’t waste your money, buy this one and it is all you will need.
The github code for the examples above can be found here : https://github.com/jasonriedel/fetcher
Godspeed, happy learning.
A simple, concurrent webcrawler written in Go Read More »