TuxLabs LLC

All things DevOps

A simple, concurrent webcrawler written in Go

Published / by tuxninja / Leave a Comment

I have been playing with the Go programming language on an off for about a year. I started learning Go, because I was running into lots of issues distributing my Python codes dependencies to production machines, with specific security constraints. Go solves this problem by allowing you to compile a single binary that can be easily copied to all of your systems and you can cross compile it easily for different platforms. In addition, Go has a really great & simple way of dealing with concurrency, not to mention it really is concurrent unlike my beloved Python (GIL), which is great for plenty of use cases, but sometimes you need real concurrency. Here is some code I wrote for a simple concurrent webcrawler.

Here is the code for the command line utility fetcher. Notice it imports another package, crawler.

I am not going go over the basics in this code, because that should be fairly self explanatory. What is important here is how we are implementing concurrency. Once the scripts validates you passed a string in (that is hopefully a URL – No input validation yet!) it starts by creating a channel via

After we initialized the channel, we need to split the sites passed in from the command line -url flag via comma in case there is more than 1 site to crawl. Then we loop through each site and kick off a go routine like so.

At this point, our go routine is executing code from the imported crawler package mentioned above. Calling the method Crawl. Let’s take a look at it now…

This is pretty straight forward. We start a timer, take the passed in URL and do an http.get …if that doesn’t error, the response Body is copied into nbytes, which is ultimately returned to the channel at the bottom of the function.

Once the code returns from crawler.Crawl to fetcher… it loops through each URL for channel output. This is very important. If you try placing a print inside of the same loop as your go routine you are going to change the behavior of your application to work in a serial/slower fashion because after each go routine it will wait for output. Putting the loop for channel output outside of the loop that launches the go routine enables them to all be launched one right after another, and then output is gathered after they have all been launched. This creates a very highly performant outcome. Here is an example of this script once it has been compiled.

20 sites in 1.75s seconds..that is not too shabby. The remainder of the fetcher code runs a go routine if only one site is passed..then returns an error if message if a url is not passed in on the command line, and finally outputs the time it took total to run for all sites. The go routine is not necessary in the case of running a single url, however, it doesn’t hurt and I like the consistency of how the code reads this way.

Hopefully you enjoyed this brief show of the Go programming language. If you decide to get into Go, I cannot recommend this book enough : https://www.amazon.com/Programming-Language-Addison-Wesley-Professional-Computing/dp/0134190440 . This book has a bit of a cult following due to one of the authors being https://en.wikipedia.org/wiki/Brian_Kernighan who co-authored what consider to be the best book on C ever written (I own it, it’s really good too). I bought other Go books before this one, and I have to say don’t waste your money, buy this one and it is all you will need.

The github code for the examples above can be found here : https://github.com/jasonriedel/fetcher

Godspeed, happy learning.

Leave a Reply