June 16, 2016

A simple, concurrent webcrawler written in Go

Leave a Comment / Go, Programming / tuxninja

I have been playing with the Go programming language on an off for about a year. I started learning Go, because I was running into lots of issues distributing my Python codes dependencies to production machines, with specific security constraints. Go solves this problem by allowing you to compile a single binary that can be easily copied to all of your systems and you can cross compile it easily for different platforms. In addition, Go has a really great & simple way of dealing with concurrency, not to mention it really is concurrent unlike my beloved Python (GIL), which is great for plenty of use cases, but sometimes you need real concurrency. Here is some code I wrote for a simple concurrent webcrawler.

Here is the code for the command line utility fetcher. Notice it imports another package, crawler.

package main

import (
	"flag"
	"fmt"
	"strings"
	"github.com/jasonriedel/fetcher/crawler"
	"time"
)

var (
	sites = flag.String("url", "", "Name of sites to crawl comma delimitted")

)
func main() {
	flag.Parse()
	start := time.Now()
	count := 0
	if *sites != "" {
		ch := make(chan string)
		if strings.Contains(*sites, ",") {
			u := strings.Split(*sites, ",")
			for _, cu := range u {
				count++
				go crawler.Crawl(cu, ch) // start goroutine
			}
			for range u {
				fmt.Println(<-ch)
			}
		} else {
			count++
			go crawler.Crawl(*sites, ch) // start goroutine
			fmt.Println(<-ch)
		}
	} else {
		fmt.Println("Please specify urls")
	}
	secs := time.Since(start).Seconds()
	fmt.Printf("Total time: %.2fs - %d site(s)", secs, count)
}

I am not going go over the basics in this code, because that should be fairly self explanatory. What is important here is how we are implementing concurrency. Once the scripts validates you passed a string in (that is hopefully a URL – No input validation yet!) it starts by creating a channel via

ch := make(chan string)

After we initialized the channel, we need to split the sites passed in from the command line -url flag via comma in case there is more than 1 site to crawl. Then we loop through each site and kick off a go routine like so.

go crawler.Crawl(cu, ch) // start goroutine

At this point, our go routine is executing code from the imported crawler package mentioned above. Calling the method Crawl. Let’s take a look at it now…

package crawler

import (
	"fmt"
	"time"
	"io/ioutil"
	"io"
	"net/http"
)

func Crawl(url string, ch chan<- string) {
	start := time.Now()
	resp, err := http.Get(url)
	if err != nil {
		ch <- fmt.Sprint(err) // send to channel ch
		return
	}

	nbytes, err := io.Copy(ioutil.Discard, resp.Body)
	resp.Body.Close() // dont leak resources
	if err != nil {
		ch <- fmt.Sprintf("While reading %s: %v", url, err)
		return
	}
	secs := time.Since(start).Seconds()
	ch <- fmt.Sprintf("%.2fs %7d %s", secs, nbytes, url)
}

This is pretty straight forward. We start a timer, take the passed in URL and do an http.get …if that doesn’t error, the response Body is copied into nbytes, which is ultimately returned to the channel at the bottom of the function.

Once the code returns from crawler.Crawl to fetcher… it loops through each URL for channel output. This is very important. If you try placing a print inside of the same loop as your go routine you are going to change the behavior of your application to work in a serial/slower fashion because after each go routine it will wait for output. Putting the loop for channel output outside of the loop that launches the go routine enables them to all be launched one right after another, and then output is gathered after they have all been launched. This creates a very highly performant outcome. Here is an example of this script once it has been compiled.

➜  fetcher ./fetcher -url http://www.google.com,http://www.tuxlabs.com,http://www.imdb.com,http://www.southwest.com,http://www.udemy.com,http://www.microsoft.com,http://www.github.com,http://www.yahoo.com,http://www.linkedin.com,http://www.facebook.com,http://www.twitter.com,http://www.apple.com,http://www.betterment.com,http://www.cox.net,http://www.centurylink.net,http://www.att.com,http://www.toyota.com,http://www.netflix.com,http://www.etrade.com,http://www.thestreet.com
0.10s  195819 http://www.toyota.com
0.10s   33200 http://www.apple.com
0.14s   10383 http://www.google.com
0.37s   57338 http://www.facebook.com
0.39s   84816 http://www.microsoft.com
0.47s  207124 http://www.att.com
0.53s  294608 http://www.thestreet.com
0.65s  264782 http://www.twitter.com
0.66s  428256 http://www.southwest.com
0.74s   99983 http://www.betterment.com
0.80s   41372 http://www.linkedin.com
0.82s  520502 http://www.yahoo.com
0.87s  150688 http://www.etrade.com
0.89s   51826 http://www.udemy.com
1.13s   71862 http://www.tuxlabs.com
1.16s   25509 http://www.github.com
1.30s  311818 http://www.centurylink.net
1.33s  169775 http://www.imdb.com
1.33s   87346 http://www.cox.net
1.75s  247502 http://www.netflix.com
Total time: 1.75s - 20 site(s)%

20 sites in 1.75s seconds..that is not too shabby. The remainder of the fetcher code runs a go routine if only one site is passed..then returns an error if message if a url is not passed in on the command line, and finally outputs the time it took total to run for all sites. The go routine is not necessary in the case of running a single url, however, it doesn’t hurt and I like the consistency of how the code reads this way.

Hopefully you enjoyed this brief show of the Go programming language. If you decide to get into Go, I cannot recommend this book enough : https://www.amazon.com/Programming-Language-Addison-Wesley-Professional-Computing/dp/0134190440 . This book has a bit of a cult following due to one of the authors being https://en.wikipedia.org/wiki/Brian_Kernighan who co-authored what consider to be the best book on C ever written (I own it, it’s really good too). I bought other Go books before this one, and I have to say don’t waste your money, buy this one and it is all you will need.

The github code for the examples above can be found here : https://github.com/jasonriedel/fetcher

Godspeed, happy learning.

A simple, concurrent webcrawler written in Go Read More »