{"id":374,"date":"2016-06-16T05:14:31","date_gmt":"2016-06-16T05:14:31","guid":{"rendered":"http:\/\/tuxlabs.com\/?p=374"},"modified":"2016-06-16T05:21:10","modified_gmt":"2016-06-16T05:21:10","slug":"a-simple-concurrent-webcrawler-written-in-go","status":"publish","type":"post","link":"https:\/\/tuxlabs.com\/?p=374","title":{"rendered":"A simple, concurrent webcrawler written in Go"},"content":{"rendered":"<p>I have been playing with the Go programming language on an off for about a year. I started learning Go, because I was running into lots of issues distributing my Python codes dependencies to production machines, with specific security constraints. Go solves this problem by allowing you to compile a single binary that can be easily copied to all of your systems and you can cross compile it easily for different platforms. In addition, Go has a really great &amp; simple way of dealing with concurrency, not to mention it really is concurrent unlike my beloved Python (GIL), which is great for plenty\u00a0of use cases, but sometimes you need real concurrency. Here is some code I wrote for a simple concurrent webcrawler.<\/p>\n<p>Here is the code for the command line utility fetcher. Notice it imports another package, crawler.<\/p>\n<pre class=\"lang:default decode:true\" title=\"fetcher\">package main\r\n\r\nimport (\r\n\t\"flag\"\r\n\t\"fmt\"\r\n\t\"strings\"\r\n\t\"github.com\/jasonriedel\/fetcher\/crawler\"\r\n\t\"time\"\r\n)\r\n\r\nvar (\r\n\tsites = flag.String(\"url\", \"\", \"Name of sites to crawl comma delimitted\")\r\n\r\n)\r\nfunc main() {\r\n\tflag.Parse()\r\n\tstart := time.Now()\r\n\tcount := 0\r\n\tif *sites != \"\" {\r\n\t\tch := make(chan string)\r\n\t\tif strings.Contains(*sites, \",\") {\r\n\t\t\tu := strings.Split(*sites, \",\")\r\n\t\t\tfor _, cu := range u {\r\n\t\t\t\tcount++\r\n\t\t\t\tgo crawler.Crawl(cu, ch) \/\/ start goroutine\r\n\t\t\t}\r\n\t\t\tfor range u {\r\n\t\t\t\tfmt.Println(&lt;-ch)\r\n\t\t\t}\r\n\t\t} else {\r\n\t\t\tcount++\r\n\t\t\tgo crawler.Crawl(*sites, ch) \/\/ start goroutine\r\n\t\t\tfmt.Println(&lt;-ch)\r\n\t\t}\r\n\t} else {\r\n\t\tfmt.Println(\"Please specify urls\")\r\n\t}\r\n\tsecs := time.Since(start).Seconds()\r\n\tfmt.Printf(\"Total time: %.2fs - %d site(s)\", secs, count)\r\n}\r\n<\/pre>\n<p>I am not going go over the basics in this code, because that should be fairly self explanatory. What is important here is how we are implementing concurrency. Once the scripts validates you passed a string in (that is hopefully a URL &#8211; No input validation yet!) it starts by creating a channel via<\/p>\n<pre class=\"lang:default decode:true\">ch := make(chan string)\r\n<\/pre>\n<p>After we initialized the channel, we need to split the sites passed in from the command line -url flag via comma in case there is more than 1 site to crawl. Then we loop through each site and kick off a go routine like so.<\/p>\n<pre class=\"lang:default decode:true \">go crawler.Crawl(cu, ch) \/\/ start goroutine<\/pre>\n<p>At this point, our go routine is executing code from the imported crawler package mentioned above. Calling the method Crawl. Let&#8217;s take a look at it now&#8230;<\/p>\n<pre class=\"lang:default decode:true \">package crawler\r\n\r\nimport (\r\n\t\"fmt\"\r\n\t\"time\"\r\n\t\"io\/ioutil\"\r\n\t\"io\"\r\n\t\"net\/http\"\r\n)\r\n\r\nfunc Crawl(url string, ch chan&lt;- string) {\r\n\tstart := time.Now()\r\n\tresp, err := http.Get(url)\r\n\tif err != nil {\r\n\t\tch &lt;- fmt.Sprint(err) \/\/ send to channel ch\r\n\t\treturn\r\n\t}\r\n\r\n\tnbytes, err := io.Copy(ioutil.Discard, resp.Body)\r\n\tresp.Body.Close() \/\/ dont leak resources\r\n\tif err != nil {\r\n\t\tch &lt;- fmt.Sprintf(\"While reading %s: %v\", url, err)\r\n\t\treturn\r\n\t}\r\n\tsecs := time.Since(start).Seconds()\r\n\tch &lt;- fmt.Sprintf(\"%.2fs %7d %s\", secs, nbytes, url)\r\n}<\/pre>\n<p>This is pretty straight forward. We start a timer, take the passed in URL and do an http.get &#8230;if that doesn&#8217;t error, the response Body is copied into nbytes, which is ultimately returned to the channel at the bottom of the function.<\/p>\n<p>Once the code returns from crawler.Crawl to fetcher&#8230; it loops through each URL for channel output. This is very important. If you try placing a print inside of the same loop as your go routine you are going to change the behavior of your application to work in a serial\/slower fashion because after each go routine it will wait for output. Putting the loop for channel output outside of the loop that launches the go routine enables them to all be launched one right after another, and then output is gathered after they have all been launched. This creates a very highly performant outcome. Here is an example of this script once it has been compiled.<\/p>\n<pre class=\"lang:default decode:true\">\u279c  fetcher .\/fetcher -url http:\/\/www.google.com,http:\/\/www.tuxlabs.com,http:\/\/www.imdb.com,http:\/\/www.southwest.com,http:\/\/www.udemy.com,http:\/\/www.microsoft.com,http:\/\/www.github.com,http:\/\/www.yahoo.com,http:\/\/www.linkedin.com,http:\/\/www.facebook.com,http:\/\/www.twitter.com,http:\/\/www.apple.com,http:\/\/www.betterment.com,http:\/\/www.cox.net,http:\/\/www.centurylink.net,http:\/\/www.att.com,http:\/\/www.toyota.com,http:\/\/www.netflix.com,http:\/\/www.etrade.com,http:\/\/www.thestreet.com\r\n0.10s  195819 http:\/\/www.toyota.com\r\n0.10s   33200 http:\/\/www.apple.com\r\n0.14s   10383 http:\/\/www.google.com\r\n0.37s   57338 http:\/\/www.facebook.com\r\n0.39s   84816 http:\/\/www.microsoft.com\r\n0.47s  207124 http:\/\/www.att.com\r\n0.53s  294608 http:\/\/www.thestreet.com\r\n0.65s  264782 http:\/\/www.twitter.com\r\n0.66s  428256 http:\/\/www.southwest.com\r\n0.74s   99983 http:\/\/www.betterment.com\r\n0.80s   41372 http:\/\/www.linkedin.com\r\n0.82s  520502 http:\/\/www.yahoo.com\r\n0.87s  150688 http:\/\/www.etrade.com\r\n0.89s   51826 http:\/\/www.udemy.com\r\n1.13s   71862 http:\/\/www.tuxlabs.com\r\n1.16s   25509 http:\/\/www.github.com\r\n1.30s  311818 http:\/\/www.centurylink.net\r\n1.33s  169775 http:\/\/www.imdb.com\r\n1.33s   87346 http:\/\/www.cox.net\r\n1.75s  247502 http:\/\/www.netflix.com\r\nTotal time: 1.75s - 20 site(s)%<\/pre>\n<p>20 sites in 1.75s\u00a0seconds..that is not too shabby. The remainder of the fetcher code runs a go routine if only one site is passed..then returns an error if message if a url is not passed in on the command line, and finally outputs the time it took total to run for all sites. The go routine is not necessary in the case of running a single url, however, it doesn&#8217;t hurt and I like the consistency of how the code reads this way.<\/p>\n<p>Hopefully you enjoyed this brief show of the Go programming language. If you decide to get into Go, I cannot recommend this\u00a0book enough :\u00a0<a href=\"https:\/\/www.amazon.com\/Programming-Language-Addison-Wesley-Professional-Computing\/dp\/0134190440\">https:\/\/www.amazon.com\/Programming-Language-Addison-Wesley-Professional-Computing\/dp\/0134190440<\/a>\u00a0. This book has a bit of a cult following due to one of the authors being\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Brian_Kernighan\">https:\/\/en.wikipedia.org\/wiki\/Brian_Kernighan<\/a>\u00a0who co-authored what consider to be the best book on C ever written (I own it, it&#8217;s really good too). I bought other Go books before this one, and I have to say don&#8217;t waste your money, buy this one and it is all you will need.<\/p>\n<p>The github code for the examples above can be found here :\u00a0<a href=\"https:\/\/github.com\/jasonriedel\/fetcher\">https:\/\/github.com\/jasonriedel\/fetcher<\/a><\/p>\n<p>Godspeed, happy learning.<\/p>\n","protected":false},"excerpt":{"rendered":"<a href=\"https:\/\/tuxlabs.com\/?p=374\" rel=\"bookmark\" title=\"Permalink to A simple, concurrent webcrawler written in Go\"><p>I have been playing with the Go programming language on an off for about a year. I started learning Go, because I was running into lots of issues distributing my Python codes dependencies to production machines, with specific security constraints. Go solves this problem by allowing you to compile a single binary that can be [&hellip;]<\/p>\n<\/a>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[142,8],"tags":[],"class_list":{"0":"post-374","1":"post","2":"type-post","3":"status-publish","4":"format-standard","6":"category-go","7":"category-programming","8":"h-entry","9":"hentry"},"_links":{"self":[{"href":"https:\/\/tuxlabs.com\/index.php?rest_route=\/wp\/v2\/posts\/374","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/tuxlabs.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/tuxlabs.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/tuxlabs.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/tuxlabs.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=374"}],"version-history":[{"count":3,"href":"https:\/\/tuxlabs.com\/index.php?rest_route=\/wp\/v2\/posts\/374\/revisions"}],"predecessor-version":[{"id":378,"href":"https:\/\/tuxlabs.com\/index.php?rest_route=\/wp\/v2\/posts\/374\/revisions\/378"}],"wp:attachment":[{"href":"https:\/\/tuxlabs.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=374"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/tuxlabs.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=374"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/tuxlabs.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=374"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}