A simple, concurrent webcrawler written in Go

I have been playing with the Go programming language on an off for about a year. I started learning Go, because I was running into lots of issues distributing my Python codes dependencies to production machines, with specific security constraints. Go solves this problem by allowing you to compile a single binary that can be easily copied to all of your systems and you can cross compile it easily for different platforms. In addition, Go has a really great & simple way of dealing with concurrency, not to mention it really is concurrent unlike my beloved Python (GIL), which is great for plenty of use cases, but sometimes you need real concurrency. Here is some code I wrote for a simple concurrent webcrawler.

Here is the code for the command line utility fetcher. Notice it imports another package, crawler.

package main

import (
	"flag"
	"fmt"
	"strings"
	"github.com/jasonriedel/fetcher/crawler"
	"time"
)

var (
	sites = flag.String("url", "", "Name of sites to crawl comma delimitted")

)
func main() {
	flag.Parse()
	start := time.Now()
	count := 0
	if *sites != "" {
		ch := make(chan string)
		if strings.Contains(*sites, ",") {
			u := strings.Split(*sites, ",")
			for _, cu := range u {
				count++
				go crawler.Crawl(cu, ch) // start goroutine
			}
			for range u {
				fmt.Println(<-ch)
			}
		} else {
			count++
			go crawler.Crawl(*sites, ch) // start goroutine
			fmt.Println(<-ch)
		}
	} else {
		fmt.Println("Please specify urls")
	}
	secs := time.Since(start).Seconds()
	fmt.Printf("Total time: %.2fs - %d site(s)", secs, count)
}

I am not going go over the basics in this code, because that should be fairly self explanatory. What is important here is how we are implementing concurrency. Once the scripts validates you passed a string in (that is hopefully a URL – No input validation yet!) it starts by creating a channel via

ch := make(chan string)

After we initialized the channel, we need to split the sites passed in from the command line -url flag via comma in case there is more than 1 site to crawl. Then we loop through each site and kick off a go routine like so.

go crawler.Crawl(cu, ch) // start goroutine

At this point, our go routine is executing code from the imported crawler package mentioned above. Calling the method Crawl. Let’s take a look at it now…

package crawler

import (
	"fmt"
	"time"
	"io/ioutil"
	"io"
	"net/http"
)

func Crawl(url string, ch chan<- string) {
	start := time.Now()
	resp, err := http.Get(url)
	if err != nil {
		ch <- fmt.Sprint(err) // send to channel ch
		return
	}

	nbytes, err := io.Copy(ioutil.Discard, resp.Body)
	resp.Body.Close() // dont leak resources
	if err != nil {
		ch <- fmt.Sprintf("While reading %s: %v", url, err)
		return
	}
	secs := time.Since(start).Seconds()
	ch <- fmt.Sprintf("%.2fs %7d %s", secs, nbytes, url)
}

This is pretty straight forward. We start a timer, take the passed in URL and do an http.get …if that doesn’t error, the response Body is copied into nbytes, which is ultimately returned to the channel at the bottom of the function.

Once the code returns from crawler.Crawl to fetcher… it loops through each URL for channel output. This is very important. If you try placing a print inside of the same loop as your go routine you are going to change the behavior of your application to work in a serial/slower fashion because after each go routine it will wait for output. Putting the loop for channel output outside of the loop that launches the go routine enables them to all be launched one right after another, and then output is gathered after they have all been launched. This creates a very highly performant outcome. Here is an example of this script once it has been compiled.

➜  fetcher ./fetcher -url http://www.google.com,http://www.tuxlabs.com,http://www.imdb.com,http://www.southwest.com,http://www.udemy.com,http://www.microsoft.com,http://www.github.com,http://www.yahoo.com,http://www.linkedin.com,http://www.facebook.com,http://www.twitter.com,http://www.apple.com,http://www.betterment.com,http://www.cox.net,http://www.centurylink.net,http://www.att.com,http://www.toyota.com,http://www.netflix.com,http://www.etrade.com,http://www.thestreet.com
0.10s  195819 http://www.toyota.com
0.10s   33200 http://www.apple.com
0.14s   10383 http://www.google.com
0.37s   57338 http://www.facebook.com
0.39s   84816 http://www.microsoft.com
0.47s  207124 http://www.att.com
0.53s  294608 http://www.thestreet.com
0.65s  264782 http://www.twitter.com
0.66s  428256 http://www.southwest.com
0.74s   99983 http://www.betterment.com
0.80s   41372 http://www.linkedin.com
0.82s  520502 http://www.yahoo.com
0.87s  150688 http://www.etrade.com
0.89s   51826 http://www.udemy.com
1.13s   71862 http://www.tuxlabs.com
1.16s   25509 http://www.github.com
1.30s  311818 http://www.centurylink.net
1.33s  169775 http://www.imdb.com
1.33s   87346 http://www.cox.net
1.75s  247502 http://www.netflix.com
Total time: 1.75s - 20 site(s)%

20 sites in 1.75s seconds..that is not too shabby. The remainder of the fetcher code runs a go routine if only one site is passed..then returns an error if message if a url is not passed in on the command line, and finally outputs the time it took total to run for all sites. The go routine is not necessary in the case of running a single url, however, it doesn’t hurt and I like the consistency of how the code reads this way.

Hopefully you enjoyed this brief show of the Go programming language. If you decide to get into Go, I cannot recommend this book enough : https://www.amazon.com/Programming-Language-Addison-Wesley-Professional-Computing/dp/0134190440 . This book has a bit of a cult following due to one of the authors being https://en.wikipedia.org/wiki/Brian_Kernighan who co-authored what consider to be the best book on C ever written (I own it, it’s really good too). I bought other Go books before this one, and I have to say don’t waste your money, buy this one and it is all you will need.

The github code for the examples above can be found here : https://github.com/jasonriedel/fetcher

Godspeed, happy learning.

How to use Boto to Audit your AWS EC2 instance security groups

Boto is a Software Development Kit for accessing the AWS API’s using Python.

https://github.com/boto/boto3

Recently, I needed to determine how many of my EC2 instances were spawned in a public subnet, that also had security groups with wide open access on any port via any protocol to the instances. Because I have an IGW (Internet Gateway) in my VPC’s and properly setup routing tables, instances launched in the public subnets with wide open security groups (allowing ingress traffic from any source) is a really bad thing 🙂

Here is the code I wrote to identify these naughty instances. It will require slight modifications in your own environment, to match your public subnet IP Ranges, EC2 Tags, and Account names.

#!/usr/bin/env python
__author__ = 'Jason Riedel'
__description__ = 'Find EC2 instances provisioned in a public subnet that have security groups allowing ingress traffic from any source.'
__date__ = 'June 5th 2016'
__version__ = '1.0'

import boto3

def find_public_addresses(ec2):
    public_instances = {}
    instance_public_ips = {}
    instance_private_ips = {}
    instance_ident = {}
    instances = ec2.instances.filter(Filters=[{'Name': 'instance-state-name', 'Values': ['running'] }])

    # Ranges that you define as public subnets in AWS go here.
    public_subnet_ranges = ['10.128.0', '192.168.0', '172.16.0']

    for instance in instances:
        # I only care if the private address falls into a public subnet range
        # because if it doesnt Internet ingress cant reach it directly anyway even with a public IP
        if any(cidr in instance.private_ip_address for cidr in public_subnet_ranges):
            owner_tag = "None"
            instance_name = "None"
            if instance.tags:
                for i in range(len(instance.tags)):
                    #comment OwnerEmail tag out if you do not tag your instances with it.
                    if instance.tags[i]['Key'] == "OwnerEmail":
                        owner_tag = instance.tags[i]['Value']
                    if instance.tags[i]['Key'] == "Name":
                        instance_name = instance.tags[i]['Value']
            instance_ident[instance.id] = "Name: %s\n\tKeypair: %s\n\tOwner: %s" % (instance_name, instance.key_name, owner_tag)
            if instance.public_ip_address is not None:
                values=[]
                for i in range(len(instance.security_groups)):
                    values.append(instance.security_groups[i]['GroupId'])
                public_instances[instance.id] = values
                instance_public_ips[instance.id] = instance.public_ip_address
                instance_private_ips[instance.id] = instance.private_ip_address

    return (public_instances, instance_public_ips,instance_private_ips, instance_ident)

def inspect_security_group(ec2, sg_id):
    sg = ec2.SecurityGroup(sg_id)

    open_cidrs = []
    for i in range(len(sg.ip_permissions)):
        to_port = ''
        ip_proto = ''
        if 'ToPort' in sg.ip_permissions[i]:
            to_port = sg.ip_permissions[i]['ToPort']
        if 'IpProtocol' in sg.ip_permissions[i]:
            ip_proto = sg.ip_permissions[i]['IpProtocol']
            if '-1' in ip_proto:
                ip_proto = 'All'
        for j in range(len(sg.ip_permissions[i]['IpRanges'])):
            cidr_string = "%s %s %s" % (sg.ip_permissions[i]['IpRanges'][j]['CidrIp'], ip_proto, to_port)

            if sg.ip_permissions[i]['IpRanges'][j]['CidrIp'] == '0.0.0.0/0':
                #preventing an instance being flagged for only ICMP being open
                if ip_proto != 'icmp':
                    open_cidrs.append(cidr_string)

    return open_cidrs


if __name__ == "__main__":
    #loading profiles from ~/.aws/config & credentials
    profile_names = ['de', 'pe', 'pde', 'ppe']
    #Translates profile name to a more friendly name i.e. Account Name
    translator = {'de': 'Platform Dev', 'pe': 'Platform Prod', 'pde': 'Products Dev', 'ppe': 'Products Prod'}
    for profile_name in profile_names:
        session = boto3.Session(profile_name=profile_name)
        ec2 = session.resource('ec2')

        (public_instances, instance_public_ips, instance_private_ips, instance_ident) = find_public_addresses(ec2)

        for instance in public_instances:
            for sg_id in public_instances[instance]:
                open_cidrs = inspect_security_group(ec2, sg_id)
                if open_cidrs: #only print if there are open cidrs
                    print "=================================="
                    print " %s, %s" % (instance, translator[profile_name])
                    print "=================================="
                    print "\tprivate ip: %s\n\tpublic ip: %s\n\t%s" % (instance_private_ips[instance], instance_public_ips[instance], instance_ident[instance])
                    print "\t=========================="
                    print "\t open ingress rules"
                    print "\t=========================="
                    for cidr in open_cidrs:
                        print "\t\t" + cidr

To run this you also need to setup your .aws/config and .aws/credentials file.

http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html#cli-config-files

Email me tuxninja [at] tuxlabs.com if you have any issues.
Boto is awesome 🙂 Note so is the AWS CLI, but requires some shell scripting as opposed to Python to do cool stuff.

The github for this code here https://github.com/jasonriedel/AWS/blob/master/sg-audit.py

Enjoy !