CS 4700 / CS 5700 - Network Fundamentals

Project 2: Web Crawler

This project is due at 11:59pm on October 3, 2014.

Description

This assignment is intended to familiarize you with the HTTP protocol. HTTP is (arguably) the most important application level protocol on the Internet today: the Web runs on HTTP, and increasingly other applications use HTTP as well (including Bittorrent, streaming video, Facebook and Twitter's social APIs, etc.).

Your goal in this assignment is to implement a web crawler that gathers data from a fake social networking website that we have set up for you. The site is available here: Fakebook.

What is a Web Crawler?

A web crawler (sometimes known as a robot, a spider, or a screen scraper) is a piece of software that automatically gathers and traverses documents on the web. For example, lets say you have a crawler and you tell it to start at www.wikipedia.com. The software will first download the Wikipedia homepage, then it will parse the HTML and locate all hyperlinks (i.e. anchor tags) embedded in the page. The crawler then downloads all the HTML pages specified by the URLs on the homepage, and parses them looking for more hyperlinks. This process continues until all of the pages on Wikipedia are downloaded and parsed.

Web crawlers are a fundamental component of today's web. For example, Googlebot is Google's web crawler. Googlebot is constantly scouring the web, downloading pages in search of new and updated content. All of this data forms the backbone of Google's search engine infrastructure.

Fakebook

We have set up a fake social network for this project called Fakebook. Fakebook is a very simple website that consists of the following pages: In order to browse Fakebook, you must first login with a username and password. We will email each student to give them a unique username and password.

WARNING: DO NOT TEST YOUR CRAWLERS ON PUBLIC WEBSITES

Many web server administrators view crawlers as a nuisance, and they get very mad if they see strange crawlers traversing their sites. Only test your crawler against Fakebook, do not test it against any other websites.

High Level Requirements

Your goal is to collect 5 secret flags that have been hidden somewhere on the Fakebook website. The flags are unique for each student, and the pages that contain the flags will be different for each student. Since you have no idea what pages the secret flags will appear on, your only option is to write a web crawler that will traverse Fakebook and locate your flags.

Your web crawler must execute on the command line using the following syntax:

./webcrawler [username] [password]

username and password are used by your crawler to log-in to Fakebook. You may assume that the root page for Fakebook is available at http://cs5700sp15.ccs.neu.edu/. You may also assume that the log-in form for Fakebook is available at http://cs5700sp15.ccs.neu.edu/accounts/login/?next=/fakebook/.

Your web crawler should print exactly fives lines of output: the five secret flags discovered during the crawl of Fakebook. If your program encounters an unrecoverable error, it may print an error message before terminating.

Secret flags may be hidden on any page on Fakebook, and their relative location on each page may be different. Each secret flag is a 64 character long sequences of random alphanumerics. All secret flags will appear in the following format (which makes them easy to identify):

<h2 class='secret_flag' style="color:red">FLAG: 64-characters-of-random-alphanumerics</h2>

There are a few key things that all web crawlers must do in order function:

In order to build a successful web crawler, you will need to handle several different aspects of the HTTP protocol:

In addition to crawling Fakebook, your web crawler must be able to correctly handle HTTP status codes. Obviously, you need to handle 200, since that means everything is okay. Your code must also handle:

I highly recommend the HTTP Made Really Easy tutorial as a starting place for students to learn about the HTTP protocol. Furthermore, the developer tools built-in to the Chrome browser, as well as the Firebug extension for Firefox, are both excellent tools for inspecting and understanding HTTP requests.

Logging in to Fakebook

In order to write code that can successfully log-in to Fakebook, you will need to reverse engineer the HTML form on the log-in page. Students should carefully inspect the form's code, since it may not be as simple as it initially appears.

Language

You can write your code in whatever language you choose, as long as your code compiles and runs on unmodified CCIS Linux machines on the command line. Do not use libraries that are not installed by default on the CCIS Linux machines. Similarly, your code must compile and run on the command line. You may use IDEs (e.g. Eclipse) during development, but do not turn in your IDE project without a Makefile. Make sure you code has no dependencies on your IDE.

Legal Libraries and Modules

Students may use any available libraries to create socket connections, parse URLs, and parse HTML. However, all HTTP request code must be written by the student, from scratch. Your code must build all HTTP messages, parse HTTP responses, and manage all cookies.

For example, if you were to write your crawler in Python, the following modules would all be allowed: socket, parseurl, html, html.parse, and xml. However, the following modules would not be allowed: urllib, urllib2, httplib, and cookielib.

Similarly, if you were to write your crawler in Java, it would not be legal to use java.net.CookieHandler, java.net.CookieManager, java.net.HttpCookie, java.net.HttpUrlConnection, java.net.URLConnection, URL.openConnection(), URL.openStream(), or URL.getContent().

If students have any questions about the legality of any libraries please post them to Piazza. It is much safer to ask ahead of time, rather than turn in code that uses a questionable library and receive points off for the assignment after the fact.

Submitting Your Project

Before turning in your project, you and your partner(s) must register your group. To register yourself in a group, execute the following script:
$ /course/cs5700sp15/bin/register project2 [team name]
This will either report back success or will give you an error message. If you have trouble registering, please contact the course staff. You and your partner(s) must all run this script with the same [team name]. This is how we know you are part of the same group.

To turn-in your project, you should submit your (thoroughly documented) code along with three other files:

Your README, Makefile, secret_flags file, source code, etc. should all be placed in a directory. You submit your project by running the turn-in script as follows:
$ /course/cs5700sp15/bin/turnin project2 [project directory]
[project directory] is the name of the directory with your submission. The script will print out every file that you are submitting, so make sure that it prints out all of the files you wish to submit! The turn-in script will not accept submissions that are missing a README, a Makefile, or a secret_flags file. Only one group member needs to submit your project. Your group may submit as many times as you wish; only the last submission will be graded, and the time of the last submission will determine whether your assignment is late.

Grading

This project is worth 8 points. You will receive full credit if 1) your code compiles, runs, and produces the expected output, 2) you have not used any illegal libraries, and 3) you successfully submit the secret flags of all group members. All student code will be scanned by plagarism detection software to ensure that students are not copying code from the Internet or each other.