--- date: "2015-02-16T11:47:00-05:00" draft: true title: "Homework 3 (Due 3/15)" menu: main: parent: "Homework" --- # Objective In this assignment, you will work with a team to create a vertical search engine using elasticsearch. Please read these instructions carefully: although you are working with teammates, you will be graded individually for most of the assignment. You will write a web crawler, and crawl Internet documents to construct a document collection focused on a particular topic. Your crawler must conform strictly to a particular politeness policy. Once the documents are crawled, you will pool them together. Form a team of three students with your classmates. Your team will be assigned a single query with few associated seed URLs. You will each crawl web pages starting from a different seed URL. When you have each collected your individual documents, you will pool them together, index them and implement search. Although you are working in a team, you are each responsible for developing your own crawlers individually, and for crawling from your own seeds for your team's assigned topic. # Obtaining a topic Form a team of three students with your classmates. If you have trouble finding teammates, please contact the TAs right away to be placed in a team. Once your team has been formed, have each team member create a file in Dropbox named team_X_Yabcd.txt (using your first initial and the last name). This file should contain the names team members. The TAs will update this file with a topic and three seed URLs. Each individual on your team will crawl using three seed URLs: one of the URLs provided to the team, and at least two additional seed URLs you devise on your own. In total, the members of your team will crawl from at least nine seed URLs. # Crawling Documents Each individual is responsible for writing their own crawler, and crawling from their own seed URLs. Set up Elastic Search with your teammates to have the same cluster name and the same index name. Your crawler will manage a *frontier* of URLs to be crawled. The frontier will initially contain just your seed URLs. URLs will be added to the frontier as you crawl, by finding the links on the web pages you crawl. 1. You should crawl at least 15,000 documents in total, including your seed URLs. This will take several hours, so think carefully about how to adequately test your program without running it to completion in each debugging cycle. 2. You should choose the next URL to crawl from your frontier using a best-first strategy. See Frontier Management, below. 3. Your crawler must strictly conform to the politeness policy detailed in the section below. You will be consuming resources owned by the web sites you crawl, and many of them are actively looking for misbehaving crawlers to permanently block. Please be considerate of the resources you consume. 4. You should only crawl HTML documents. It is up to you to devise a way to ensure this. However, do not reject documents simply because their URLs don't end in .html or .htm. 5. You should find all outgoing links on the pages you crawl, canonicalize them, and add them to your frontier if they are new. See the Document Processing and URL Canonicalization sections below for a discussion. 6. For each page you crawl, you should store the following filed with ElasticSearch : an id, the URL, the HTTP headers, the page contents cleaned (with term positions), the raw html, and a list of all in-links (known) and out-links for the page. Once your crawl is done, you should get together with your teammates and figure out how to merge the indexes. With proper ids, the ElasticSearch will do the merging itself, you still have to manage the link graph. ## Politeness Policy Your crawler must strictly observe this politeness policy at all times, including during development and testing. Violating these policies can harm to the web sites you crawl, and cause the web site administrators to block the IP address from which you are crawling. 1. Make no more than one HTTP request per second from any given domain. You may crawl multiple pages from different domains at the same time, but be prepared to convince the TAs that your crawler obeys this rule. The simplest approach is to make one request at a time and have your program sleep between requests. The one exception to this rule is that if you make a HEAD request for a URL, you may then make a GET request for the same URL without waiting. 2. Before you crawl the first page from a given domain, fetch its robots.txt file and make sure your crawler strictly obeys the file. You should use a third party library to parse the file and tell you which URLs are OK to crawl. ## Frontier Management The _frontier_ is the data structure you use to store pages you need to crawl. For each page, the frontier should store the canonicalized page URL and the in-link count to the page from other pages you have already crawled. When selecting the next page to crawl, you should choose the next page in the following order: 1. Seed URLs should always be crawled first. 2. Prefer pages with higher in-link counts. 3. If multiple pages have maximal in-link counts, choose the option which has been in the queue the longest. If the next page in the frontier is at a domain you have recently crawled a page from and you do not wish to wait, then you should crawl the next page from a different domain instead. ## URL Canonicalization Many URLs can refer to the same web resource. In order to ensure that you crawl 15,000 distinct web sites, you should apply the following canonicalization rules to all URLs you encounter. 1. Convert the scheme and host to lower case: `HTTP://www.Example.com/SomeFile.html` → `http://www.example.com/SomeFile.html` 2. Remove port 80 from http URLs, and port 443 from HTTPS URLs: `http://www.example.com:80` → `http://www.example.com` 3. Make relative URLs absolute: if you crawl `http://www.example.com/a/b.html` and find the URL `../c.html`, it should canonicalize to `http://www.example.com/c.html`. 4. Remove the fragment, which begins with `#`: `http://www.example.com/a.html#anything` → `http://www.example.com/a.html` 5. Remove duplicate slashes: `http://www.example.com//a.html` → `http://www.example.com/a.html` You may add additional canonicalization rules to improve performance, if you wish to do so. ## Document Processing Once you have downloaded a web page, you will need to parse it to update the frontier and save its contents. You should parse it using a third party library. We suggest jsoup for Java, and Beautiful Soup for Python. You will need to do the following: 1. Extract all links in `` tags. Canonicalize the URL, add it to the frontier if it has not been crawled (or increment the in-link count if the URL is already in the frontier), and record it as an out-link in the link graph file. 2. Extract the document text, stripped of all HTML formatting, JavaScript, CSS, and so on. Write the document text to a file in the same format as the AP89 corpus, as described below. Use the canonical URL as the DOCNO. If the page has a `` tag, store its contents in a `<HEAD>` element in the file. This will allow you to use your existing indexing code from HW1 to index these documents. 3. Store the entire HTTP response separately, as described below. <!-- ## Crawler Output For each page you crawl, you should write the following output. **Indexable Document Contents** Produce a single file named `crawl_name.trec`, where “name” is your first initial and last name (e.g. “Tom Cruise” would use `crawl_tcruise.trec`), which contains the documents you have crawled in a TREC-style file. Each document should look something like this: ``` <DOC> <DOCNO>http://www.example.com/something.html</DOCNO> <HEAD>The page title</HEAD> <TEXT>The body text from the document</TEXT> </DOC> ``` This file will be used later to index the documents. **Raw Document Contents** You should also store the document in the WARC format named `crawl_name.warc`, following the same convention for your name. This format preserves the HTTP headers and raw HTTP content. The file should begin with the line: ``` WARC/0.18 ``` Each document record consists of a header, a blank line, and the entire raw HTTP response (including HTTP headers). It looks like this: ``` WARC-Type: response WARC-Target-URI: http://www.example.com/something.html WARC-Date: 2015-03-69T11:04:55-0700 Content-Length: 11253 HTTP/1.1 200 OK Content-Type: text/html; charset=UTF-8 Server: Apache/2.2.3 (CentOS) X-Powered-By: PHP/5.1.6 Last-Modified: Wed, 07 Jan 2009 10:29:05 GMT Date: Sat, 07 Feb 2009 14:46:08 GMT Connection: close Content-Length: 11020 Page contents... ``` Please set the WARC-Target-URI, WARC-Date, and Content-Length fields in the WARC header appropriately. --> ##merging individual crawls Ideally we would like to have the crawling process send any stored data directly to the team-index, while merging. But this is too much of a headache for students to keep their ES servers connected while crawling; so we allow for individual crawls, then merged in ES. If you use individual crawls to be merged at the end, you have to simulate a realistic environment: merge indexes (or the crawled data) into one ES index. Merging should happen as independent agents : everyone updates the index independently while ES servers are connected. Meaning not in a Master-Slave or Server-Client manner. This is team work. #Link Graph You should also write a link graph reporting all out-links from each URL you crawl, all the inlinks you have encountered (obviously there will be inlinks on the web that you dont discover). This will be used in a future assignment to calculate PageRank for your collection. - option 1 : We prefer that you store the canonical links as two fields "inlinks" and "outlinks" in ElasticSearch, for each document. You will have to manage these fields appropriately, such that when you are done, your team has correct links for all document crawled. - option 2: maintain a separate links file (you can do this even if you also do option1). Each line of this file contains a tab-separated list of canonical URLs. The first URL is a document you crawled, and the remaining URLs are out-links from the document. When all team members are finished with their crawls, you should merge your link graphs. Only submit one file, containing this merged graph. During the merge process, reduce any URL which was not crawled to just a domain. # Vertical Search Once all team members are finished with their crawls, and all documents are into an elasticsearch index (using the canonical URL as the document ID for de-duplication), run vertical search. Create a simple HTML page which runs queries against your elasticsearch index. You may either write your own interface, or use an existing tool such as [Calaca](https://github.com/romansanchez/Calaca) or [FacetView](https://github.com/okfn/facetview). Or modify [this one](../../4_eval_userstudy/matt_es_client.zip) used by one of our grad students. Your search engine should allow users to enter text queries, and display elasticsearch results to those queries from your index. The result list should contain at minimum the URL to the page you crawled. Make sure you run several queries on your group's topic, and you think about the result quality. During your demo, you will be asked to explain how your seeds and crawls affected the search results. # Extra Credit These extra problems are provided for students who wish to dig deeper into this project. Extra credit is meant to be significantly harder and more open-ended than the standard problems. We strongly recommend completing all of the above before attempting any of these problems. Points will be awarded based on the difficulty of the solution you attempt and how far you get. You will receive no credit unless your solution is "at least half right," as determined by the graders. ## EC1: Crawl more documents Expand your team crawl to 100,000 documents ## EC2: Frontier Management Experiment with different techniques for selecting URLs to crawl from your frontier. See the Coverage slides for the Seattle section for some suggestions. Does the selection policy appear to impact the quality of pages crawled? ## EC3: Speed Improvements *Without violating the politeness policy,* find ways to optimize your crawler. How fast can you get it to run? Do your optimizations change the set of pages you crawl? ## EC4: Search Interface Improvements Improve meaningfully on your search engine interface. This may include one or more of the following (or your own ideas). Instead of just showing URLs, show text snippets containing the query terms from the document. Change the visual layout or user interface to make the search engine easier to use, or to make it easier to find what you're looking for. Add domain-specific search operators, or other custom search operators. <!-- ### Deliverables 1. Your group should submit a compressed file containing the TREC and WARC formatted files you crawled and the merged link graph from your individual crawls. 2. You should each submit the code for your own crawler. --> ### Rubric <dl class="dl-horizontal"> <dt>10 points</dt><dd>You strictly follow the politeness policy</dd> <dt>10 points</dt><dd>You chose reasonable seeds, and understand the impact of seeds on the crawl</dd> <dt>20 points</dt><dd>You crawl pages in the correct order</dd> <dt>10 points</dt><dd>You correctly canonicalize URLs</dd> <dt>10 points</dt><dd>Correctly index with ES</dd> <dt>10 points</dt><dd>Merge crawled pages with your teammates</dd> <dt>20 points</dt><dd>Your group's vertical search engine works correctly</dd> <dt>10 points</dt><dd>You can explain the quality of your search results</dd> </dl>