---
date: "2015-02-16T11:47:00-05:00"
draft: true
title: "Homework 5"
menu:
  main:
    parent: "Homework"
---

<!-- # TODO

* Clean up language about team
* Have students dedup by canonical URL and don’t index same doc twice — use hashcode of url in _id
* Index to elasticsearch right away, store both raw and clean versions. No output files.
* Submit merged index — use distinct cluster name for each group and merge via ES
* Team should share canonicalization code
* submit “top 3 topics” early to pick one or get assigned one -->

# Objective

In this assignment, you will represent documents as numerical features, and apply machine learning to obtain retrieval ranked lists. The data is the AP89 collection we used for HW1.

# Data
Restrict the data to documents present in the QREL. That is, for each of the 25 queries only consider documents that have a qrel assessment. You should have about 14193 documents; some of the docIDs will appear multiple times for multiple queries, while most documents will not appear at all. 

Split the data randomly into 20 "training" quereis and 5 "testing" queries.

#  Document-Query Features
The general plan is to build a query-doc static feature matrix.
qid-docid	f1	f2	f3	...	fd	label
qid-docid	f1	f2	f3	...	fd	label
. . .
qid-docid	f1	f2	f3	...	fd	label

You can rearrange this matrix in any format required by the training procedure of the learning algorithm.

Extract for each query-doc the IR features. These are your HW1 and HW2 models like BM25, Language Models, Cosine, Proximity, etc. The cells are feature values, or the scores given by the IR functions. The label is the qrel relevance value.


# Train a learning algorithm
Using the "train" queries static matrix, train a learner to compute a model relating labels to the features. You can use a learning library like [SciPy/NumPy](http://www.scipy.org), [C4.5](https://github.com/scottjulian/C4.5), [Weka](http://www.cs.waikato.ac.nz/ml/weka/), [LibLinear](http://www.csie.ntu.edu.tw/~cjlin/liblinear/), [SVM Light](http://svmlight.joachims.org), etc. The easiest models are linear regression and decision trees.


# Test the model 
For each of the 5 testing queries:

1. Run the model to obtain scores
2. Treat the scores as coming from an IR function, and rank the documents
3. Format the results as in HW1
4. Run trec_eval and report evaluation as "testing_performance".

# Test the model  on training data
Same as for testing, but on the 20 training queries. Run the learned model against the training matrix, compute prediction/scores, rank, and trec_eval. Report as "training_performance".


# Extra Credit

These extra problems are provided for students who wish to dig deeper into this project. Extra credit is meant to be significantly harder and more open-ended than the standard problems. We strongly recommend completing all of the above before attempting any of these problems.

Points will be awarded based on the difficulty of the solution you attempt and how far you get. You will receive no credit unless your solution is "at least half right," as determined by the graders.

## EC1: Document static features 
Extract for each document in the collection "static" features (query independent), such as document length, timestamp, pagerank, etc. Add these to the feature matrix, and rerun the learning algorithm(s). 

## EC2: Test on your crawled data
Extract the same features (as in the training matrix) for your evaluated documents (about 200/query). Run the learned model to obtain predictions, rank documents accordingly, and evaluate using the qrel produced in HW4 for your crawl.

## EC3: Advance Learning Algorithms
Run fancy learning algorithms like SVM or Neural Networks

## EC4: Ranking  Algorithms
Run learning algorithms with a ranking objective, such as SVM-Rank, RankBoost, or LambdaMart.

## EC5: Topic Models
On the entire AP89 collection of 85K documents, run the LDA to obtain topic models.

<!-- ### Deliverables

1. Your group should submit a compressed file containing the TREC and WARC formatted files you crawled and the merged link graph from your individual crawls.
2. You should each submit the code for your own crawler. -->

### Rubric

<dl class="dl-horizontal">
	<dt>10 points</dt><dd>A proper static matrix setup for feature values and labels</dd>
	<dt>20 points</dt><dd>feature values fro IR functions, at least 5 columns</dd>
	<dt>30 points</dt><dd>Training successfully a learning algorithm </dd>
	<dt>15 points</dt><dd>Evaluate on training queries</dd>
	<dt>15 points</dt><dd>Evaluate on testing queries</dd>
</dl>