Shahzad K. Rajput, Virgil Pavlu, Peter B. Golbus and Javed A. Aslam A Nugget-based Test Collection Construction Paradigm, submitted to the 20th ACM CIKM, 2011, Glasgow, UK.
The problem of building test collections is central to the development of information retrieval systems such as search engines. The primary use of test collections is the evaluation of IR systems. The widely employed "Cranfield paradigm" dictates that the information relevant to a topic be encoded at the level of documents, therefore requiring effectively complete document relevance assessments. As this is no longer practical for modern corpora, numerous problems arise, including scalability, reusability, and applicability.
We propose a new method for relevance assessment based on relevant information, not relevant documents. Once the relevant information is collected, any document can be assessed for relevance, and any retrieved list of documents can be assessed for performance. Starting with a few relevant "nuggets" of information manually extracted from existing TREC corpora, we implement and test a method that finds and correctly assesses the vast majority of relevant documents found by TREC assessors, as well as up to four times more additional relevant documents. We then show how these inferred relevance assessments can be used to perform IR system evaluation. Our main contribution is a methodology for producing test collections that are highly accurate, more complete, scalable, reusable, and can be generated with similar amounts of effort as existing methods, with great potential for future applications.
The revised paper contains results on ClueWeb09, which is a web-based coprus, with large and diverse set of documents. We also validate that a small number of nuggets can cover a large and diverse set of documents, such as ClueWeb09.
We extract nuggets from a sample (training set) and then, using those nuggets, infer the relevance of the documents from outside that sample (test set). The test set does not need to be a fixed set of documents, new documents may be added or older documents may be modified. Therefore, by design our method handles dynamic collections.
Our methodology works under the assumption that a large number of documents may contain the "nugget(s)". By design, the matching algorithm would match a nugget with a document even if it is not contained in the document as it is. This allows us to infer a large number of non-duplicate documents as relevant. This assumption has been validated by a user study, and the argument has been included in the paper.
Assuming TREC assessors spent one minute per document, overall the entire TREC-8 qrel took about 36 man-weeks. SampleAdHoc, which is about 11% the size of entire TREC 8 qrel, by proportion required about 4 man-weeks for binary relevance assessments. For the relevant documents found in the sample, we spent an additional 2.1 man-weeks on extracting nuggets; thus the total human effort required for our method on SampleAdHoc is about 6.2 man-weeks.
Under the same assumption, TREC spent about 11 man-weeks in creating the entire ClueWeb09 qrel. SampleWeb, which is about 38% the size of entire ClueWeb09 qrel, required about 4 man-weeks. Nugget extraction from relevant documents in the sample took another 1.6 man-weeks, for a total human effort on SampleWeb of about 5.6 man-weeks.
Argument added to the paper.
Plot added in the paper.
The top systems are underestimating mainly due to one reason: some of the unique relevant documents brought into the pool by these systems are infered not relevant due to a missing aspect. The argument has been added to the paper.
The comparison has been added to the paper.
Majority of the typos have been fixed in the paper.
"It is addressing an important an interesting topic, but I just found it very hard to understand at the correct level of detail in important places."
"It is a straightforward idea with good ultimate performance, but the paper would be stronger with more analysis and discussion of some issues."
"Surprisingly good results from a simple method, and a paper which would be likely to stimulate lots of discussion. More discussion and analysis would strengthen this submission."
Complete reviews can be seen here.
This paper proposes a new paradigm for assessing and encoding relevant information in Information Retrieval, with applications to search engine training and evaluation. Shahzad was the lead author on this work: he contributed a majority of the ideas, he conducted all of the experiments, and he contributed a majority of the writing. The experiments, in particular, involved the creation of a user study, the implementation of a user interface for this study, and an extensive analysis of the results obtained, all conducted entirely by Shahzad.
This paper represents quality publishable work, in my opinion. The paper was submitted to SIGIR'11, where it was not accepted; however, SIGIR is the premier venue for IR research and a most difficult conference in which to be published. (SIGIR has an historical average acceptance rate of 18%.) The paper will be resubmitted to CIKM'11, and I have no doubt that it will be published at some point.
Given the above, I believe that this paper demonstrates Shahzad's research potential, in terms of ideas, execution, and writing.
-- Prof. Javed A. Aslam