------------- Review from Reviewer 1 ------------- Relevance to SIGIR (1-5, accept threshold=3) : 4 Originality of work (1-5, accept threshold=3): 3 Quality of work (1-5, accept threshold=3) : 2 Quality of presentation (1-5, accept threshold=3) : 2 Impact of Ideas or Results (1-5, accept threshold=3): : 2 Adequacy of Citations (1-5, accept threshold=3) : 3 Recommendation (1-6) : 2 Confidence in review (1-4) : 3 {CRITERIONNAME_09}: {CRITERIONVALUE_09} {CRITERIONNAME_10}: {CRITERIONVALUE_10} -- Comments to the author(s): Summarizing the reviews, the paper covers too much ground and is hard to follow, and is possibly not very novel. -- Summary: Summarizing the reviews, the paper covers too much ground and is hard to follow, and is possibly not very novel. ---------- End of Review from Reviewer 1 ------------ ----------- Review from Reviewer 2 ------------- Relevance to SIGIR (1-5, accept threshold=3) : 5 Originality of work (1-5, accept threshold=3): 4 Quality of work (1-5, accept threshold=3) : 3 Quality of presentation (1-5, accept threshold=3) : 2 Impact of Ideas or Results (1-5, accept threshold=3): : 4 Adequacy of Citations (1-5, accept threshold=3) : 3 Recommendation (1-6) : 3 Confidence in review (1-4) : 3 {CRITERIONNAME_09}: {CRITERIONVALUE_09} {CRITERIONNAME_10}: {CRITERIONVALUE_10} -- Comments to the author(s): This paper describes the use of short, stop worded, strings as the unit of information for IR relevance assessment and therefore IR evaluation. It is addressing an important an interesting topic, but I just found it very hard to understand at the correct level of detail in important places. This made it very difficult to review - I spent very much longer on the paper than I do a typical SIGIR paper, but was none the less still left uncertain about important details and therefore couldn't fully interpret and assess the results. It is possible that the root of the problem is tat the authors are simply trying to cover too much ground for a short conference paper. Perhaps a journal paper with a much longer methodology section is what is needed. The author's should have referenced some related work in XML retrieval e.g. AU - Gövert, Norbert AU - Fuhr, Norbert AU - Lalmas, Mounia AU - Kazai, Gabriella T1 - Evaluating the effectiveness of content-oriented XML retrieval methods JF - Information Retrieval Y1 - 2006-12-01 PB - Springer Netherlands SN - 1386-4564 KW - Computer Science SP - 699 EP - 722 VL - 9 IS - 6 UR - http://dx.doi.org/10.1007/s10791-006-9008-2 DO - 10.1007/s10791-006-9008-2 ER - Or a more recent paper by Geva and Trotman. The author's are soemtimes rather naive in their discussion of ambiguity - e.g. "Barack Obama" is not just the name of the current US president, but also his father, a distinguished Kenyan, amongst others. There appear to be some strange type face changes - e.g. "assesses" in Section1 para 3. Google reault reports are suspect -especially for large numbers and should be reported as such. The author's need to ensure Figure 1 renders well in monochrome. In the Introduction the authors need to make clear the nature of their nuggets - that they are text strings - rather then e.g. formal ontological constructs. (if this is not the case it reinforces the earlier point about the difficulty of understanding the paper). Nugget extraction si an inherently errorful process. Nugget variation generation is also and errorful process. Both of these errors will produce confounding effects for the experimental results. The authors should clearly acknowledge this. Section 2, para 2 (Headed "Matching text") Insert "The" before "shingling". I feel you use of the term "Semantic matching" is misleading. So far as I can determine this is text matching over stop worded phrases. I find Section 2.4 very hard to understand with any confidence - and a detailed understanding is needed to interpret the results in the later sections. I suspect part of the problem is a failure to distinguish "real" human relevance scores from putative automatically calculated relevance scores. This compounds a general problem with a failure to carefully and clearly distinguish automatic and manual processing. Further I feel there are some problems from linguistics which will be masked by the author's nugget extraction processes - consider "rocks from the moon" versus "rocks which look like the moon" (artificial examples I acknowledge). The author's nugget extraction process so far as I can see will precisely conflate these two queries, whereas the real potential for nugget processing is in distinguishing these phrases. In Table 3 I had expected from the description that the total of the count column would be 200, but it is higher, and I don't understand why. Can a document be in more than one of the categories 3 to 12? needs more explanation in the text. Section 3 para 2 I think "alike" should be "like". At this point I decided I didn't understand the process of producing "inferred nuggets" as used to report the results. I suspect in Section 2 I got confused between things you did in past, but didn't work, things you plan to do in the future, and the system set up used to report the results. Please distinguish these three carefully. Section 3 last paragraph I think "widely" should be "wildly" Section 4 is in principle interesting, but is to brief to add value to the paper in its current form. The references are littered with typographical and other errors. e.g. "development in information retrieval" should be capitalised Also "Garcia-molina" ("Garcia-Molina") trec -> TREC 22, 24 and 29 are incomplete "svms" -> "SVMs" -- Summary: Some interesting ideas: but paper not well enough written for SIGIR. ---------- End of Review from Reviewer 2 ----------- ------------ Review from Reviewer 3 ------------- Relevance to SIGIR (1-5, accept threshold=3) : 5 Originality of work (1-5, accept threshold=3): 2 Quality of work (1-5, accept threshold=3) : 2 Quality of presentation (1-5, accept threshold=3) : 3 Impact of Ideas or Results (1-5, accept threshold=3): : 2 Adequacy of Citations (1-5, accept threshold=3) : 2 Recommendation (1-6) : 2 Confidence in review (1-4) : 4 {CRITERIONNAME_09}: {CRITERIONVALUE_09} {CRITERIONNAME_10}: {CRITERIONVALUE_10} -- Comments to the author(s): This paper proposes an IR evaluation method based on nugget extraction from sampled documents. The authors's overall arguments are, however, not very convincing. More specifically, I believe there there are several problems in the paper which make me think that it is not yet ready for publication. MAJOR PROBLEMS: 1. The authors argue as if nugget based IR evaluation is completely new. They do briefly acknowledge nugget-based evaluation for QA and alpha-nDCG, but quickly dismiss them (Section 2.2, last para). They also fail to discuss NRBP, nugget based evaluation in summarization, as well as the following SIGIR 2004 paper: Einat Amitay, David Carmel, Ronny Lempel, Aya Soffer Edit Scaling IR-system evaluation using term relevance sets Instance recall from the TREC interactive track is also relevant. The paper (for example the last para of Section 5) suggests that the authors are not aware of these existing studies. 2. The second problem, I think, is more serious, and it is to do with the authors' basic arguments and methodology. The authors argue (Sections 1 and 2) that document relevance based evaluation cannot handle the Barack Obama query with 65 million hits. Probably true. However, the paper does NOT show that nugget based evaluation can handle this situation. Are all of the 65 million pages just redundant copies of his biography page? I don't think so. I don't see any evidence in this paper that a small number of nuggets can cover a large and diverse set of documents. (I agree that a small number of nuggets can cover a large and *redundant* set of documents.) The authors further argue that document relevance based evaluation cannot handle dynamic collections (Section 1). Again, is nugget based evaluation any different? As new documents are added, nuggets may become incomplete, even if they are initially complete. (I do agree that retaining nuggets rather than the actual relevant documents may be useful for reusability, though.) Now, what the authors have done boils down to: (1) construct nuggets from a sample of relevant documents; (2) obtain "pseudorelevant" documents by matching the nuggets with unknown documents; and (3) evaluate with the expanded qrels using a traditional IR metric. This does not really make sense to me, because the expanded set is largely a collection of redundant documents. If the authors already have nuggets, why reward retrieval of redundant documents rather than evaluate novelty etc? Why not use the nuggets with alpha-nDCG, for example? Moreover, since the authors are claiming that nugget based evaluation is more cost-effective than document relevance based evaluation, I think there should be more discussion on the cost-effectiveness. Four man-weeks of nugget extraction for 50 queries (Section 2.3): okay, but what would have happened if the same amount of manpower was used for traditional relevance assessments? Moreover, the proposed methods are underestimating the TREC top performers in Figures 4 and 5. (Yes, the inferred docs are alleviating the effect, but the methods are still underestimating.) Is this really acceptable? MINOR PROBLEMS: - The paper needs to be reorganized. Section 1.2 discusses previous work, but so does 2.2. Section 3 has a long "Section 3.0" that really should be 3.1. - Figure 6: Shouldn't you compare "nugget MAP (excluding 10 systems)" with "TREC MAP"? The stability shown in Figure 6 right is not so surprising, as you are removing docs and not removing nuggets directly. OTHER COMMENTS: - Table 3 is very interesting. It is useful to know the limitations (as well as advantages) of nugget based evaluation. - Regarding the underestimation of top performers (Figures 4 and 5), efforts on ranking systems without relevance assessments may be worth looking into, e.g. Ian Soboroff, Charles Nicholas, Patrick Cahan Edit Ranking retrieval systems without relevance judgments SIGIR 2001 - The learning to rank experiments in Section 4 is more convincing than the rest of the paper. It is great if pseudorels can replace the true rels. -- Summary: This paper proposes an IR evaluation method based on nugget extraction from sampled documents. The authors's overall arguments are, however, not convincing. There are interesting elements though, such as the learning to rank results. ---------- End of Review from Reviewer 3 ------------ ----------- Review from Reviewer 4 ------------- Relevance to SIGIR (1-5, accept threshold=3) : 5 Originality of work (1-5, accept threshold=3) : 3 Quality of work (1-5, accept threshold=3) : 4 Quality of presentation (1-5, accept threshold=3) : 4 Impact of Ideas or Results (1-5, accept threshold=3): : 4 Adequacy of Citations (1-5, accept threshold=3) : 3 Recommendation (1-6) : 4 Confidence in review (1-4) : 3 {CRITERIONNAME_09}: {CRITERIONVALUE_09} {CRITERIONNAME_10}: {CRITERIONVALUE_10} -- Comments to the author(s): This submission details experiments with a simple "nugget"-based method of inferring relevance judgements. It is a straightforward idea with good ultimate performance, but the paper would be stronger with more analysis and discussion of some issues. The idea is not as new as is claimed: comparisons are fair with facets; with QA in TREC; and with the INEX judging interface (which marked spans). See also, for example, alpha-NDCG and work on test collection reusability. More comparison and discussion of this existing work, to point out similarities and differences, would help readers understand what's novel here. The three problems in s1.1 have also been considered in the past: - on scalability: inferred measures help here. Judging Clueweb with volunteer labour suggests they're doing at least ok. - Soboroff has written on reusability of dynamic test collections. (Also, in s1.1: graded relevance predates web search by decades! The Cranfield experiments used it, for example.) *Effectively* complete is the key word for past efforts and it's important here too. In phase 1, might we still need to look at lots of documents to be sure we have enough of the "nuggets"? In phase 2, it's noted that the set of "relevant" documents is significantly larger than otherwise collected; this may or may not be a good thing! Some discussion here would be interesting. On a similar note, the judging times presented in s2.3 are hard to interpret as they stand; this is a bit over one document per minute, but how fast are other methods? Comparing this to the time taken to judge a full pool isn't really fair, since there are established alternatives. The tests described in s2 are appropriate but the results are hard to interpret. The "steep initial slope" of fig 3 is a result of the scale being different on each axis! It seems to show 1/3rd of the marked documents as non-relevant up to about 300 total, then as much as 4/9ths by 450 total. But it's not clear: is this ok? Certainly it's not obvious from this alone that it's "good enough" (and relevance feedback is very variable, so it's an odd baseline). The failure analysis is nice to see though; and the analysis in s3 suggests comparison with past studies of inter-assessor agreement (which may provide an upper bound). Overall, the results are suprisingly good -- even remarkably good. Could this be because nuggets tend to be repeated a lot? Some words on nugget distribution would be interesting. This might also tell us something about the rate of discovery of new nuggets, which would speak to the scalability of the method. A few small notes: - the caption for Table 1 is incomplete. - the labels for Figs 4 and 5 are too small to read comfortably. - Figs 4 and 5 report rho, which is ok but is not "linear correlation"; it's the degree of fit to a monotone, which can be non-linear. - Fig 6 has neither red nor blue on most printers or photocopiers. - there are a few typos (e.g. "Trec") and missing details (e.g. [3]) in the reference list. A close proofreading would help. -- Summary: Surprisingly good results from a simple method, and a paper which would be likely to stimulate lots of discussion. More discussion and analysis would strengthen this submission. ---------- End of Review from Reviewer 4 ----------