This will be a closed-book, half-hour quiz.
Question 1. This is about the Boolean Model, Sec. 2.5.2.
Consider the four terms, in order: park, mountain, trails, difficult. Assume that the query, in disjunctive normal form, DNF, is the following, where "OR" is the logical disjunction operator:
Q = (1,0,1,0) OR (0,1,1, 0) OR (1,1,1,0)
You'll be asked to write an English language description of this, which could be a straightforward translation from DNF: "Search for a document containing park and trails, but not mountain or difficult. Or, search for a document containing trails and mountain but not difficult. Or, search for a document containing park and mountain and trails, but not difficult."
Another way of saying this, is that no document should contain difficult. All should contain trails. All should contain park or mountain or both. This latter description is not the DNF form but easier to understand.
Now consider the result of applying the query to the following two (tiny) documents. Which of the two are retrieved, if either? Explain briefly how you arrived at your conclusions.
Document 1: "Loon park contains a lovely lake and is near Mystery mountain. It's not difficult to get to from the city."
Document 2: "The Mystery mountain area has many easy trails, but no difficult ones."
Answer: Neither will be retrieved, because they both contain difficult. Oddly, the second one contains difficult in a negated form. But essentially no retrieval systems can't take negation in English into account. The intent of the query was probably to find a park or mountain without difficult trails. But finding just what you want is not easy! Experimenting with google shows that even when +difficulty is included, phrases such as "Difficulty level: Easy" appear. Not easy!
Question 2. This is about the Vector Model, Sec. 2.5.3.
I will NOT give you equation 2.1 or 2.3. You have to remember it. If you understand and practice doing computations with it, you should easily be able to remember it.
Assume you index the terms "Mars", "landed" and "rover" in the following document:
Document = "After a successful landing on Mars, the Mars rover Opportunity landed on a Mars plain in Meridiani section of Mars. The ship landed at an excellent landing spot."
Assume that the number of documents in the total collection of 64 that contain "Mars" is 16, "landed", 4 and "rover", 8. Using these, compute the three weight vector components for the document. Ignore the stop words: the, a, an, of, on, in and at. Use lg = log2.
Answer: The highest frequency word is Mars, with 4 occurrences. The absolute frequencies of the others are landed (2) and rover (1). This gives tf-idf factors of:
Note that it is just a coincidence that a keyword, Mars, has the highest absolute frequency in the document.
Go to ISU535 home page. or RPF's Teaching Gateway or homepage