(1) Document Corpus and Query
DOCUMENTS:
d1: for english model retireval have a relevance model while vectors space model retrieval dont
d2: R-precision measure is relevant to average precision measure
d3: most efficient retrieval models are language model and vector space model
d4: english is the most efficient language
d5: retrieval efficiency is measured by average precision
==============================================================================
TERM by DOCUMENT TABLE (ignoring stopwords)
d1 d2
d3 d4 d5
term in corpus
english
1
1
2
language
1
1
1
3
model
3
3
6
retrieval
2
1
1 4
relevance
1
1
1 3
vector
1
1
2
space
1
1
2
R
1
1
most
1
1
2
efficient
1
1
2
measure
2
1 3
average
1
1 2
precision
2
1 3
DOC LENGTH 10
7 8 4
6
35
==============================================================================
T=35
D=5
U=13
avg_doc_length=35/5=7
==============================================================================
QUERY:
"efficient retrieval model efficient"
(2) Vector Space Model
RAW TF
binary QUERY WEIGHTS (i.e. a term either occur in query or not)
Dot Product Similarity
d1 d2
d3 d4 d5 QUERY
model 3 3 1
retrieval 2 1 1
1
efficient
1 1 1
5 0
4 1 1
==============================================================================
ROBERTSON's TF = TF/(TF+k), k=1
binary QUERY WEIGHTS
Dot Product Similarity
d1 d2
d3 d4 d5 QUERY
model 3/4 3/4 1
retrieval 2/3 1/2
1/2 1
efficient
1/2 1/2 1
13/12 0 5/4 1/2 1
==============================================================================
OKAPI TF = TF/[TF+k+c*(doclen/avglen)], k=0.5, c=1.5
binary QUERY WEIGHTS
Dot Product Similarity
d1 d2
d3 d4 d5 QUERY
model 0.53
0.57 1
retrieval 0.43 0.31 0.36
1
efficient 0.42 0.36
1
0.96 0 0.88 0.42 0.72
==============================================================================
OKAPI TF= TF/[TF+k+c*(doclen/avglen)], k=0.5, c=1.5
TF QUERY
Dot Product Similarity
d1 d2
d3 d4 d5 QUERY
model 0.53 0.57 1
retrieval 0.43 0.31 0.36
1
efficient
0.42 0.36
2
0.96 0
0.88 0.84 1.08
==============================================================================
IDF WEIGHTS = log(N/n_t)
english
log(5/2)=1.32
language
log(5/3)=0.73
model
log(5/2)=1.32
retrieval
log(5/3)=0.73
relevance
log(5/3)=0.73
vector
log(5/2)=1.32
space
log(5/2)=1.32
R
log(5/1)=2.32
most
log(5/2)=1.32
efficient log(5/2)=1.32
measure log(5/2)=1.32
average log(5/2)=1.32
precision log(5/2)=1.32
==============================================================================
OKAPI TF*IDF
TF QUERY
Dot Product Similarity
model 0.53*1.32 0
0.57*1.32 0 0 1
retrieval 0.43*0.73
0 0.31*0.73 0
0.36*0.73
1
efficient 0
0 0 0.42*1.32 0.36*1.32
2
1.01 0 0.97
1.10
1.21
(3) Language Models and Smoothing
-
LANG MODEL MAX-LIKELIHOOD ESTIMATE, document model
d1 d2
d3 d4 d5
english 1/10 1/4
language 1/10 1/8
1/4
model 3/10 3/8
retrieval 2/10 1/8 1/6
relevance 1/10
1/7 1/6
vector 1/10 1/8
space 1/10 1/8
R
1/7
most 1/8 1/4
efficient
1/4 1/6
measure 2/7 1/6
average 1/7 1/6
precision 2/7 1/6
QUERY-LIKELIHOOD 0
0 0 0
0
==============================================================================
-
LANG MODEL MAX-LIKELIHOOD+LAPLACE ESTIMATE, document model
d1 d2 d3 d4 d5
english 2/23 1/20 1/21 2/17 1/19
language 2/23 1/20 2/21 2/17 1/19
model 4/23 1/20 4/21 1/17 1/19
retrieval 3/23 1/20 2/21 1/17 2/19
relevance 2/23 2/20 1/21 1/17 2/19
vector
2/23 1/20 2/21 1/17 1/19
space 2/23 1/20 2/21 1/17 1/19
R 1/23 3/20 1/21 1/17 1/19
most 1/23 1/20 2/21 2/17 1/19
efficient
1/23 1/20 1/21 2/17 2/19
measure 1/23 3/20 1/21 1/17 2/19
average 1/23 2/20 1/21 1/17 2/19
precision 1/23 3/20 1/21 1/17 2/19
QUERY-LIKELIHOOD 12/23^4
1/20^4 8/21^4 2/17^4 4/19^4
==============================================================================
-
LANG MODEL MAX-LIKELIHOOD ESTIMATE, JELINEK-MERCER SMOOTHING document model
NOTE: Two different ways to get background probabilities.
A. corpus ML estimate B. average doc ML
probabs
english 2/35
(1/10 + 1/4)/5 =
0.07
language 3/35
(1/10 + 1/8 + 1/4)/5
=0.09
model 6/35
(3/10 + 3/8)/5
=0.13
retrieval 4/35
(2/10 + 1/8 + 1/6)/5
= 0.1
relevance 3/35
vector 2/35
space 2/35
R 1/35
most 2/35
efficient 2/35
(1/4 + 1/6)/5=
0.08
measure 3/35
average 2/35
precision 3/35
USE L=lambda
=0.8
d1
english .2*1/10 + .8*0.07
language .2*1/10 + .8*0.09
model .2*3/10 + .8*0.13 = 0.16
retrieval .2*2/10 + .8*0.1 = 0.12
relevance .2*1/10 + .8*
vector .2*1/10 + .8*
space .2*1/10 + .8*
R
most
efficient 0.8 * 0.08 = 0.06
measure
average
precision
QUERY-LIKELIHOOD for d1: 0.16 * 0.12 * 0.06^2
==============================================================================
-
LANG MODEL MAX-LIKELIHOOD ESTIMATE, WITTEN BELL SMOOTHING document model
A. corpus ML estimate B. average doc ML
probabs
english 2/35
(1/10 + 1/4)/5 =
0.07
language 3/35
(1/10 + 1/8 + 1/4)/5
=0.09
model 6/35
(3/10 + 3/8)/5
=0.13
retrieval 4/35
(2/10+1/8
+1/6)/5=0.1
relevance 3/35
vector 2/35
space 2/35
R 1/35
most 2/35
efficient 2/35
(1/4 + 1/6)/5=
0.08
measure 3/35
average 2/35
precision 3/35
d1
N 10
V 7
english 10/17*1/10 + 7/17*0.07
language 10/17*1/10 + 7/17*0.09
model 10/17*3/10 + 7/17*0.13 =0.23
retrieval 10/17*2/10 + 7/17*0.1 =0.16
relevance 10/17*1/10 + 7/17*
vector 10/17*1/10 + 7/17*
space 10/17*1/10 + 7/17*
R
most
efficient 7/17 * 0.08 =0.03
measure
average
precision
QUERY-LIKELIHOOD for d1: 0.23 * 0.16 * 0.03^2
==============================================================================
-
LANG MODEL MAX-LIKELIHOOD+LAPLACE , query model
QUERY
MODEL
english 1/17
language 1/17
model 2/17
retrieval 2/17
relevance 1/17
vector 1/17
space 1/17
R 1/17
most 1/17
efficient
3/17
measure 1/17
average 1/17
precision 1/17
d1 d2
d3 d4 d5
english 1 1 2
language 1 1
1 3
model 3 3
6
retrieval 2 1 1 4
relevance 1
1 1 3
vector 1 1
2
space 1 1
2
R
1
1
most
1 1 2
efficient
1 1
2
measure 2 1 3
average 1 1 2
precision 2 1 3
DOC-LIKELIHOOD for
d1: 1/17 * 1/17 * (2/17)^3 * (2/17)^2 * 1/17 * 1/17 * 1/17
PROBLEM: normalize with doc length , otherwise large docs get a very low score
==============================================================================
-
KL-DIVERGENCE model comparison, LAPLACE SMOOTHING
d1 d2 d3 d4 d5
QUERY
english 2/23
1/20 1/21 2/17 1/19
1/17
language 2/23
1/20 2/21 2/17 1/19
1/17
model 4/23 1/20 4/21 1/17 1/19 2/17
retrieval 3/23
1/20 2/21 1/17 2/19
2/17
relevance 2/23
2/20 1/21 1/17 2/19
1/17
vector 2/23 1/20 2/21 1/17 1/19
1/17
space 2/23 1/20 2/21 1/17 1/19
1/17
R 1/23 3/20 1/21 1/17 1/19
1/17
most 1/23 1/20 2/21 2/17 1/19
1/17
efficient
1/23 1/20 1/21 2/17 2/19
3/17
measure 1/23
3/20 1/21 1/17 2/19
1/17
average 1/23
2/20 1/21 1/17 2/19
1/17
precision 1/23
3/20 1/21 1/17 2/19
1/17
FOR d1 : distance
= - 1/17*log(2/23) - 1/17*log(2/23) - 2/17*log(4/23) -
2/17*log(3/23).....
FOR d2 : distance
= - 1/17*log(1/20) - 1/17*log(1/20) - 2/17*log(1/20) -
2/17*log(1/20).....
FOR d3 : distance
= - 1/17*log(1/21) - 1/17*log(2/21) - 2/17*log(4/21) -
2/17*log(2/21).....