IR
Whatās Search?
Search doc model
doc - web page 90% donāt look beyond the 1st page #1 result -> 33% click
Already have dwonload the page. goal: pick the most relevent
challenges:
- result relevance
- processing speed
- scaling to many docs
Boolean retrieval
- whether a doc have a specific word. Term-doc incidence: table: terms aka words, document aka, entire web.
simply include or exclude a doc from results
Vector Space Model.
each doc -> vector of values, one component for each term. a doc is a point in the space. -> this is a SPARSE space. position i represent the term i.
docs that closes together are close in meaning. measures the distance between two docs.
- treat query as a short doc.
- sort docs by increasing distance to the query docs.
- easy to compute, both query and doc are vectors.
tf-idf
- Term frequency: times mentioned in a doc.
- Document freq: how often a word appears in doc collection. (whether itās a rare word)
High value for rare word: IDF = N / n_k N = total # docs in collection C n_k = # docs in C that contain T_k
use log, -> still monotonically increasing, order preserved.
TF * IDF
- TF: high for common word in one document.
- IDF: high for rare word in collection.
tf-idf normalization: avoid giving long doc more weight. to normalize.
\[w_{i k}=\frac{t f_{i k} \log \left(N / n_{k}\right)}{\sqrt{\sum_{k=1}^{l}\left(t f_{i k}\right)^{2}\left[\log \left(N / n_{k}\right)\right]^{2}}}\]similarity: cosine
Assessing Quality
- precision/recal curve(return or not)
- kendallās Tau(total order right or not )
- Mean reciprocal Rank (top hit)
True/false: relevent or irrelevent.
posisitive/negative: returned or not.
\[\begin{array}{|l|l|l|}\hline & {\text { Relevant }} & {\text { Not Relevant }} \\ \hline \text { Retrieved } & {\text { TP }} & {F P} \\ \hline \text { Not retrieved } & {F N} & {T N} \\ \hline\end{array}\]Precision and recall: whether its there or not. (tradeoff)
Kendallās Tau
- order of all results. whether its in right order.
1 - perfect agree -1 perfect not agree.
0 irrelevent.