Term
|
Definition
- Data structure listing all the terms that appear in the document collection
- Lexicon construction carried out after term selection
- Usually hash table for fast lookup
|
|
|
Term
What is an inverted index? |
|
Definition
- Data structure that relates a word in the lexicon to a document in which it appeasrs (a posting)
- May also include:
- Occurrence count within document
- Pointer to location of word within document
[image]
|
|
|
Term
Give examples of searching for single words |
|
Definition
- Example 1: If postings consist only of document identifiers, no rankin of results possible
- Lookup query term in lexicon
- Use inverted index to get address of posting list
- Return full posting address
- Example 2: If postings also contain term count, we can do some primitive ranking of results
- Lookup query term in lexicon
- Use inverted index to get address of posting lsit
- Create empty priority queue with maximum length R
- For each document/count pair in posting list
- If priority queue has fewer than R elements, add (doc, count) pair to queue
- If count is larger than lowest entry in queue, delete lowest entry and add the new pair
|
|
|
Term
What are the characteristics of the boolean model? |
|
Definition
- Lookup each query term in lexicon and inverted index
- Apply boolean set operators to posting lists for query terms to identify set of relevant documents
- AND - set intersection
- OR - set union
- NOT - set complement
|
|
|
Term
Give an example of the boolean model |
|
Definition
- Document collection:
- d1="Three quarks for Master Mark"
- d2="The strange history of quark cheese"
- d3="Strange quark plasmas"
- d4="Strange Quark XPress algorithm"
- Lexicon and inverted index:
- three → {d1}
- quark → {d1, d2, d3, d4}
- master → {d1}
- mark → {d1}
- strange → {d2, d3, d4}
- history → {d2}
- cheese → {d2}
- plasma → {d3}
- xpress → {d4}
- problem → {d4}
- Query: "strange" AND "quark" AND NOT "cheese"
- Result set:
- {D2, D3, D4} ∩ {D1, D2, D3, D4} ∩ {D1, D3, D4} = {D3, D4}
|
|
|
Term
What are the principles of document length normalisation? |
|
Definition
- In a large collection, document sizes vary widely
- Large documents are more likely to be considered relevant
- Adjust answer set ranking - divide rank by document length
- Size in bytes
- Number of words
- Vector norm
|
|
|
Term
What are some characteristics of positional indexes? |
|
Definition
- If postings include term locations, we can also support a term proximity operator
- Term1/k term2
- effectively an extended AND operator
- term1 occurs in a document within k words of term2
- When calculating intersection of posting lists, examine term locations
|
|
|
Term
|
Definition
- Can extend boolean model to handle phrase queries by indexing biwords (pairs of consecutive terms)
- Consider a document containing the phrase "electronics and computer science"
- After removing stop words and normalising, index and biwords "electronic computer" and "computer science"
- When querying on phrases longer than two terms, use AND to join constituent biwords
|
|
|
Term
What are the disadvantages of the boolean model? |
|
Definition
- A document either matches a query or not
- Need better criteria for ranking hits (similarity is not binary)
- Partial matches
- Term weighting
|
|
|
Term
What are the principles of the Vector model? |
|
Definition
- Documents and queries are reprensented as vectors of term-based features (occurrence of terms in collection)
- dj=(ti,j, t2,j, ... tn,j)
- qj=(ti,k, t2,k, ... tn,k)
- Features may be binary
- ti,j = 1 if term tm is present in document dj
- ti,j = 0 otherwise
|
|
|
Term
How is it possible to assess similarity between query and document as number of terms in common? |
|
Definition
|
|
Term
What are the principles of the weighted vector model? |
|
Definition
- Not all terms are equally interesting
- 'the' vs 'system' vs 'Bernes-Lee'
- Replace binary features with weights
[image]
View documents and queries as vectors in multidimensional space |
|
|
Term
How can similarity be determined in weighted vector models? |
|
Definition
Using the dot product (i.e. angle between vectors in multidimensional space)
[image] |
|
|
Term
What are some basis for term weighting? |
|
Definition
- Need basis for assigning weights to terms in documents
- How common is the term in the document?
- Within document measure
- Term frequency
- How common is the term in document collection?
- Collection frequency
- Inverse document frequency
|
|
|
Term
What are the formulas for Term Frequency (TF) and Inverse Document Frequency (IDF) |
|
Definition
|
|
Term
Give an example of TF-IDF |
|
Definition
- Consider the following document collection
[image]
[image] |
|
|
Term
What are the principles of the probabilistic model? |
|
Definition
Calculate similarity between query and documet as the odds that the document is relevant to the query:
[image]
P(relevant|q) and P(~relevant|q) are the same for all documents in teh collection |
|
|
Term
What are the principles of query refinement? |
|
Definition
- Typical queries very short, ambiguous
- Add more terms to disambiguate, improve
- Often difficult for users to express their information needs as a query
- Easier to say which results are/aren't relevant than to write a better query
|
|
|