Term
What is the definition of Information Retreival? |
|
Definition
- An information need is a topic about which a user desires to know more
- A query is what a user conveys to the computer in an attempt to communicate their information need
- A document is relevant if it is one that the user perceives as containing the information of value with respect to their personal information need
- In other words IR is the taks of finding relevant documents in a collection
- Best known modern examples of IR are Web Search Engines
|
|
|
Term
What is the information retreival problem? |
|
Definition
The primary goal of an information retreival system is to retreive all the documents tha are relevant to a user query while retreiving as few non-relevant documents as possible |
|
|
Term
Draw a diagram for a high level system architecture |
|
Definition
|
|
Term
What are some key concepts in IR Systems? |
|
Definition
- Document collection
- Document granularity: paragraph, page or multi-page text
- Query
- Expressed in a query language
- List of words [AI book], a phrase ["AI Book"], contain boolean operators [AI AND Book] and non-boolean operators [AI NEAR book]
- Answer set
- Set of documents judged to be relevant to the query
- Ranked in decreasing order of relevance
|
|
|
Term
What are some key points on IR system implementations? |
|
Definition
- User expectations of information retrieval systems are exceedingly demanding
- Need to index billion document collections
- Need to return top results these collctions in a fraction of a second
- Design of appropriate document representations and data structures is critical
|
|
|
Term
What are some characteristics of Information Retreival Models? |
|
Definition
- Information retreival systems are distinguished by how they represent documents to queries
- Several common models: Boolean (Set-theoretic) Algebraic (Vector spaces), probabilistic
- An information retreival model consists of
- A set D of representations of the documents in the collection
- A set Q of representation of user information needs (queries)
- A ranking function R(di,qi) that associates a real number with a query representation qi in Q and a document representation di in D
|
|
|
Term
What are the principles of implementing Term Selection? |
|
Definition
Information retreival systems either:
- Index all words contained in documents (full-text indexing)
- Selectively extract index terms from the documents that are to be indexes
[image] |
|
|
Term
What are the principles of implementing Tokenization? |
|
Definition
- Identify distinct words
- Separate on whitespace, turn each document into list of tokens
- Discard punctuation (but consider tokenisation of O'Neill, and hyphens)
- Fold case, remove diacritics
- Tokenisation is language-specific
- Identify language first (using character n-gram model, Markov model)
- Languages with compound words (e.g. German, Finnish) may require special treatment (compound splitting)
- Languages which do not leave spaces between words (e.g. Chinese, Japanese) may require special treatment (word segmentation)
|
|
|
Term
Draw Zipf's Law on Word significance |
|
Definition
|
|
Term
What are the concepts of Stop Word Removal? |
|
Definition
- Extremely common words have little use for discriminating between documents
- e.g. the, od, and , a, but, etc
- Construct a stop list
- Short terms by collection frequency
- Identify high frequency stop words (possibly hand filtering)
- Use stop list discard terms during indexing
|
|
|
Term
|
Definition
Most meaning is carried by nouns
Identify adjacent nouns to index terms (special case of biword indexing) |
|
|
Term
What are the concepts of stemming? |
|
Definition
- Terms in user query might not exactly match document terms
- Query contains 'computer', document contains 'computer'
- Terms have wide variety of sintactic variations
- Affixes added to word stems, that are common to all inflected forms
- For english, typically suffixes
- e.g. connect is the stem of: connected, connecting, connects, etc.
- Use stemming algorithm:
- To remove affixes from terms before indexing
- When generating query terms from query
|
|
|
Term
What are some stemming algorithms? |
|
Definition
- Lookup table
- Contains mappings from inflexted forms to uninflected stems
- Suffix-stripping
- Rules for identifying and removing suffixes
- e.g. '-sses' --> '-ss'
- Porter Stemming Algorithm
- Lemmatisation
- Identify part of speech (noun, verb, adverb)
- Apply POS-specific normalisation rules to yield lemmas
|
|
|
Term
What are some disadvantages of Stemming? |
|
Definition
- Yields small (~2%) increase in recall for English
- More important in other language
- Can harm precision
- 'Stock' versus 'stocking'
- Difficult in agglomerative languages (German, Finnish)
- Difficult in languages with many exceptions (Spanish)
|
|
|
Term
What are some characteristics of manual indexing? |
|
Definition
- Identification of indexing terms by a (subject expert)
- Indexing terms typically taken from a controlled vocabulary
- May be a flat list of terms, or hierarchical thesaurus
- May be structured terms
- (e.g. names or structures of chemical compounds)
|
|
|
Term
What are some characteristics on result presentation? |
|
Definition
- Probability ranking principle
- List oredered by probabilities of relevance
- Good for speed
- Doesn't consider utility
- If two copies of most relevant document, second has equal relevance but zero utility once you've seen the first
- Many IR systems eliminate results that are too similar
|
|
|