Shared Flashcard Set

Details

Title

COMP3017 - Lecture 9 (A) - Information Retreival

Description

COMP3017 - Lecture 9 (A) - Information Retreival

Total Cards

Subject

Computer Science

Level

Undergraduate 2

Created

05/04/2014

Click here to study/print these flashcards.

Create your own flash cards! Sign up here.

Additional Computer Science Flashcards

Cards Return to Set Details

Term

What is the definition of Information Retreival?

Definition

An information need is a topic about which a user desires to know more

A query is what a user conveys to the computer in an attempt to communicate their information need

A document is relevant if it is one that the user perceives as containing the information of value with respect to their personal information need

In other words IR is the taks of finding relevant documents in a collection

Best known modern examples of IR are Web Search Engines

Term

What is the information retreival problem?

Definition

The primary goal of an information retreival system is to retreive all the documents tha are relevant to a user query while retreiving as few non-relevant documents as possible

Term

Draw a diagram for a high level system architecture

Definition

[image]

Term

What are some key concepts in IR Systems?

Definition

Document collection

Document granularity: paragraph, page or multi-page text

Query

Expressed in a query language

List of words [AI book], a phrase ["AI Book"], contain boolean operators [AI AND Book] and non-boolean operators [AI NEAR book]

Answer set

Set of documents judged to be relevant to the query

Ranked in decreasing order of relevance

Term

What are some key points on IR system implementations?

Definition

User expectations of information retrieval systems are exceedingly demanding

Need to index billion document collections

Need to return top results these collctions in a fraction of a second

Design of appropriate document representations and data structures is critical

Term

What are some characteristics of Information Retreival Models?

Definition

Information retreival systems are distinguished by how they represent documents to queries

Several common models: Boolean (Set-theoretic) Algebraic (Vector spaces), probabilistic

An information retreival model consists of

A set D of representations of the documents in the collection

A set Q of representation of user information needs (queries)

A ranking function R(di,qi) that associates a real number with a query representation qi in Q and a document representation di in D

Term

What are the principles of implementing Term Selection?

Definition

Information retreival systems either:

Index all words contained in documents (full-text indexing)

Selectively extract index terms from the documents that are to be indexes

[image]

Term

What are the principles of implementing Tokenization?

Definition

Identify distinct words

Separate on whitespace, turn each document into list of tokens

Discard punctuation (but consider tokenisation of O'Neill, and hyphens)

Fold case, remove diacritics

Tokenisation is language-specific

Identify language first (using character n-gram model, Markov model)

Languages with compound words (e.g. German, Finnish) may require special treatment (compound splitting)

Languages which do not leave spaces between words (e.g. Chinese, Japanese) may require special treatment (word segmentation)

Term

Draw Zipf's Law on Word significance

Definition

[image]

[image]

Term

What are the concepts of Stop Word Removal?

Definition

Extremely common words have little use for discriminating between documents

e.g. the, od, and , a, but, etc

Construct a stop list

Short terms by collection frequency

Identify high frequency stop words (possibly hand filtering)

Use stop list discard terms during indexing

Term

What is noun grouping?

Definition

Most meaning is carried by nouns

Identify adjacent nouns to index terms (special case of biword indexing)

Term

What are the concepts of stemming?

Definition

Terms in user query might not exactly match document terms

Query contains 'computer', document contains 'computer'

Terms have wide variety of sintactic variations

Affixes added to word stems, that are common to all inflected forms

For english, typically suffixes

e.g. connect is the stem of: connected, connecting, connects, etc.

Use stemming algorithm:

To remove affixes from terms before indexing

When generating query terms from query

Term

What are some stemming algorithms?

Definition

Lookup table

Contains mappings from inflexted forms to uninflected stems

Suffix-stripping

Rules for identifying and removing suffixes

e.g. '-sses' --> '-ss'

Porter Stemming Algorithm

Lemmatisation

Identify part of speech (noun, verb, adverb)

Apply POS-specific normalisation rules to yield lemmas

Term

What are some disadvantages of Stemming?

Definition

Yields small (~2%) increase in recall for English

More important in other language

Can harm precision

'Stock' versus 'stocking'

Difficult in agglomerative languages (German, Finnish)

Difficult in languages with many exceptions (Spanish)

Can use dictionary

Term

What are some characteristics of manual indexing?

Definition

Identification of indexing terms by a (subject expert)

Indexing terms typically taken from a controlled vocabulary

May be a flat list of terms, or hierarchical thesaurus

May be structured terms

(e.g. names or structures of chemical compounds)

Term

What are some characteristics on result presentation?

Definition

Probability ranking principle

List oredered by probabilities of relevance

Good for speed

Doesn't consider utility

If two copies of most relevant document, second has equal relevance but zero utility once you've seen the first

Many IR systems eliminate results that are too similar

Flashcard Machine - create, study and share online flash cards

Shared Flashcard Set

Details

Additional Computer Science Flashcards

Cards Return to Set Details

My Flashcards

Flashcard Library

Browse

About

Help

Mobile