Shared Flashcard Set

Details

Title

Search Engines 11-442 Midterm 2

Description

A flash card set for studying for the final of the CMU 11-X42 course on Search Engines

Total Cards

Subject

Education

Level

Graduate

Created

12/06/2023

Click here to study/print these flashcards.

Create your own flash cards! Sign up here.

Additional Education Flashcards

Cards Return to Set Details

Term

Index Construction

Definition

Partition the Indexes (Sharding)

Add Tiers of Indexes

Cache Queries (Check First)

Cache Index Terms (Check Second)

For Construction Use MapReduce

Term

Dual-Encoder Models

(Representation based models)

Definition

Separately encode the query and document (can be same or diff encoding functions)

Then match the results (typically with a simple match function)

Term

Deep Structured Semantic Models (DSSM)

Definition

Type of Dual-Encoder model

vocabulary of 500k terms, ignores all others

Map each word to a vector (Hashing), but doesn't use word2vec or anything

Instead breaks words into trigrams (eg. best -> #best# -> #be, bes, est, st#) This hashing is robust to spelling diffs, but not conceptual diffs

[image]

Better than BM25, worse than good LTR

Term

Deep Relevance Matching Model (DRMM)

Definition

Interaction Based Neural Model (Meaning get local matches between pieces of text like cosine similarity of word embeddings, then learn patterns of interaction)

Convert all words to word2vec embeddings,

compare every query word to every doc word,

use histogram pooling to get a constant number of inputs (eg. log of # of words that had match of .8 to .9 as one input feature),

Pass that through a feed forward neural net to get a score,

[image]

too expensive for initial retrieval
This is comparable to good LTR models

Term

BERT Ranking

Definition

Term

DeepCT

Definition

The idea is to fine-tune BERT to produce importance scores for each word and then use those scores rather than term frequency. Use the max score for each word.

This improves Indri or BM25 and is a preprocessing step, so it can be done offline.

Term

Doc2Query/DocT5Query

Definition

The idea is to automatically generate questions that a document could answer and append them to the end of the document, then use traditional techniques. We are augmenting the documents.

This is a lexicon based approach that enables Document expansion.

This improves BM25 15% (and the T5 model which has a better transformer gets 25%)

Term

COIL

Definition

Contextualized Inverted Lists

The idea is that words don't convey meaning on their own, but BERT can be trained to produce contextualized embeddings which do a better job. Lets use those instead.

[image]

Term

SPLADE

Definition

The idea is to learn a Bag of Words for the text that is representative rather than using the words themselves. This is done by projecting the output of a BERT model into a vocabulary sized vector.

[image]

Term

FAISS

Definition

[image]

Term

ANCE

Definition

Hard Negative Mining

[image]

Term

Condenser

Definition

[image]

Term

HyDE

Definition

Use a LLM to generate a fictional document that is a good response to a query for query augmentation.

Term

Multi-hop Search

Definition

LLMs can help with this

It's where there are questions that require subquestions to answer.

Who was the forth presidents wife?

Need to know who the forth president was.

Need to know who he was married to.

Term

RAG

Definition

Retrieval Augmented Generation

Want to give a query. Have an LLM or smth do research with doc retrieval then answer.

Flashcard Machine - create, study and share online flash cards

Shared Flashcard Set

Details

Additional Education Flashcards

Cards Return to Set Details

My Flashcards

Flashcard Library

Browse

About

Help

Mobile