Term
|
Definition
Partition the Indexes (Sharding)
Add Tiers of Indexes
Cache Queries (Check First)
Cache Index Terms (Check Second)
For Construction Use MapReduce
|
|
|
Term
Dual-Encoder Models
(Representation based models) |
|
Definition
Separately encode the query and document (can be same or diff encoding functions)
Then match the results (typically with a simple match function) |
|
|
Term
Deep Structured Semantic Models (DSSM) |
|
Definition
Type of Dual-Encoder model
vocabulary of 500k terms, ignores all others
Map each word to a vector (Hashing), but doesn't use word2vec or anything
Instead breaks words into trigrams (eg. best -> #best# -> #be, bes, est, st#) This hashing is robust to spelling diffs, but not conceptual diffs
[image]
Better than BM25, worse than good LTR |
|
|
Term
Deep Relevance Matching Model (DRMM) |
|
Definition
Interaction Based Neural Model (Meaning get local matches between pieces of text like cosine similarity of word embeddings, then learn patterns of interaction)
Convert all words to word2vec embeddings,
compare every query word to every doc word,
use histogram pooling to get a constant number of inputs (eg. log of # of words that had match of .8 to .9 as one input feature),
Pass that through a feed forward neural net to get a score,
[image]
too expensive for initial retrieval This is comparable to good LTR models
|
|
|
Term
|
Definition
|
|
Term
|
Definition
The idea is to fine-tune BERT to produce importance scores for each word and then use those scores rather than term frequency. Use the max score for each word.
This improves Indri or BM25 and is a preprocessing step, so it can be done offline. |
|
|
Term
|
Definition
The idea is to automatically generate questions that a document could answer and append them to the end of the document, then use traditional techniques. We are augmenting the documents.
This is a lexicon based approach that enables Document expansion.
This improves BM25 15% (and the T5 model which has a better transformer gets 25%) |
|
|
Term
|
Definition
Contextualized Inverted Lists
The idea is that words don't convey meaning on their own, but BERT can be trained to produce contextualized embeddings which do a better job. Lets use those instead.
[image] |
|
|
Term
|
Definition
The idea is to learn a Bag of Words for the text that is representative rather than using the words themselves. This is done by projecting the output of a BERT model into a vocabulary sized vector.
[image] |
|
|
Term
|
Definition
|
|
Term
|
Definition
Hard Negative Mining
[image] |
|
|
Term
|
Definition
|
|
Term
|
Definition
Use a LLM to generate a fictional document that is a good response to a query for query augmentation. |
|
|
Term
|
Definition
LLMs can help with this
It's where there are questions that require subquestions to answer.
Who was the forth presidents wife?
Need to know who the forth president was.
Need to know who he was married to. |
|
|
Term
|
Definition
Retrieval Augmented Generation
Want to give a query. Have an LLM or smth do research with doc retrieval then answer. |
|
|