Search Engine Design

14 min readMay 12, 2021

Build Text Retrieval system based on vector space model; create the index of the system; rank the retrieved results based on term frequency added TF-IDF and document length normalization

The evaluation metric of the retrieval system and the feedback

step2: Retrieve the document based on the input query (language modeling)

step3: Rank the search results with link analysis for search engine

Let’s consider the simplest system where we consider each document as a binary vector with continuous bag of words (CBOW) model. The query is also parsed in as a binary vector and the dimension of the vector equals to the vocabulary size.

Now, after the query is parsed in, the step 2 is the obtain the similarities for query with each document.

As for the step 3, after all the similarity scores are obtained, we can rank the results.

The f(q, d) literally represents the number of distinct query words matched in document d.

In reality the search ranking system is much more complicated than this.

Step1: build the index of search engine

1.1 TF-IDF weighting factor

Instead of the binary vector, we can use the vector where each element should represent the number of times this word has appeared. It is more like to times a Term Frequency (TF) as a weighting factor on the binary vector. In addition, if a word has shown up in multiple documents, it means this word is not as important.

And we account for this factor by an Inverse Document Frequency (IDF) term, which is defined as log((M+1)/k). M represents the total number of documents and k represents the number of documents containing this word.

After times TF-IDF weight factor, we can find doc5 has the highest score.

1.2 BM25 TF transformation

d5 has the highest score merely due to the fact it contains multiple word “campaign”. It shouldn’t have a higher score than d4 however. Previously, the term frequency takes the literal meaning and represents the count of the words in the document. Now, we do a TF transformation called BM25 transformation in order to reduce the impact on high frequency words. k is a hyper parameter that can take the value of 5 or 6 practically.

The BM25 transformation has a upper bound and is robust and effective.

1.3 Document Length Normalization

Considering a d6 with 5000 words and d4 only with 100 words. It is obvious the words in d4 should have a higher weight than d6.

In order to solve this problem, a pivoted length normalization is introduced.

And the state of art PLN are dated back to 1994 and 1996. Those length normalizations are shown to be most effective combined with BM25 transformation.

4 Implementation of text retrieval system

In order to do fast online calculation of the query message, the text retrivel system should do (1)tokenization, which requires the words with similar meanings should be mapped into the same indexing term. It should also do the stemming, where all the inflectional forms of the words are mapped to the same root form. Such as computer, computation and computing are all mapped to compute. The tokenization of some languages such as Chinese are challenging in terms of word segmentation.

After tokenizing the words, indexing, which means converting documents to data structures enables fast search should be done. For this goal, the precomputing offline should be done as much as possible.

It is worthwhile to mention that Inverted Index is the dominating method for supporting basic search algorithms. Other indices such as the document index is needed for feedback. One example of the inverted index is shown as below:

Inverted index is more efficient since it can deal with multi-term key words by aggregating them by weights. The dictionary of the inverted index should be of a modest size so it provides fast random access. This dictionary prefers to be in the memory, the data structure suitable are hash table, B-tree and tire. The postings is huge. It usually allows sequential access and can stay on the disk. It can contain docId, term frequency, term positions and so on. The compression of posting is desirable.

The construction of inverted index on memory is difficult since the index is huge while the memory is limited. Here we provide sort based methods that can be separated into 4 steps: a. collect local (termId, docId, freq) tuples; b. sort local tuples; c. pair-wise merge runs; d. output inverted file.

The inverted indexing compression uses variable-length encoding. The term frequency compression assign smaller numbers for more frequent words. The docID compression uses d-gap compression due to the sequential access.

There are also stable language-independent patterns in how people use natural languages. A few words occur very frequently and most appear rarely. The top 4 most frequent words occupies 10~15% word occurrences and top 50 most frequent words occupies 35~40%. The most frequent word in one corpus may be rare in another. Here, we introduce Zipf’s Law, which requires the high frequency words have a lower rank.

2. Step2: Answer the query based on the index built

(a) fd(d) and fq(q) are precomputed

(b) Maintain a score accumulator for each d to compute h

(c) For each query term ti, first fetch the inverted list{(d1, f1), …, (dn, fn)}. Then for each entry (di, fi), compute g(ti, dj, q) and update the score accumlator for doc di to incrementally compute h

(d) Adjust the score and compute fa and then sort

For example a ranking system based on TF sum is shown as following. To further improve the efficiency, caching (such as query results, list of inverted index), keep only the most promising accumulators helps. In order to scale up to the web-scale, the distributed system is needed.

To summarize, for the implementation, the inverted index should be pre-processed as much as possible and compression should be done. For the fast search based on inverted index, we should exploit the inverted index to accumulate the scores for documents matching a query term and also exploit Zipf’s Law (rank the most frequent word in the front) to avoid searching many documents that does not match any query term. The use of inverted index also supports a wide range of ranking algorithms. With the help of parallel processing, distributed file system and caching, the system can be scaled.

2.2 Language Model

The statistical language model is a probability distribution over sequence of words. For a give sequence, it assigns a probability for the whole sequence. If the current word is the nth word and it only depends on the previous n-1 words, this is the assumption of N-gram. The uni-gram model can also be called bag of words model. Language models are used in information retrieval in the query likelihood model. Here, a separate language model is associated with each document in the collection. Documents are ranked based on the probability of query Q given the document’s language model Md: P(Q|Md). Commonly, the unigram language model is used.

The estimation of unigram LM is usually count the frequency of the words and divide it by the length of the document (according to the MLE estimation). It is not necessary the best estimation. For the words that didn’t appear in the document, it would return 0 as the probability.

LMs for topic representation can avoid this problem by assigning a very small number for the rare words.

Now, if we do an association analysis, which returns the words that are mostly related to the word computer. We find by normalized topic LM, the word software, program are highly associated. Those information can be used for modifying the query in order to improve the evaluation score.

In summary, the LM represents the probability of a sequence distribution over the text and unigram LM equals the word distribution. We can use LM for topic representation in order to find word associations.

2.3 Query likelihood model

For a quick demonstration, we used term frequency for a fast retrieval. If language model is used then the score function of query q with document d, f(q, d) is the log probability of the whole query sequence.

It is also worthwhile to mention here, the document language model is smoothed, which means for the word that does not appear in the document, there will be a very small number assigned.

With the smoothed p(w|d) and TF-IDF and length normalization, the f(q, d) can be re-write as below. The smoothed p(w|d) for the unseen word is assumed to be proportional to p(w|C), where C is the whole corpus.

3. Evaluation of the text retrieval system

3.1 F score

Assume with two different type of algorithms, they retrieve different documents and rank them differently as following. Now we use precision and recall to evaluate their performance for the 10 documents retrieved.

Precision represents the percentage out of all the retrieved results are relevant and recall represents the percentage of the relevant retrieved documents out of all the relevant documents in the pool. Obviously we cannot evaluate the system by only precision or recall. Their combined metric is F score. The user can trade off precision and recall depends on the search task.

3.2 Mean Average Precision

F score can tell us the overall performance after retrieving all the documents. In the above method, as long as the total number of documents does not change, their sequence does not matter. To evaluate the system more precisely, we can about how the precision and recall behaves every time a new document is retrieved. In this way, the rank of the document retrieved also matters. Here, we introduce the concept of Mean Average Precision (MAP) shown as below.

The average precision is the precision at every cutoff where a new relevant document is retrieved. In the case of non-relevant document, the precision for this document would be zero. The normalizer is the total number of documents retrieved. Obviously it is sensitive to the rank of documents. Mean average precision is the standard measure for comparing two ranking systems.

3.3 Multi-level judgements

The previous metrics assumes all the relevant documents have the same importance. What if the relevance of the documents are different. Then, we need to resort to the multi-level judgement.

The multi-level judgements in applicable to an importance level from 1 to r. r has be to larger than 2 to be useful. The main idea is measure total utility of the top k documents and the utility of the lower rank document is discounted. In the end, the normalization should be done to ensure the comparability over all across the queries.

4. Feedbacks of the system

4.1 Feedback of the text retrieval system

In the previous session we talked about the evaluation metric, which is useful to produced an update query and updated results.

If we rely on the clickthrough of the users, then we do not need extra effort from the user. However, the results are not completely useful.

4.2 Feedback in vector space model Rocchio Feedback

In the previous session, we talked about the TR system can learn from the examples to improve the retrieval accuracy. The examples are marked as positive or negative depending on whether or not it is relevant. The general method is to update the query. The common ways include add new terms and/or adjust the weights of the old terms. And this is the Rocchio Feedback. The intuition of Rocchio feedback is demonstrated as following:

The basic idea is to move the centroid of the query closer to the centroid of the relevant documents. The detailed formula is demonstrated as below:

Practically, the negative examples are not very important. And it usually make the algorithm much more efficient to truncate the vector, which means consider only a small vector of words that have a large weight in the centroid vector. In order to avoid overfitting, it is suggested to keep the relative high weight on the original query weights. This algorithm is usually robust and effective.

5. Sort the ranking results for web search

Web search is one of the most important applications of text retrieval. The challenges are the scalability, efficiency, quality of the information. And the ranking should combine with the link analysis.

Crawler is an essential component of the web search applications. The initial cached pages are saved and forms the corpus. For indexing such large scale of documents, storage over multiple machines is required and distributed file system and map reduce algorithm should be used. Google’s contributions are: a. Google File System (GFS),which is a distributed file system; b. MapReduce, which is a software framework for parallel computation; c. Hadoop, which is an open source implementation of MapReduce.

Map Reduce minimizes the effort of programmer for simple parallel processing tasks. It hides many low-level details (network, storage), has built-in fault tolerance and has automatic load balancing. The Map Reduce algorithm for inverted index can be shown as following:

The standard information retrieval models are not sufficient, mainly due to different information needs, the document has additional information and information quality varies a lot. The major extensions include, exploiting links to improve the scoring, exploiting clickthrough for massive implicit feedback. In general, we rely on machine learning to combine all kinds of features.

The links are like citations in literature. The pages that are cited more often should be expected to be more useful. PageRank is essentially citation counting and it improves simple counting. The details of PR algorithm can be seen as below the basic idea is solve the equation update p = (alpha*I + (1-alpha)*M)*p iteratively until converge. Here M is the transition matrix and (1-alpha) is the probability of transition.

The computation can be quite efficient since M is usually sparse. Normalization does not affect the ranking and this leads to some variants of the formula. For the websites that has zero out link, this is a problem since the sum of p(di) does not sum up to 1. To solve this problem, we can set page specific damping factor alpha = 1. The PR algorithm has many extensions such as topic-specific PR. It also has other applications such as social network analysis.

Besides the PageRank algorithm, HITS (Hypertext-Induced Topic Search)can be also applied to do link analysis. The intuition behind HITS is pages that are widely cited are good authorities, pages that cited many other pages are good hubs. The key idea of HITS is good authority are cited by good hubs and good hubs point to good authorities and this iteration forms a reinforcement. HITS is usually used in graph/network analysis.

After obtaining the rank from TR and the page rank, we usually combine them using machine learning methods such as logistic regression.

5. Search Engine System design

5.1 Senario

A search engine is an information retrieval system designed to help find information stored on a distributed storage system. It can be separated into full text, vertical search and meta search engine. It can also has the type of text, image or video search. Here we would talk about the full text search first.

5.2 Necessary: constrain/hypothesis

a. network load

Assume the daily user is 0.1 billion

concurrent user is about daily/5 = 20million

peak users = concurrent user * 3 = 60 million

max peak user in 3 month = 60million * 2 = 120 million

b. QPS

daily requests = 6 billion

query per second = daily request /24/3600 = 70,000

c. Memory Estimation

Memory per query =0.1kB (account ID, query history, location, time etc)

Memory per user = 0.1kB*30 = 3KB

Max Daily memory = daily user * increase rate in 3 month * memory per user = 0.1 billion * 2 * 3KB = 600GB

d. Traffic Estimation

size per page = 100Kb

request per user = 30/day = 30/24/3600/sec

max peak traffic = max peak users in 3 month * 100Kb*30/24/3600 = 4GB/s

3. Application

a. Crawler → web content crawling, it can be simplified into graph search, dfs/bfs

b. Indexer → parse text into inverted index ( → storage), parse the text, memory based and disk based Inverted-index algorithm

c. search → answer user’s query and rank, TF-IDF and page rank

4. Kilobit: data storage

a. crawler → save the page content locally and store it on GFS

b. Indexer → save the inverted index and use BigTable to ensure the efficiency

c. Search → save the historical query data, use mySQL

5. Evolve

From the perspective of users:

a. better constrains ( has a rapid growth after joining China’s market)

b. Boarder: new use cases (filter the information from some country)

c. Deeper: more details (for document search, it can go deeper to news search, scholar search, trending topic search etc)

From the perspective of servers:

Performance (decrease the response time from 0.5s to 0.1s)

Scalability (1billion to 7 billion users)

robustness (server down or fast recovery)

Search Engine Design

Written by BigFanTheory