Recommendation System

16 min readMay 4, 2021

What is recommendation system

Recommendation is a system to provide product suggestions or information for customers based on the e-commerce website. It helps the customer decide which product to purchase and acts like a sales person. Personalized recommendation provides the product and information that the specific user will be interested in based on user past behavior and individual interests.

With the rapid growth of the scale of the e-commerce, the number of products and number of different categories of products are growing rapidly, it is becoming more time consuming for the customers to find the useful information. The process of browsing excessive amount of information and products would drive away the customers who are overloaded with too much information.

Personalized recommendation is designed to solve this problem. It is an intelligent business platform built upon mining massive data, which aims to provide each user a personalized information useful for their own decisions. The common recommendation systems that is in the market are: “guess you like” in Amazon, movie recommendation in Netflix, things to do nearby and so on.

Recommendation system is more of an engineering project. To make more precise recommendations, algorithm itself is not enough. It is a rather comprehensive system which usually involves the identification of customers’ purpose, sentiment analysis, behavioral analysis and so on.

2. System Architecture

In this session, a few recommendation architecture will be introduced. They are not independent from each other. In practical, they should be combined with each other. The whole system can be roughly separated into online training and offline training modules.

2.1 Offline Training

Offline training recommendation system is one of the most common recommendation systems. The offline literally means training with the data that is generated a week or a few weeks back. The update of the model usually happen in one or a few hours. It keeps tracking the mid and long term customer interests.

As shown in the figure below, a typical offline recommendation system has data collection, offline training, online storage, online prediction and A/B testing modules. Data collection and the offline training are the supervised learning part. Online prediction and A/B testing are the prediction part. Besides the model, there is an online storage, which stores both the model and the feature information that is needed for the online prediction. The modules can be divided into one training and one prediction data flow. The data collection for training the model will be also stored in the online storage. On the prediction side, the web sends the request to do A/B testing and retrieves the prediction results from the online prediction module.

(1) Data Collection: collects the log from the web; usually has collection, verification, cleaning and transformation steps. The module collects data and converts them into the format that can be used directly for training and save them in the offline storage.

(2) Offline Training: can be further separated into offline storage and offline calculation. Practically, the offline training data are massive, it requires distributed file system for storage. Some of the offline calculations are: data sampling, feature engineering, model training, similarity calculation.

(3) Online Storage: has a more strict requirement of the response speed. If a user opens app and the app takes quite a while to respond, this would impact the user experience in a very bad way. Generally speaking, the recommendation system should response in couple of milli seconds after the user send the request. To meet such high demands on the speed, online storage is designed to store the online model and feature data.

(4) Online prediction: makes prediction based on the new request user just put. This module does the following: a. obtain user feature; b. retrieve the model; c. sort the results.

Practically, the number of products is massive; if we make predictions for each product, it would be taking too long to return user’s request. A common way is to split the recommendation list into filtering and sorting two steps. Filtering is to pick out the few (a few hundreds) products from the massive products pool (a few million). Sorting is to score the few products after filtering. To make more diverse recommendation, there will be a third step to re-sort and filter after the sorting step.

(5) A/B testing: is a necessary module for any web based application. It helps the developer to evaluate how the algorithm affects user behavior. Besides the metrics that can be obtained offline, A/B testing is usually performed to test the effectiveness of a new algorithm before being put online.

2.2 Online Training

For the sake of business, we want to use the feedback of the reaction a user to the previous product (like or dislike, click or not) to the recommendation of next product. This requires online training.

Online training recommendation system can be used for advertisement or e-commercial, where the data is usually high-dimensional and instant response is required. Compared to the offline training system, the online training system does not have to separate the training (offline) and prediction (online) anymore. The system learns every single time when user provide feedback and adjust the strategy instantly. One one hand, the online system requires the sample, feature and the model to be updated instantly so that it can provide recommendation that is of users’ interest more quickly. On the other hand, online system does not need the system to save all the user data for training which reduces the cost of huge storage. This makes the system even more flexible to take in bigger model and/or larger amount of data.

(1) Data collection: as for offline system, the data was collected to a distributed file system and waits for the offline computation task. In the online case, the data collection which includes the filtering, removing the duplication, sampling needs to be done instantly.

(2) Instant Feature: this module mainly dose the feature engineering and organization and prepare for the streaming computing module

(3) Streaming Computing: this module updates the model based on the instant feature input. It supports the sparse storage of the model. The model training can be initialized as the model that was trained offline.

(4) Model Upload: model is usually saved in the server and then uploaded in the module to the online storage

3. Feature Engineering

To train the model, the system needs to collect the user behavior and construct the feature vectors. Each feature vector has the features times its weights. To obtain the feature vectors from the user behavior, the following factors need to be considered:

(1) The type of user behavior: when browsing the website, the user may have the following behaviors towards certain product: browsing the product, click on the link, save the product, rate the product, purchase the product, comment on the product, share the product, put a different label on the product, search a different key word. All those behaviors have impact on the weights of the feature vector and different behavior has different impacts. It is hard to make a common rule for the weights for all the behaviors. The general idea is the behavior that costs more from the user has a higher weight.

(2) The time the behavior happens: the recent behavior would have higher impact than the behavior that happens longer time ago. For example, if a user purchases the product recently, the corresponding feature would have a higher weight.

(3) The frequency of the behavior: it might occur that certain behavior can happen multiple times. For example, a user can listen to a song multiple times, or watch different episodes of a TV series. So the frequency also reflects the weights of the feature.

(4) The popularity of product: if a user happen to browse a very popular product, this might not reflect the users’ interest. It might be the link to this product is everywhere. On contrary, if a user’s behavior is towards a less popular product, this can reflect user’s personal interest more, which represents a higher weight in the feature engineering.

(5) De-noising: it might happen the user repeats certain behavior too many times just because of some promotion or some type of data is missing. The noise data should be removed and the missing data needs to be filled.

(6) Sample Balance: the samples where the user has a behavior towards a product are the positive sample. For a balanced data sample, the negative samples where the user dost not have any behavior on the product should also be sampled.

(7) Feature combination: the relevance of the features towards the results should be considered. For example, for the ranking of hotel, revenue of the hotel, price, customer’s affordability are features that has a high impact. The age of customer, location are less important and the customer id is totally non-relevant. After select the relevant features, each feature should be in a numerical format. The common useful feature include: counting features like browsing times, purchase times; rate features like click rate, conversion rate; statistical features, such as mean price, standard deviation, quantile, skewness, kurtosis.

4. Collaborative Filtering

Dated back to 1992, when Xerox aims to build a personalized spam filter system. Collaborative filtering selects user-item interactive data for training. In this way, the system can meet different users’ need more accurately. Collaborative Filtering has item-CF and user-CF, both of which returns the topN recommendation list.

4.1 Item collaborative filtering

This method gets most popular after Amazon published the patent in 2001. Compared to the user collaborative filtering, it has a lower burden of online computation for the systems where the number of users is much more than the number of the products and the products are updated less frequently. The basic assumption of Item-CF is the user is likely to like the item that is very similar to what he or she liked in the past. The key idea is to find the items that are similar to the items the user already liked. The way to obtain the similarity between different items are:

4.1.1 Jaccard Similarity

In the formula below, the denominator is the number of users who purchased both item i and item j, the nominator is combination of the the total number of users who purchased i and total number of users who purchased j.

The ratio between those two gives the weights which tells how much item j and item i are similar.

4.1.2 Cosine similarity

The above method counts the number of users who purchased both the items and obtain their relative similarity. However, there are cases where a user purchased an item and end up not liking it. If the user-item data are ratings, we can calculate the cosine similarity between different items as following:

Nik represents the rating from user i to item k. If the rating was missing, then 0 is usually used to represent the number.

For a simple recommendation system, the products can be recommended based on this similarity score. For a more complex system (rating), the rating for each item would be calculated by the summation of the all the rated item weighted with its similarity with the recommending item. And the final recommendation is made based on the rating score.

4.1.3 Co-Occurrence Matrix based recommendation

The goal of co occurrence matrix is to find how many times the two items appear together for the same user in the users’ historical data. Below is an example of movie recommendation based on historical rating data. After parsing in the user rating log, the matrix storing the user movie rating pairs can be obtained. Those user-movie pair can be translated into the co-occurrence matrix in the following manner. The diagonal element is the total number of users who has rated this movie. Mathematically, the co-occurrence matrix M = Transpose(R)*R, where R is the rated matrix (binary element represents rated or not).

From the parsing the data log to the co-occurrence matrix, it usually takes 2 map-reduce jobs. And there is another normalization for each row for the co occurrence matrix. The normalized co-occurrence matrix is then read in to do a multiplication with the user rating vector. It is worthwhile to mention here, in order to obtain the final rating matrix that is has the elements closer to the elements that already has a rating, the default rating of the unrated movie should be the average.

In the final result vector, we remove the element that already had a rating and recommend the rest based on the element value.

The disadvantage of item CF is that it ignores the user difference, so it has less accuracy compared to user CF.

4.2 User Collaborative Filtering

The basic assumption of user CF is that the similar users would have similar preference of products. This algorithm is more suitable for the scenario where the products are more than the users and the products are updated frequently, such as news app. It is worthwhile to mention that it is usually more computational efficient the calculate the similarity using the inverted-index shown as below. Since the products are less and the rating matrix is usually sparse.

For the target user (row vector), we obtain its similarity with other users based on indexes such as Pearson correlation, cosine similarity.

For top-N recommendation, we find the K users with the highest similarity and among all the products favored by all those K users, we find the the products that has not been purchased yet and recommend those based on the score.

The advantage of the user CF is we can skip the mining of the matrix for the items and the preference of the users. The disadvantage comes when the number of users scales.

5. Matrix Factorization

Previously we talked about item and user based collaborative filtering. The disadvantage of both methods are the sparsity of the data and it is difficult to account for the instant data in the big data scheme. So, they train a model based on historical data first and then make predictions based on those trained model. Model based CF methods include Latent Semantic Indexing, Bayesian Network and so on. Those methods does not rely on a pre-trained model from the historical data. Instead, model based CF methods takes in the whole data matrix and respond instantly. The challenge of those methods is how to update the instant model based on users’ most recent update or feedback.

Based on the Netflix challenge in 2016, the matrix factorization method showed effectiveness in reducing the RMSE by 10%. In the netflix challenge, there are 48000 users and 17770 movies with only a few hundreds million ratings. The matrix is rather sparse. One of the natural approach to reduce the computation burden is to do matrix factorization.

After obtaining user factor matrix P and item factor matrix Q, we can obtain the rating Rij = Pi*Qj. The decomposed matrix P and Q are only useful to obtain the rating scores.

The common ways to do matrix factorization

5.1 Non-negative Matrix Factorization (NMF, introduces in Andrew Ng’s Machine Learning course on Coursera)

The basic idea is to find the elements for user factor matrix P and item factor matrix Q iteratively through gradient decent. The loss function is the regularized L2 norm of difference between the prediction and the known elements in the rating matrix.

This method, however, is not the MF method shown to be most effective in 2016 Netflix contest.

5.1.2. SVD

5.1.3. Probabilistic Matrix Factorization (PMF)

The difference between PMF and SVD is PMF assumes the rating scores in the rating matrix are only the mean value of a gaussian distribution. The Guassian distribution subjects to a stand deviation sigma. PMF has a better performance than SVD especially with a lot of infrequent users.

Other model based methods include self-representation model, restricted Boltzmann Machine (RBM), Clustering based collaborative filtering and classifiers for CF, etc.

6. Learning to Rank

Traditional recommendation is based on collaborative filtering and find the top N recommendations through methods such as matrix factorization. This step would return such as a few hundreds or thousands samples. In reality, the recommenders would need to combine other factors before making the final recommendations. This procedure, learning to rank, takes the input user, item and context and outputs a score from 0 to 1 and find the top N. This procedure is also called CTR prediction. The common methods include:

6.1 Logistic Regression

The simplest method for prediction. The input concatenates user vector, the item vector of one of the recommendations from collaborative filtering, the context vector. If the user has a positive relation with this item, then the output should be 1, counted as a positive sample. If the user has skipped this item or the user has not act on the popular item, this counts as a negative sample. After collecting the inputs and outputs, a logistic regression model can be trained.

6.2 GBDT

Gradient boosted decision tree can be also used for ranking. Different from LR, GBDT would use a lot of small decision trees, in which the current DT takes into account the error result from previous DT to further reduce the loss.

GBDT can be combined with LR. LR can only improve the result of GBDT a little bit (~5% during online training) but it can prevent GBDT from overfitting. In addition, LR does not require to manually convert the new feature during online training, this improves the preprocessing efficiency.

GBDT does not support high dimensional sparse vectors. If we use LR for the high dimensional vector, the computation burden would increase quadratically. The GBDT +FM shown as below, in which the FM replaces LR mentioned in the previous paragraph can better solve this problem.

GBDT+FM model has a 4~6% improvement compared to GBDT+LR model

6.3 GBDT+FM+DNN

The GBDT+FM model does not fully support the deep features such as embedding layer. DNN can learn from the embedding layer and improve the model accuracy. The details are shown as the figure below. The model has an average of 4% improve from the GDBT+FM model. The Adam optimizer has shown the best training performance.

Some advances in the industry are:

a. Youtube publishes the learning to rank based on DNN in 2016

b. google published wide and deep model in 2016 which has the similar structure with the previous model except they use Cross feature LR in replacement of FM.

c. Youtube recommendation only uses deep learning and separates into candidate generation and ranking two parts. It has shown a improvement from the previous model where FM is used.

7. Cold start and Content Based Filtering

Cold start means there are two few historical data to make a personalized recommendation. This is a common problem for the recommenders. On one hand, a new product with zero user interaction data; on the other hand, a new user with zero historical data on how he or she acts on the products; without additional data from outsider apps, it is difficult to make a prediction.

7.1 User based cold start

The problem can be solved from the following methods:

(a) Create user profile based on his or her account information

(b) Use users’ phone IMEI number as one of the account information

7.2 Item based cold start

(a) use the content and the classification information to create item profile

(b) use the labels from the expert

7.3 System cold start

For a new website (without user and user behavior, with only item information), how to recommend when the website just gets online?

7.4 Content Based Recommendation

The collaborative filtering methods all have the problem of cold start, it is worthwhile to mention the content based recommendation is based on the item profile and the similarity between the new items and the items the user used to like. Below is an image showing how the data is parsed for collaborative filtering. In the case of CB recommendation, only the item attributes and user profiles would be used. It is very useful in the case of cold start.

7.5 Demographic based Recommendation

Demographic based Recommendation is barely used alone. It is based on user profile which contains the gender, age, active period, locations and so on. The basic assumption is a user would like similar products with the users who share similar user profiles.

The advantage of this recommendation method is the user profile barely changes and the similarity calculation between users can be done offline. The major problem is the credential of the results since the user with very similar profile would have totally different preferences.

8. A/B testing and evaluation of recommendation results

A/B testing is a very common method to test online method. It will separate the users to several groups and apply different algorithms for those groups. And compare the CTR for each group.

The evaluation metrics for offline systems include the accuracy, coverage, diversity, novelty. For online AB testing, the common metrics are the CTR, the time user spend, advertisement profit. It is worthwhile to mention the metrics of the short term should be combined with long term metrics. The PM should consider from all the aspects.

[ref. 1] Collaborative filtering with co-occurrence matrix algorithm

[ref. 3] 推荐系统 — 完整的架构设计和算法(协同过滤、隐语义)

Recommendation System

Written by BigFanTheory