基於協同過濾的推薦系統實戰(附完整代碼)

【導讀】本文使用 Python 實現簡單的推薦系統,分別實踐了基於用戶和基於商品的推薦系統,代碼使用 sklearn 工具包實現。除了代碼實現外,還分別從理論上介紹了兩種推薦系統原理:User-Based Collaborative Filtering 和 Item-Based Collaborative Filtering,並講解了幾種常見的相似性度量方法及它們分別適用場景,還實現了推薦系統的評估。最終分析兩種推薦系統的優劣,說明混合推薦技術可能具有更好的性能。

基於協同過濾的推薦系統

在這篇文章中,我使用 Python 實現一個簡單的推薦系統。

給定下面的用戶商品 user-item 評分矩陣 M,其中 6 個用戶(行)評價了 6 個商品(列)。 評分可以爲 1-10 的整數值,0 表示評分不存在。 (請注意,我們在 Python 中使用從零開始的行和列索引,但對於用戶輸入,user_id 將佔用 1-6 和 1-6 的 item_id)。 假設,我們必須找出用戶 3 是否喜歡第 4 項商品。 因此用戶 3 成爲我們的目標用戶或活躍用戶,項目 4 是目標商品。

代碼如下:

#M is user-item ratings matrix where ratings are integers from 1-10
M = np.asarray([[3,7,4,9,9,7], 
                [7,0,5,3,8,8],
               [7,5,5,0,8,4],
               [5,6,8,5,9,8],
               [5,8,8,8,10,9],
               [7,7,0,4,7,8]])
M=pd.DataFrame(M)

#declaring k,metric as global which can be changed by the user later
global k,metric
k=4
metric='cosine' #can be changed to 'correlation' for 
Pearson correlation similaries

1

基於用戶的協同過濾

首先,我們必須預測用戶 3 對第 4 項商品的評分。在基於用戶的推薦系統中,我們會找到 3 個與用戶 3 最相似的用戶,並用這三個用戶的評分預測用戶 3 對第 4 項商品的評分。

常用的相似性度量是餘弦,皮爾森(Pearson),歐幾里得 等等。我們將在這裏使用餘弦相似性,其定義如下:

而且, Pearson 相關性定義爲:

在 sklearn 中,NearestNeighbors 方法可用於基於各種相似性度量搜索 k 個最近鄰。

代碼如下:

#This function finds k similar users given the user_id and ratings 
matrix M
#Note that the similarities are same as obtained via using 
pairwise_distances
def findksimilarusers(user_id, ratings, metric = metric, k=k):
    similarities=[]
    indices=[]
    model_knn = NearestNeighbors(metric = metric, algorithm = 'brute') 
    model_knn.fit(ratings)

    distances, indices = model_knn.kneighbors(ratings.iloc[user_id-1, :]
    .values.reshape(1, -1), n_neighbors = k)
    similarities = 1-distances.flatten()
    print '{0} most similar users for User {1}:\n'.format(k-1,user_id)
    for i in range(0, len(indices.flatten())):
        if indices.flatten()[i]+1 == user_id:
            continue;

        else:
            print '{0}: User {1}, with similarity of {2}'.
            format(i, indices.flatten()[i]+1, similarities.flatten()[i])
            
    return similarities,indices

def predict_userbased(user_id, item_id, ratings, metric = metric, k=k):
    prediction=0
    similarities, indices=findksimilarusers(user_id, ratings,metric, k)
     #similar users based on cosine similarity
    mean_rating = ratings.loc[user_id-1,:].mean() 
     #to adjust for zero based indexing
    sum_wt = np.sum(similarities)-1
    product=1
    wtd_sum = 0 
    
    for i in range(0, len(indices.flatten())):
        if indices.flatten()[i]+1 == user_id:
            continue;
        else: 
            ratings_diff = ratings.iloc[indices.flatten()[i],item_id-1]
            -np.mean(ratings.iloc[indices.flatten()[i],:])
            product = ratings_diff * (similarities[i])
            wtd_sum = wtd_sum + product
    
    prediction = int(round(mean_rating + (wtd_sum/sum_wt)))
    print '\nPredicted rating for user {0} -> item {1}: {2}'.
    format(user_id,item_id,prediction)

    return prediction

findksimilarusers 函數使用這個方法爲目標用戶返回 k 個最接近鄰的相似性和索引。 基於用戶的推薦系統方法,基於 predict_user 的函數進一步預測用戶 3 對於商品 4 的評分。 預測通過鄰居平均值的偏差的加權平均值來計算,並將其添加到目標用戶的平均評分中。 偏差用於調整用戶相關的偏差。 出現用戶偏差的原因是某些用戶可能總是對所有項目給予高評分或低評分。

其中 p(a,i) 是目標用戶 a 對商品 i 的預測,w(a,u) 是用戶 a 和 u 之間的相似度,K 是和目標用戶相似的 K 個用戶。

2

基於商品(Item-Based)的協同過濾

在這種方法中,使用餘弦相似性度量來計算一對商品之間的相似度。 可以通過使用簡單的加權平均值來預測目標用戶 a 對目標商品 i 的評分:

#This function finds k similar items given the item_id and ratings 
matrix M

def findksimilaritems(item_id, ratings, metric=metric, k=k):
    similarities=[]
    indices=[]    
    ratings=ratings.T
    model_knn = NearestNeighbors(metric = metric, algorithm = 'brute')
    model_knn.fit(ratings)

    distances, indices = model_knn.kneighbors(ratings.iloc[item_id-1, :]
    .values.reshape(1, -1), n_neighbors = k)
    similarities = 1-distances.flatten()
    print '{0} most similar items for item {1}:\n'.format(k-1,item_id)
    for i in range(0, len(indices.flatten())):
        if indices.flatten()[i]+1 == item_id:
            continue;

        else:
            print '{0}: Item {1} :, with similarity of {2}'
.format(i,indices.flatten()[i]+1, similarities.flatten()[i])


    return similarities,indices

#This function predicts the rating for specified user-item combination 
based on item-based approach
def predict_itembased(user_id, item_id, ratings, metric = metric, k=k):
    prediction= wtd_sum =0
    similarities, indices=findksimilaritems(item_id, ratings) 
    #similar users based on correlation coefficients
    sum_wt = np.sum(similarities)-1
    product=1
    
    for i in range(0, len(indices.flatten())):
        if indices.flatten()[i]+1 == item_id:
            continue;
        else:
            product = ratings.iloc[user_id-1,indices.flatten()[i]] * 
            (similarities[i])
            wtd_sum = wtd_sum + product                              
    prediction = int(round(wtd_sum/sum_wt))
    print '\nPredicted rating for user {0} -> item {1}: {2}'
.format(user_id,item_id,prediction)      

    return prediction

在上述代碼中,函數 findksimilaritems 使用最近鄰方法採用餘弦相似性來找到 k 項最相似的商品 i。 函數 predict_itembased 進一步預測用戶 3 將使用基於商品的 CF 方法(上面的公式)給予商品 4 的評分。

3

調整後的餘弦相似度

使用基於商品的推薦系統方法的餘弦相似性度量不考慮用戶評分的偏差。 調整後的餘弦相似度通過從每個共同評分對中減去各自用戶的平均評分來抵消該缺點,並且被定義爲如下

爲了在 Python 中實現 Adjusted Cosine 相似度,我定義了一個名爲 computeAdjCosSim 的簡單函數,該函數返回調整後的餘弦相似度矩陣,給出評分矩陣。函數 findksimilaritems_adjcos 和 predict_itembased_adjcos 利用調整後的餘弦相似度來查找 k 個相似項並計算預測評分。

函數 recommendItem 提示用戶選擇推薦方法(基於用戶(餘弦),基於用戶(相關),基於商品(餘弦),基於商品(調整餘弦,Adjusted Cosine Similarity)。基於所選方法和相似性度量,該函數可以預測指定用戶和商品的評分,並建議商品是否可以推薦給用戶,如果該商品尚未被用戶評分,並且預測評分大於 6,則推薦給用戶,如果評分小於 6,則不推薦給用戶。

代碼如下:

This function is used to compute adjusted cosine similarity matrix for 
items
def computeAdjCosSim(M):
    sim_matrix = np.zeros((M.shape[1], M.shape[1]))
    M_u = M.mean(axis=1) #means
          
    for i in range(M.shape[1]):
        for j in range(M.shape[1]):
            if i == j:
                
                sim_matrix[i][j] = 1
            else:                
                if i<j:
                    
                    sum_num = sum_den1 = sum_den2 = 0
                    for k,row in M.loc[:,[i,j]].iterrows(): 

                        if ((M.loc[k,i] != 0) & (M.loc[k,j] != 0)):
                            num = (M[i][k]-M_u[k])*(M[j][k]-M_u[k])
                            den1= (M[i][k]-M_u[k])**2
                            den2= (M[j][k]-M_u[k])**2
                            
                            sum_num = sum_num + num
                            sum_den1 = sum_den1 + den1
                            sum_den2 = sum_den2 + den2
                        
                        else:
                            continue                          
                                       
                    den=(sum_den1**0.5)*(sum_den2**0.5)
                    if den!=0:
                        sim_matrix[i][j] = sum_num/den
                    else:
                        sim_matrix[i][j] = 0


                else:
                    sim_matrix[i][j] = sim_matrix[j][i]           
            
    return pd.DataFrame(sim_matrix)

#This function finds k similar items given the item_id and ratings
matrix M

def findksimilaritems_adjcos(item_id, ratings, k=k):
    
    sim_matrix = computeAdjCosSim(ratings)
    similarities = sim_matrix[item_id-1].sort_values(ascending=False)
    [:k].values
    indices = sim_matrix[item_id-1].sort_values(ascending=False)[:k].
    index
    
    print '{0} most similar items for item {1}:\n'.format(k-1,item_id)
    for i in range(0, len(indices)):
            if indices[i]+1 == item_id:
                continue;

            else:
                print '{0}: Item {1} :, with similarity of {2}'
.format(i,indices[i]+1, similarities[i])
        
    return similarities ,indices

#This function predicts the rating for specified user-item combination 
for adjusted cosine item-based approach
#As the adjusted cosine similarities range from -1,+1, sometimes the 
predicted rating can be negative or greater than max value
#Hack to deal with this: Rating is set to min if prediction is negative, 
Rating is set to max if prediction is above max
def predict_itembased_adjcos(user_id, item_id, ratings):
    prediction=0

    similarities, indices=findksimilaritems_adjcos(item_id, ratings) 
    #similar users based on correlation coefficients
    sum_wt = np.sum(similarities)-1

    product=1
    wtd_sum = 0 
    for i in range(0, len(indices)):
        if indices[i]+1 == item_id:
            continue;
        else:
            product = ratings.iloc[user_id-1,indices[i]] * 
            (similarities[i])
            wtd_sum = wtd_sum + product                              
    prediction = int(round(wtd_sum/sum_wt))
    if prediction < 0:
        prediction = 1
    elif prediction >10:
        prediction = 10
    print '\nPredicted rating for user {0} -> item {1}: {2}'
.format(user_id,item_id,prediction)      
        
    return prediction

#This function utilizes above function to recommend items for selected 
approach. Recommendations are made if the predicted
#rating for an item is greater than or equal to 6, and the items has not 
been rated already
def recommendItem(user_id, item_id, ratings):
    
    if user_id<1 or user_id>6 or type(user_id) is not int:
        print 'Userid does not exist. Enter numbers from 1-6'
    else:    
        ids = ['User-based CF (cosine)','User-based CF (correlation)',
'Item-based CF (cosine)', 'Item-based CF (adjusted cosine)']

        approach = widgets.Dropdown(options=ids, value=ids[0],
               description='Select Approach', width='500px')
        
        def on_change(change):
            prediction = 0
            clear_output(wait=True)
            if change['type'] == 'change' and change['name'] == 'value':            
                if (approach.value == 'User-based CF (cosine)'):
                    metric = 'cosine'
                    prediction = predict_userbased(user_id, 
item_id, ratings, metric)
                elif (approach.value == 'User-based CF (correlation)') :                       
                    metric = 'correlation'               
                    prediction = predict_userbased(user_id,
item_id, ratings, metric)
                elif (approach.value == 'Item-based CF (cosine)'):
                    prediction = predict_itembased(user_id, item_id, 
ratings)
                else:
                    prediction = predict_itembased_adjcos(user_id,
item_id,ratings)

                if ratings[item_id-1][user_id-1] != 0: 
                    print 'Item already rated'
                else:
                    if prediction>=6:
                        print '\nItem recommended'
                    else:
                        print 'Item not recommended'

        approach.observe(on_change)
        display(approach)

4

相似性度量的選擇

在選擇相似性度量時,可根據以下幾點進行選擇:

• 當您的數據受用戶偏好 / 用戶的不同評分尺度影響時,請使用皮爾遜相似度

• 如果數據稀疏,則使用餘弦(許多額定值未定義)

• 如果您的數據不稀疏並且屬性值的大小很重要,請使用歐幾里得(Euclidean)。

• 建議使用調整後的餘弦(Adjusted Cosine Similarity)進行基於商品的方法來調整用戶偏好。

5

評估推薦系統

評估推薦系統有很多評估指標。然而,最常用的是 RMSE(均方根誤差)。函數 evaluateRS 使用 sklearn 的 mean_squared_error 函數計算預測評級與實際評級之間的 RMSE,並顯示所選方法的 RMSE 值。 (爲了簡化說明,使用了小數據集,因此尚未將其分爲訓練集和測試集; 本文中也未考慮交叉驗證)。

代碼如下:

#This is final function to evaluate the performance of selected 
recommendation approach and the metric used here is RMSE
#suppress_stdout function is used to suppress the print outputs of all 
the functions inside this function. It will only print 
#RMSE values
def evaluateRS(ratings):
    ids = ['User-based CF (cosine)','User-based CF (correlation)',
'Item-based CF (cosine)','Item-based CF (adjusted cosine)']
    approach = widgets.Dropdown(options=ids, value=ids[0],
description='Select Approach', width='500px')
    n_users = ratings.shape[0]
    n_items = ratings.shape[1]
    prediction = np.zeros((n_users, n_items))
    prediction= pd.DataFrame(prediction)
    def on_change(change):
        clear_output(wait=True)
        with suppress_stdout():
            if change['type'] == 'change' and change['name'] == 'value':            
                if (approach.value == 'User-based CF (cosine)'):
                    metric = 'cosine'
                    for i in range(n_users):
                        for j in range(n_items):
                            prediction[i][j] = predict_userbased(i+1, 
j+1, ratings, metric)
                elif (approach.value == 'User-based CF (correlation)') :                       
                    metric = 'correlation'               
                    for i in range(n_users):
                        for j in range(n_items):
                            prediction[i][j] = predict_userbased(i+1, 
j+1, ratings, metric)
                elif (approach.value == 'Item-based CF (cosine)'):
                    for i in range(n_users):
                        for j in range(n_items):
                            prediction[i][j] = predict_userbased(i+1, 
j+1, ratings)
                else:
                    for i in range(n_users):
                        for j in range(n_items):
                            prediction[i][j] = predict_userbased(i+1, 
j+1, ratings)
              
        MSE = mean_squared_error(prediction, ratings)
        RMSE = round(sqrt(MSE),3)
        print "RMSE using {0} approach is: {1}".format(approach.value,
RMSE)
              
    approach.observe(on_change)
    display(approach)

在用戶基數較大的應用程序中,基於用戶的方法面臨可擴展性問題,因爲它們的複雜性隨用戶數量呈線性增長。基於商品的方法解決了這些可擴展性問題,因此有了基於商品相似性推薦項目。混合技術(Hybrid techniques)利用各種這類方法的優點,並以幾種方式將它們結合起來以獲得更好的性能。

本文由 Readfog 進行 AMP 轉碼,版權歸原作者所有。
來源https://mp.weixin.qq.com/s/4gxAL6uEm79QT8JkocH2XQ