2012-01-13 114 views
1

我正在测试根据Simon Funk的算法构建的推荐系统。 (由Timely Dev。http://www.timelydevelopment.com/demos/NetflixPrize.aspx编写)Howto使用增量式SVD推荐系统创建推荐

问题是,所有增量式SVD算法都试图预测user_id和movie_id的评级。但是在真实系统中,这应该为活动用户生成一个新项目列表。 我看到一些人在增量奇异值分解后使用了kNN,但是如果我不会错过任何东西,那么如果我在通过增量奇异值分解创建模型后使用kNN,则会失去所有的性能增益。

任何人都有增量SVD/Simon Funk方法的经验,并告诉我如何生成新的推荐项目列表?

回答

0

我认为这是一个很大的问题,因为我认为有很多推荐方法可以称为“增量式SVD”。为了回答您的具体问题:kNN运行在投影项目空间上,而不是原始空间上,因此速度应该很快。

+0

其实我试着提的是西蒙·芬克的算法,并指定相关的C++源代码,以减少空间。如果您熟悉该算法,Funk将使用特征而不是所有矩阵。我最初的想法是,kNN可以用于用户* k(其中k是特征号,例如2),但这仍然可能会花费一些成本,请考虑数百万用户。 – tackleberry 2012-01-13 08:41:41

+0

我想说的是,让我们说在矩阵分解后有1M * 50k矩阵,为活动用户建立相似性和推荐新项目将花费:在1M用户中寻找2个特征的余弦相似度,然后在k top中找到新项目类似的用户。这就是我的想法,并想知道是否有更好/有效的方法来做到这一点。 – tackleberry 2012-01-13 08:55:11

+0

是的,这就是我在我的回答中所描述的:您在缩小的(投影的)空间中操作。您正在基于用户特征矩阵描述基于用户的推荐器,这是有效的。 – 2012-01-13 10:38:42

1

的方式产生推荐电影:

  1. 拿还没有被用户的特征向量的观察
  2. 乘他们的特征向量的电影列表。
  3. 按结果降序排序并拍摄顶级电影。

理论:假装只有两个维度(喜剧和戏剧)。如果我喜欢喜剧,但是讨厌电视剧,我的特色矢量是[1.0, 0.0]。如果您将我与以下电影相比较:

Comedy: [1.0, 0.0] x [1.0, 0.0] = 1 
Dramedy: [0.5, 0.5] x [1.0, 0.0] = 0.5 
Drama: [0.0, 1.0] x [1.0, 0,0] = 0 
0

假设您有n个用户和m个项目。增量式SVD后,你有k个训练特征。为了获得给定用户的新项目,将1xk用户特征向量和kxm项目特征矩阵相乘。您最终得到该用户的每个项目的m个评分。然后,只需对它们进行排序,删除他们已经看到的,并显示一些新的。

1

这是一个基于Yelp Netflix代码的简单Python代码。如果你安装Numba,它将以C速度运行。

data_loader.py

import os 
import numpy as np 
from scipy import sparse 

class DataLoader: 
    def __init__(self): 
     pass 

    @staticmethod 
    def create_review_matrix(file_path): 
     data = np.array([[int(tok) for tok in line.split('\t')[:3]] 
         for line in open(file_path)]) 

     ij = data[:, :2] 
     ij -= 1 
     values = data[:, 2] 
     review_matrix = sparse.csc_matrix((values, ij.T)).astype(float) 
     return review_matrix 

movielens_file_path = '%s/Downloads/ml-100k/u1.base' % os.environ['HOME'] 

my_reviews = DataLoader.create_review_matrix(movielens_file_path) 

user_reviews = my_reviews[8] 
user_reviews = user_reviews.toarray().ravel() 
user_rated_movies, = np.where(user_reviews > 0) 
user_ratings = user_reviews[user_rated_movies] 

movie_reviews = my_reviews[:, 201] 
movie_reviews = movie_reviews.toarray().ravel() 
movie_rated_users, = np.where(movie_reviews > 0) 
movie_ratings = movie_reviews[movie_rated_users] 

user_pseudo_average_ratings = {} 
user_pseudo_average_ratings[8] = np.mean(user_ratings) 
user_pseudo_average_ratings[9] = np.mean(user_ratings) 
user_pseudo_average_ratings[10] = np.mean(user_ratings) 
users, movies = my_reviews.nonzero() 

users_matrix = np.empty((3, 3)) 
users_matrix[:] = 0.1 

movies_matrix = np.empty((3, 3)) 
movies_matrix[:] = 0.1 

result = users_matrix[0] * movies_matrix[0] 
otro = movies_matrix[:, 2] 
otro[2] = 8 

funk.py

# Requires Movielens 100k data 
import numpy as np, time, sys 
from data_loader import DataLoader 
from numba import jit 
import os 

def get_user_ratings(user_id, review_matrix): 
    """ 
    Returns a numpy array with the ratings that user_id has made 

    :rtype : numpy array 
    :param user_id: the id of the user 
    :return: a numpy array with the ratings that user_id has made 
    """ 
    user_reviews = review_matrix[user_id] 
    user_reviews = user_reviews.toarray().ravel() 
    user_rated_movies, = np.where(user_reviews > 0) 
    user_ratings = user_reviews[user_rated_movies] 
    return user_ratings 

def get_movie_ratings(movie_id, review_matrix): 
    """ 
    Returns a numpy array with the ratings that movie_id has received 

    :rtype : numpy array 
    :param movie_id: the id of the movie 
    :return: a numpy array with the ratings that movie_id has received 
    """ 
    movie_reviews = review_matrix[:, movie_id] 
    movie_reviews = movie_reviews.toarray().ravel() 
    movie_rated_users, = np.where(movie_reviews > 0) 
    movie_ratings = movie_reviews[movie_rated_users] 
    return movie_ratings 

def create_user_feature_matrix(review_matrix, NUM_FEATURES, FEATURE_INIT_VALUE): 
    """ 
    Creates a user feature matrix of size NUM_FEATURES X NUM_USERS 
    with all cells initialized to FEATURE_INIT_VALUE 

    :rtype : numpy matrix 
    :return: a matrix of size NUM_FEATURES X NUM_USERS 
    with all cells initialized to FEATURE_INIT_VALUE 
    """ 
    num_users = review_matrix.shape[0] 
    user_feature_matrix = np.empty((NUM_FEATURES, num_users)) 
    user_feature_matrix[:] = FEATURE_INIT_VALUE 
    return user_feature_matrix 

def create_movie_feature_matrix(review_matrix, NUM_FEATURES, FEATURE_INIT_VALUE): 
    """ 
    Creates a user feature matrix of size NUM_FEATURES X NUM_MOVIES 
    with all cells initialized to FEATURE_INIT_VALUE 

    :rtype : numpy matrix 
    :return: a matrix of size NUM_FEATURES X NUM_MOVIES 
    with all cells initialized to FEATURE_INIT_VALUE 
    """ 
    num_movies = review_matrix.shape[1] 
    movie_feature_matrix = np.empty((NUM_FEATURES, num_movies)) 
    movie_feature_matrix[:] = FEATURE_INIT_VALUE 
    return movie_feature_matrix 

@jit(nopython=True) 
def predict_rating(user_id, movie_id, user_feature_matrix, movie_feature_matrix): 
    """ 
    Makes a prediction of the rating that user_id will give to movie_id if 
    he/she sees it 

    :rtype : float 
    :param user_id: the id of the user 
    :param movie_id: the id of the movie 
    :return: a float in the range [1, 5] with the predicted rating for 
    movie_id by user_id 
    """ 
    rating = 1. 
    for f in range(user_feature_matrix.shape[0]): 
     rating += user_feature_matrix[f, user_id] * movie_feature_matrix[f, movie_id] 

    # We trim the ratings in case they go above or below the stars range 
    if rating > 5: rating = 5 
    elif rating < 1: rating = 1 
    return rating 

@jit(nopython=True) 
def sgd_inner(feature, A_row, A_col, A_data, user_feature_matrix, movie_feature_matrix, NUM_FEATURES): 
    K = 0.015 
    LEARNING_RATE = 0.001 
    squared_error = 0 
    for k in range(len(A_data)): 
     user_id = A_row[k] 
     movie_id = A_col[k] 
     rating = A_data[k] 
     p = predict_rating(user_id, movie_id, user_feature_matrix, movie_feature_matrix) 
     err = rating - p 

     squared_error += err ** 2 

     user_feature_value = user_feature_matrix[feature, user_id] 
     movie_feature_value = movie_feature_matrix[feature, movie_id] 
     #for j in range(NUM_FEATURES): 
     user_feature_matrix[feature, user_id] += \ 
      LEARNING_RATE * (err * movie_feature_value - K * user_feature_value) 
     movie_feature_matrix[feature, movie_id] += \ 
      LEARNING_RATE * (err * user_feature_value - K * movie_feature_value) 

    return squared_error 

def calculate_features(A_row, A_col, A_data, user_feature_matrix, movie_feature_matrix, NUM_FEATURES): 
    """ 
    Iterates through all the ratings in search for the best features that 
    minimize the error between the predictions and the real ratings. 
    This is the main function in Simon Funk SVD algorithm 

    :rtype : void 
    """ 
    MIN_IMPROVEMENT = 0.0001 
    MIN_ITERATIONS = 100 
    rmse = 0 
    last_rmse = 0 
    print len(A_data) 
    num_ratings = len(A_data) 
    for feature in xrange(NUM_FEATURES): 
     iter = 0 
     while (iter < MIN_ITERATIONS) or (rmse < last_rmse - MIN_IMPROVEMENT): 
      last_rmse = rmse 
      squared_error = sgd_inner(feature, A_row, A_col, A_data, user_feature_matrix, movie_feature_matrix, NUM_FEATURES) 
      rmse = (squared_error/num_ratings) ** 0.5 
      iter += 1 
     print ('Squared error = %f' % squared_error) 
     print ('RMSE = %f' % rmse) 
     print ('Feature = %d' % feature) 
    return last_rmse 


LAMBDA = 0.02 
FEATURE_INIT_VALUE = 0.1 
NUM_FEATURES = 20 

movielens_file_path = '%s/Downloads/ml-100k/u1.base' % os.environ['HOME'] 

A = DataLoader.create_review_matrix(movielens_file_path) 
from scipy.io import mmread, mmwrite 
mmwrite('./data/A', A) 

user_feature_matrix = create_user_feature_matrix(A, NUM_FEATURES, FEATURE_INIT_VALUE) 
movie_feature_matrix = create_movie_feature_matrix(A, NUM_FEATURES, FEATURE_INIT_VALUE) 

users, movies = A.nonzero() 
A = A.tocoo() 

rmse = calculate_features(A.row, A.col, A.data, user_feature_matrix, movie_feature_matrix, NUM_FEATURES) 
print 'rmse', rmse