欢迎来到Surprise文档!¶
Surprise是一款易于使用的面向推荐系统的Python scikit。
如果您是Surprise的新手,我们希望您参阅入门指南,您将在其中找到关于使用Surprise的一系列教程。 您也可以查看FAQ中的使用示例。 关于安装教程的内容,请参阅项目页面。
我们极力感谢任何形式的反馈和批评(软件设计,文档,改进想法,拼写错误等)。 请随时上传您的代码或说明需求(请参阅GitHub页面)!
入门¶
基本用法¶
自动交叉验证(cross-validation)¶
Surprise 有一套内置的算法和数据集可供您使用。 以最简单的形式,它只需要几行代码即可完成交叉验证过程:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate
#加载movielens-100k数据集(本地没有的情况会自动下载)
data = Dataset.load_builtin('ml-100k')
#此处使用著名的SVD算法
algo = SVD()
# 运行5折(5-fold)交叉验证过程并打印结果
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
结果如下(实际值可能因随机性而异):
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean Std
RMSE 0.9311 0.9370 0.9320 0.9317 0.9391 0.9342 0.0032
MAE 0.7350 0.7375 0.7341 0.7342 0.7375 0.7357 0.0015
Fit time 6.53 7.11 7.23 7.15 3.99 6.40 1.23
Test time 0.26 0.26 0.25 0.15 0.13 0.21 0.06
如果尚未下载movielens-100k数据集,load_builtin()
方法将会下载该数据集至home目录下的.surprise_data
文件夹(您也可以选择将其保存在其他地方)。
例子中使用的是著名的SVD
算法,此外还有许多其他算法可被使用。 有关更多详细信息,请参阅使用预测算法。
cross_validate()
函数根据cv
参数运行交叉验证过程,并计算一些accuracy
度量。 我们在这里使用经典的5折(5-fold)交叉验证,另外也可以使用更好的迭代器(详情见此)。
Train-test split和fit()方法¶
如果您不想运行完整的交叉验证过程,您可以使用train_test_split()
对给定尺寸的trainset和testset进行采样,并且选择适合的accuracy metric
。 您需要使用fit()
方法来训练trainset上的算法,以及test()
方法,该方法将返回从testset中得到的预测:
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split
#加载movielens-100k数据集(本地没有的情况会自动下载)
data = Dataset.load_builtin('ml-100k')
# 简单的随机trainset和testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=.25)
#此处使用著名的SVD算法
algo = SVD()
#在trainset上训练算法,并用testset预测评分
algo.fit(trainset)
predictions = algo.test(testset)
# 然后计算RMSE
accuracy.rmse(predictions)
结果如下:
RMSE: 0.9411
请注意,您可以使用如下代码来训练和测试算法:
predictions = algo.fit(trainset).test(testset)
在某些情况下,你的trainset和testset已经被一些文件定义了。 请参阅此处来处理这种情况。
训练整个训练集和predict()方法¶
显然,我们也可以简单地将我们的算法运用于整个数据集,而不进行交叉验证。 这可以通过使用构建trainset
对象的build_full_trainset()
方法来完成:
from surprise import KNNBasic
from surprise import Dataset
#加载movielens-100k数据集
data = Dataset.load_builtin('ml-100k')
# 构建一个算法,并对其进行训练。
trainset = data.build_full_trainset()
# Build an algorithm, and train it.
algo = KNNBasic()
algo.fit(trainset)
我们现在可以直接调用predict()
方法来预测评分。 假设我们对用户196和物品302感兴趣(确保二者在数据集中!), and you know that the true rating \(r_{ui} = 4\):
uid = str(196) # raw user id (as in the ratings file). They are **strings**!
iid = str(302) # raw item id (as in the ratings file). They are **strings**!
# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=4, verbose=True)
The result should be:
user: 196 item: 302 r_ui = 4.00 est = 4.06 {'actual_k': 40, 'was_impossible': False}
到目前为止我们使用的都是内置的数据集,当然您也可以使用自定义数据集。 下一节将介绍如何使用自定义数据集。
使用自定义数据集¶
Surprise有一系列内置的数据集,当然您也可以使用自定义的数据集。
可以从文件(例如csv文件)或pandas的DataFrame中加载评级数据集。 无论用哪种方式,您都需要为Surprise定义一个Reader
对象,以便能够解析文件或DataFrame结构。
要从文件(例如csv文件)加载数据集,您需要使用
load_from_file()
方法:from surprise import BaselineOnly from surprise import Dataset from surprise import Reader from surprise.model_selection import cross_validate # 数据集文件路径 file_path = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/u.data') #加载自定义数据集时,需定义一个 Reader 。 #在movielens-100k数据集中,每行都是按以下格式组织的: #'用户 物品 评分 时间戳',由'\t'字符分隔。 reader = Reader(line_format='user item rating timestamp', sep='\t') data = Dataset.load_from_file(file_path, reader=reader) #然后即可根据需要来操作此数据集,例如调用cross_validate cross_validate(BaselineOnly(), data, verbose=True)
有关readers及其用法的详细信息,请参阅
Reader 类
文档。注意
正如您在前一节中已经知道的那样,Movielens-100k数据集是内置的,因此加载数据集的更快的方法是执行
data = Dataset.load_builtin('ml-100k')
。 在这里我们忽略这点。
要从pandas的DataFrame结构加载数据集,您需要使用
load_from_df()
方法。 同样地,您也需要定义Reader
对象,但只需指定rating_scale
参数。 其中DataFrame结构必须包含三列数据,分别是 user (raw) ids、item (raw) ids、ratings。 这样的结构使得每行都对应着一个Rating数据。 DataFrame数据的字段不一定要按照这个顺序组织,因为您可以在调用时调整顺序。import pandas as pd from surprise import NormalPredictor from surprise import Dataset from surprise import Reader from surprise.model_selection import cross_validate # 建立 DataFrame 结构。 列名是互不相关的。 ratings_dict = {'itemID': [1, 1, 1, 2, 2], 'userID': [9, 32, 2, 45, 'user_foo'], 'rating': [3, 2, 4, 3, 1]} df = pd.DataFrame(ratings_dict) # 同样需要定义 Reader 对象,但只需指定 rating_scale 参数。 reader = Reader(rating_scale=(1, 5)) # 传入的列必须对应着 userID,itemID 和 rating(严格按此顺序)。 data = Dataset.load_from_df(df[['userID', 'itemID', 'rating']], reader) # 然后即可根据需要来操作此数据集,例如调用cross_validate cross_validate(NormalPredictor(), data, cv=2)
原始的DataFrame看起来是这样的:
itemID rating userID 0 1 3 9 1 1 2 32 2 1 4 2 3 2 3 45 4 2 1 user_foo
使用交叉验证迭代器(cross-validation iterators)¶
对于交叉验证,我们可以轻松地使用cross_validate()
函数实现。 但为了更好地控制过程,我们也可以实例化一个交叉验证迭代器,并使用迭代器的split()
方法,和算法的test()
方法,对每个split进行预测。 我们用一个例子来说明,例子中使用了经典的K-fold交叉验证(定义K为3):
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import KFold
# 加载 movielens-100k 数据集
data = Dataset.load_builtin('ml-100k')
# 定义一个交叉验证迭代器
kf = KFold(n_splits=3)
algo = SVD()
for trainset, testset in kf.split(data):
# 训练并测试算法
algo.fit(trainset)
predictions = algo.test(testset)
# 计算并打印 RMSE(均方根误差,Root Mean Squared Error)
accuracy.rmse(predictions, verbose=True)
结果如下:
RMSE: 0.9374
RMSE: 0.9476
RMSE: 0.9478
也可以使用其他的交叉验证迭代器,例如 LeaveOneOut 或 ShuffleSplit 。 要查看所有可用的迭代器请点击此处。 Surprise 的交叉验证工具的设计灵感来源于优秀的 scikit-learn API。
另一个特殊情况就是,K-fold已经被事先处理好了。 例如,movielens-100K数据集就提供了5个训练文件和5个测试文件(u1.base,u1.test ... u5.base,u5.test)。 在这种情况下,Surprise 可以通过使用surprise.model_selection.split.PredefinedKFold
对象来处理:
from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import PredefinedKFold
# 数据集的 K-fold 的路径
files_dir = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/')
# 此处我们使用内置的 reader
reader = Reader('ml-100k')
# folds_files 是 “包含文件路径的元组” 组合而成的列表:
# [(u1.base, u1.test), (u2.base, u2.test), ... (u5.base, u5.test)]
train_file = files_dir + 'u%d.base'
test_file = files_dir + 'u%d.test'
folds_files = [(train_file % i, test_file % i) for i in (1, 2, 3, 4, 5)]
data = Dataset.load_from_folds(folds_files, reader=reader)
pkf = PredefinedKFold()
algo = SVD()
for trainset, testset in pkf.split(data):
# 训练并测试算法
algo.fit(trainset)
predictions = algo.test(testset)
# 计算并打印 RMSE(均方根误差,Root Mean Squared Error)
accuracy.rmse(predictions, verbose=True)
当然,你也可以只加载单个文件进行训练和测试。 但folds_files
参数必须得是list
类型。
使用GridSearchCV ¶调整算法参数
函数cross_validate()
的作用是:针对一组给定的参数,通过交叉验证展示其准确性度量结果。 如果您想知道如何组合参数可以产生最佳结果,那么就可通过使用GridSearchCV
类实现。 Given a dict
of parameters, this class exhaustively
tries all the combinations of parameters and reports the best parameters for any
accuracy measure (averaged over the different splits). It is heavily inspired
from scikit-learn’s GridSearchCV.
此处展示一个例子,例子中我们为 SVD
算法的参数n_epochs
,lr_all
和reg_all
尝试不同的值。
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import GridSearchCV
# 使用 movielens-100K 数据集
data = Dataset.load_builtin('ml-100k')
param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
'reg_all': [0.4, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)
gs.fit(data)
# 最佳 RMSE 得分
print(gs.best_score['rmse'])
# 能达到最佳 RMSE 得分的参数组合
print(gs.best_params['rmse'])
结果:
0.961300130118
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}
We are here evaluating the average RMSE and MAE over a 3-fold cross-validation procedure, but any cross-validation iterator can used.
Once fit()
has been called, the best_estimator
attribute gives us an
algorithm instance with the optimal set of parameters, which can be used how we
please:
# We can now use the algorithm that yields the best rmse:
algo = gs.best_estimator['rmse']
algo.fit(data.build_full_trainset())
注意
诸如bsl_options
和sim_options
之类的字典参数需要特殊处理。 See usage example below:
param_grid = {'k': [10, 20],
'sim_options': {'name': ['msd', 'cosine'],
'min_support': [1, 5],
'user_based': [False]}
}
Naturally, both can be combined, for example for the
KNNBaseline
algorithm:
param_grid = {'bsl_options': {'method': ['als', 'sgd'],
'reg': [1, 2]},
'k': [2, 3],
'sim_options': {'name': ['msd', 'cosine'],
'min_support': [1, 5],
'user_based': [False]}
}
For further analysis, the cv_results
attribute has all the needed
information and can be imported in a pandas dataframe:
results_df = pd.DataFrame.from_dict(gs.cv_results)
In our example, the cv_results
attribute looks like this (floats are
formatted):
'split0_test_rmse': [1.0, 1.0, 0.97, 0.98, 0.98, 0.99, 0.96, 0.97]
'split1_test_rmse': [1.0, 1.0, 0.97, 0.98, 0.98, 0.99, 0.96, 0.97]
'split2_test_rmse': [1.0, 1.0, 0.97, 0.98, 0.98, 0.99, 0.96, 0.97]
'mean_test_rmse': [1.0, 1.0, 0.97, 0.98, 0.98, 0.99, 0.96, 0.97]
'std_test_rmse': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
'rank_test_rmse': [7 8 3 5 4 6 1 2]
'split0_test_mae': [0.81, 0.82, 0.78, 0.79, 0.79, 0.8, 0.77, 0.79]
'split1_test_mae': [0.8, 0.81, 0.78, 0.79, 0.78, 0.79, 0.77, 0.78]
'split2_test_mae': [0.81, 0.81, 0.78, 0.79, 0.78, 0.8, 0.77, 0.78]
'mean_test_mae': [0.81, 0.81, 0.78, 0.79, 0.79, 0.8, 0.77, 0.78]
'std_test_mae': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
'rank_test_mae': [7 8 2 5 4 6 1 3]
'mean_fit_time': [1.53, 1.52, 1.53, 1.53, 3.04, 3.05, 3.06, 3.02]
'std_fit_time': [0.03, 0.04, 0.0, 0.01, 0.04, 0.01, 0.06, 0.01]
'mean_test_time': [0.46, 0.45, 0.44, 0.44, 0.47, 0.49, 0.46, 0.34]
'std_test_time': [0.0, 0.01, 0.01, 0.0, 0.03, 0.06, 0.01, 0.08]
'params': [{'n_epochs': 5, 'lr_all': 0.002, 'reg_all': 0.4}, {'n_epochs': 5, 'lr_all': 0.002, 'reg_all': 0.6}, {'n_epochs': 5, 'lr_all': 0.005, 'reg_all': 0.4}, {'n_epochs': 5, 'lr_all': 0.005, 'reg_all': 0.6}, {'n_epochs': 10, 'lr_all': 0.002, 'reg_all': 0.4}, {'n_epochs': 10, 'lr_all': 0.002, 'reg_all': 0.6}, {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}, {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.6}]
'param_n_epochs': [5, 5, 5, 5, 10, 10, 10, 10]
'param_lr_all': [0.0, 0.0, 0.01, 0.01, 0.0, 0.0, 0.01, 0.01]
'param_reg_all': [0.4, 0.6, 0.4, 0.6, 0.4, 0.6, 0.4, 0.6]
As you can see, each list has the same size of the number of parameter combination. It corresponds to the following table:
split0_test_rmse | split1_test_rmse | split2_test_rmse | mean_test_rmse | std_test_rmse | rank_test_rmse | split0_test_mae | split1_test_mae | split2_test_mae | mean_test_mae | std_test_mae | rank_test_mae | mean_fit_time | std_fit_time | mean_test_time | std_test_time | params | param_n_epochs | param_lr_all | param_reg_all |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.99775 | 0.997744 | 0.996378 | 0.997291 | 0.000645508 | 7 | 0.807862 | 0.804626 | 0.805282 | 0.805923 | 0.00139657 | 7 | 1.53341 | 0.0305216 | 0.455831 | 0.000922113 | {‘n_epochs’: 5, ‘lr_all’: 0.002, ‘reg_all’: 0.4} | 5 | 0.002 | 0.4 |
1.00381 | 1.00304 | 1.00257 | 1.00314 | 0.000508358 | 8 | 0.816559 | 0.812905 | 0.813772 | 0.814412 | 0.00155866 | 8 | 1.5199 | 0.0367117 | 0.451068 | 0.00938646 | {‘n_epochs’: 5, ‘lr_all’: 0.002, ‘reg_all’: 0.6} | 5 | 0.002 | 0.6 |
0.973524 | 0.973595 | 0.972495 | 0.973205 | 0.000502609 | 3 | 0.783361 | 0.780242 | 0.78067 | 0.781424 | 0.00138049 | 2 | 1.53449 | 0.00496203 | 0.441558 | 0.00529696 | {‘n_epochs’: 5, ‘lr_all’: 0.005, ‘reg_all’: 0.4} | 5 | 0.005 | 0.4 |
0.98229 | 0.982059 | 0.981486 | 0.981945 | 0.000338056 | 5 | 0.794481 | 0.790781 | 0.79186 | 0.792374 | 0.00155377 | 5 | 1.52739 | 0.00859185 | 0.44463 | 0.000888907 | {‘n_epochs’: 5, ‘lr_all’: 0.005, ‘reg_all’: 0.6} | 5 | 0.005 | 0.6 |
0.978034 | 0.978407 | 0.976919 | 0.977787 | 0.000632049 | 4 | 0.787643 | 0.784723 | 0.784957 | 0.785774 | 0.00132486 | 4 | 3.03572 | 0.0431101 | 0.466606 | 0.0254965 | {‘n_epochs’: 10, ‘lr_all’: 0.002, ‘reg_all’: 0.4} | 10 | 0.002 | 0.4 |
0.986263 | 0.985817 | 0.985004 | 0.985695 | 0.000520899 | 6 | 0.798218 | 0.794457 | 0.795373 | 0.796016 | 0.00160135 | 6 | 3.0544 | 0.00636185 | 0.488357 | 0.0576194 | {‘n_epochs’: 10, ‘lr_all’: 0.002, ‘reg_all’: 0.6} | 10 | 0.002 | 0.6 |
0.963751 | 0.963463 | 0.962676 | 0.963297 | 0.000454661 | 1 | 0.774036 | 0.770548 | 0.771588 | 0.772057 | 0.00146201 | 1 | 3.0636 | 0.0597982 | 0.456484 | 0.00510321 | {‘n_epochs’: 10, ‘lr_all’: 0.005, ‘reg_all’: 0.4} | 10 | 0.005 | 0.4 |
0.973605 | 0.972868 | 0.972765 | 0.973079 | 0.000374222 | 2 | 0.78607 | 0.781918 | 0.783537 | 0.783842 | 0.00170855 | 3 | 3.01907 | 0.011834 | 0.338839 | 0.075346 | {‘n_epochs’: 10, ‘lr_all’: 0.005, ‘reg_all’: 0.6} | 10 | 0.005 | 0.6 |
命令行用法¶
Surprise 也可以从命令行使用,例如:
surprise -algo SVD -params "{'n_epochs': 5, 'verbose': True}" -load-builtin ml-100k -n-folds 3
可以使用 -h 查看详情:
surprise -h
使用预测算法¶
Surprise提供了一系列的内置算法。 所有算法都来自于 AlgoBase
基类,其中实现了一些关键方法(如predict
,fit
和test
)。 可用预测算法的列表和详细信息可以在prediction_algorithms
包文档中找到。
每种算法都是Surprise全局名称空间的一部分,因此您只需要从Surprise包中导入其名称,例如:
from surprise import KNNBasic
algo = KNNBasic()
其中有些算法可能使用基线评估,而另一些可能使用相似性度量。 在此,我们将介绍设置计算基线评估和相似度的方式。
基线估计( Baselines estimates)配置¶
注意
本节仅适用于尝试最小化以下正则化平方误差(或等效)的算法(或相似性度量):
For algorithms using baselines in another objective function (e.g. the
SVD
algorithm), the baseline configuration is done differently and is specific to
each algorithm. Please refer to their own documentation.
首先,如果您不想配置基线估计的计算方式,则完全可以不必配置:默认参数就可以正常工作。 如果你确实想要...这节或许能够给予帮助。
您或许想要了解什么是基线估计 ( Baselines estimates ),点击此处阅读[Kor10]的第2.1节。
基线可以用两种不同的方式进行估计:
- 使用随机梯度下降法(SGD,Stochastic Gradient Descent)
- 使用交替最小二乘法(ALS,Alternating Least Squares)
You can configure the way baselines are computed using the bsl_options
parameter passed at the creation of an algorithm. This parameter is a
dictionary for which the key 'method'
indicates the method to use. Accepted
values are 'als'
(default) and 'sgd'
. Depending on its value, other
options may be set. For ALS:
'reg_i'
: The regularization parameter for items. Corresponding to \(\lambda_2\) in [Kor10]. Default is10
.'reg_u'
: The regularization parameter for users. Corresponding to \(\lambda_3\) in [Kor10]. Default is15
.'n_epochs'
: The number of iteration of the ALS procedure. Default is10
. Note that in [Kor10], what is described is a single iteration ALS process.
And for SGD:
'reg'
: The regularization parameter of the cost function that is optimized, corresponding to \(\lambda_1\) and then \(\lambda_5\) in [Kor10] Default is0.02
.'learning_rate'
: The learning rate of SGD, corresponding to \(\gamma\) in [Kor10]. Default is0.005
.'n_epochs'
: The number of iteration of the SGD procedure. Default is 20.
注意
For both procedures (ALS and SGD), user and item biases (\(b_u\) and \(b_i\)) are initialized to zero.
用法示例:
print('Using ALS')
bsl_options = {'method': 'als',
'n_epochs': 5,
'reg_u': 12,
'reg_i': 5
}
algo = BaselineOnly(bsl_options=bsl_options)
print('Using SGD')
bsl_options = {'method': 'sgd',
'learning_rate': .00005,
}
algo = BaselineOnly(bsl_options=bsl_options)
Note that some similarity measures may use baselines, such as the
pearson_baseline
similarity.
Configuration works just the same, whether the baselines are used in the actual
prediction \(\hat{r}_{ui}\) or not:
bsl_options = {'method': 'als',
'n_epochs': 20,
}
sim_options = {'name': 'pearson_baseline'}
algo = KNNBasic(bsl_options=bsl_options, sim_options=sim_options)
This leads us to similarity measure configuration, which we will review right now.
相似性度量(Similarity measure)配置¶
许多算法都使用相似性度量来评估 rating。 它们的配置方式与基线评估类似:您只需在创建算法时传递sim_options
参数即可。 这个参数是一个包含以下(均为可选)关键字的字典:
'name'
:similarities
模块中预先定义的相似度的名称。 默认值为'MSD'
。'user_based'
:选择在用户/物品间计算相似度。 对预测算法性能有巨大影响。 默认值True
。'min_support'
: The minimum number of common items (when'user_based'
is'True'
) or minimum number of common users (when'user_based'
is'False'
) for the similarity not to be zero. Simply put, if \(|I_{uv}| < \text{min_support}\) then \(\text{sim}(u, v) = 0\). 对 Item 也是一样。'shrinkage'
:要应用的收缩参数(仅与pearson_baseline
相似度相关)。 默认值为100。
用法示例:
sim_options = {'name': 'cosine',
'user_based': False # 计算物品间的相似度
}
algo = KNNBasic(sim_options=sim_options)
sim_options = {'name': 'pearson_baseline',
'shrinkage': 0 # 不收缩
}
algo = KNNBasic(sim_options=sim_options)
相关
有关similarities
模块。
如何构建你自己的预测算法¶
本节介绍如何使用Surprise构建自定义预测算法。
基础知识¶
真的想亲手试一试吗? 酷。
创建您自己的预测算法非常简单:算法不过是从AlgoBase
派生的具有estimate
方法的类。 这是由predict()
方法调用的方法。 It takes
in an inner user id, an inner item id (see this note), and returns the estimated rating \(\hat{r}_{ui}\):
from surprise import AlgoBase
from surprise import Dataset
from surprise.model_selection import cross_validate
class MyOwnAlgorithm(AlgoBase):
def __init__(self):
# Always call base method before doing anything.
AlgoBase.__init__(self)
def estimate(self, u, i):
return 3
data = Dataset.load_builtin('ml-100k')
algo = MyOwnAlgorithm()
cross_validate(algo, data, verbose=True)
这个算法相当糟糕:不管User和Item为何值,它都预测rating为3。
若需要存储有关预测的其他信息,可以返回一个具有详细信息的字典:
def estimate(self, u, i):
details = {'info1' : 'That was',
'info2' : 'easy stuff :)'}
return 3, details
该字典将作为details
字段存储在prediction
中,并可用于后续分析。
关于fit
方法¶
现在,我们来写一个稍微聪明一些的算法,即将trainset所有rating的平均值作为预测值。 由于这是一个不依赖于实际 User 或 Item 的常数值,在此例中将直接计算。 这可通过定义fit
方法实现:
class MyOwnAlgorithm(AlgoBase):
def __init__(self):
# Always call base method before doing anything.
AlgoBase.__init__(self)
def fit(self, trainset):
# Here again: call base method before doing anything.
AlgoBase.fit(self, trainset)
# Compute the average rating. We might as well use the
# trainset.global_mean attribute ;)
self.the_mean = np.mean([r for (_, _, r) in
self.trainset.all_ratings()])
return self
def estimate(self, u, i):
return self.the_mean
The fit
method is called e.g. by the cross_validate
function at each fold of
a cross-validation process, (but you can also call it yourself). 不管你打算做什么,你都应该先调用基类fit()
方法。
请注意,fit()
方法返回self
。 这就允许使用像algo.fit(trainset).test(testset)
这样的表达式。
关于trainset
属性¶
Once the base class fit()
method has returned,
all the info you need about the current training set (rating values, etc…) is
stored in the self.trainset
attribute. This is a Trainset
object that has many attributes and methods of
interest for prediction.
To illustrate its usage, let’s make an algorithm that predicts an average between the mean of all ratings, the mean rating of the user and the mean rating for the item:
def estimate(self, u, i):
sum_means = self.trainset.global_mean
div = 1
if self.trainset.knows_user(u):
sum_means += np.mean([r for (_, r) in self.trainset.ur[u]])
div += 1
if self.trainset.knows_item(i):
sum_means += np.mean([r for (_, r) in self.trainset.ir[i]])
div += 1
return sum_means / div
Note that it would have been a better idea to compute all the user means in the
fit
method, thus avoiding the same computations multiple times.
When the prediction is impossible¶
It’s up to your algorithm to decide if it can or cannot yield a prediction. If
the prediction is impossible, then you can raise the
PredictionImpossible
exception.
You’ll need to import it first):
from surprise import PredictionImpossible
This exception will be caught by the predict()
method, and the
estimation \(\hat{r}_{ui}\) will be set to the global mean of all ratings
\(\mu\).
Using similarities and baselines¶
Should your algorithm use a similarity measure or baseline estimates, you’ll
need to accept bsl_options
and sim_options
as parameters to the
__init__
method, and pass them along to the Base class. See how to use
these parameters in the Using prediction algorithms section.
Methods compute_baselines()
and
compute_similarities()
can
be called in the fit
method (or anywhere else).
class MyOwnAlgorithm(AlgoBase):
def __init__(self, sim_options={}, bsl_options={}):
AlgoBase.__init__(self, sim_options=sim_options,
bsl_options=bsl_options)
def fit(self, trainset):
AlgoBase.fit(self, trainset)
# Compute baselines and similarities
self.bu, self.bi = self.compute_baselines()
self.sim = self.compute_similarities()
return self
def estimate(self, u, i):
if not (self.trainset.knows_user(u) and self.trainset.knows_item(i)):
raise PredictionImpossible('User and/or item is unkown.')
# Compute similarities between u and v, where v describes all other
# users that have also rated item i.
neighbors = [(v, self.sim[u, v]) for (v, r) in self.trainset.ir[i]]
# Sort these neighbors by similarity
neighbors = sorted(neighbors, key=lambda x: x[1], reverse=True)
print('The 3 nearest neighbors of user', str(u), 'are:')
for v, sim_uv in neighbors[:3]:
print('user {0:} with sim {1:1.2f}'.format(v, sim_uv))
# ... Aaaaand return the baseline estimate anyway ;)
Feel free to explore the prediction_algorithms package source to get an idea of what can be done.
Notation standards, References¶
In the documentation, you will find the following notation:
- \(R\) : the set of all ratings.
- \(R_{train}\), \(R_{test}\) and \(\hat{R}\) denote the training set, the test set, and the set of predicted ratings.
- \(U\) : the set of all users. \(u\) and \(v\) denotes users.
- \(I\) : the set of all items. \(i\) and \(j\) denotes items.
- \(U_i\) : the set of all users that have rated item \(i\).
- \(U_{ij}\) : the set of all users that have rated both items \(i\) and \(j\).
- \(I_u\) : the set of all items rated by user \(u\).
- \(I_{uv}\) : the set of all items rated by both users \(u\) and \(v\).
- \(r_{ui}\) : the true rating of user \(u\) for item \(i\).
- \(\hat{r}_{ui}\) : the estimated rating of user \(u\) for item \(i\).
- \(b_{ui}\) : the baseline rating of user \(u\) for item \(i\).
- \(\mu\) : the mean of all ratings.
- \(\mu_u\) : the mean of all ratings given by user \(u\).
- \(\mu_i\) : the mean of all ratings given to item \(i\).
- \(\sigma_u\) : the standard deviation of all ratings given by user \(u\).
- \(\sigma_i\) : the standard deviation of all ratings given to item \(i\).
- \(N_i^k(u)\) : the \(k\) nearest neighbors of user \(u\) that
have rated item \(i\). This set is computed using a
similarity metric
. - \(N_u^k(i)\) : the \(k\) nearest neighbors of item \(i\) that
are rated by user \(u\). This set is computed using a
similarity metric
.
References
Here are the papers used as references in the documentation. Links to pdf files where added when possible. A simple Google search should lead you easily to the missing ones :)
[GM05] | Thomas George and Srujana Merugu. A scalable collaborative filtering framework based on co-clustering. 2005. URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.113.6458&rep=rep1&type=pdf. |
[Kor08] | Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. 2008. URL: http://www.cs.rochester.edu/twiki/pub/Main/HarpSeminar/Factorization_Meets_the_Neighborhood-_a_Multifaceted_Collaborative_Filtering_Model.pdf. |
[Kor10] | Yehuda Koren. Factor in the neighbors: scalable and accurate collaborative filtering. 2010. URL: http://courses.ischool.berkeley.edu/i290-dm/s11/SECURE/a1-koren.pdf. |
[KBV09] | Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. 2009. |
[LS01] | Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. 2001. URL: http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf. |
[LM07] | Daniel Lemire and Anna Maclachlan. Slope one predictors for online rating-based collaborative filtering. 2007. URL: http://arxiv.org/abs/cs/0702144. |
[LZXZ14] | Xin Luo, Mengchu Zhou, Yunni Xia, and Qinsheng Zhu. An efficient non-negative matrix factorization-based approach to collaborative filtering for recommender systems. 2014. |
[RRSK10] | Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul B. Kantor. Recommender Systems Handbook. 1st edition, 2010. |
[SM08] | Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. 2008. URL: http://papers.nips.cc/paper/3208-probabilistic-matrix-factorization.pdf. |
[ZWFM96] | Sheng Zhang, Weihong Wang, James Ford, and Fillia Makedon. Learning from incomplete ratings using non-negative matrix factorization. 1996. URL: http://www.siam.org/meetings/sdm06/proceedings/059zhangs2.pdf. |
FAQ¶
此部分对一些常见问题进行说明,并展示一些用户指南中未出现的例程。
如何获得每个用户的 top-N推荐¶
这个例子展示了如何在MovieLens-100k数据集中,为每个用户检索出评分预测值最高的前10项物品。 首先用整个数据集训练SVD算法,然后预测测试集中的所有“(用户,项目)对”的评分。 然后我们为每个用户检索出前10位预测值。
from collections import defaultdict
from surprise import SVD
from surprise import Dataset
def get_top_n(predictions, n=10):
'从一个预测集中为每个用户返回top-N个推荐'
参数:
predictions(list of Prediction objects): 预测对象列表,由某个用于预测的算法返回.
—————————————————————————————————————————————————————————————————
n(int):为每个用户进行的推荐的数量。 默认值为10.
—————————————————————————————————————————————————————————————————
返回:
一个字典,字典的键是用户(原始)ID,字典对应的值是为这个用户推荐的n个元组的列表:
[(物品1原始id, 评分预测1), ...,(物品n原始id, 评分预测n)]
'''
# 首先将预测值映射至每个用户
top_n = defaultdict(list)
for uid, iid, true_r, est, _ in predictions:
top_n[uid].append((iid, est))
# Then sort the predictions for each user and retrieve the k highest ones.
for uid, user_ratings in top_n.items():
user_ratings.sort(key=lambda x: x[1], reverse=True)
top_n[uid] = user_ratings[:n]
return top_n
# First train an SVD algorithm on the movielens dataset.
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
algo = SVD()
algo.fit(trainset)
# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo.test(testset)
top_n = get_top_n(predictions, n=10)
# Print the recommended items for each user
for uid, user_ratings in top_n.items():
print(uid, [iid for (iid, _) in user_ratings])
如何计算准确率(Precision)@k和召回率(Recall)@k¶
以下是为每个用户计算 精确度Precision@k 和 召回率Recall@k 的例子:
\(\text{Precision@k} = \frac{ | \{ \text{Recommended items that are relevant} \} | }{ | \{ \text{Recommended items} \} | }\) \(\text{Recall@k} = \frac{ | \{ \text{Recommended items that are relevant} \} | }{ | \{ \text{Relevant items} \} | }\)
An item is considered relevant if its true rating \(r_{ui}\) is greater than a given threshold. An item is considered recommended if its estimated rating \(\hat{r}_{ui}\) is greater than the threshold, and if it is among the k highest estimated ratings.
from collections import defaultdict
from surprise import Dataset
from surprise import SVD
from surprise.model_selection import KFold
def precision_recall_at_k(predictions, k=10, threshold=3.5):
'''返回为每个用户推荐k个物品时的准确率和召回率。'''
# 首先将预测值映射至每个用户
user_est_true = defaultdict(list)
for uid, _, true_r, est, _ in predictions:
user_est_true[uid].append((est, true_r))
precisions = dict()
recalls = dict()
for uid, user_ratings in user_est_true.items():
# Sort user ratings by estimated value
user_ratings.sort(key=lambda x: x[0], reverse=True)
# Number of relevant items
n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)
# Number of recommended items in top k
n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])
# Number of relevant and recommended items in top k
n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
for (est, true_r) in user_ratings[:k])
# Precision@K: Proportion of recommended items that are relevant
precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 1
# Recall@K: Proportion of relevant items that are recommended
recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 1
return precisions, recalls
data = Dataset.load_builtin('ml-100k')
kf = KFold(n_splits=5)
algo = SVD()
for trainset, testset in kf.split(data):
algo.fit(trainset)
predictions = algo.test(testset)
precisions, recalls = precision_recall_at_k(predictions, k=5, threshold=4)
# Precision and recall can then be averaged over all users
print(sum(prec for prec in precisions.values()) / len(precisions))
print(sum(rec for rec in recalls.values()) / len(recalls))
How to get the k nearest neighbors of a user (or item)¶
You can use the get_neighbors()
methods of
the algorithm object. This is only relevant for algorithms that use a
similarity measure, such as the k-NN algorithms.
Here is an example where we retrieve the 10 nearest neighbors of the movie Toy Story from the MovieLens-100k dataset. The output is:
The 10 nearest neighbors of Toy Story are:
Beauty and the Beast (1991)
Raiders of the Lost Ark (1981)
That Thing You Do! (1996)
Lion King, The (1994)
Craft, The (1996)
Liar Liar (1997)
Aladdin (1992)
Cool Hand Luke (1967)
Winnie the Pooh and the Blustery Day (1968)
Indiana Jones and the Last Crusade (1989)
There’s a lot of boilerplate because of the conversions between movie names and
their raw/inner ids (see this note), but it all boils
down to the use of get_neighbors()
:
import io # needed because of weird encoding of u.item file
from surprise import KNNBaseline
from surprise import Dataset
from surprise import get_dataset_dir
def read_item_names():
"""Read the u.item file from MovieLens 100-k dataset and return two
mappings to convert raw ids into movie names and movie names into raw ids.
"""
file_name = get_dataset_dir() + '/ml-100k/ml-100k/u.item'
rid_to_name = {}
name_to_rid = {}
with io.open(file_name, 'r', encoding='ISO-8859-1') as f:
for line in f:
line = line.split('|')
rid_to_name[line[0]] = line[1]
name_to_rid[line[1]] = line[0]
return rid_to_name, name_to_rid
# First, train the algortihm to compute the similarities between items
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
sim_options = {'name': 'pearson_baseline', 'user_based': False}
algo = KNNBaseline(sim_options=sim_options)
algo.fit(trainset)
# Read the mappings raw id <-> movie name
rid_to_name, name_to_rid = read_item_names()
# Retrieve inner id of the movie Toy Story
toy_story_raw_id = name_to_rid['Toy Story (1995)']
toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)
# Retrieve inner ids of the nearest neighbors of Toy Story.
toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=10)
# Convert inner ids of the neighbors into names.
toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id)
for inner_id in toy_story_neighbors)
toy_story_neighbors = (rid_to_name[rid]
for rid in toy_story_neighbors)
print()
print('The 10 nearest neighbors of Toy Story are:')
for movie in toy_story_neighbors:
print(movie)
Naturally, the same can be done for users with minor modifications.
How to serialize an algorithm¶
Prediction algorithms can be serialized and loaded back using the dump()
and load()
functions. Here
is a small example where the SVD algorithm is trained on a dataset and
serialized. It is then reloaded and can be used again for making predictions:
import os
from surprise import SVD
from surprise import Dataset
from surprise import dump
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
algo = SVD()
algo.fit(trainset)
# Compute predictions of the 'original' algorithm.
predictions = algo.test(trainset.build_testset())
# Dump algorithm and reload it.
file_name = os.path.expanduser('~/dump_file')
dump.dump(file_name, algo=algo)
_, loaded_algo = dump.load(file_name)
# We now ensure that the algo is still the same by checking the predictions.
predictions_loaded_algo = loaded_algo.test(trainset.build_testset())
assert predictions == predictions_loaded_algo
print('Predictions are the same')
Algorithms can be serialized along with their predictions, so that can be further analyzed or compared with other algorithms, using pandas dataframes. Some examples are given in the two following notebooks:
What are raw and inner ids¶
Users and items have a raw id and an inner id. Some methods will use/return a
raw id (e.g. the predict()
method), while
some other will use/return an inner id.
Raw ids are ids as defined in a rating file or in a pandas dataframe. They can
be strings or numbers. Note though that if the ratings were read from a file
which is the standard scenario, they are represented as strings. This is
important to know if you’re using e.g. predict()
or other methods
that accept raw ids as parameters.
On trainset creation, each raw id is mapped to a unique integer called inner
id, which is a lot more suitable for Surprise to manipulate. Conversions between
raw and inner ids can be done using the to_inner_uid()
, to_inner_iid()
, to_raw_uid()
, and to_raw_iid()
methods of the trainset
.
可以在Surprise中使用自己的数据集吗?可以是pandas的DataFrame格式吗?¶
Yes, and yes. See the user guide.
如何调整算法参数¶
您可以使用GridSearchCV
类来调整算法的参数,如此处中所描述。 调整之后,您可能需要对算法性能进行无偏估计。
How to get accuracy measures on the training set¶
You can use the build_testset()
method of the Trainset
object to build a testset that can be then used
with the test()
method:
from surprise import Dataset
from surprise import SVD
from surprise import accuracy
from surprise.model_selection import KFold
data = Dataset.load_builtin('ml-100k')
algo = SVD()
trainset = data.build_full_trainset()
algo.fit(trainset)
testset = trainset.build_testset()
predictions = algo.test(testset)
# RMSE should be low as we are biased
accuracy.rmse(predictions, verbose=True) # ~ 0.68 (which is low)
Check out the example file for more usage examples.
How to save some data for unbiased accuracy estimation¶
If your goal is to tune the parameters of an algorithm, you may want to spare a bit of data to have an unbiased estimation of its performances. For instance you may want to split your data into two sets A and B. A is used for parameter tuning using grid search, and B is used for unbiased estimation. This can be done as follows:
import random
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import GridSearchCV
# Load the full dataset.
data = Dataset.load_builtin('ml-100k')
raw_ratings = data.raw_ratings
# shuffle ratings if you want
random.shuffle(raw_ratings)
# A = 90% of the data, B = 10% of the data
threshold = int(.9 * len(raw_ratings))
A_raw_ratings = raw_ratings[:threshold]
B_raw_ratings = raw_ratings[threshold:]
data.raw_ratings = A_raw_ratings # data is now the set A
# Select your best algo with grid search.
print('Grid Search...')
param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005]}
grid_search = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)
grid_search.fit(data)
algo = grid_search.best_estimator['rmse']
# retrain on the whole set A
trainset = data.build_full_trainset()
algo.fit(trainset)
# Compute biased accuracy on A
predictions = algo.test(trainset.build_testset())
print('Biased accuracy on A,', end=' ')
accuracy.rmse(predictions)
# Compute unbiased accuracy on B
testset = data.construct_testset(B_raw_ratings) # testset is now the set B
predictions = algo.test(testset)
print('Unbiased accuracy on B,', end=' ')
accuracy.rmse(predictions)
How to have reproducible experiments¶
Some algorithms randomly initialize their parameters (sometimes with
numpy
), and the cross-validation folds are also randomly generated. If you
need to reproduce your experiments multiple times, you just have to set the
seed of the RNG at the beginning of your program:
import random
import numpy as np
my_seed = 0
random.seed(my_seed)
numpy.random.seed(my_seed)
Where are datasets stored and how to change it?¶
By default, datasets downloaded by Surprise will be saved in the
'~/.surprise_data'
directory. This is also where dump files will be stored.
You can change the default directory by setting the 'SURPRISE_DATA_FOLDER'
environment variable.
预测算法包 ¶
prediction_algorithms
包中包含了可用于推荐的预测算法。
下列是可用的预测算法:
random_pred.NormalPredictor |
基于训练集的分布来预测随机评分的算法,假定其为正态分布。 |
baseline_only.BaselineOnly |
预测给定用户和项目的基线评估的算法。 |
knns.KNNBasic |
基本协作过滤算法。 |
knns.KNNWithMeans |
考虑到每个用户的平均评分的基本协同过滤算法。 |
knns.KNNWithZScore |
考虑到每个用户的z分数正态分布的基本协同过滤算法。 |
knns.KNNBaseline |
考虑到baseline评级基本协同过滤算法。 |
matrix_factorization.SVD |
著名的SVD算法,在Netflix Prize期间因Simon Funk而广为人知。 |
matrix_factorization.SVDpp |
SVD++算法,是考虑到隐式评级的SVD 的扩展。 |
matrix_factorization.NMF |
基于非负矩阵分解的协同过滤算法。 |
slope_one.SlopeOne |
一个简单且精确的协同过滤算法。 |
co_clustering.CoClustering |
一种基于协同聚类的协同过滤算法。 |
在深入公式之前,您可能需要检查符号标准。
The algorithm base class¶
The surprise.prediction_algorithms.algo_base
module defines the base
class AlgoBase
from which every single prediction algorithm has to
inherit.
-
class
surprise.prediction_algorithms.algo_base.
AlgoBase
(**kwargs)¶ Abstract class where is defined the basic behavior of a prediction algorithm.
Keyword Arguments: baseline_options (dict, optional) – If the algorithm needs to compute a baseline estimate, the baseline_options
parameter is used to configure how they are computed. See Baselines estimates configuration for usage.-
compute_baselines
()¶ Compute users and items baselines.
The way baselines are computed depends on the
bsl_options
parameter passed at the creation of the algorithm (see Baselines estimates configuration).This method is only relevant for algorithms using
Pearson baseline similarty
or theBaselineOnly
algorithm.Returns: A tuple (bu, bi)
, which are users and items baselines.
-
compute_similarities
()¶ Build the similarity matrix.
The way the similarity matrix is computed depends on the
sim_options
parameter passed at the creation of the algorithm (see Similarity measure configuration).This method is only relevant for algorithms using a similarity measure, such as the k-NN algorithms.
Returns: The similarity matrix.
-
fit
(trainset)¶ Train an algorithm on a given training set.
This method is called by every derived class as the first basic step for training an algorithm. It basically just initializes some internal structures and set the self.trainset attribute.
Parameters: trainset ( Trainset
) – A training set, as returned by thefolds
method.Returns: self
-
get_neighbors
(iid, k)¶ Return the
k
nearest neighbors ofiid
, which is the inner id of a user or an item, depending on theuser_based
field ofsim_options
(see Similarity measure configuration).As the similarities are computed on the basis of a similarity measure, this method is only relevant for algorithms using a similarity measure, such as the k-NN algorithms.
For a usage example, see the FAQ.
Parameters: - iid (int) – The (inner) id of the user (or item) for which we want the nearest neighbors. See this note.
- k (int) – The number of neighbors to retrieve.
Returns: The list of the
k
(inner) ids of the closest users (or items) toiid
.
-
predict
(uid, iid, r_ui=None, clip=True, verbose=False)¶ Compute the rating prediction for given user and item.
The
predict
method converts raw ids to inner ids and then calls theestimate
method which is defined in every derived class. If the prediction is impossible (e.g. because the user and/or the item is unkown), the prediction is set to the global mean of all ratings.Parameters: - uid – (Raw) id of the user. See this note.
- iid – (Raw) id of the item. See this note.
- r_ui (float) – The true rating \(r_{ui}\). Optional, default is
None
. - clip (bool) – Whether to clip the estimation into the rating scale.
For example, if \(\hat{r}_{ui}\) is \(5.5\) while the
rating scale is \([1, 5]\), then \(\hat{r}_{ui}\) is
set to \(5\). Same goes if \(\hat{r}_{ui} < 1\).
Default is
True
. - verbose (bool) – Whether to print details of the prediction. Default is False.
Returns: A
Prediction
object containing:- The (raw) user id
uid
. - The (raw) item id
iid
. - The true rating
r_ui
(\(\hat{r}_{ui}\)). - The estimated rating (\(\hat{r}_{ui}\)).
- Some additional details about the prediction that might be useful for later analysis.
-
test
(testset, verbose=False)¶ Test the algorithm on given testset, i.e. estimate all the ratings in the given testset.
Parameters: - testset – A test set, as returned by a cross-validation
itertor or by the
build_testset()
method. - verbose (bool) – Whether to print details for each predictions. Default is False.
Returns: A list of
Prediction
objects that contains all the estimated ratings.- testset – A test set, as returned by a cross-validation
itertor or by the
-
The predictions module¶
The surprise.prediction_algorithms.predictions
module defines the
Prediction
named tuple and the PredictionImpossible
exception.
-
class
surprise.prediction_algorithms.predictions.
Prediction
¶ A named tuple for storing the results of a prediction.
It’s wrapped in a class, but only for documentation and printing purposes.
Parameters:
-
exception
surprise.prediction_algorithms.predictions.
PredictionImpossible
¶ Exception raised when a prediction is impossible.
When raised, the estimation \(\hat{r}_{ui}\) is set to the global mean of all ratings \(\mu\).
Basic algorithms¶
These are basic algorithms that do not do much work but that are still useful for comparing accuracies.
-
class
surprise.prediction_algorithms.random_pred.
NormalPredictor
¶ Bases:
surprise.prediction_algorithms.algo_base.AlgoBase
Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.
The prediction \(\hat{r}_{ui}\) is generated from a normal distribution \(\mathcal{N}(\hat{\mu}, \hat{\sigma}^2)\) where \(\hat{\mu}\) and \(\hat{\sigma}\) are estimated from the training data using Maximum Likelihood Estimation:
\[\begin{split}\hat{\mu} &= \frac{1}{|R_{train}|} \sum_{r_{ui} \in R_{train}} r_{ui}\\\\ \hat{\sigma} &= \sqrt{\sum_{r_{ui} \in R_{train}} \frac{(r_{ui} - \hat{\mu})^2}{|R_{train}|}}\end{split}\]
-
class
surprise.prediction_algorithms.baseline_only.
BaselineOnly
(bsl_options={})¶ Bases:
surprise.prediction_algorithms.algo_base.AlgoBase
Algorithm predicting the baseline estimate for given user and item.
\(\hat{r}_{ui} = b_{ui} = \mu + b_u + b_i\)
If user \(u\) is unknown, then the bias \(b_u\) is assumed to be zero. The same applies for item \(i\) with \(b_i\).
See section 2.1 of [Kor10] for details.
Parameters: bsl_options (dict) – A dictionary of options for the baseline estimates computation. See Baselines estimates configuration for accepted options.
k-NN inspired algorithms¶
These are algorithms that are directly derived from a basic nearest neighbors approach.
Note
For each of these algorithms, the actual number of neighbors that are
aggregated to compute an estimation is necessarily less than or equal to
\(k\). First, there might just not exist enough neighbors and second, the
sets \(N_i^k(u)\) and \(N_u^k(i)\) only include neighbors for which
the similarity measure is positive. It would make no sense to aggregate
ratings from users (or items) that are negatively correlated. For a given
prediction, the actual number of neighbors can be retrieved in the
'actual_k'
field of the details
dictionary of the prediction
.
You may want to read the User Guide
on how to configure the sim_options
parameter.
-
class
surprise.prediction_algorithms.knns.
KNNBasic
(k=40, min_k=1, sim_options={}, **kwargs)¶ Bases:
surprise.prediction_algorithms.knns.SymmetricAlgo
A basic collaborative filtering algorithm.
The prediction \(\hat{r}_{ui}\) is set as:
\[\hat{r}_{ui} = \frac{ \sum\limits_{v \in N^k_i(u)} \text{sim}(u, v) \cdot r_{vi}} {\sum\limits_{v \in N^k_i(u)} \text{sim}(u, v)}\]or
\[\hat{r}_{ui} = \frac{ \sum\limits_{j \in N^k_u(i)} \text{sim}(i, j) \cdot r_{uj}} {\sum\limits_{j \in N^k_u(j)} \text{sim}(i, j)}\]depending on the
user_based
field of thesim_options
parameter.Parameters: - k (int) – The (max) number of neighbors to take into account for
aggregation (see this note). Default is
40
. - min_k (int) – The minimum number of neighbors to take into account for
aggregation. If there are not enough neighbors, the prediction is
set the the global mean of all ratings. Default is
1
. - sim_options (dict) – A dictionary of options for the similarity measure. See Similarity measure configuration for accepted options.
- k (int) – The (max) number of neighbors to take into account for
aggregation (see this note). Default is
-
class
surprise.prediction_algorithms.knns.
KNNWithMeans
(k=40, min_k=1, sim_options={}, **kwargs)¶ Bases:
surprise.prediction_algorithms.knns.SymmetricAlgo
A basic collaborative filtering algorithm, taking into account the mean ratings of each user.
The prediction \(\hat{r}_{ui}\) is set as:
\[\hat{r}_{ui} = \mu_u + \frac{ \sum\limits_{v \in N^k_i(u)} \text{sim}(u, v) \cdot (r_{vi} - \mu_v)} {\sum\limits_{v \in N^k_i(u)} \text{sim}(u, v)}\]or
\[\hat{r}_{ui} = \mu_i + \frac{ \sum\limits_{j \in N^k_u(i)} \text{sim}(i, j) \cdot (r_{uj} - \mu_j)} {\sum\limits_{j \in N^k_u(i)} \text{sim}(i, j)}\]depending on the
user_based
field of thesim_options
parameter.Parameters: - k (int) – The (max) number of neighbors to take into account for
aggregation (see this note). Default is
40
. - min_k (int) – The minimum number of neighbors to take into account for
aggregation. If there are not enough neighbors, the neighbor
aggregation is set to zero (so the prediction ends up being
equivalent to the mean \(\mu_u\) or \(\mu_i\)). Default is
1
. - sim_options (dict) – A dictionary of options for the similarity measure. See Similarity measure configuration for accepted options.
- k (int) – The (max) number of neighbors to take into account for
aggregation (see this note). Default is
-
class
surprise.prediction_algorithms.knns.
KNNWithZScore
(k=40, min_k=1, sim_options={}, **kwargs)¶ Bases:
surprise.prediction_algorithms.knns.SymmetricAlgo
- A basic collaborative filtering algorithm, taking into account
- the z-score normalization of each user.
The prediction \(\hat{r}_{ui}\) is set as:
\[\hat{r}_{ui} = \mu_u + \sigma_u \frac{ \sum\limits_{v \in N^k_i(u)} \text{sim}(u, v) \cdot (r_{vi} - \mu_v) / \sigma_v} {\sum\limits_{v \in N^k_i(u)} \text{sim}(u, v)}\]or
\[\hat{r}_{ui} = \mu_i + \sigma_i \frac{ \sum\limits_{j \in N^k_u(i)} \text{sim}(i, j) \cdot (r_{uj} - \mu_j) / \sigma_j} {\sum\limits_{j \in N^k_u(i)} \text{sim}(i, j)}\]depending on the
user_based
field of thesim_options
parameter.If \(\sigma\) is 0, than the overall sigma is used in that case.
Parameters: - k (int) – The (max) number of neighbors to take into account for
aggregation (see this note). Default is
40
. - min_k (int) – The minimum number of neighbors to take into account for
aggregation. If there are not enough neighbors, the neighbor
aggregation is set to zero (so the prediction ends up being
equivalent to the mean \(\mu_u\) or \(\mu_i\)). Default is
1
. - sim_options (dict) – A dictionary of options for the similarity measure. See Similarity measure configuration for accepted options.
-
class
surprise.prediction_algorithms.knns.
KNNBaseline
(k=40, min_k=1, sim_options={}, bsl_options={})¶ Bases:
surprise.prediction_algorithms.knns.SymmetricAlgo
A basic collaborative filtering algorithm taking into account a baseline rating.
The prediction \(\hat{r}_{ui}\) is set as:
\[\hat{r}_{ui} = b_{ui} + \frac{ \sum\limits_{v \in N^k_i(u)} \text{sim}(u, v) \cdot (r_{vi} - b_{vi})} {\sum\limits_{v \in N^k_i(u)} \text{sim}(u, v)}\]or
\[\hat{r}_{ui} = b_{ui} + \frac{ \sum\limits_{j \in N^k_u(i)} \text{sim}(i, j) \cdot (r_{uj} - b_{uj})} {\sum\limits_{j \in N^k_u(j)} \text{sim}(i, j)}\]depending on the
user_based
field of thesim_options
parameter. For the best predictions, use thepearson_baseline
similarity measure.This algorithm corresponds to formula (3), section 2.2 of [Kor10].
Parameters: - k (int) – The (max) number of neighbors to take into account for
aggregation (see this note). Default is
40
. - min_k (int) – The minimum number of neighbors to take into account for
aggregation. If there are not enough neighbors, the neighbor
aggregation is set to zero (so the prediction ends up being
equivalent to the baseline). Default is
1
. - sim_options (dict) – A dictionary of options for the similarity
measure. See Similarity measure configuration for accepted
options. It is recommended to use the
pearson_baseline
similarity measure. - bsl_options (dict) – A dictionary of options for the baseline estimates computation. See Baselines estimates configuration for accepted options.
- k (int) – The (max) number of neighbors to take into account for
aggregation (see this note). Default is
Matrix Factorization-based algorithms¶
-
class
surprise.prediction_algorithms.matrix_factorization.
SVD
¶ Bases:
surprise.prediction_algorithms.algo_base.AlgoBase
The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize. When baselines are not used, this is equivalent to Probabilistic Matrix Factorization [SM08] (see note below).
The prediction \(\hat{r}_{ui}\) is set as:
\[\hat{r}_{ui} = \mu + b_u + b_i + q_i^Tp_u\]If user \(u\) is unknown, then the bias \(b_u\) and the factors \(p_u\) are assumed to be zero. The same applies for item \(i\) with \(b_i\) and \(q_i\).
For details, see equation (5) from [KBV09]. See also [RRSK10], section 5.3.1.
To estimate all the unknown, we minimize the following regularized squared error:
\[\sum_{r_{ui} \in R_{train}} \left(r_{ui} - \hat{r}_{ui} \right)^2 + \lambda\left(b_i^2 + b_u^2 + ||q_i||^2 + ||p_u||^2\right)\]The minimization is performed by a very straightforward stochastic gradient descent:
\[\begin{split}b_u &\leftarrow b_u &+ \gamma (e_{ui} - \lambda b_u)\\ b_i &\leftarrow b_i &+ \gamma (e_{ui} - \lambda b_i)\\ p_u &\leftarrow p_u &+ \gamma (e_{ui} \cdot q_i - \lambda p_u)\\ q_i &\leftarrow q_i &+ \gamma (e_{ui} \cdot p_u - \lambda q_i)\end{split}\]where \(e_{ui} = r_{ui} - \hat{r}_{ui}\). These steps are performed over all the ratings of the trainset and repeated
n_epochs
times. Baselines are initialized to0
. User and item factors are randomly initialized according to a normal distribution, which can be tuned using theinit_mean
andinit_std_dev
parameters.You also have control over the learning rate \(\gamma\) and the regularization term \(\lambda\). Both can be different for each kind of parameter (see below). By default, learning rates are set to
0.005
and regularization terms are set to0.02
.Note
You can choose to use an unbiased version of this algorithm, simply predicting:
\[\hat{r}_{ui} = q_i^Tp_u\]This is equivalent to Probabilistic Matrix Factorization ([SM08], section 2) and can be achieved by setting the
biased
parameter toFalse
.Parameters: - n_factors – The number of factors. Default is
100
. - n_epochs – The number of iteration of the SGD procedure. Default is
20
. - biased (bool) – Whether to use baselines (or biases). See note above. Default is
True
. - init_mean – The mean of the normal distribution for factor vectors
initialization. Default is
0
. - init_std_dev – The standard deviation of the normal distribution for
factor vectors initialization. Default is
0.1
. - lr_all – The learning rate for all parameters. Default is
0.005
. - reg_all – The regularization term for all parameters. Default is
0.02
. - lr_bu – The learning rate for \(b_u\). Takes precedence over
lr_all
if set. Default isNone
. - lr_bi – The learning rate for \(b_i\). Takes precedence over
lr_all
if set. Default isNone
. - lr_pu – The learning rate for \(p_u\). Takes precedence over
lr_all
if set. Default isNone
. - lr_qi – The learning rate for \(q_i\). Takes precedence over
lr_all
if set. Default isNone
. - reg_bu – The regularization term for \(b_u\). Takes precedence
over
reg_all
if set. Default isNone
. - reg_bi – The regularization term for \(b_i\). Takes precedence
over
reg_all
if set. Default isNone
. - reg_pu – The regularization term for \(p_u\). Takes precedence
over
reg_all
if set. Default isNone
. - reg_qi – The regularization term for \(q_i\). Takes precedence
over
reg_all
if set. Default isNone
. - random_state (int, RandomState instance from numpy, or
None
) – Determines the RNG that will be used for initialization. If int,random_state
will be used as a seed for a new RNG. This is useful to get the same initialization over multiple calls tofit()
. If RandomState instance, this same instance is used as RNG. IfNone
, the current RNG from numpy is used. Default isNone
. - verbose – If
True
, prints the current epoch. Default isFalse
.
-
pu
¶ numpy array of size (n_users, n_factors) – The user factors (only exists if
fit()
has been called)
-
qi
¶ numpy array of size (n_items, n_factors) – The item factors (only exists if
fit()
has been called)
-
bu
¶ numpy array of size (n_users) – The user biases (only exists if
fit()
has been called)
-
bi
¶ numpy array of size (n_items) – The item biases (only exists if
fit()
has been called)
- n_factors – The number of factors. Default is
-
class
surprise.prediction_algorithms.matrix_factorization.
SVDpp
¶ Bases:
surprise.prediction_algorithms.algo_base.AlgoBase
The SVD++ algorithm, an extension of
SVD
taking into account implicit ratings.The prediction \(\hat{r}_{ui}\) is set as:
\[\hat{r}_{ui} = \mu + b_u + b_i + q_i^T\left(p_u + |I_u|^{-\frac{1}{2}} \sum_{j \in I_u}y_j\right)\]Where the \(y_j\) terms are a new set of item factors that capture implicit ratings. Here, an implicit rating describes the fact that a user \(u\) rated an item \(j\), regardless of the rating value.
If user \(u\) is unknown, then the bias \(b_u\) and the factors \(p_u\) are assumed to be zero. The same applies for item \(i\) with \(b_i\), \(q_i\) and \(y_i\).
For details, see section 4 of [Kor08]. See also [RRSK10], section 5.3.1.
Just as for
SVD
, the parameters are learned using a SGD on the regularized squared error objective.Baselines are initialized to
0
. User and item factors are randomly initialized according to a normal distribution, which can be tuned using theinit_mean
andinit_std_dev
parameters.You have control over the learning rate \(\gamma\) and the regularization term \(\lambda\). Both can be different for each kind of parameter (see below). By default, learning rates are set to
0.005
and regularization terms are set to0.02
.Parameters: - n_factors – The number of factors. Default is
20
. - n_epochs – The number of iteration of the SGD procedure. Default is
20
. - init_mean – The mean of the normal distribution for factor vectors
initialization. Default is
0
. - init_std_dev – The standard deviation of the normal distribution for
factor vectors initialization. Default is
0.1
. - lr_all – The learning rate for all parameters. Default is
0.007
. - reg_all – The regularization term for all parameters. Default is
0.02
. - lr_bu – The learning rate for \(b_u\). Takes precedence over
lr_all
if set. Default isNone
. - lr_bi – The learning rate for \(b_i\). Takes precedence over
lr_all
if set. Default isNone
. - lr_pu – The learning rate for \(p_u\). Takes precedence over
lr_all
if set. Default isNone
. - lr_qi – The learning rate for \(q_i\). Takes precedence over
lr_all
if set. Default isNone
. - lr_yj – The learning rate for \(y_j\). Takes precedence over
lr_all
if set. Default isNone
. - reg_bu – The regularization term for \(b_u\). Takes precedence
over
reg_all
if set. Default isNone
. - reg_bi – The regularization term for \(b_i\). Takes precedence
over
reg_all
if set. Default isNone
. - reg_pu – The regularization term for \(p_u\). Takes precedence
over
reg_all
if set. Default isNone
. - reg_qi – The regularization term for \(q_i\). Takes precedence
over
reg_all
if set. Default isNone
. - reg_yj – The regularization term for \(y_j\). Takes precedence
over
reg_all
if set. Default isNone
. - random_state (int, RandomState instance from numpy, or
None
) – Determines the RNG that will be used for initialization. If int,random_state
will be used as a seed for a new RNG. This is useful to get the same initialization over multiple calls tofit()
. If RandomState instance, this same instance is used as RNG. IfNone
, the current RNG from numpy is used. Default isNone
. - verbose – If
True
, prints the current epoch. Default isFalse
.
-
pu
¶ numpy array of size (n_users, n_factors) – The user factors (only exists if
fit()
has been called)
-
qi
¶ numpy array of size (n_items, n_factors) – The item factors (only exists if
fit()
has been called)
-
yj
¶ numpy array of size (n_items, n_factors) – The (implicit) item factors (only exists if
fit()
has been called)
-
bu
¶ numpy array of size (n_users) – The user biases (only exists if
fit()
has been called)
-
bi
¶ numpy array of size (n_items) – The item biases (only exists if
fit()
has been called)
- n_factors – The number of factors. Default is
-
class
surprise.prediction_algorithms.matrix_factorization.
NMF
¶ Bases:
surprise.prediction_algorithms.algo_base.AlgoBase
A collaborative filtering algorithm based on Non-negative Matrix Factorization.
This algorithm is very similar to
SVD
. The prediction \(\hat{r}_{ui}\) is set as:\[\hat{r}_{ui} = q_i^Tp_u,\]where user and item factors are kept positive. Our implementation follows that suggested in [LZXZ14], which is equivalent to [ZWFM96] in its non-regularized form. Both are direct applications of NMF for dense matrices [LS01].
The optimization procedure is a (regularized) stochastic gradient descent with a specific choice of step size that ensures non-negativity of factors, provided that their initial values are also positive.
At each step of the SGD procedure, the factors \(f\) or user \(u\) and item \(i\) are updated as follows:
\[\begin{split}p_{uf} &\leftarrow p_{uf} &\cdot \frac{\sum_{i \in I_u} q_{if} \cdot r_{ui}}{\sum_{i \in I_u} q_{if} \cdot \hat{r_{ui}} + \lambda_u |I_u| p_{uf}}\\ q_{if} &\leftarrow q_{if} &\cdot \frac{\sum_{u \in U_i} p_{uf} \cdot r_{ui}}{\sum_{u \in U_i} p_{uf} \cdot \hat{r_{ui}} + \lambda_i |U_i| q_{if}}\\\end{split}\]where \(\lambda_u\) and \(\lambda_i\) are regularization parameters.
This algorithm is highly dependent on initial values. User and item factors are uniformly initialized between
init_low
andinit_high
. Change them at your own risks!A biased version is available by setting the
biased
parameter toTrue
. In this case, the prediction is set as\[\hat{r}_{ui} = \mu + b_u + b_i + q_i^Tp_u,\]still ensuring positive factors. Baselines are optimized in the same way as in the
SVD
algorithm. While yielding better accuracy, the biased version seems highly prone to overfitting so you may want to reduce the number of factors (or increase regularization).Parameters: - n_factors – The number of factors. Default is
15
. - n_epochs – The number of iteration of the SGD procedure. Default is
50
. - biased (bool) – Whether to use baselines (or biases). Default is
False
. - reg_pu – The regularization term for users \(\lambda_u\). Default is
0.06
. - reg_qi – The regularization term for items \(\lambda_i\). Default is
0.06
. - reg_bu – The regularization term for \(b_u\). Only relevant for
biased version. Default is
0.02
. - reg_bi – The regularization term for \(b_i\). Only relevant for
biased version. Default is
0.02
. - lr_bu – The learning rate for \(b_u\). Only relevant for biased
version. Default is
0.005
. - lr_bi – The learning rate for \(b_i\). Only relevant for biased
version. Default is
0.005
. - init_low – Lower bound for random initialization of factors. Must be
greater than
0
to ensure non-negative factors. Default is0
. - init_high – Higher bound for random initialization of factors. Default
is
1
. - random_state (int, RandomState instance from numpy, or
None
) – Determines the RNG that will be used for initialization. If int,random_state
will be used as a seed for a new RNG. This is useful to get the same initialization over multiple calls tofit()
. If RandomState instance, this same instance is used as RNG. IfNone
, the current RNG from numpy is used. Default isNone
. - verbose – If
True
, prints the current epoch. Default isFalse
.
-
pu
¶ numpy array of size (n_users, n_factors) – The user factors (only exists if
fit()
has been called)
-
qi
¶ numpy array of size (n_items, n_factors) – The item factors (only exists if
fit()
has been called)
-
bu
¶ numpy array of size (n_users) – The user biases (only exists if
fit()
has been called)
-
bi
¶ numpy array of size (n_items) – The item biases (only exists if
fit()
has been called)
- n_factors – The number of factors. Default is
Slope One¶
-
class
surprise.prediction_algorithms.slope_one.
SlopeOne
¶ Bases:
surprise.prediction_algorithms.algo_base.AlgoBase
A simple yet accurate collaborative filtering algorithm.
This is a straightforward implementation of the SlopeOne algorithm [LM07].
The prediction \(\hat{r}_{ui}\) is set as:
\[\hat{r}_{ui} = \mu_u + \frac{1}{ |R_i(u)|} \sum\limits_{j \in R_i(u)} \text{dev}(i, j),\]where \(R_i(u)\) is the set of relevant items, i.e. the set of items \(j\) rated by \(u\) that also have at least one common user with \(i\). \(\text{dev}_(i, j)\) is defined as the average difference between the ratings of \(i\) and those of \(j\):
\[\text{dev}(i, j) = \frac{1}{ |U_{ij}|}\sum\limits_{u \in U_{ij}} r_{ui} - r_{uj}\]
Co-clustering¶
-
class
surprise.prediction_algorithms.co_clustering.
CoClustering
¶ Bases:
surprise.prediction_algorithms.algo_base.AlgoBase
A collaborative filtering algorithm based on co-clustering.
This is a straightforward implementation of [GM05].
Basically, users and items are assigned some clusters \(C_u\), \(C_i\), and some co-clusters \(C_{ui}\).
The prediction \(\hat{r}_{ui}\) is set as:
\[\hat{r}_{ui} = \overline{C_{ui}} + (\mu_u - \overline{C_u}) + (\mu_i - \overline{C_i}),\]where \(\overline{C_{ui}}\) is the average rating of co-cluster \(C_{ui}\), \(\overline{C_u}\) is the average rating of \(u\)‘s cluster, and \(\overline{C_i}\) is the average rating of \(i\)‘s cluster. If the user is unknown, the prediction is \(\hat{r}_{ui} = \mu_i\). If the item is unknown, the prediction is \(\hat{r}_{ui} = \mu_u\). If both the user and the item are unknown, the prediction is \(\hat{r}_{ui} = \mu\).
Clusters are assigned using a straightforward optimization method, much like k-means.
Parameters: - n_cltr_u (int) – Number of user clusters. Default is
3
. - n_cltr_i (int) – Number of item clusters. Default is
3
. - n_epochs (int) – Number of iteration of the optimization loop. Default is
20
. - random_state (int, RandomState instance from numpy, or
None
) – Determines the RNG that will be used for initialization. If int,random_state
will be used as a seed for a new RNG. This is useful to get the same initialization over multiple calls tofit()
. If RandomState instance, this same instance is used as RNG. IfNone
, the current RNG from numpy is used. Default isNone
. - verbose (bool) – If True, the current epoch will be printed. Default is
False
.
- n_cltr_u (int) – Number of user clusters. Default is
The model_selection package¶
Surprise provides various tools to run cross-validation procedures and search the best parameters for a prediction algorithm. The tools presented here are all heavily inspired from the excellent scikit learn library.
Cross validation iterators¶
The model_selection.split
module
contains various cross-validation iterators. Design and tools are inspired from
the mighty scikit learn.
The available iterators are:
KFold |
A basic cross-validation iterator. |
RepeatedKFold |
Repeated KFold cross validator. |
ShuffleSplit |
A basic cross-validation iterator with random trainsets and testsets. |
LeaveOneOut |
Cross-validation iterator where each user has exactly one rating in the testset. |
PredefinedKFold |
A cross-validation iterator to when a dataset has been loaded with the load_from_folds method. |
This module also contains a function for splitting datasets into trainset and testset:
train_test_split |
Split a dataset into trainset and testset. |
-
class
surprise.model_selection.split.
KFold
(n_splits=5, random_state=None, shuffle=True)¶ A basic cross-validation iterator.
Each fold is used once as a testset while the k - 1 remaining folds are used for training.
See an example in the User Guide.
Parameters: - n_splits (int) – The number of folds.
- random_state (int, RandomState instance from numpy, or
None
) – Determines the RNG that will be used for determining the folds. If int,random_state
will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls tosplit()
. If RandomState instance, this same instance is used as RNG. IfNone
, the current RNG from numpy is used.random_state
is only used ifshuffle
isTrue
. Default isNone
. - shuffle (bool) – Whether to shuffle the ratings in the
data
parameter of thesplit()
method. Shuffling is not done in-place. Default isTrue
.
-
class
surprise.model_selection.split.
LeaveOneOut
(n_splits=5, random_state=None)¶ Cross-validation iterator where each user has exactly one rating in the testset.
Contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.
See an example in the User Guide.
Parameters: - n_splits (int) – The number of folds.
- random_state (int, RandomState instance from numpy, or
None
) – Determines the RNG that will be used for determining the folds. If int,random_state
will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls tosplit()
. If RandomState instance, this same instance is used as RNG. IfNone
, the current RNG from numpy is used.random_state
is only used ifshuffle
isTrue
. Default isNone
. - shuffle (bool) – Whether to shuffle the ratings in the
data
parameter of thesplit()
method. Shuffling is not done in-place. Default isTrue
.
-
class
surprise.model_selection.split.
PredefinedKFold
¶ A cross-validation iterator to when a dataset has been loaded with the
load_from_folds
method.See an example in the User Guide.
-
class
surprise.model_selection.split.
RepeatedKFold
(n_splits=5, n_repeats=10, random_state=None)¶ Repeated
KFold
cross validator.Repeats
KFold
n times with different randomization in each repetition.See an example in the User Guide.
Parameters: - n_splits (int) – The number of folds.
- n_repeats (int) – The number of repetitions.
- random_state (int, RandomState instance from numpy, or
None
) – Determines the RNG that will be used for determining the folds. If int,random_state
will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls tosplit()
. If RandomState instance, this same instance is used as RNG. IfNone
, the current RNG from numpy is used.random_state
is only used ifshuffle
isTrue
. Default isNone
. - shuffle (bool) – Whether to shuffle the ratings in the
data
parameter of thesplit()
method. Shuffling is not done in-place. Default isTrue
.
-
class
surprise.model_selection.split.
ShuffleSplit
(n_splits=5, test_size=0.2, train_size=None, random_state=None, shuffle=True)¶ A basic cross-validation iterator with random trainsets and testsets.
Contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.
See an example in the User Guide.
Parameters: - n_splits (int) – The number of folds.
- test_size (float or int
None
) – If float, it represents the proportion of ratings to include in the testset. If int, represents the absolute number of ratings in the testset. IfNone
, the value is set to the complement of the trainset size. Default is.2
. - train_size (float or int or
None
) – If float, it represents the proportion of ratings to include in the trainset. If int, represents the absolute number of ratings in the trainset. IfNone
, the value is set to the complement of the testset size. Default isNone
. - random_state (int, RandomState instance from numpy, or
None
) – Determines the RNG that will be used for determining the folds. If int,random_state
will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls tosplit()
. If RandomState instance, this same instance is used as RNG. IfNone
, the current RNG from numpy is used.random_state
is only used ifshuffle
isTrue
. Default isNone
. - shuffle (bool) – Whether to shuffle the ratings in the
data
parameter of thesplit()
method. Shuffling is not done in-place. Setting this to False defeats the purpose of this iterator, but it’s useful for the implementation oftrain_test_split()
. Default isTrue
.
-
surprise.model_selection.split.
train_test_split
(data, test_size=0.2, train_size=None, random_state=None, shuffle=True)¶ Split a dataset into trainset and testset.
See an example in the User Guide.
Note: this function cannot be used as a cross-validation iterator.
Parameters: - data (
Dataset
) – The dataset to split into trainset and testset. - test_size (float or int
None
) – If float, it represents the proportion of ratings to include in the testset. If int, represents the absolute number of ratings in the testset. IfNone
, the value is set to the complement of the trainset size. Default is.2
. - train_size (float or int or
None
) – If float, it represents the proportion of ratings to include in the trainset. If int, represents the absolute number of ratings in the trainset. IfNone
, the value is set to the complement of the testset size. Default isNone
. - random_state (int, RandomState instance from numpy, or
None
) – Determines the RNG that will be used for determining the folds. If int,random_state
will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls tosplit()
. If RandomState instance, this same instance is used as RNG. IfNone
, the current RNG from numpy is used.random_state
is only used ifshuffle
isTrue
. Default isNone
. - shuffle (bool) – Whether to shuffle the ratings in the
data
parameter. Shuffling is not done in-place. Default isTrue
.
- data (
Cross validation¶
-
surprise.model_selection.validation.
cross_validate
(algo, data, measures=[u'rmse', u'mae'], cv=None, return_train_measures=False, n_jobs=-1, pre_dispatch=u'2*n_jobs', verbose=False)¶ Run a cross validation procedure for a given algorithm, reporting accuracy measures and computation times.
See an example in the User Guide.
Parameters: - algo (
AlgoBase
) – The algorithm to evaluate. - data (
Dataset
) – The dataset on which to evaluate the algorithm. - measures (list of string) – The performance measures to compute. Allowed
names are function names as defined in the
accuracy
module. Default is['rmse', 'mae']
. - cv (cross-validation iterator, int or
None
) – Determines how thedata
parameter will be split (i.e. how trainsets and testsets will be defined). If an int is passed,KFold
is used with the appropriaten_splits
parameter. IfNone
,KFold
is used withn_splits=5
. - return_train_measures (bool) – Whether to compute performance measures on
the trainsets. Default is
False
. - n_jobs (int) –
The maximum number of folds evaluated in parallel.
- If
-1
, all CPUs are used. - If
1
is given, no parallel computing code is used at all, which is useful for debugging. - For
n_jobs
below-1
,(n_cpus + n_jobs + 1)
are used. For example, withn_jobs = -2
all CPUs but one are used.
Default is
-1
. - If
- pre_dispatch (int or string) –
Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
None
, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs.- An int, giving the exact number of total jobs that are spawned.
- A string, giving an expression as a function of
n_jobs
, as in'2*n_jobs'
.
Default is
'2*n_jobs'
. - verbose (int) – If
True
accuracy measures for each split are printed, as well as train and test times. Averages and standard deviations over all splits are also reported. Default isFalse
: nothing is printed.
Returns: A dict with the following keys:
'test_*'
where*
corresponds to a lower-case accuracy measure, e.g.'test_rmse'
: numpy array with accuracy values for each testset.'train_*'
where*
corresponds to a lower-case accuracy measure, e.g.'train_rmse'
: numpy array with accuracy values for each trainset. Only available ifreturn_train_measures
isTrue
.'fit_time'
: numpy array with the training time in seconds for each split.'test_time'
: numpy array with the testing time in seconds for each split.
Return type: dict
- algo (
Parameter search¶
-
class
surprise.model_selection.search.
GridSearchCV
(algo_class, param_grid, measures=[u'rmse', u'mae'], cv=None, refit=False, return_train_measures=False, n_jobs=-1, pre_dispatch=u'2*n_jobs', joblib_verbose=0)¶ The
GridSearchCV
class computes accuracy metrics for an algorithm on various combinations of parameters, over a cross-validation procedure. This is useful for finiding the best set of parameters for a prediction algorithm. It is analogous to GridSearchCV from scikit-learn.See an example in the User Guide.
Parameters: - algo_class (
AlgoBase
) – The class of the algorithm to evaluate. - param_grid (dict) – Dictionary with algorithm parameters as keys and
list of values as keys. All combinations will be evaluated with
desired algorithm. Dict parameters such as
sim_options
require special treatment, see this note. - measures (list of string) – The performance measures to compute. Allowed
names are function names as defined in the
accuracy
module. Default is['rmse', 'mae']
. - cv (cross-validation iterator, int or
None
) – Determines how thedata
parameter will be split (i.e. how trainsets and testsets will be defined). If an int is passed,KFold
is used with the appropriaten_splits
parameter. IfNone
,KFold
is used withn_splits=5
. - refit (bool or str) – If
True
, refit the algorithm on the whole dataset using the set of parameters that gave the best average performance for the first measure ofmeasures
. Other measures can be used by passing a string (corresponding to the measure name). Then, you can use thetest()
andpredict()
methods.refit
can only be used if thedata
parameter given tofit()
hasn’t been loaded withload_from_folds()
. Default isFalse
. - return_train_measures (bool) – Whether to compute performance measures on
the trainsets. If
True
, thecv_results
attribute will also contain measures for trainsets. Default isFalse
. - n_jobs (int) –
The maximum number of parallel training procedures.
- If
-1
, all CPUs are used. - If
1
is given, no parallel computing code is used at all, which is useful for debugging. - For
n_jobs
below-1
,(n_cpus + n_jobs + 1)
are used. For example, withn_jobs = -2
all CPUs but one are used.
Default is
-1
. - If
- pre_dispatch (int or string) –
Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
None
, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs.- An int, giving the exact number of total jobs that are spawned.
- A string, giving an expression as a function of
n_jobs
, as in'2*n_jobs'
.
Default is
'2*n_jobs'
. - joblib_verbose (int) – Controls the verbosity of joblib: the higher, the more messages.
-
best_estimator
¶ dict of AlgoBase – Using an accuracy measure as key, get the algorithm that gave the best accuracy results for the chosen measure, averaged over all splits.
-
best_score
¶ dict of floats – Using an accuracy measure as key, get the best average score achieved for that measure.
-
best_params
¶ dict of dicts – Using an accuracy measure as key, get the parameters combination that gave the best accuracy results for the chosen measure (on average).
-
best_index
¶ dict of ints – Using an accuracy measure as key, get the index that can be used with
cv_results
that achieved the highest accuracy for that measure (on average).
-
cv_results
¶ dict of arrays – A dict that contains accuracy measures over all splits, as well as train and test time for each parameter combination. Can be imported into a pandas DataFrame (see example).
-
fit
(data)¶ Runs the
fit()
method of the algorithm for all parameter combination, over different splits given by thecv
parameter.Parameters: data ( Dataset
) – The dataset on which to evaluate the algorithm, in parallel.
-
predict
(*args)¶ Call
predict()
on the estimator with the best found parameters (according the therefit
parameter). SeeAlgoBase.predict()
.Only available if
refit
is notFalse
.
-
test
(testset, verbose=False)¶ Call
test()
on the estimator with the best found parameters (according the therefit
parameter). SeeAlgoBase.test()
.Only available if
refit
is notFalse
.
- algo_class (
similarities module¶
The similarities
module includes tools to
compute similarity metrics between users or items. You may need to refer to the
Notation standards, References page. See also the
Similarity measure configuration section of the User Guide.
Available similarity measures:
cosine |
Compute the cosine similarity between all pairs of users (or items). |
msd |
Compute the Mean Squared Difference similarity between all pairs of users (or items). |
pearson |
Compute the Pearson correlation coefficient between all pairs of users (or items). |
pearson_baseline |
Compute the (shrunk) Pearson correlation coefficient between all pairs of users (or items) using baselines for centering instead of means. |
-
surprise.similarities.
cosine
()¶ Compute the cosine similarity between all pairs of users (or items).
Only common users (or items) are taken into account. The cosine similarity is defined as:
\[\text{cosine_sim}(u, v) = \frac{ \sum\limits_{i \in I_{uv}} r_{ui} \cdot r_{vi}} {\sqrt{\sum\limits_{i \in I_{uv}} r_{ui}^2} \cdot \sqrt{\sum\limits_{i \in I_{uv}} r_{vi}^2} }\]or
\[\text{cosine_sim}(i, j) = \frac{ \sum\limits_{u \in U_{ij}} r_{ui} \cdot r_{uj}} {\sqrt{\sum\limits_{u \in U_{ij}} r_{ui}^2} \cdot \sqrt{\sum\limits_{u \in U_{ij}} r_{uj}^2} }\]depending on the
user_based
field ofsim_options
(see Similarity measure configuration).For details on cosine similarity, see on Wikipedia.
-
surprise.similarities.
msd
()¶ Compute the Mean Squared Difference similarity between all pairs of users (or items).
Only common users (or items) are taken into account. The Mean Squared Difference is defined as:
\[\text{msd}(u, v) = \frac{1}{|I_{uv}|} \cdot \sum\limits_{i \in I_{uv}} (r_{ui} - r_{vi})^2\]or
\[\text{msd}(i, j) = \frac{1}{|U_{ij}|} \cdot \sum\limits_{u \in U_{ij}} (r_{ui} - r_{uj})^2\]depending on the
user_based
field ofsim_options
(see Similarity measure configuration).The MSD-similarity is then defined as:
\[\begin{split}\text{msd_sim}(u, v) &= \frac{1}{\text{msd}(u, v) + 1}\\ \text{msd_sim}(i, j) &= \frac{1}{\text{msd}(i, j) + 1}\end{split}\]The \(+ 1\) term is just here to avoid dividing by zero.
For details on MSD, see third definition on Wikipedia.
-
surprise.similarities.
pearson
()¶ Compute the Pearson correlation coefficient between all pairs of users (or items).
Only common users (or items) are taken into account. The Pearson correlation coefficient can be seen as a mean-centered cosine similarity, and is defined as:
\[\text{pearson_sim}(u, v) = \frac{ \sum\limits_{i \in I_{uv}} (r_{ui} - \mu_u) \cdot (r_{vi} - \mu_{v})} {\sqrt{\sum\limits_{i \in I_{uv}} (r_{ui} - \mu_u)^2} \cdot \sqrt{\sum\limits_{i \in I_{uv}} (r_{vi} - \mu_{v})^2} }\]or
\[\text{pearson_sim}(i, j) = \frac{ \sum\limits_{u \in U_{ij}} (r_{ui} - \mu_i) \cdot (r_{uj} - \mu_{j})} {\sqrt{\sum\limits_{u \in U_{ij}} (r_{ui} - \mu_i)^2} \cdot \sqrt{\sum\limits_{u \in U_{ij}} (r_{uj} - \mu_{j})^2} }\]depending on the
user_based
field ofsim_options
(see Similarity measure configuration).Note: if there are no common users or items, similarity will be 0 (and not -1).
For details on Pearson coefficient, see Wikipedia.
-
surprise.similarities.
pearson_baseline
()¶ Compute the (shrunk) Pearson correlation coefficient between all pairs of users (or items) using baselines for centering instead of means.
The shrinkage parameter helps to avoid overfitting when only few ratings are available (see Similarity measure configuration).
The Pearson-baseline correlation coefficient is defined as:
\[\text{pearson_baseline_sim}(u, v) = \hat{\rho}_{uv} = \frac{ \sum\limits_{i \in I_{uv}} (r_{ui} - b_{ui}) \cdot (r_{vi} - b_{vi})} {\sqrt{\sum\limits_{i \in I_{uv}} (r_{ui} - b_{ui})^2} \cdot \sqrt{\sum\limits_{i \in I_{uv}} (r_{vi} - b_{vi})^2}}\]or
\[\text{pearson_baseline_sim}(i, j) = \hat{\rho}_{ij} = \frac{ \sum\limits_{u \in U_{ij}} (r_{ui} - b_{ui}) \cdot (r_{uj} - b_{uj})} {\sqrt{\sum\limits_{u \in U_{ij}} (r_{ui} - b_{ui})^2} \cdot \sqrt{\sum\limits_{u \in U_{ij}} (r_{uj} - b_{uj})^2}}\]The shrunk Pearson-baseline correlation coefficient is then defined as:
\[ \begin{align}\begin{aligned}\text{pearson_baseline_shrunk_sim}(u, v) &= \frac{|I_{uv}| - 1} {|I_{uv}| - 1 + \text{shrinkage}} \cdot \hat{\rho}_{uv}\\\text{pearson_baseline_shrunk_sim}(i, j) &= \frac{|U_{ij}| - 1} {|U_{ij}| - 1 + \text{shrinkage}} \cdot \hat{\rho}_{ij}\end{aligned}\end{align} \]Obviously, a shrinkage parameter of 0 amounts to no shrinkage at all.
Note: here again, if there are no common users/items, similarity will be 0 (and not -1).
Motivations for such a similarity measure can be found on the Recommender System Handbook, section 5.4.1.
accuracy module¶
The surprise.accuracy
module provides with tools for computing accuracy
metrics on a set of predictions.
可用的准确度指标
rmse |
计算RMSE(均方根误差,Root Mean Squared Error)。 |
mae |
计算MAE(平均绝对误差,Mean Absolute Error)。 |
fcp |
Compute FCP (Fraction of Concordant Pairs). |
-
surprise.accuracy.
fcp
(predictions, verbose=True)¶ Compute FCP (Fraction of Concordant Pairs).
Computed as described in paper Collaborative Filtering on Ordinal User Feedback by Koren and Sill, section 5.2.
参数: - predictions (
list
ofPrediction
) – A list of predictions, as returned by thetest()
method. - verbose – If True, will print computed value. Default is
True
.
返回: The Fraction of Concordant Pairs.
举: ValueError
– Whenpredictions
is empty.- predictions (
-
surprise.accuracy.
mae
(predictions, verbose=True)¶ Compute MAE (Mean Absolute Error).
\[\text{MAE} = \frac{1}{|\hat{R}|} \sum_{\hat{r}_{ui} \in \hat{R}}|r_{ui} - \hat{r}_{ui}|\]Parameters: - predictions (
list
ofPrediction
) – A list of predictions, as returned by thetest()
method. - verbose – If True, will print computed value. Default is
True
.
Returns: The Mean Absolute Error of predictions.
Raises: ValueError
– Whenpredictions
is empty.- predictions (
-
surprise.accuracy.
rmse
(predictions, verbose=True)¶ Compute RMSE (Root Mean Squared Error).
\[\text{RMSE} = \sqrt{\frac{1}{|\hat{R}|} \sum_{\hat{r}_{ui} \in \hat{R}}(r_{ui} - \hat{r}_{ui})^2}.\]Parameters: - predictions (
list
ofPrediction
) – A list of predictions, as returned by thetest()
method. - verbose – If True, will print computed value. Default is
True
.
Returns: The Root Mean Squared Error of predictions.
Raises: ValueError
– Whenpredictions
is empty.- predictions (
数据集模块¶
dataset
模块定义了用于管理数据集的Dataset
类和其他子类。
用户可以同时使用内置和用户定义的数据集(有关示例,请参阅入门页)。 目前,有三种内置数据集可供使用:
- The movielens-100k dataset.
- The movielens-1m dataset.
- Jester数据集2。
Built-in datasets can all be loaded (or downloaded if you haven’t already)
using the Dataset.load_builtin()
method.
Summary:
Dataset.load_builtin |
Load a built-in dataset. |
Dataset.load_from_file |
Load a dataset from a (custom) file. |
Dataset.load_from_folds |
Load a dataset where folds (for cross-validation) are predefined by some files. |
Dataset.folds |
Generator function to iterate over the folds of the Dataset. |
DatasetAutoFolds.split |
Split the dataset into folds for future cross-validation. |
-
class
surprise.dataset.
Dataset
(reader)¶ Base class for loading datasets.
Note that you should never instantiate the
Dataset
class directly (same goes for its derived classes), but instead use one of the three available methods for loading datasets.-
folds
()¶ Generator function to iterate over the folds of the Dataset.
Warning
Deprecated since version 1.05. Use cross-validation iterators instead. This method will be removed in later versions.
Yields: tuple – Trainset
and testset of current fold.
-
classmethod
load_builtin
(name=u'ml-100k')¶ Load a built-in dataset.
If the dataset has not already been loaded, it will be downloaded and saved. You will have to split your dataset using the
split
method. See an example in the User Guide.Parameters: name ( string
) – The name of the built-in dataset to load. Accepted values are ‘ml-100k’, ‘ml-1m’, and ‘jester’. Default is ‘ml-100k’.Returns: A Dataset
object.Raises: ValueError
– If thename
parameter is incorrect.
-
classmethod
load_from_df
(df, reader)¶ Load a dataset from a pandas dataframe.
Use this if you want to use a custom dataset that is stored in a pandas dataframe. See the User Guide for an example.
Parameters: - df (Dataframe) – The dataframe containing the ratings. It must have three columns, corresponding to the user (raw) ids, the item (raw) ids, and the ratings, in this order.
- reader (
Reader
) – A reader to read the file. Only therating_scale
field needs to be specified.
-
classmethod
load_from_file
(file_path, reader)¶ Load a dataset from a (custom) file.
Use this if you want to use a custom dataset and all of the ratings are stored in one file. You will have to split your dataset using the
split
method. See an example in the User Guide.Parameters: - file_path (
string
) – The path to the file containing ratings. - reader (
Reader
) – A reader to read the file.
- file_path (
-
classmethod
load_from_folds
(folds_files, reader)¶ Load a dataset where folds (for cross-validation) are predefined by some files.
The purpose of this method is to cover a common use case where a dataset is already split into predefined folds, such as the movielens-100k dataset which defines files u1.base, u1.test, u2.base, u2.test, etc… It can also be used when you don’t want to perform cross-validation but still want to specify your training and testing data (which comes down to 1-fold cross-validation anyway). See an example in the User Guide.
Parameters: - folds_files (
iterable
oftuples
) – The list of the folds. A fold is a tuple of the form(path_to_train_file, path_to_test_file)
. - reader (
Reader
) – A reader to read the files.
- folds_files (
-
-
class
surprise.dataset.
DatasetAutoFolds
(ratings_file=None, reader=None, df=None)¶ A derived class from
Dataset
for which folds (for cross-validation) are not predefined. (Or for when there are no folds at all).-
build_full_trainset
()¶ Do not split the dataset into folds and just return a trainset as is, built from the whole dataset.
User can then query for predictions, as shown in the User Guide.
Returns: The Trainset
.
-
split
(n_folds=5, shuffle=True)¶ Split the dataset into folds for future cross-validation.
Warning
Deprecated since version 1.05. Use cross-validation iterators instead. This method will be removed in later versions.
If you forget to call
split()
, the dataset will be automatically shuffled and split for 5-fold cross-validation.You can obtain repeatable splits over your all your experiments by seeding the RNG:
import random random.seed(my_seed) # call this before you call split!
Parameters: - n_folds (
int
) – The number of folds. - shuffle (
bool
) – Whether to shuffle ratings before splitting. IfFalse
, folds will always be the same each time the experiment is run. Default isTrue
.
- n_folds (
-
Trainset class¶
-
class
surprise.
Trainset
(ur, ir, n_users, n_items, n_ratings, rating_scale, offset, raw2inner_id_users, raw2inner_id_items)¶ A trainset contains all useful data that constitutes a training set.
It is used by the
fit()
method of every prediction algorithm. You should not try to built such an object on your own but rather use theDataset.folds()
method or theDatasetAutoFolds.build_full_trainset()
method.Trainsets are different from
Datasets
. You can think of aDatasets
as the raw data, and Trainsets as higher-level data where useful methods are defined. Also, aDatasets
may be comprised of multiple Trainsets (e.g. when doing cross validation).-
ur
¶ defaultdict
oflist
– The users ratings. This is a dictionary containing lists of tuples of the form(item_inner_id, rating)
. The keys are user inner ids.
-
ir
¶ defaultdict
oflist
– The items ratings. This is a dictionary containing lists of tuples of the form(user_inner_id, rating)
. The keys are item inner ids.
-
n_users
¶ Total number of users \(|U|\).
-
n_items
¶ Total number of items \(|I|\).
-
n_ratings
¶ Total number of ratings \(|R_{train}|\).
-
rating_scale
¶ tuple – The minimum and maximal rating of the rating scale.
-
global_mean
¶ The mean of all ratings \(\mu\).
-
all_items
()¶ Generator function to iterate over all items.
Yields: Inner id of items.
-
all_ratings
()¶ Generator function to iterate over all ratings.
Yields: A tuple (uid, iid, rating)
where ids are inner ids (see this note).
-
all_users
()¶ Generator function to iterate over all users.
Yields: Inner id of users.
-
build_anti_testset
(fill=None)¶ Return a list of ratings that can be used as a testset in the
test()
method.The ratings are all the ratings that are not in the trainset, i.e. all the ratings \(r_{ui}\) where the user \(u\) is known, the item \(i\) is known, but the rating \(r_{ui}\) is not in the trainset. As \(r_{ui}\) is unknown, it is either replaced by the
fill
value or assumed to be equal to the mean of all ratingsglobal_mean
.Parameters: fill (float) – The value to fill unknown ratings. If None
the global mean of all ratingsglobal_mean
will be used.Returns: A list of tuples (uid, iid, fill)
where ids are raw ids.
-
build_testset
()¶ Return a list of ratings that can be used as a testset in the
test()
method.The ratings are all the ratings that are in the trainset, i.e. all the ratings returned by the
all_ratings()
generator. This is useful in cases where you want to to test your algorithm on the trainset.
-
global_mean
Return the mean of all ratings.
It’s only computed once.
-
knows_item
(iid)¶ Indicate if the item is part of the trainset.
An item is part of the trainset if the item was rated at least once.
Parameters: iid (int) – The (inner) item id. See this note. Returns: True
if item is part of the trainset, elseFalse
.
-
knows_user
(uid)¶ Indicate if the user is part of the trainset.
A user is part of the trainset if the user has at least one rating.
Parameters: uid (int) – The (inner) user id. See this note. Returns: True
if user is part of the trainset, elseFalse
.
-
to_inner_iid
(riid)¶ Convert an item raw id to an inner id.
See this note.
Parameters: riid (str) – The item raw id. Returns: The item inner id. Return type: int Raises: ValueError
– When item is not part of the trainset.
-
to_inner_uid
(ruid)¶ Convert a user raw id to an inner id.
See this note.
Parameters: ruid (str) – The user raw id. Returns: The user inner id. Return type: int Raises: ValueError
– When user is not part of the trainset.
-
Reader类¶
-
class
surprise.reader.
Reader
(name=None, line_format=u'user item rating', sep=None, rating_scale=(1, 5), skip_lines=0)¶ Reader类用于解析包含Rating数据的文件。
我们假定每行仅有一个Rating,且数据需遵循以下结构:
user ; item ; rating ; [timestamp]
其中,字段的出现顺序和分隔符( 此处是 ';' )可以任意定义(见下文)。 中括号表示时间戳字段是可选的。
对于内置数据集,Surprise均提供了预定义读取器。如果您想使用的自定义数据集与内置数据集具有相同的格式,也可以考虑使用预定义读取器。(请参阅
name
参数)参数: - name(
string
,可选参数) - 若指定了name,则返回同名的内置数据集的Reader,并忽略其他所有参数。 可选的值有 'ml-100k','ml-1m' 和 'jester' 。 默认值为None
。 - line_format(
string
) - 字段名称,需要按数据行中出现的顺序排列。 值得注意的是,line_format
一般是用空格来分隔的(使用sep
参数)。 默认值为'user item rating'
. - sep(char) - 字段间的分隔符。 例如:
';'
。 - rating_scale(
tuple
,可选参数) - Rating的数值范围。 默认值为(1, 5)
。 - skip_lines(
int
,可选参数) - 从文件开头跳过指定行数。 默认值为0
。
- name(
评估模块¶
The evaluate
module defines the evaluate()
function and GridSearch
class
-
class
surprise.evaluate.
GridSearch
(algo_class, param_grid, measures=[u'rmse', u'mae'], n_jobs=-1, pre_dispatch=u'2*n_jobs', seed=None, verbose=1, joblib_verbose=0)¶ Warning
Deprecated since version 1.05. Use
GridSearchCV
instead. This class will be removed in later versions.The
GridSearch
class, used to evaluate the performance of an algorithm on various combinations of parameters, and extract the best combination. It is analogous to GridSearchCV from scikit-learn.See User Guide for usage.
Parameters: - algo_class (
AlgoBase
) – The class object of the algorithm to evaluate. - param_grid (dict) – Dictionary with algorithm parameters as keys and
list of values as keys. All combinations will be evaluated with
desired algorithm. Dict parameters such as
sim_options
require special treatment, see this note. - measures (list of string) – The performance measures to compute. Allowed
names are function names as defined in the
accuracy
module. Default is['rmse', 'mae']
. - n_jobs (int) –
The maximum number of algorithm training in parallel.
- If
-1
, all CPUs are used. - If
1
is given, no parallel computing code is used at all, which is useful for debugging. - For
n_jobs
below-1
,(n_cpus + n_jobs + 1)
are used. For example, withn_jobs = -2
all CPUs but one are used.
Default is
-1
. - If
- pre_dispatch (int or string) –
Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
None
, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs.- An int, giving the exact number of total jobs that are spawned.
- A string, giving an expression as a function of
n_jobs
, as in'2*n_jobs'
.
Default is
'2*n_jobs'
. - seed (int) – The value to use as seed for RNG. It will determine how
splits are defined. If
None
, the current time since epoch is used. Default isNone
. - verbose (bool) – Level of verbosity. If
False
, nothing is printed. IfTrue
, The mean values of each measure are printed along for each parameter combination. Default isTrue
. - joblib_verbose (int) – Controls the verbosity of joblib: the higher, the more messages.
-
cv_results
¶ dict of arrays – A dict that contains all parameters and accuracy information for each combination. Can be imported into a pandas DataFrame.
-
best_estimator
¶ dict of AlgoBase – Using an accuracy measure as key, get the estimator that gave the best accuracy results for the chosen measure.
-
best_score
¶ dict of floats – Using an accuracy measure as key, get the best score achieved for that measure.
-
best_params
¶ dict of dicts – Using an accuracy measure as key, get the parameters combination that gave the best accuracy results for the chosen measure.
-
best_index
¶ dict of ints – Using an accuracy measure as key, get the index that can be used with cv_results that achieved the highest accuracy for that measure.
- algo_class (
-
surprise.evaluate.
evaluate
(algo, data, measures=[u'rmse', u'mae'], with_dump=False, dump_dir=None, verbose=1)¶ Warning
Deprecated since version 1.05. Use
cross_validate
instead. This function will be removed in later versions.Evaluate the performance of the algorithm on given data.
Depending on the nature of the
data
parameter, it may or may not perform cross validation.Parameters: - algo (
AlgoBase
) – The algorithm to evaluate. - data (
Dataset
) – The dataset on which to evaluate the algorithm. - measures (list of string) – The performance measures to compute. Allowed
names are function names as defined in the
accuracy
module. Default is['rmse', 'mae']
. - with_dump (bool) – If True, the predictions and the algorithm will be
dumped for later further analysis at each fold (see FAQ). The file names will be set as:
'<date>-<algorithm name>-<fold number>'
. Default isFalse
. - dump_dir (str) – The directory where to dump to files. Default is
'~/.surprise_data/dumps/'
, or the folder specified by the'SURPRISE_DATA_FOLDER'
environment variable (see FAQ). - verbose (int) – Level of verbosity. If 0, nothing is printed. If 1 (default), accuracy measures for each folds are printed, with a final summary. If 2, every prediction is printed.
Returns: A dictionary containing measures as keys and lists as values. Each list contains one entry per fold.
- algo (
dump module¶
The dump
module defines the dump()
function.
-
surprise.dump.
dump
(file_name, predictions=None, algo=None, verbose=0)¶ A basic wrapper around Pickle to serialize a list of prediction and/or an algorithm on drive.
What is dumped is a dictionary with keys
'predictions'
and'algo'
.Parameters: - file_name (str) – The name (with full path) specifying where to dump the predictions.
- predictions (list of
Prediction
) – The predictions to dump. - algo (
Algorithm
, optional) – The algorithm to dump. - verbose (int) – Level of verbosity. If
1
, then a message indicates that the dumping went successfully. Default is0
.
-
surprise.dump.
load
(file_name)¶ A basic wrapper around Pickle to deserialize a list of prediction and/or an algorithm that were dumped on drive using
dump()
.Parameters: file_name (str) – The path of the file from which the algorithm is to be loaded Returns: A tuple (predictions, algo)
wherepredictions
is a list ofPrediction
objects andalgo
is anAlgorithm
object. Depending on what was dumped, some of these may beNone
.