Welcome to Surprise’ documentation!¶

Surprise is an easy-to-use Python scikit for recommender systems.

If you’re new to Surprise, we invite you to take a look at the Getting Started guide, where you’ll find a series of tutorials illustrating all you can do with Surprise. You can also check out the FAQ for many use-case example. For installation guidelines, please refer to the project page.

Any kind of feedback/criticism would be greatly appreciated (software design, documentation, improvement ideas, spelling mistakes, etc…). Please feel free to contribute and send pull requests (see GitHub page)!

Getting Started¶

Basic usage¶

Automatic cross-validation¶

Surprise has a set of built-in algorithms and datasets for you to play with. In its simplest form, it only takes a few lines of code to run a cross-validation procedure:

From file examples/basic_usage.py¶

from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate


# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k')

# We'll use the famous SVD algorithm.
algo = SVD()

# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

The result should be as follows (actual values may vary due to randomization):

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

            Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std
RMSE        0.9311  0.9370  0.9320  0.9317  0.9391  0.9342  0.0032
MAE         0.7350  0.7375  0.7341  0.7342  0.7375  0.7357  0.0015
Fit time    6.53    7.11    7.23    7.15    3.99    6.40    1.23
Test time   0.26    0.26    0.25    0.15    0.13    0.21    0.06

The load_builtin() method will offer to download the movielens-100k dataset if it has not already been downloaded, and it will save it in the .surprise_data folder in your home directory (you can also choose to save it somewhere else).

We are here using the well-known SVD algorithm, but many other algorithms are available. See Using prediction algorithms for more details.

The cross_validate() function runs a cross-validation procedure according to the cv argument, and computes some accuracy measures. We are here using a classical 5-fold cross-validation, but fancier iterators can be used (see here).

Train-test split and the fit() method¶

If you don’t want to run a full cross-validation procedure, you can use the train_test_split() to sample a trainset and a testset with given sizes, and use the accuracy metric of your chosing. You’ll need to use the fit() method which will train the algorithm on the trainset, and the test() method which will return the predictions made from the testset:

From file examples/train_test_split.py¶

from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split

# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k')

# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=.25)

# We'll use the famous SVD algorithm.
algo = SVD()

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

Result:

RMSE: 0.9411

Note that you can train and test an algorithm with the following one-line:

predictions = algo.fit(trainset).test(testset)

In some cases, your trainset and testset are already defined by some files. Please refer to this section to handle such cases.

Train on a whole trainset and the predict() method¶

Obviously, we could also simply fit our algorithm to the whole dataset, rather than running cross-validation. This can be done by using the build_full_trainset() method which will build a trainset object:

From file examples/predict_ratings.py¶

from surprise import KNNBasic
from surprise import Dataset

# Load the movielens-100k dataset
data = Dataset.load_builtin('ml-100k')

# Retrieve the trainset.
trainset = data.build_full_trainset()

# Build an algorithm, and train it.
algo = KNNBasic()
algo.fit(trainset)

We can now predict ratings by directly calling the predict() method. Let’s say you’re interested in user 196 and item 302 (make sure they’re in the trainset!), and you know that the true rating \(r_{ui} = 4\):

From file examples/predict_ratings.py¶

uid = str(196)  # raw user id (as in the ratings file). They are **strings**!
iid = str(302)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=4, verbose=True)

The result should be:

user: 196        item: 302        r_ui = 4.00   est = 4.06   {'actual_k': 40, 'was_impossible': False}

Note

The predict() uses raw ids (please read this about raw and inner ids). As the dataset we have used has been read from a file, the raw ids are strings (even if they represent numbers).

We have so far used a built-in dataset, but you can of course use your own. This is explained in the next section.

Use a custom dataset¶

Surprise has a set of builtin datasets, but you can of course use a custom dataset. Loading a rating dataset can be done either from a file (e.g. a csv file), or from a pandas dataframe. Either way, you will need to define a Reader object for Surprise to be able to parse the file or the dataframe.

To load a dataset from a file (e.g. a csv file), you will need the load_from_file() method:

From file examples/load_custom_dataset.py¶

from surprise import BaselineOnly
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate

# path to dataset file
file_path = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/u.data')

# As we're loading a custom dataset, we need to define a reader. In the
# movielens-100k dataset, each line has the following format:
# 'user item rating timestamp', separated by '\t' characters.
reader = Reader(line_format='user item rating timestamp', sep='\t')

data = Dataset.load_from_file(file_path, reader=reader)

# We can now use this dataset as we please, e.g. calling cross_validate
cross_validate(BaselineOnly(), data, verbose=True)

For more details about readers and how to use them, see the Reader class documentation.

Note

As you already know from the previous section, the Movielens-100k dataset is built-in so a much quicker way to load the dataset is to do data = Dataset.load_builtin('ml-100k'). We will of course ignore this here.

To load a dataset from a pandas dataframe, you will need the load_from_df() method. You will also need a Reader object, but only the rating_scale parameter must be specified. The dataframe must have three columns, corresponding to the user (raw) ids, the item (raw) ids, and the ratings in this order. Each row thus corresponds to a given rating. This is not restrictive as you can reorder the columns of your dataframe easily.

From file examples/load_from_dataframe.py¶

import pandas as pd

from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate


# Creation of the dataframe. Column names are irrelevant.
ratings_dict = {'itemID': [1, 1, 1, 2, 2],
                'userID': [9, 32, 2, 45, 'user_foo'],
                'rating': [3, 2, 4, 3, 1]}
df = pd.DataFrame(ratings_dict)

# A reader is still needed but only the rating_scale param is requiered.
reader = Reader(rating_scale=(1, 5))

# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(df[['userID', 'itemID', 'rating']], reader)

# We can now use this dataset as we please, e.g. calling cross_validate
cross_validate(NormalPredictor(), data, cv=2)

The dataframe initially looks like this:

      itemID  rating    userID
     1       3         9
     1       2        32
     1       4         2
     2       3        45
     2       1  user_foo

Use cross-validation iterators¶

For cross-validation, we can use the cross_validate() function that does all the hard work for us. But for a better control, we can also instanciate a cross-validation iterator, and make predictions over each split using the split() method of the iterator, and the test() method of the algorithm. Here is an example where we use a classical K-fold cross-validation procedure with 3 splits:

From file examples/use_cross_validation_iterators.py¶

from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import KFold

# Load the movielens-100k dataset
data = Dataset.load_builtin('ml-100k')

# define a cross-validation iterator
kf = KFold(n_splits=3)

algo = SVD()

for trainset, testset in kf.split(data):

    # train and test algorithm.
    algo.fit(trainset)
    predictions = algo.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)

Result could be, e.g.:

RMSE: 0.9374
RMSE: 0.9476
RMSE: 0.9478

Other cross-validation iterator can be used, like LeaveOneOut or ShuffleSplit. See all the available iterators here. The design of Surprise’s cross-validation tools is heavily inspired from the excellent scikit-learn API.

A special case of cross-validation is when the folds are already predefined by some files. For instance, the movielens-100K dataset already provides 5 train and test files (u1.base, u1.test … u5.base, u5.test). Surprise can handle this case by using a surprise.model_selection.split.PredefinedKFold object:

From file examples/load_custom_dataset_predefined_folds.py¶

from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import PredefinedKFold

# path to dataset folder
files_dir = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/')

# This time, we'll use the built-in reader.
reader = Reader('ml-100k')

# folds_files is a list of tuples containing file paths:
# [(u1.base, u1.test), (u2.base, u2.test), ... (u5.base, u5.test)]
train_file = files_dir + 'u%d.base'
test_file = files_dir + 'u%d.test'
folds_files = [(train_file % i, test_file % i) for i in (1, 2, 3, 4, 5)]

data = Dataset.load_from_folds(folds_files, reader=reader)
pkf = PredefinedKFold()

algo = SVD()

for trainset, testset in pkf.split(data):

    # train and test algorithm.
    algo.fit(trainset)
    predictions = algo.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)

Of course, nothing prevents you from only loading a single file for training and a single file for testing. However, the folds_files parameter still needs to be a list.

Tune algorithm parameters with GridSearchCV¶

The cross_validate() function reports accuracy metric over a cross-validation procedure for a given set of parameters. If you want to know which parameter combination yields the best results, the GridSearchCV class comes to the rescue. Given a dict of parameters, this class exhaustively tries all the combinations of parameters and reports the best parameters for any accuracy measure (averaged over the different splits). It is heavily inspired from scikit-learn’s GridSearchCV.

Here is an example where we try different values for parameters n_epochs, lr_all and reg_all of the SVD algorithm.

From file examples/grid_search_usage.py¶

from surprise import SVD
from surprise import Dataset
from surprise.model_selection import GridSearchCV

# Use movielens-100K
data = Dataset.load_builtin('ml-100k')

param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

Result:

0.961300130118
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}

We are here evaluating the average RMSE and MAE over a 3-fold cross-validation procedure, but any cross-validation iterator can used.

Once fit() has been called, the best_estimator attribute gives us an algorithm instance with the optimal set of parameters, which can be used how we please:

From file examples/grid_search_usage.py¶

# We can now use the algorithm that yields the best rmse:
algo = gs.best_estimator['rmse']
algo.fit(data.build_full_trainset())

Note

Dictionary parameters such as bsl_options and sim_options require particular treatment. See usage example below:

param_grid = {'k': [10, 20],
              'sim_options': {'name': ['msd', 'cosine'],
                              'min_support': [1, 5],
                              'user_based': [False]}
              }

Naturally, both can be combined, for example for the KNNBaseline algorithm:

param_grid = {'bsl_options': {'method': ['als', 'sgd'],
                              'reg': [1, 2]},
              'k': [2, 3],
              'sim_options': {'name': ['msd', 'cosine'],
                              'min_support': [1, 5],
                              'user_based': [False]}
              }

For further analysis, the cv_results attribute has all the needed information and can be imported in a pandas dataframe:

From file examples/grid_search_usage.py¶

results_df = pd.DataFrame.from_dict(gs.cv_results)

In our example, the cv_results attribute looks like this (floats are formatted):

'split0_test_rmse': [1.0, 1.0, 0.97, 0.98, 0.98, 0.99, 0.96, 0.97]
'split1_test_rmse': [1.0, 1.0, 0.97, 0.98, 0.98, 0.99, 0.96, 0.97]
'split2_test_rmse': [1.0, 1.0, 0.97, 0.98, 0.98, 0.99, 0.96, 0.97]
'mean_test_rmse':   [1.0, 1.0, 0.97, 0.98, 0.98, 0.99, 0.96, 0.97]
'std_test_rmse':    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
'rank_test_rmse':   [7 8 3 5 4 6 1 2]
'split0_test_mae':  [0.81, 0.82, 0.78, 0.79, 0.79, 0.8, 0.77, 0.79]
'split1_test_mae':  [0.8, 0.81, 0.78, 0.79, 0.78, 0.79, 0.77, 0.78]
'split2_test_mae':  [0.81, 0.81, 0.78, 0.79, 0.78, 0.8, 0.77, 0.78]
'mean_test_mae':    [0.81, 0.81, 0.78, 0.79, 0.79, 0.8, 0.77, 0.78]
'std_test_mae':     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
'rank_test_mae':    [7 8 2 5 4 6 1 3]
'mean_fit_time':    [1.53, 1.52, 1.53, 1.53, 3.04, 3.05, 3.06, 3.02]
'std_fit_time':     [0.03, 0.04, 0.0, 0.01, 0.04, 0.01, 0.06, 0.01]
'mean_test_time':   [0.46, 0.45, 0.44, 0.44, 0.47, 0.49, 0.46, 0.34]
'std_test_time':    [0.0, 0.01, 0.01, 0.0, 0.03, 0.06, 0.01, 0.08]
'params':           [{'n_epochs': 5, 'lr_all': 0.002, 'reg_all': 0.4}, {'n_epochs': 5, 'lr_all': 0.002, 'reg_all': 0.6}, {'n_epochs': 5, 'lr_all': 0.005, 'reg_all': 0.4}, {'n_epochs': 5, 'lr_all': 0.005, 'reg_all': 0.6}, {'n_epochs': 10, 'lr_all': 0.002, 'reg_all': 0.4}, {'n_epochs': 10, 'lr_all': 0.002, 'reg_all': 0.6}, {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}, {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.6}]
'param_n_epochs':   [5, 5, 5, 5, 10, 10, 10, 10]
'param_lr_all':     [0.0, 0.0, 0.01, 0.01, 0.0, 0.0, 0.01, 0.01]
'param_reg_all':    [0.4, 0.6, 0.4, 0.6, 0.4, 0.6, 0.4, 0.6]

As you can see, each list has the same size of the number of parameter combination. It corresponds to the following table:

split0_test_rmse	split1_test_rmse	split2_test_rmse	mean_test_rmse	std_test_rmse	rank_test_rmse	split0_test_mae	split1_test_mae	split2_test_mae	mean_test_mae	std_test_mae	rank_test_mae	mean_fit_time	std_fit_time	mean_test_time	std_test_time	params	param_n_epochs	param_lr_all	param_reg_all
0.99775	0.997744	0.996378	0.997291	0.000645508	7	0.807862	0.804626	0.805282	0.805923	0.00139657	7	1.53341	0.0305216	0.455831	0.000922113	{‘n_epochs’: 5, ‘lr_all’: 0.002, ‘reg_all’: 0.4}	5	0.002	0.4
1.00381	1.00304	1.00257	1.00314	0.000508358	8	0.816559	0.812905	0.813772	0.814412	0.00155866	8	1.5199	0.0367117	0.451068	0.00938646	{‘n_epochs’: 5, ‘lr_all’: 0.002, ‘reg_all’: 0.6}	5	0.002	0.6
0.973524	0.973595	0.972495	0.973205	0.000502609	3	0.783361	0.780242	0.78067	0.781424	0.00138049	2	1.53449	0.00496203	0.441558	0.00529696	{‘n_epochs’: 5, ‘lr_all’: 0.005, ‘reg_all’: 0.4}	5	0.005	0.4
0.98229	0.982059	0.981486	0.981945	0.000338056	5	0.794481	0.790781	0.79186	0.792374	0.00155377	5	1.52739	0.00859185	0.44463	0.000888907	{‘n_epochs’: 5, ‘lr_all’: 0.005, ‘reg_all’: 0.6}	5	0.005	0.6
0.978034	0.978407	0.976919	0.977787	0.000632049	4	0.787643	0.784723	0.784957	0.785774	0.00132486	4	3.03572	0.0431101	0.466606	0.0254965	{‘n_epochs’: 10, ‘lr_all’: 0.002, ‘reg_all’: 0.4}	10	0.002	0.4
0.986263	0.985817	0.985004	0.985695	0.000520899	6	0.798218	0.794457	0.795373	0.796016	0.00160135	6	3.0544	0.00636185	0.488357	0.0576194	{‘n_epochs’: 10, ‘lr_all’: 0.002, ‘reg_all’: 0.6}	10	0.002	0.6
0.963751	0.963463	0.962676	0.963297	0.000454661	1	0.774036	0.770548	0.771588	0.772057	0.00146201	1	3.0636	0.0597982	0.456484	0.00510321	{‘n_epochs’: 10, ‘lr_all’: 0.005, ‘reg_all’: 0.4}	10	0.005	0.4
0.973605	0.972868	0.972765	0.973079	0.000374222	2	0.78607	0.781918	0.783537	0.783842	0.00170855	3	3.01907	0.011834	0.338839	0.075346	{‘n_epochs’: 10, ‘lr_all’: 0.005, ‘reg_all’: 0.6}	10	0.005	0.6

Command line usage¶

Surprise can also be used from the command line, for example:

surprise -algo SVD -params "{'n_epochs': 5, 'verbose': True}" -load-builtin ml-100k -n-folds 3

See detailed usage by running:

surprise -h

Using prediction algorithms¶

Surprise provides a bunch of built-in algorithms. All algorithms derive from the AlgoBase base class, where are implemented some key methods (e.g. predict, fit and test). The list and details of the available prediction algorithms can be found in the prediction_algorithms package documentation.

Every algorithm is part of the global Surprise namespace, so you only need to import their names from the Surprise package, for example:

from surprise import KNNBasic
algo = KNNBasic()

Some of these algorithms may use baseline estimates, some may use a similarity measure. We will here review how to configure the way baselines and similarities are computed.

Baselines estimates configuration¶

Note

This section only applies to algorithms (or similarity measures) that try to minimize the following regularized squared error (or equivalent):

\[\sum_{r_{ui} \in R_{train}} \left(r_{ui} - (\mu + b_u + b_i)\right)^2 + \lambda \left(b_u^2 + b_i^2 \right).\]

For algorithms using baselines in another objective function (e.g. the SVD algorithm), the baseline configuration is done differently and is specific to each algorithm. Please refer to their own documentation.

First of all, if you do not want to configure the way baselines are computed, you don’t have to: the default parameters will do just fine. If you do want to well… This is for you.

You may want to read section 2.1 of [Kor10] to get a good idea of what are baseline estimates.

Baselines can be estimated in two different ways:

Using Stochastic Gradient Descent (SGD).
Using Alternating Least Squares (ALS).

You can configure the way baselines are computed using the bsl_options parameter passed at the creation of an algorithm. This parameter is a dictionary for which the key 'method' indicates the method to use. Accepted values are 'als' (default) and 'sgd'. Depending on its value, other options may be set. For ALS:

'reg_i': The regularization parameter for items. Corresponding to \(\lambda_2\) in [Kor10]. Default is 10.
'reg_u': The regularization parameter for users. Corresponding to \(\lambda_3\) in [Kor10]. Default is 15.
'n_epochs': The number of iteration of the ALS procedure. Default is 10. Note that in [Kor10], what is described is a single iteration ALS process.

And for SGD:

'reg': The regularization parameter of the cost function that is optimized, corresponding to \(\lambda_1\) and then \(\lambda_5\) in [Kor10] Default is 0.02.
'learning_rate': The learning rate of SGD, corresponding to \(\gamma\) in [Kor10]. Default is 0.005.
'n_epochs': The number of iteration of the SGD procedure. Default is 20.

Note

For both procedures (ALS and SGD), user and item biases (\(b_u\) and \(b_i\)) are initialized to zero.

Usage examples:

From file examples/baselines_conf.py¶

print('Using ALS')
bsl_options = {'method': 'als',
               'n_epochs': 5,
               'reg_u': 12,
               'reg_i': 5
               }
algo = BaselineOnly(bsl_options=bsl_options)

From file examples/baselines_conf.py¶

print('Using SGD')
bsl_options = {'method': 'sgd',
               'learning_rate': .00005,
               }
algo = BaselineOnly(bsl_options=bsl_options)

Note that some similarity measures may use baselines, such as the pearson_baseline similarity. Configuration works just the same, whether the baselines are used in the actual prediction \(\hat{r}_{ui}\) or not:

From file examples/baselines_conf.py¶

bsl_options = {'method': 'als',
               'n_epochs': 20,
               }
sim_options = {'name': 'pearson_baseline'}
algo = KNNBasic(bsl_options=bsl_options, sim_options=sim_options)

This leads us to similarity measure configuration, which we will review right now.

Similarity measure configuration¶

Many algorithms use a similarity measure to estimate a rating. The way they can be configured is done in a similar fashion as for baseline ratings: you just need to pass a sim_options argument at the creation of an algorithm. This argument is a dictionary with the following (all optional) keys:

'name': The name of the similarity to use, as defined in the similarities module. Default is 'MSD'.
'user_based': Whether similarities will be computed between users or between items. This has a huge impact on the performance of a prediction algorithm. Default is True.
'min_support': The minimum number of common items (when 'user_based' is 'True') or minimum number of common users (when 'user_based' is 'False') for the similarity not to be zero. Simply put, if \(|I_{uv}| < \text{min_support}\) then \(\text{sim}(u, v) = 0\). The same goes for items.
'shrinkage': Shrinkage parameter to apply (only relevant for pearson_baseline similarity). Default is 100.

Usage examples:

From file examples/similarity_conf.py¶

sim_options = {'name': 'cosine',
               'user_based': False  # compute  similarities between items
               }
algo = KNNBasic(sim_options=sim_options)

From file examples/similarity_conf.py¶

sim_options = {'name': 'pearson_baseline',
               'shrinkage': 0  # no shrinkage
               }
algo = KNNBasic(sim_options=sim_options)

How to build your own prediction algorithm¶

This page describes how to build a custom prediction algorithm using Surprise.

The basics¶

Want to get your hands dirty? Cool.

Creating your own prediction algorithm is pretty simple: an algorithm is nothing but a class derived from AlgoBase that has an estimate method. This is the method that is called by the predict() method. It takes in an inner user id, an inner item id (see this note), and returns the estimated rating \(\hat{r}_{ui}\):

From file examples/building_custom_algorithms/most_basic_algorithm.py¶

from surprise import AlgoBase
from surprise import Dataset
from surprise.model_selection import cross_validate

class MyOwnAlgorithm(AlgoBase):

    def __init__(self):

        # Always call base method before doing anything.
        AlgoBase.__init__(self)

    def estimate(self, u, i):

        return 3

data = Dataset.load_builtin('ml-100k')
algo = MyOwnAlgorithm()

cross_validate(algo, data, verbose=True)

This algorithm is the dumbest we could have thought of: it just predicts a rating of 3, regardless of users and items.

If you want to store additional information about the prediction, you can also return a dictionary with given details:

def estimate(self, u, i):

    details = {'info1' : 'That was',
               'info2' : 'easy stuff :)'}
    return 3, details

This dictionary will be stored in the prediction as the details field and can be used for later analysis.

The `fit` method¶

Now, let’s make a slightly cleverer algorithm that predicts the average of all the ratings of the trainset. As this is a constant value that does not depend on current user or item, we would rather compute it once and for all. This can be done by defining the fit method:

From file examples/building_custom_algorithms/most_basic_algorithm2.py¶

class MyOwnAlgorithm(AlgoBase):

    def __init__(self):

        # Always call base method before doing anything.
        AlgoBase.__init__(self)

    def fit(self, trainset):

        # Here again: call base method before doing anything.
        AlgoBase.fit(self, trainset)

        # Compute the average rating. We might as well use the
        # trainset.global_mean attribute ;)
        self.the_mean = np.mean([r for (_, _, r) in
                                 self.trainset.all_ratings()])

        return self

    def estimate(self, u, i):

        return self.the_mean

The fit method is called e.g. by the cross_validate function at each fold of a cross-validation process, (but you can also call it yourself). Before doing anything, you should call the base class fit() method.

Note that the fit() method returns self. This allows to use expression like algo.fit(trainset).test(testset).

The `trainset` attribute¶

Once the base class fit() method has returned, all the info you need about the current training set (rating values, etc…) is stored in the self.trainset attribute. This is a Trainset object that has many attributes and methods of interest for prediction.

To illustrate its usage, let’s make an algorithm that predicts an average between the mean of all ratings, the mean rating of the user and the mean rating for the item:

From file examples/building_custom_algorithms/mean_rating_user_item.py¶

    def estimate(self, u, i):

        sum_means = self.trainset.global_mean
        div = 1

        if self.trainset.knows_user(u):
            sum_means += np.mean([r for (_, r) in self.trainset.ur[u]])
            div += 1
        if self.trainset.knows_item(i):
            sum_means += np.mean([r for (_, r) in self.trainset.ir[i]])
            div += 1

        return sum_means / div

Note that it would have been a better idea to compute all the user means in the fit method, thus avoiding the same computations multiple times.

When the prediction is impossible¶

It’s up to your algorithm to decide if it can or cannot yield a prediction. If the prediction is impossible, then you can raise the PredictionImpossible exception. You’ll need to import it first):

from surprise import PredictionImpossible

This exception will be caught by the predict() method, and the estimation \(\hat{r}_{ui}\) will be set to the global mean of all ratings \(\mu\).

Using similarities and baselines¶

Should your algorithm use a similarity measure or baseline estimates, you’ll need to accept bsl_options and sim_options as parameters to the __init__ method, and pass them along to the Base class. See how to use these parameters in the Using prediction algorithms section.

Methods compute_baselines() and compute_similarities() can be called in the fit method (or anywhere else).

From file examples/building_custom_algorithms/.with_baselines_or_sim.py¶

class MyOwnAlgorithm(AlgoBase):

    def __init__(self, sim_options={}, bsl_options={}):

        AlgoBase.__init__(self, sim_options=sim_options,
                          bsl_options=bsl_options)

    def fit(self, trainset):

        AlgoBase.fit(self, trainset)

        # Compute baselines and similarities
        self.bu, self.bi = self.compute_baselines()
        self.sim = self.compute_similarities()

        return self

    def estimate(self, u, i):

        if not (self.trainset.knows_user(u) and self.trainset.knows_item(i)):
            raise PredictionImpossible('User and/or item is unkown.')

        # Compute similarities between u and v, where v describes all other
        # users that have also rated item i.
        neighbors = [(v, self.sim[u, v]) for (v, r) in self.trainset.ir[i]]
        # Sort these neighbors by similarity
        neighbors = sorted(neighbors, key=lambda x: x[1], reverse=True)

        print('The 3 nearest neighbors of user', str(u), 'are:')
        for v, sim_uv in neighbors[:3]:
            print('user {0:} with sim {1:1.2f}'.format(v, sim_uv))

        # ... Aaaaand return the baseline estimate anyway ;)

Feel free to explore the prediction_algorithms package source to get an idea of what can be done.

Notation standards, References¶

In the documentation, you will find the following notation:

\(R\) : the set of all ratings.
\(R_{train}\), \(R_{test}\) and \(\hat{R}\) denote the training set, the test set, and the set of predicted ratings.
\(U\) : the set of all users. \(u\) and \(v\) denotes users.
\(I\) : the set of all items. \(i\) and \(j\) denotes items.
\(U_i\) : the set of all users that have rated item \(i\).
\(U_{ij}\) : the set of all users that have rated both items \(i\) and \(j\).
\(I_u\) : the set of all items rated by user \(u\).
\(I_{uv}\) : the set of all items rated by both users \(u\) and \(v\).
\(r_{ui}\) : the true rating of user \(u\) for item \(i\).
\(\hat{r}_{ui}\) : the estimated rating of user \(u\) for item \(i\).
\(b_{ui}\) : the baseline rating of user \(u\) for item \(i\).
\(\mu\) : the mean of all ratings.
\(\mu_u\) : the mean of all ratings given by user \(u\).
\(\mu_i\) : the mean of all ratings given to item \(i\).
\(\sigma_u\) : the standard deviation of all ratings given by user \(u\).
\(\sigma_i\) : the standard deviation of all ratings given to item \(i\).
\(N_i^k(u)\) : the \(k\) nearest neighbors of user \(u\) that have rated item \(i\). This set is computed using a similarity metric.
\(N_u^k(i)\) : the \(k\) nearest neighbors of item \(i\) that are rated by user \(u\). This set is computed using a similarity metric.

References

Here are the papers used as references in the documentation. Links to pdf files where added when possible. A simple Google search should lead you easily to the missing ones :)

[GM05]

Thomas George and Srujana Merugu. A scalable collaborative filtering framework based on co-clustering. 2005. URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.113.6458&rep=rep1&type=pdf.

[Kor08]

Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. 2008. URL: http://www.cs.rochester.edu/twiki/pub/Main/HarpSeminar/Factorization_Meets_the_Neighborhood-_a_Multifaceted_Collaborative_Filtering_Model.pdf.

[Kor10]

Yehuda Koren. Factor in the neighbors: scalable and accurate collaborative filtering. 2010. URL: http://courses.ischool.berkeley.edu/i290-dm/s11/SECURE/a1-koren.pdf.

[KBV09]

Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. 2009.

[LS01]

Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. 2001. URL: http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf.

[LM07]

Daniel Lemire and Anna Maclachlan. Slope one predictors for online rating-based collaborative filtering. 2007. URL: http://arxiv.org/abs/cs/0702144.

[LZXZ14]

Xin Luo, Mengchu Zhou, Yunni Xia, and Qinsheng Zhu. An efficient non-negative matrix factorization-based approach to collaborative filtering for recommender systems. 2014.

[RRSK10]

Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul B. Kantor. Recommender Systems Handbook. 1st edition, 2010.

[SM08]

Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. 2008. URL: http://papers.nips.cc/paper/3208-probabilistic-matrix-factorization.pdf.

[ZWFM96]

Sheng Zhang, Weihong Wang, James Ford, and Fillia Makedon. Learning from incomplete ratings using non-negative matrix factorization. 1996. URL: http://www.siam.org/meetings/sdm06/proceedings/059zhangs2.pdf.

FAQ¶

You will find here the Frequently Asked Questions, as well as some other use-case examples that are not part of the User Guide.

How to get the top-N recommendations for each user¶

Here is an example where we retrieve retrieve the top-10 items with highest rating prediction for each user in the MovieLens-100k dataset. We first train an SVD algorithm on the whole dataset, and then predict all the ratings for the pairs (user, item) that are not in the training set. We then retrieve the top-10 prediction for each user.

From file examples/top_n_recommendations.py¶

from collections import defaultdict

from surprise import SVD
from surprise import Dataset


def get_top_n(predictions, n=10):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n


# First train an SVD algorithm on the movielens dataset.
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
algo = SVD()
algo.fit(trainset)

# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo.test(testset)

top_n = get_top_n(predictions, n=10)

# Print the recommended items for each user
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])

How to compute precision@k and recall@k¶

Here is an example where we compute Precision@k and Recall@k for each user:

\(\text{Precision@k} = \frac{ | \{ \text{Recommended items that are relevant} \} | }{ | \{ \text{Recommended items} \} | }\) \(\text{Recall@k} = \frac{ | \{ \text{Recommended items that are relevant} \} | }{ | \{ \text{Relevant items} \} | }\)

An item is considered relevant if its true rating \(r_{ui}\) is greater than a given threshold. An item is considered recommended if its estimated rating \(\hat{r}_{ui}\) is greater than the threshold, and if it is among the k highest estimated ratings.

From file examples/precision_recall_at_k.py¶

from collections import defaultdict

from surprise import Dataset
from surprise import SVD
from surprise.model_selection import KFold


def precision_recall_at_k(predictions, k=10, threshold=3.5):
    '''Return precision and recall at k metrics for each user.'''

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[:k])

        # Precision@K: Proportion of recommended items that are relevant
        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 1

        # Recall@K: Proportion of relevant items that are recommended
        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 1

    return precisions, recalls


data = Dataset.load_builtin('ml-100k')
kf = KFold(n_splits=5)
algo = SVD()

for trainset, testset in kf.split(data):
    algo.fit(trainset)
    predictions = algo.test(testset)
    precisions, recalls = precision_recall_at_k(predictions, k=5, threshold=4)

    # Precision and recall can then be averaged over all users
    print(sum(prec for prec in precisions.values()) / len(precisions))
    print(sum(rec for rec in recalls.values()) / len(recalls))

How to get the k nearest neighbors of a user (or item)¶

You can use the get_neighbors() methods of the algorithm object. This is only relevant for algorithms that use a similarity measure, such as the k-NN algorithms.

Here is an example where we retrieve the 10 nearest neighbors of the movie Toy Story from the MovieLens-100k dataset. The output is:

The 10 nearest neighbors of Toy Story are:
Beauty and the Beast (1991)
Raiders of the Lost Ark (1981)
That Thing You Do! (1996)
Lion King, The (1994)
Craft, The (1996)
Liar Liar (1997)
Aladdin (1992)
Cool Hand Luke (1967)
Winnie the Pooh and the Blustery Day (1968)
Indiana Jones and the Last Crusade (1989)

There’s a lot of boilerplate because of the conversions between movie names and their raw/inner ids (see this note), but it all boils down to the use of get_neighbors():

From file examples/k_nearest_neighbors.py¶

import io  # needed because of weird encoding of u.item file

from surprise import KNNBaseline
from surprise import Dataset
from surprise import get_dataset_dir


def read_item_names():
    """Read the u.item file from MovieLens 100-k dataset and return two
    mappings to convert raw ids into movie names and movie names into raw ids.
    """

    file_name = get_dataset_dir() + '/ml-100k/ml-100k/u.item'
    rid_to_name = {}
    name_to_rid = {}
    with io.open(file_name, 'r', encoding='ISO-8859-1') as f:
        for line in f:
            line = line.split('|')
            rid_to_name[line[0]] = line[1]
            name_to_rid[line[1]] = line[0]

    return rid_to_name, name_to_rid


# First, train the algortihm to compute the similarities between items
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
sim_options = {'name': 'pearson_baseline', 'user_based': False}
algo = KNNBaseline(sim_options=sim_options)
algo.fit(trainset)

# Read the mappings raw id <-> movie name
rid_to_name, name_to_rid = read_item_names()

# Retrieve inner id of the movie Toy Story
toy_story_raw_id = name_to_rid['Toy Story (1995)']
toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)

# Retrieve inner ids of the nearest neighbors of Toy Story.
toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=10)

# Convert inner ids of the neighbors into names.
toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id)
                       for inner_id in toy_story_neighbors)
toy_story_neighbors = (rid_to_name[rid]
                       for rid in toy_story_neighbors)

print()
print('The 10 nearest neighbors of Toy Story are:')
for movie in toy_story_neighbors:
    print(movie)

Naturally, the same can be done for users with minor modifications.

How to serialize an algorithm¶

Prediction algorithms can be serialized and loaded back using the dump() and load() functions. Here is a small example where the SVD algorithm is trained on a dataset and serialized. It is then reloaded and can be used again for making predictions:

From file examples/serialize_algorithm.py¶

import os

from surprise import SVD
from surprise import Dataset
from surprise import dump


data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()

algo = SVD()
algo.fit(trainset)

# Compute predictions of the 'original' algorithm.
predictions = algo.test(trainset.build_testset())

# Dump algorithm and reload it.
file_name = os.path.expanduser('~/dump_file')
dump.dump(file_name, algo=algo)
_, loaded_algo = dump.load(file_name)

# We now ensure that the algo is still the same by checking the predictions.
predictions_loaded_algo = loaded_algo.test(trainset.build_testset())
assert predictions == predictions_loaded_algo
print('Predictions are the same')

Algorithms can be serialized along with their predictions, so that can be further analyzed or compared with other algorithms, using pandas dataframes. Some examples are given in the two following notebooks:

Dumping and analysis of the KNNBasic algorithm.

Comparison of two algorithms.

How to build my own prediction algorithm¶

There’s a whole guide here.

What are raw and inner ids¶

Users and items have a raw id and an inner id. Some methods will use/return a raw id (e.g. the predict() method), while some other will use/return an inner id.

Raw ids are ids as defined in a rating file or in a pandas dataframe. They can be strings or numbers. Note though that if the ratings were read from a file which is the standard scenario, they are represented as strings. This is important to know if you’re using e.g. predict() or other methods that accept raw ids as parameters.

On trainset creation, each raw id is mapped to a unique integer called inner id, which is a lot more suitable for Surprise to manipulate. Conversions between raw and inner ids can be done using the to_inner_uid(), to_inner_iid(), to_raw_uid(), and to_raw_iid() methods of the trainset.

Can I use my own dataset with Surprise, and can it be a pandas dataframe¶

Yes, and yes. See the user guide.

How to tune an algorithm parameters¶

You can tune the parameters of an algorithm with the GridSearchCV class as described here. After the tuning, you may want to have an unbiased estimate of your algorithm performances.

How to get accuracy measures on the training set¶

You can use the build_testset() method of the Trainset object to build a testset that can be then used with the test() method:

From file examples/evaluate_on_trainset.py¶

from surprise import Dataset
from surprise import SVD
from surprise import accuracy
from surprise.model_selection import KFold


data = Dataset.load_builtin('ml-100k')

algo = SVD()

trainset = data.build_full_trainset()
algo.fit(trainset)

testset = trainset.build_testset()
predictions = algo.test(testset)
# RMSE should be low as we are biased
accuracy.rmse(predictions, verbose=True)  # ~ 0.68 (which is low)

Check out the example file for more usage examples.

How to save some data for unbiased accuracy estimation¶

If your goal is to tune the parameters of an algorithm, you may want to spare a bit of data to have an unbiased estimation of its performances. For instance you may want to split your data into two sets A and B. A is used for parameter tuning using grid search, and B is used for unbiased estimation. This can be done as follows:

From file examples/split_data_for_unbiased_estimation.py¶

import random

from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import GridSearchCV


# Load the full dataset.
data = Dataset.load_builtin('ml-100k')
raw_ratings = data.raw_ratings

# shuffle ratings if you want
random.shuffle(raw_ratings)

# A = 90% of the data, B = 10% of the data
threshold = int(.9 * len(raw_ratings))
A_raw_ratings = raw_ratings[:threshold]
B_raw_ratings = raw_ratings[threshold:]

data.raw_ratings = A_raw_ratings  # data is now the set A

# Select your best algo with grid search.
print('Grid Search...')
param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005]}
grid_search = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)
grid_search.fit(data)

algo = grid_search.best_estimator['rmse']

# retrain on the whole set A
trainset = data.build_full_trainset()
algo.fit(trainset)

# Compute biased accuracy on A
predictions = algo.test(trainset.build_testset())
print('Biased accuracy on A,', end='   ')
accuracy.rmse(predictions)

# Compute unbiased accuracy on B
testset = data.construct_testset(B_raw_ratings)  # testset is now the set B
predictions = algo.test(testset)
print('Unbiased accuracy on B,', end=' ')
accuracy.rmse(predictions)

How to have reproducible experiments¶

Some algorithms randomly initialize their parameters (sometimes with numpy), and the cross-validation folds are also randomly generated. If you need to reproduce your experiments multiple times, you just have to set the seed of the RNG at the beginning of your program:

import random
import numpy as np

my_seed = 0
random.seed(my_seed)
numpy.random.seed(my_seed)

Where are datasets stored and how to change it?¶

By default, datasets downloaded by Surprise will be saved in the '~/.surprise_data' directory. This is also where dump files will be stored. You can change the default directory by setting the 'SURPRISE_DATA_FOLDER' environment variable.

prediction_algorithms package¶

The prediction_algorithms package includes the prediction algorithms available for recommendation.

The available prediction algorithms are:

`random_pred.NormalPredictor`	Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.
`baseline_only.BaselineOnly`	Algorithm predicting the baseline estimate for given user and item.
`knns.KNNBasic`	A basic collaborative filtering algorithm.
`knns.KNNWithMeans`	A basic collaborative filtering algorithm, taking into account the mean ratings of each user.
`knns.KNNWithZScore`	A basic collaborative filtering algorithm, taking into account the z-score normalization of each user.
`knns.KNNBaseline`	A basic collaborative filtering algorithm taking into account a baseline rating.
`matrix_factorization.SVD`	The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.
`matrix_factorization.SVDpp`	The SVD++ algorithm, an extension of `SVD` taking into account implicit ratings.
`matrix_factorization.NMF`	A collaborative filtering algorithm based on Non-negative Matrix Factorization.
`slope_one.SlopeOne`	A simple yet accurate collaborative filtering algorithm.
`co_clustering.CoClustering`	A collaborative filtering algorithm based on co-clustering.

You may want to check the notation standards before diving into the formulas.

The algorithm base class¶

The surprise.prediction_algorithms.algo_base module defines the base class AlgoBase from which every single prediction algorithm has to inherit.

class surprise.prediction_algorithms.algo_base.AlgoBase(**kwargs)¶

Abstract class where is defined the basic behavior of a prediction algorithm.

Keyword Arguments:
	baseline_options (dict, optional) – If the algorithm needs to compute a baseline estimate, the `baseline_options` parameter is used to configure how they are computed. See Baselines estimates configuration for usage.

compute_baselines()¶

Compute users and items baselines.

The way baselines are computed depends on the bsl_options parameter passed at the creation of the algorithm (see Baselines estimates configuration).

This method is only relevant for algorithms using Pearson baseline similarty or the BaselineOnly algorithm.

Returns:	A tuple `(bu, bi)`, which are users and items baselines.

compute_similarities()¶

Build the similarity matrix.

The way the similarity matrix is computed depends on the sim_options parameter passed at the creation of the algorithm (see Similarity measure configuration).

This method is only relevant for algorithms using a similarity measure, such as the k-NN algorithms.

Returns:	The similarity matrix.

fit(trainset)¶

Train an algorithm on a given training set.

This method is called by every derived class as the first basic step for training an algorithm. It basically just initializes some internal structures and set the self.trainset attribute.

Parameters:	trainset (`Trainset`) – A training set, as returned by the `folds` method.
Returns:	self

get_neighbors(iid, k)¶

Return the k nearest neighbors of iid, which is the inner id of a user or an item, depending on the user_based field of sim_options (see Similarity measure configuration).

As the similarities are computed on the basis of a similarity measure, this method is only relevant for algorithms using a similarity measure, such as the k-NN algorithms.

For a usage example, see the FAQ.

Parameters:	iid (int) – The (inner) id of the user (or item) for which we want the nearest neighbors. See this note. k (int) – The number of neighbors to retrieve.
Returns:	The list of the `k` (inner) ids of the closest users (or items) to `iid`.

predict(uid, iid, r_ui=None, clip=True, verbose=False)¶

Compute the rating prediction for given user and item.

The predict method converts raw ids to inner ids and then calls the estimate method which is defined in every derived class. If the prediction is impossible (e.g. because the user and/or the item is unkown), the prediction is set to the global mean of all ratings.

Parameters:

uid – (Raw) id of the user. See this note.
iid – (Raw) id of the item. See this note.
r_ui (float) – The true rating \(r_{ui}\). Optional, default is None.
clip (bool) – Whether to clip the estimation into the rating scale. For example, if \(\hat{r}_{ui}\) is \(5.5\) while the rating scale is \([1, 5]\), then \(\hat{r}_{ui}\) is set to \(5\). Same goes if \(\hat{r}_{ui} < 1\). Default is True.
verbose (bool) – Whether to print details of the prediction. Default is False.

Returns:

A Prediction object containing:

The (raw) user id uid.
The (raw) item id iid.
The true rating r_ui (\(\hat{r}_{ui}\)).
The estimated rating (\(\hat{r}_{ui}\)).
Some additional details about the prediction that might be useful for later analysis.

test(testset, verbose=False)¶

Test the algorithm on given testset, i.e. estimate all the ratings in the given testset.

Parameters:	testset – A test set, as returned by a cross-validation itertor or by the `build_testset()` method. verbose (bool) – Whether to print details for each predictions. Default is False.
Returns:	A list of `Prediction` objects that contains all the estimated ratings.

train(trainset)¶: Deprecated method: use fit() instead.

The predictions module¶

The surprise.prediction_algorithms.predictions module defines the Prediction named tuple and the PredictionImpossible exception.

class surprise.prediction_algorithms.predictions.Prediction¶

A named tuple for storing the results of a prediction.

It’s wrapped in a class, but only for documentation and printing purposes.

Parameters:	uid – The (raw) user id. See this note. iid – The (raw) item id. See this note. r_ui (float) – The true rating \(r_{ui}\). est (float) – The estimated rating \(\hat{r}_{ui}\). details (dict) – Stores additional details about the prediction that might be useful for later analysis.

exception surprise.prediction_algorithms.predictions.PredictionImpossible¶

Exception raised when a prediction is impossible.

When raised, the estimation \(\hat{r}_{ui}\) is set to the global mean of all ratings \(\mu\).

Basic algorithms¶

These are basic algorithms that do not do much work but that are still useful for comparing accuracies.

class surprise.prediction_algorithms.random_pred.NormalPredictor¶

Bases: surprise.prediction_algorithms.algo_base.AlgoBase

Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.

The prediction \(\hat{r}_{ui}\) is generated from a normal distribution \(\mathcal{N}(\hat{\mu}, \hat{\sigma}^2)\) where \(\hat{\mu}\) and \(\hat{\sigma}\) are estimated from the training data using Maximum Likelihood Estimation:

\[\begin{split}\hat{\mu} &= \frac{1}{|R_{train}|} \sum_{r_{ui} \in R_{train}} r_{ui}\\\\ \hat{\sigma} &= \sqrt{\sum_{r_{ui} \in R_{train}} \frac{(r_{ui} - \hat{\mu})^2}{|R_{train}|}}\end{split}\]

class surprise.prediction_algorithms.baseline_only.BaselineOnly(bsl_options={})¶

Bases: surprise.prediction_algorithms.algo_base.AlgoBase

Algorithm predicting the baseline estimate for given user and item.

\(\hat{r}_{ui} = b_{ui} = \mu + b_u + b_i\)

If user \(u\) is unknown, then the bias \(b_u\) is assumed to be zero. The same applies for item \(i\) with \(b_i\).

See section 2.1 of [Kor10] for details.

Parameters:	bsl_options (dict) – A dictionary of options for the baseline estimates computation. See Baselines estimates configuration for accepted options.

k-NN inspired algorithms¶

These are algorithms that are directly derived from a basic nearest neighbors approach.

Note

For each of these algorithms, the actual number of neighbors that are aggregated to compute an estimation is necessarily less than or equal to \(k\). First, there might just not exist enough neighbors and second, the sets \(N_i^k(u)\) and \(N_u^k(i)\) only include neighbors for which the similarity measure is positive. It would make no sense to aggregate ratings from users (or items) that are negatively correlated. For a given prediction, the actual number of neighbors can be retrieved in the 'actual_k' field of the details dictionary of the prediction.

You may want to read the User Guide on how to configure the sim_options parameter.

class surprise.prediction_algorithms.knns.KNNBasic(k=40, min_k=1, sim_options={}, **kwargs)¶

Bases: surprise.prediction_algorithms.knns.SymmetricAlgo

A basic collaborative filtering algorithm.

The prediction \(\hat{r}_{ui}\) is set as:

\[\hat{r}_{ui} = \frac{ \sum\limits_{v \in N^k_i(u)} \text{sim}(u, v) \cdot r_{vi}} {\sum\limits_{v \in N^k_i(u)} \text{sim}(u, v)}\]

or

\[\hat{r}_{ui} = \frac{ \sum\limits_{j \in N^k_u(i)} \text{sim}(i, j) \cdot r_{uj}} {\sum\limits_{j \in N^k_u(j)} \text{sim}(i, j)}\]

depending on the user_based field of the sim_options parameter.

Parameters:

k (int) – The (max) number of neighbors to take into account for aggregation (see this note). Default is 40.
min_k (int) – The minimum number of neighbors to take into account for aggregation. If there are not enough neighbors, the prediction is set the the global mean of all ratings. Default is 1.
sim_options (dict) – A dictionary of options for the similarity measure. See Similarity measure configuration for accepted options.

class surprise.prediction_algorithms.knns.KNNWithMeans(k=40, min_k=1, sim_options={}, **kwargs)¶

Bases: surprise.prediction_algorithms.knns.SymmetricAlgo

A basic collaborative filtering algorithm, taking into account the mean ratings of each user.

The prediction \(\hat{r}_{ui}\) is set as:

\[\hat{r}_{ui} = \mu_u + \frac{ \sum\limits_{v \in N^k_i(u)} \text{sim}(u, v) \cdot (r_{vi} - \mu_v)} {\sum\limits_{v \in N^k_i(u)} \text{sim}(u, v)}\]

or

\[\hat{r}_{ui} = \mu_i + \frac{ \sum\limits_{j \in N^k_u(i)} \text{sim}(i, j) \cdot (r_{uj} - \mu_j)} {\sum\limits_{j \in N^k_u(i)} \text{sim}(i, j)}\]

depending on the user_based field of the sim_options parameter.

Parameters:

k (int) – The (max) number of neighbors to take into account for aggregation (see this note). Default is 40.
min_k (int) – The minimum number of neighbors to take into account for aggregation. If there are not enough neighbors, the neighbor aggregation is set to zero (so the prediction ends up being equivalent to the mean \(\mu_u\) or \(\mu_i\)). Default is 1.
sim_options (dict) – A dictionary of options for the similarity measure. See Similarity measure configuration for accepted options.

class surprise.prediction_algorithms.knns.KNNWithZScore(k=40, min_k=1, sim_options={}, **kwargs)¶

Bases: surprise.prediction_algorithms.knns.SymmetricAlgo

A basic collaborative filtering algorithm, taking into account: the z-score normalization of each user.

The prediction \(\hat{r}_{ui}\) is set as:

\[\hat{r}_{ui} = \mu_u + \sigma_u \frac{ \sum\limits_{v \in N^k_i(u)} \text{sim}(u, v) \cdot (r_{vi} - \mu_v) / \sigma_v} {\sum\limits_{v \in N^k_i(u)} \text{sim}(u, v)}\]

or

\[\hat{r}_{ui} = \mu_i + \sigma_i \frac{ \sum\limits_{j \in N^k_u(i)} \text{sim}(i, j) \cdot (r_{uj} - \mu_j) / \sigma_j} {\sum\limits_{j \in N^k_u(i)} \text{sim}(i, j)}\]

depending on the user_based field of the sim_options parameter.

If \(\sigma\) is 0, than the overall sigma is used in that case.

Parameters:

k (int) – The (max) number of neighbors to take into account for aggregation (see this note). Default is 40.
min_k (int) – The minimum number of neighbors to take into account for aggregation. If there are not enough neighbors, the neighbor aggregation is set to zero (so the prediction ends up being equivalent to the mean \(\mu_u\) or \(\mu_i\)). Default is 1.
sim_options (dict) – A dictionary of options for the similarity measure. See Similarity measure configuration for accepted options.

class surprise.prediction_algorithms.knns.KNNBaseline(k=40, min_k=1, sim_options={}, bsl_options={})¶

Bases: surprise.prediction_algorithms.knns.SymmetricAlgo

A basic collaborative filtering algorithm taking into account a baseline rating.

The prediction \(\hat{r}_{ui}\) is set as:

\[\hat{r}_{ui} = b_{ui} + \frac{ \sum\limits_{v \in N^k_i(u)} \text{sim}(u, v) \cdot (r_{vi} - b_{vi})} {\sum\limits_{v \in N^k_i(u)} \text{sim}(u, v)}\]

or

\[\hat{r}_{ui} = b_{ui} + \frac{ \sum\limits_{j \in N^k_u(i)} \text{sim}(i, j) \cdot (r_{uj} - b_{uj})} {\sum\limits_{j \in N^k_u(j)} \text{sim}(i, j)}\]

depending on the user_based field of the sim_options parameter. For the best predictions, use the pearson_baseline similarity measure.

This algorithm corresponds to formula (3), section 2.2 of [Kor10].

Parameters:

k (int) – The (max) number of neighbors to take into account for aggregation (see this note). Default is 40.
min_k (int) – The minimum number of neighbors to take into account for aggregation. If there are not enough neighbors, the neighbor aggregation is set to zero (so the prediction ends up being equivalent to the baseline). Default is 1.
sim_options (dict) – A dictionary of options for the similarity measure. See Similarity measure configuration for accepted options. It is recommended to use the pearson_baseline similarity measure.
bsl_options (dict) – A dictionary of options for the baseline estimates computation. See Baselines estimates configuration for accepted options.

Matrix Factorization-based algorithms¶

class surprise.prediction_algorithms.matrix_factorization.SVD¶

Bases: surprise.prediction_algorithms.algo_base.AlgoBase

The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize. When baselines are not used, this is equivalent to Probabilistic Matrix Factorization [SM08] (see note below).

The prediction \(\hat{r}_{ui}\) is set as:

\[\hat{r}_{ui} = \mu + b_u + b_i + q_i^Tp_u\]

If user \(u\) is unknown, then the bias \(b_u\) and the factors \(p_u\) are assumed to be zero. The same applies for item \(i\) with \(b_i\) and \(q_i\).

For details, see equation (5) from [KBV09]. See also [RRSK10], section 5.3.1.

To estimate all the unknown, we minimize the following regularized squared error:

\[\sum_{r_{ui} \in R_{train}} \left(r_{ui} - \hat{r}_{ui} \right)^2 + \lambda\left(b_i^2 + b_u^2 + ||q_i||^2 + ||p_u||^2\right)\]

The minimization is performed by a very straightforward stochastic gradient descent:

\[\begin{split}b_u &\leftarrow b_u &+ \gamma (e_{ui} - \lambda b_u)\\ b_i &\leftarrow b_i &+ \gamma (e_{ui} - \lambda b_i)\\ p_u &\leftarrow p_u &+ \gamma (e_{ui} \cdot q_i - \lambda p_u)\\ q_i &\leftarrow q_i &+ \gamma (e_{ui} \cdot p_u - \lambda q_i)\end{split}\]

where \(e_{ui} = r_{ui} - \hat{r}_{ui}\). These steps are performed over all the ratings of the trainset and repeated n_epochs times. Baselines are initialized to 0. User and item factors are randomly initialized according to a normal distribution, which can be tuned using the init_mean and init_std_dev parameters.

You also have control over the learning rate \(\gamma\) and the regularization term \(\lambda\). Both can be different for each kind of parameter (see below). By default, learning rates are set to 0.005 and regularization terms are set to 0.02.

Note

You can choose to use an unbiased version of this algorithm, simply predicting:

\[\hat{r}_{ui} = q_i^Tp_u\]

This is equivalent to Probabilistic Matrix Factorization ([SM08], section 2) and can be achieved by setting the biased parameter to False.

Parameters:

n_factors – The number of factors. Default is 100.
n_epochs – The number of iteration of the SGD procedure. Default is 20.
biased (bool) – Whether to use baselines (or biases). See note above. Default is True.
init_mean – The mean of the normal distribution for factor vectors initialization. Default is 0.
init_std_dev – The standard deviation of the normal distribution for factor vectors initialization. Default is 0.1.
lr_all – The learning rate for all parameters. Default is 0.005.
reg_all – The regularization term for all parameters. Default is 0.02.
lr_bu – The learning rate for \(b_u\). Takes precedence over lr_all if set. Default is None.
lr_bi – The learning rate for \(b_i\). Takes precedence over lr_all if set. Default is None.
lr_pu – The learning rate for \(p_u\). Takes precedence over lr_all if set. Default is None.
lr_qi – The learning rate for \(q_i\). Takes precedence over lr_all if set. Default is None.
reg_bu – The regularization term for \(b_u\). Takes precedence over reg_all if set. Default is None.
reg_bi – The regularization term for \(b_i\). Takes precedence over reg_all if set. Default is None.
reg_pu – The regularization term for \(p_u\). Takes precedence over reg_all if set. Default is None.
reg_qi – The regularization term for \(q_i\). Takes precedence over reg_all if set. Default is None.
random_state (int, RandomState instance from numpy, or None) – Determines the RNG that will be used for initialization. If int, random_state will be used as a seed for a new RNG. This is useful to get the same initialization over multiple calls to fit(). If RandomState instance, this same instance is used as RNG. If None, the current RNG from numpy is used. Default is None.
verbose – If True, prints the current epoch. Default is False.

pu¶: numpy array of size (n_users, n_factors) – The user factors (only exists if fit() has been called)

qi¶: numpy array of size (n_items, n_factors) – The item factors (only exists if fit() has been called)

bu¶: numpy array of size (n_users) – The user biases (only exists if fit() has been called)

bi¶: numpy array of size (n_items) – The item biases (only exists if fit() has been called)

class surprise.prediction_algorithms.matrix_factorization.SVDpp¶

Bases: surprise.prediction_algorithms.algo_base.AlgoBase

The SVD++ algorithm, an extension of SVD taking into account implicit ratings.

The prediction \(\hat{r}_{ui}\) is set as:

\[\hat{r}_{ui} = \mu + b_u + b_i + q_i^T\left(p_u + |I_u|^{-\frac{1}{2}} \sum_{j \in I_u}y_j\right)\]

Where the \(y_j\) terms are a new set of item factors that capture implicit ratings. Here, an implicit rating describes the fact that a user \(u\) rated an item \(j\), regardless of the rating value.

If user \(u\) is unknown, then the bias \(b_u\) and the factors \(p_u\) are assumed to be zero. The same applies for item \(i\) with \(b_i\), \(q_i\) and \(y_i\).

For details, see section 4 of [Kor08]. See also [RRSK10], section 5.3.1.

Just as for SVD, the parameters are learned using a SGD on the regularized squared error objective.

Baselines are initialized to 0. User and item factors are randomly initialized according to a normal distribution, which can be tuned using the init_mean and init_std_dev parameters.

You have control over the learning rate \(\gamma\) and the regularization term \(\lambda\). Both can be different for each kind of parameter (see below). By default, learning rates are set to 0.005 and regularization terms are set to 0.02.

Parameters:

n_factors – The number of factors. Default is 20.
n_epochs – The number of iteration of the SGD procedure. Default is 20.
init_mean – The mean of the normal distribution for factor vectors initialization. Default is 0.
init_std_dev – The standard deviation of the normal distribution for factor vectors initialization. Default is 0.1.
lr_all – The learning rate for all parameters. Default is 0.007.
reg_all – The regularization term for all parameters. Default is 0.02.
lr_bu – The learning rate for \(b_u\). Takes precedence over lr_all if set. Default is None.
lr_bi – The learning rate for \(b_i\). Takes precedence over lr_all if set. Default is None.
lr_pu – The learning rate for \(p_u\). Takes precedence over lr_all if set. Default is None.
lr_qi – The learning rate for \(q_i\). Takes precedence over lr_all if set. Default is None.
lr_yj – The learning rate for \(y_j\). Takes precedence over lr_all if set. Default is None.
reg_bu – The regularization term for \(b_u\). Takes precedence over reg_all if set. Default is None.
reg_bi – The regularization term for \(b_i\). Takes precedence over reg_all if set. Default is None.
reg_pu – The regularization term for \(p_u\). Takes precedence over reg_all if set. Default is None.
reg_qi – The regularization term for \(q_i\). Takes precedence over reg_all if set. Default is None.
reg_yj – The regularization term for \(y_j\). Takes precedence over reg_all if set. Default is None.
random_state (int, RandomState instance from numpy, or None) – Determines the RNG that will be used for initialization. If int, random_state will be used as a seed for a new RNG. This is useful to get the same initialization over multiple calls to fit(). If RandomState instance, this same instance is used as RNG. If None, the current RNG from numpy is used. Default is None.
verbose – If True, prints the current epoch. Default is False.

pu¶: numpy array of size (n_users, n_factors) – The user factors (only exists if fit() has been called)

qi¶: numpy array of size (n_items, n_factors) – The item factors (only exists if fit() has been called)

yj¶: numpy array of size (n_items, n_factors) – The (implicit) item factors (only exists if fit() has been called)

bu¶: numpy array of size (n_users) – The user biases (only exists if fit() has been called)

bi¶: numpy array of size (n_items) – The item biases (only exists if fit() has been called)

class surprise.prediction_algorithms.matrix_factorization.NMF¶

Bases: surprise.prediction_algorithms.algo_base.AlgoBase

A collaborative filtering algorithm based on Non-negative Matrix Factorization.

This algorithm is very similar to SVD. The prediction \(\hat{r}_{ui}\) is set as:

\[\hat{r}_{ui} = q_i^Tp_u,\]

where user and item factors are kept positive. Our implementation follows that suggested in [LZXZ14], which is equivalent to [ZWFM96] in its non-regularized form. Both are direct applications of NMF for dense matrices [LS01].

The optimization procedure is a (regularized) stochastic gradient descent with a specific choice of step size that ensures non-negativity of factors, provided that their initial values are also positive.

At each step of the SGD procedure, the factors \(f\) or user \(u\) and item \(i\) are updated as follows:

\[\begin{split}p_{uf} &\leftarrow p_{uf} &\cdot \frac{\sum_{i \in I_u} q_{if} \cdot r_{ui}}{\sum_{i \in I_u} q_{if} \cdot \hat{r_{ui}} + \lambda_u |I_u| p_{uf}}\\ q_{if} &\leftarrow q_{if} &\cdot \frac{\sum_{u \in U_i} p_{uf} \cdot r_{ui}}{\sum_{u \in U_i} p_{uf} \cdot \hat{r_{ui}} + \lambda_i |U_i| q_{if}}\\\end{split}\]

where \(\lambda_u\) and \(\lambda_i\) are regularization parameters.

This algorithm is highly dependent on initial values. User and item factors are uniformly initialized between init_low and init_high. Change them at your own risks!

A biased version is available by setting the biased parameter to True. In this case, the prediction is set as

\[\hat{r}_{ui} = \mu + b_u + b_i + q_i^Tp_u,\]

still ensuring positive factors. Baselines are optimized in the same way as in the SVD algorithm. While yielding better accuracy, the biased version seems highly prone to overfitting so you may want to reduce the number of factors (or increase regularization).

Parameters:

n_factors – The number of factors. Default is 15.
n_epochs – The number of iteration of the SGD procedure. Default is 50.
biased (bool) – Whether to use baselines (or biases). Default is False.
reg_pu – The regularization term for users \(\lambda_u\). Default is 0.06.
reg_qi – The regularization term for items \(\lambda_i\). Default is 0.06.
reg_bu – The regularization term for \(b_u\). Only relevant for biased version. Default is 0.02.
reg_bi – The regularization term for \(b_i\). Only relevant for biased version. Default is 0.02.
lr_bu – The learning rate for \(b_u\). Only relevant for biased version. Default is 0.005.
lr_bi – The learning rate for \(b_i\). Only relevant for biased version. Default is 0.005.
init_low – Lower bound for random initialization of factors. Must be greater than 0 to ensure non-negative factors. Default is 0.
init_high – Higher bound for random initialization of factors. Default is 1.
random_state (int, RandomState instance from numpy, or None) – Determines the RNG that will be used for initialization. If int, random_state will be used as a seed for a new RNG. This is useful to get the same initialization over multiple calls to fit(). If RandomState instance, this same instance is used as RNG. If None, the current RNG from numpy is used. Default is None.
verbose – If True, prints the current epoch. Default is False.

pu¶: numpy array of size (n_users, n_factors) – The user factors (only exists if fit() has been called)

qi¶: numpy array of size (n_items, n_factors) – The item factors (only exists if fit() has been called)

bu¶: numpy array of size (n_users) – The user biases (only exists if fit() has been called)

bi¶: numpy array of size (n_items) – The item biases (only exists if fit() has been called)

Slope One¶

class surprise.prediction_algorithms.slope_one.SlopeOne¶

Bases: surprise.prediction_algorithms.algo_base.AlgoBase

A simple yet accurate collaborative filtering algorithm.

This is a straightforward implementation of the SlopeOne algorithm [LM07].

The prediction \(\hat{r}_{ui}\) is set as:

\[\hat{r}_{ui} = \mu_u + \frac{1}{ |R_i(u)|} \sum\limits_{j \in R_i(u)} \text{dev}(i, j),\]

where \(R_i(u)\) is the set of relevant items, i.e. the set of items \(j\) rated by \(u\) that also have at least one common user with \(i\). \(\text{dev}_(i, j)\) is defined as the average difference between the ratings of \(i\) and those of \(j\):

\[\text{dev}(i, j) = \frac{1}{ |U_{ij}|}\sum\limits_{u \in U_{ij}} r_{ui} - r_{uj}\]

Co-clustering¶

class surprise.prediction_algorithms.co_clustering.CoClustering¶

Bases: surprise.prediction_algorithms.algo_base.AlgoBase

A collaborative filtering algorithm based on co-clustering.

This is a straightforward implementation of [GM05].

Basically, users and items are assigned some clusters \(C_u\), \(C_i\), and some co-clusters \(C_{ui}\).

The prediction \(\hat{r}_{ui}\) is set as:

\[\hat{r}_{ui} = \overline{C_{ui}} + (\mu_u - \overline{C_u}) + (\mu_i - \overline{C_i}),\]

where \(\overline{C_{ui}}\) is the average rating of co-cluster \(C_{ui}\), \(\overline{C_u}\) is the average rating of \(u\)‘s cluster, and \(\overline{C_i}\) is the average rating of \(i\)‘s cluster. If the user is unknown, the prediction is \(\hat{r}_{ui} = \mu_i\). If the item is unknown, the prediction is \(\hat{r}_{ui} = \mu_u\). If both the user and the item are unknown, the prediction is \(\hat{r}_{ui} = \mu\).

Clusters are assigned using a straightforward optimization method, much like k-means.

Parameters:

n_cltr_u (int) – Number of user clusters. Default is 3.
n_cltr_i (int) – Number of item clusters. Default is 3.
n_epochs (int) – Number of iteration of the optimization loop. Default is 20.
random_state (int, RandomState instance from numpy, or None) – Determines the RNG that will be used for initialization. If int, random_state will be used as a seed for a new RNG. This is useful to get the same initialization over multiple calls to fit(). If RandomState instance, this same instance is used as RNG. If None, the current RNG from numpy is used. Default is None.
verbose (bool) – If True, the current epoch will be printed. Default is False.

The model_selection package¶

Surprise provides various tools to run cross-validation procedures and search the best parameters for a prediction algorithm. The tools presented here are all heavily inspired from the excellent scikit learn library.

Cross validation iterators¶

The model_selection.split module contains various cross-validation iterators. Design and tools are inspired from the mighty scikit learn.

The available iterators are:

`KFold`	A basic cross-validation iterator.
`RepeatedKFold`	Repeated `KFold` cross validator.
`ShuffleSplit`	A basic cross-validation iterator with random trainsets and testsets.
`LeaveOneOut`	Cross-validation iterator where each user has exactly one rating in the testset.
`PredefinedKFold`	A cross-validation iterator to when a dataset has been loaded with the `load_from_folds` method.

This module also contains a function for splitting datasets into trainset and testset:

train_test_split Split a dataset into trainset and testset.

class surprise.model_selection.split.KFold(n_splits=5, random_state=None, shuffle=True)¶

A basic cross-validation iterator.

Each fold is used once as a testset while the k - 1 remaining folds are used for training.

See an example in the User Guide.

Parameters:

n_splits (int) – The number of folds.
random_state (int, RandomState instance from numpy, or None) – Determines the RNG that will be used for determining the folds. If int, random_state will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls to split(). If RandomState instance, this same instance is used as RNG. If None, the current RNG from numpy is used. random_state is only used if shuffle is True. Default is None.
shuffle (bool) – Whether to shuffle the ratings in the data parameter of the split() method. Shuffling is not done in-place. Default is True.

split(data)¶

Generator function to iterate over trainsets and testsets.

Parameters:	data (`Dataset`) – The data containing ratings that will be devided into trainsets and testsets.
Yields:	tuple of (trainset, testset)

class surprise.model_selection.split.LeaveOneOut(n_splits=5, random_state=None)¶

Cross-validation iterator where each user has exactly one rating in the testset.

Contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

See an example in the User Guide.

Parameters:

n_splits (int) – The number of folds.
random_state (int, RandomState instance from numpy, or None) – Determines the RNG that will be used for determining the folds. If int, random_state will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls to split(). If RandomState instance, this same instance is used as RNG. If None, the current RNG from numpy is used. random_state is only used if shuffle is True. Default is None.
shuffle (bool) – Whether to shuffle the ratings in the data parameter of the split() method. Shuffling is not done in-place. Default is True.

split(data)¶

Generator function to iterate over trainsets and testsets.

Parameters:	data (`Dataset`) – The data containing ratings that will be devided into trainsets and testsets.
Yields:	tuple of (trainset, testset)

class surprise.model_selection.split.PredefinedKFold¶

A cross-validation iterator to when a dataset has been loaded with the load_from_folds method.

See an example in the User Guide.

split(data)¶

Generator function to iterate over trainsets and testsets.

Parameters:	data (`Dataset`) – The data containing ratings that will be devided into trainsets and testsets.
Yields:	tuple of (trainset, testset)

class surprise.model_selection.split.RepeatedKFold(n_splits=5, n_repeats=10, random_state=None)¶

Repeated KFold cross validator.

Repeats KFold n times with different randomization in each repetition.

See an example in the User Guide.

Parameters:

n_splits (int) – The number of folds.
n_repeats (int) – The number of repetitions.
random_state (int, RandomState instance from numpy, or None) – Determines the RNG that will be used for determining the folds. If int, random_state will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls to split(). If RandomState instance, this same instance is used as RNG. If None, the current RNG from numpy is used. random_state is only used if shuffle is True. Default is None.
shuffle (bool) – Whether to shuffle the ratings in the data parameter of the split() method. Shuffling is not done in-place. Default is True.

split(data)¶

Generator function to iterate over trainsets and testsets.

Parameters:	data (`Dataset`) – The data containing ratings that will be devided into trainsets and testsets.
Yields:	tuple of (trainset, testset)

class surprise.model_selection.split.ShuffleSplit(n_splits=5, test_size=0.2, train_size=None, random_state=None, shuffle=True)¶

A basic cross-validation iterator with random trainsets and testsets.

Contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

See an example in the User Guide.

Parameters:

n_splits (int) – The number of folds.
test_size (float or int None) – If float, it represents the proportion of ratings to include in the testset. If int, represents the absolute number of ratings in the testset. If None, the value is set to the complement of the trainset size. Default is .2.
train_size (float or int or None) – If float, it represents the proportion of ratings to include in the trainset. If int, represents the absolute number of ratings in the trainset. If None, the value is set to the complement of the testset size. Default is None.
random_state (int, RandomState instance from numpy, or None) – Determines the RNG that will be used for determining the folds. If int, random_state will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls to split(). If RandomState instance, this same instance is used as RNG. If None, the current RNG from numpy is used. random_state is only used if shuffle is True. Default is None.
shuffle (bool) – Whether to shuffle the ratings in the data parameter of the split() method. Shuffling is not done in-place. Setting this to False defeats the purpose of this iterator, but it’s useful for the implementation of train_test_split(). Default is True.

split(data)¶

Generator function to iterate over trainsets and testsets.

Parameters:	data (`Dataset`) – The data containing ratings that will be devided into trainsets and testsets.
Yields:	tuple of (trainset, testset)

surprise.model_selection.split.train_test_split(data, test_size=0.2, train_size=None, random_state=None, shuffle=True)¶

Split a dataset into trainset and testset.

See an example in the User Guide.

Note: this function cannot be used as a cross-validation iterator.

Parameters:

data (Dataset) – The dataset to split into trainset and testset.
test_size (float or int None) – If float, it represents the proportion of ratings to include in the testset. If int, represents the absolute number of ratings in the testset. If None, the value is set to the complement of the trainset size. Default is .2.
train_size (float or int or None) – If float, it represents the proportion of ratings to include in the trainset. If int, represents the absolute number of ratings in the trainset. If None, the value is set to the complement of the testset size. Default is None.
random_state (int, RandomState instance from numpy, or None) – Determines the RNG that will be used for determining the folds. If int, random_state will be used as a seed for a new RNG. This is useful to get the same splits over multiple calls to split(). If RandomState instance, this same instance is used as RNG. If None, the current RNG from numpy is used. random_state is only used if shuffle is True. Default is None.
shuffle (bool) – Whether to shuffle the ratings in the data parameter. Shuffling is not done in-place. Default is True.

Cross validation¶

surprise.model_selection.validation.cross_validate(algo, data, measures=[u'rmse', u'mae'], cv=None, return_train_measures=False, n_jobs=-1, pre_dispatch=u'2*n_jobs', verbose=False)¶

Run a cross validation procedure for a given algorithm, reporting accuracy measures and computation times.

See an example in the User Guide.

Parameters:

algo (AlgoBase) – The algorithm to evaluate.
data (Dataset) – The dataset on which to evaluate the algorithm.
measures (list of string) – The performance measures to compute. Allowed names are function names as defined in the accuracy module. Default is ['rmse', 'mae'].
cv (cross-validation iterator, int or None) – Determines how the data parameter will be split (i.e. how trainsets and testsets will be defined). If an int is passed, KFold is used with the appropriate n_splits parameter. If None, KFold is used with n_splits=5.
return_train_measures (bool) – Whether to compute performance measures on the trainsets. Default is False.
n_jobs (int) –
The maximum number of folds evaluated in parallel.
- If -1, all CPUs are used.
- If 1 is given, no parallel computing code is used at all, which is useful for debugging.
- For n_jobs below -1, (n_cpus + n_jobs + 1) are used. For example, with n_jobs = -2 all CPUs but one are used.
Default is -1.
pre_dispatch (int or string) –
Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
- None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs.
- An int, giving the exact number of total jobs that are spawned.
- A string, giving an expression as a function of n_jobs, as in '2*n_jobs'.
Default is '2*n_jobs'.
verbose (int) – If True accuracy measures for each split are printed, as well as train and test times. Averages and standard deviations over all splits are also reported. Default is False: nothing is printed.

Returns:

A dict with the following keys:

'test_*' where * corresponds to a lower-case accuracy measure, e.g. 'test_rmse': numpy array with accuracy values for each testset.

'train_*' where * corresponds to a lower-case accuracy measure, e.g. 'train_rmse': numpy array with accuracy values for each trainset. Only available if return_train_measures is True.

'fit_time': numpy array with the training time in seconds for each split.

'test_time': numpy array with the testing time in seconds for each split.

Return type:

dict

Parameter search¶

class surprise.model_selection.search.GridSearchCV(algo_class, param_grid, measures=[u'rmse', u'mae'], cv=None, refit=False, return_train_measures=False, n_jobs=-1, pre_dispatch=u'2*n_jobs', joblib_verbose=0)¶

The GridSearchCV class computes accuracy metrics for an algorithm on various combinations of parameters, over a cross-validation procedure. This is useful for finiding the best set of parameters for a prediction algorithm. It is analogous to GridSearchCV from scikit-learn.

See an example in the User Guide.

Parameters:

algo_class (AlgoBase) – The class of the algorithm to evaluate.
param_grid (dict) – Dictionary with algorithm parameters as keys and list of values as keys. All combinations will be evaluated with desired algorithm. Dict parameters such as sim_options require special treatment, see this note.
measures (list of string) – The performance measures to compute. Allowed names are function names as defined in the accuracy module. Default is ['rmse', 'mae'].
cv (cross-validation iterator, int or None) – Determines how the data parameter will be split (i.e. how trainsets and testsets will be defined). If an int is passed, KFold is used with the appropriate n_splits parameter. If None, KFold is used with n_splits=5.
refit (bool or str) – If True, refit the algorithm on the whole dataset using the set of parameters that gave the best average performance for the first measure of measures. Other measures can be used by passing a string (corresponding to the measure name). Then, you can use the test() and predict() methods. refit can only be used if the data parameter given to fit() hasn’t been loaded with load_from_folds(). Default is False.
return_train_measures (bool) – Whether to compute performance measures on the trainsets. If True, the cv_results attribute will also contain measures for trainsets. Default is False.
n_jobs (int) –
The maximum number of parallel training procedures.
- If -1, all CPUs are used.
- If 1 is given, no parallel computing code is used at all, which is useful for debugging.
- For n_jobs below -1, (n_cpus + n_jobs + 1) are used. For example, with n_jobs = -2 all CPUs but one are used.
Default is -1.
pre_dispatch (int or string) –
Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
- None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs.
- An int, giving the exact number of total jobs that are spawned.
- A string, giving an expression as a function of n_jobs, as in '2*n_jobs'.
Default is '2*n_jobs'.
joblib_verbose (int) – Controls the verbosity of joblib: the higher, the more messages.

best_estimator¶: dict of AlgoBase – Using an accuracy measure as key, get the algorithm that gave the best accuracy results for the chosen measure, averaged over all splits.

best_score¶: dict of floats – Using an accuracy measure as key, get the best average score achieved for that measure.

best_params¶: dict of dicts – Using an accuracy measure as key, get the parameters combination that gave the best accuracy results for the chosen measure (on average).

best_index¶: dict of ints – Using an accuracy measure as key, get the index that can be used with cv_results that achieved the highest accuracy for that measure (on average).

cv_results¶: dict of arrays – A dict that contains accuracy measures over all splits, as well as train and test time for each parameter combination. Can be imported into a pandas DataFrame (see example).

fit(data)¶

Runs the fit() method of the algorithm for all parameter combination, over different splits given by the cv parameter.

Parameters:	data (`Dataset`) – The dataset on which to evaluate the algorithm, in parallel.

predict(*args)¶

Call predict() on the estimator with the best found parameters (according the the refit parameter). See AlgoBase.predict().

Only available if refit is not False.

test(testset, verbose=False)¶

Call test() on the estimator with the best found parameters (according the the refit parameter). See AlgoBase.test().

Only available if refit is not False.

similarities module¶

The similarities module includes tools to compute similarity metrics between users or items. You may need to refer to the Notation standards, References page. See also the Similarity measure configuration section of the User Guide.

Available similarity measures:

`cosine`	Compute the cosine similarity between all pairs of users (or items).
`msd`	Compute the Mean Squared Difference similarity between all pairs of users (or items).
`pearson`	Compute the Pearson correlation coefficient between all pairs of users (or items).
`pearson_baseline`	Compute the (shrunk) Pearson correlation coefficient between all pairs of users (or items) using baselines for centering instead of means.

surprise.similarities.cosine()¶

Compute the cosine similarity between all pairs of users (or items).

Only common users (or items) are taken into account. The cosine similarity is defined as:

\[\text{cosine_sim}(u, v) = \frac{ \sum\limits_{i \in I_{uv}} r_{ui} \cdot r_{vi}} {\sqrt{\sum\limits_{i \in I_{uv}} r_{ui}^2} \cdot \sqrt{\sum\limits_{i \in I_{uv}} r_{vi}^2} }\]

or

\[\text{cosine_sim}(i, j) = \frac{ \sum\limits_{u \in U_{ij}} r_{ui} \cdot r_{uj}} {\sqrt{\sum\limits_{u \in U_{ij}} r_{ui}^2} \cdot \sqrt{\sum\limits_{u \in U_{ij}} r_{uj}^2} }\]

depending on the user_based field of sim_options (see Similarity measure configuration).

For details on cosine similarity, see on Wikipedia.

surprise.similarities.msd()¶

Compute the Mean Squared Difference similarity between all pairs of users (or items).

Only common users (or items) are taken into account. The Mean Squared Difference is defined as:

\[\text{msd}(u, v) = \frac{1}{|I_{uv}|} \cdot \sum\limits_{i \in I_{uv}} (r_{ui} - r_{vi})^2\]

or

\[\text{msd}(i, j) = \frac{1}{|U_{ij}|} \cdot \sum\limits_{u \in U_{ij}} (r_{ui} - r_{uj})^2\]

depending on the user_based field of sim_options (see Similarity measure configuration).

The MSD-similarity is then defined as:

\[\begin{split}\text{msd_sim}(u, v) &= \frac{1}{\text{msd}(u, v) + 1}\\ \text{msd_sim}(i, j) &= \frac{1}{\text{msd}(i, j) + 1}\end{split}\]

The \(+ 1\) term is just here to avoid dividing by zero.

For details on MSD, see third definition on Wikipedia.

surprise.similarities.pearson()¶

Compute the Pearson correlation coefficient between all pairs of users (or items).

Only common users (or items) are taken into account. The Pearson correlation coefficient can be seen as a mean-centered cosine similarity, and is defined as:

\[\text{pearson_sim}(u, v) = \frac{ \sum\limits_{i \in I_{uv}} (r_{ui} - \mu_u) \cdot (r_{vi} - \mu_{v})} {\sqrt{\sum\limits_{i \in I_{uv}} (r_{ui} - \mu_u)^2} \cdot \sqrt{\sum\limits_{i \in I_{uv}} (r_{vi} - \mu_{v})^2} }\]

or

\[\text{pearson_sim}(i, j) = \frac{ \sum\limits_{u \in U_{ij}} (r_{ui} - \mu_i) \cdot (r_{uj} - \mu_{j})} {\sqrt{\sum\limits_{u \in U_{ij}} (r_{ui} - \mu_i)^2} \cdot \sqrt{\sum\limits_{u \in U_{ij}} (r_{uj} - \mu_{j})^2} }\]

depending on the user_based field of sim_options (see Similarity measure configuration).

Note: if there are no common users or items, similarity will be 0 (and not -1).

For details on Pearson coefficient, see Wikipedia.

surprise.similarities.pearson_baseline()¶

Compute the (shrunk) Pearson correlation coefficient between all pairs of users (or items) using baselines for centering instead of means.

The shrinkage parameter helps to avoid overfitting when only few ratings are available (see Similarity measure configuration).

The Pearson-baseline correlation coefficient is defined as:

\[\text{pearson_baseline_sim}(u, v) = \hat{\rho}_{uv} = \frac{ \sum\limits_{i \in I_{uv}} (r_{ui} - b_{ui}) \cdot (r_{vi} - b_{vi})} {\sqrt{\sum\limits_{i \in I_{uv}} (r_{ui} - b_{ui})^2} \cdot \sqrt{\sum\limits_{i \in I_{uv}} (r_{vi} - b_{vi})^2}}\]

or

\[\text{pearson_baseline_sim}(i, j) = \hat{\rho}_{ij} = \frac{ \sum\limits_{u \in U_{ij}} (r_{ui} - b_{ui}) \cdot (r_{uj} - b_{uj})} {\sqrt{\sum\limits_{u \in U_{ij}} (r_{ui} - b_{ui})^2} \cdot \sqrt{\sum\limits_{u \in U_{ij}} (r_{uj} - b_{uj})^2}}\]

The shrunk Pearson-baseline correlation coefficient is then defined as:

\[ \begin{align}\begin{aligned}\text{pearson_baseline_shrunk_sim}(u, v) &= \frac{|I_{uv}| - 1} {|I_{uv}| - 1 + \text{shrinkage}} \cdot \hat{\rho}_{uv}\\\text{pearson_baseline_shrunk_sim}(i, j) &= \frac{|U_{ij}| - 1} {|U_{ij}| - 1 + \text{shrinkage}} \cdot \hat{\rho}_{ij}\end{aligned}\end{align} \]

Obviously, a shrinkage parameter of 0 amounts to no shrinkage at all.

Note: here again, if there are no common users/items, similarity will be 0 (and not -1).

Motivations for such a similarity measure can be found on the Recommender System Handbook, section 5.4.1.

accuracy module¶

The surprise.accuracy module provides with tools for computing accuracy metrics on a set of predictions.

Available accuracy metrics:

`rmse`	Compute RMSE (Root Mean Squared Error).
`mae`	Compute MAE (Mean Absolute Error).
`fcp`	Compute FCP (Fraction of Concordant Pairs).

surprise.accuracy.fcp(predictions, verbose=True)¶

Compute FCP (Fraction of Concordant Pairs).

Computed as described in paper Collaborative Filtering on Ordinal User Feedback by Koren and Sill, section 5.2.

Parameters:	predictions (`list` of `Prediction`) – A list of predictions, as returned by the `test()` method. verbose – If True, will print computed value. Default is `True`.
Returns:	The Fraction of Concordant Pairs.
Raises:	`ValueError` – When `predictions` is empty.

surprise.accuracy.mae(predictions, verbose=True)¶

Compute MAE (Mean Absolute Error).

\[\text{MAE} = \frac{1}{|\hat{R}|} \sum_{\hat{r}_{ui} \in \hat{R}}|r_{ui} - \hat{r}_{ui}|\]

Parameters:	predictions (`list` of `Prediction`) – A list of predictions, as returned by the `test()` method. verbose – If True, will print computed value. Default is `True`.
Returns:	The Mean Absolute Error of predictions.
Raises:	`ValueError` – When `predictions` is empty.

surprise.accuracy.rmse(predictions, verbose=True)¶

Compute RMSE (Root Mean Squared Error).

\[\text{RMSE} = \sqrt{\frac{1}{|\hat{R}|} \sum_{\hat{r}_{ui} \in \hat{R}}(r_{ui} - \hat{r}_{ui})^2}.\]

Parameters:	predictions (`list` of `Prediction`) – A list of predictions, as returned by the `test()` method. verbose – If True, will print computed value. Default is `True`.
Returns:	The Root Mean Squared Error of predictions.
Raises:	`ValueError` – When `predictions` is empty.

dataset module¶

The dataset module defines the Dataset class and other subclasses which are used for managing datasets.

Users may use both built-in and user-defined datasets (see the Getting Started page for examples). Right now, three built-in datasets are available:

The movielens-100k dataset.
The movielens-1m dataset.
The Jester dataset 2.

Built-in datasets can all be loaded (or downloaded if you haven’t already) using the Dataset.load_builtin() method. Summary:

`Dataset.load_builtin`	Load a built-in dataset.
`Dataset.load_from_file`	Load a dataset from a (custom) file.
`Dataset.load_from_folds`	Load a dataset where folds (for cross-validation) are predefined by some files.
`Dataset.folds`	Generator function to iterate over the folds of the Dataset.
`DatasetAutoFolds.split`	Split the dataset into folds for future cross-validation.

class surprise.dataset.Dataset(reader)¶

Base class for loading datasets.

Note that you should never instantiate the Dataset class directly (same goes for its derived classes), but instead use one of the three available methods for loading datasets.

folds()¶

Generator function to iterate over the folds of the Dataset.

Warning

Deprecated since version 1.05. Use cross-validation iterators instead. This method will be removed in later versions.

Yields:	tuple – `Trainset` and testset of current fold.

classmethod load_builtin(name=u'ml-100k')¶

Load a built-in dataset.

If the dataset has not already been loaded, it will be downloaded and saved. You will have to split your dataset using the split method. See an example in the User Guide.

Parameters:	name (`string`) – The name of the built-in dataset to load. Accepted values are ‘ml-100k’, ‘ml-1m’, and ‘jester’. Default is ‘ml-100k’.
Returns:	A `Dataset` object.
Raises:	`ValueError` – If the `name` parameter is incorrect.

classmethod load_from_df(df, reader)¶

Load a dataset from a pandas dataframe.

Use this if you want to use a custom dataset that is stored in a pandas dataframe. See the User Guide for an example.

Parameters:	df (Dataframe) – The dataframe containing the ratings. It must have three columns, corresponding to the user (raw) ids, the item (raw) ids, and the ratings, in this order. reader (`Reader`) – A reader to read the file. Only the `rating_scale` field needs to be specified.

classmethod load_from_file(file_path, reader)¶

Load a dataset from a (custom) file.

Use this if you want to use a custom dataset and all of the ratings are stored in one file. You will have to split your dataset using the split method. See an example in the User Guide.

Parameters:	file_path (`string`) – The path to the file containing ratings. reader (`Reader`) – A reader to read the file.

classmethod load_from_folds(folds_files, reader)¶

Load a dataset where folds (for cross-validation) are predefined by some files.

The purpose of this method is to cover a common use case where a dataset is already split into predefined folds, such as the movielens-100k dataset which defines files u1.base, u1.test, u2.base, u2.test, etc… It can also be used when you don’t want to perform cross-validation but still want to specify your training and testing data (which comes down to 1-fold cross-validation anyway). See an example in the User Guide.

Parameters:	folds_files (`iterable` of `tuples`) – The list of the folds. A fold is a tuple of the form `(path_to_train_file, path_to_test_file)`. reader (`Reader`) – A reader to read the files.

class surprise.dataset.DatasetAutoFolds(ratings_file=None, reader=None, df=None)¶

A derived class from Dataset for which folds (for cross-validation) are not predefined. (Or for when there are no folds at all).

build_full_trainset()¶

Do not split the dataset into folds and just return a trainset as is, built from the whole dataset.

User can then query for predictions, as shown in the User Guide.

Returns:	The `Trainset`.

split(n_folds=5, shuffle=True)¶

Split the dataset into folds for future cross-validation.

Warning

Deprecated since version 1.05. Use cross-validation iterators instead. This method will be removed in later versions.

If you forget to call split(), the dataset will be automatically shuffled and split for 5-fold cross-validation.

You can obtain repeatable splits over your all your experiments by seeding the RNG:

import random
random.seed(my_seed)  # call this before you call split!

Parameters:	n_folds (`int`) – The number of folds. shuffle (`bool`) – Whether to shuffle ratings before splitting. If `False`, folds will always be the same each time the experiment is run. Default is `True`.

Trainset class¶

class surprise.Trainset(ur, ir, n_users, n_items, n_ratings, rating_scale, offset, raw2inner_id_users, raw2inner_id_items)¶

A trainset contains all useful data that constitutes a training set.

It is used by the fit() method of every prediction algorithm. You should not try to built such an object on your own but rather use the Dataset.folds() method or the DatasetAutoFolds.build_full_trainset() method.

Trainsets are different from Datasets. You can think of a Datasets as the raw data, and Trainsets as higher-level data where useful methods are defined. Also, a Datasets may be comprised of multiple Trainsets (e.g. when doing cross validation).

ur¶: defaultdict of list – The users ratings. This is a dictionary containing lists of tuples of the form (item_inner_id, rating). The keys are user inner ids.

ir¶: defaultdict of list – The items ratings. This is a dictionary containing lists of tuples of the form (user_inner_id, rating). The keys are item inner ids.

n_users¶: Total number of users \(|U|\).

n_items¶: Total number of items \(|I|\).

n_ratings¶: Total number of ratings \(|R_{train}|\).

rating_scale¶: tuple – The minimum and maximal rating of the rating scale.

global_mean¶: The mean of all ratings \(\mu\).

all_items()¶

Generator function to iterate over all items.

Yields:	Inner id of items.

all_ratings()¶

Generator function to iterate over all ratings.

Yields:	A tuple `(uid, iid, rating)` where ids are inner ids (see this note).

all_users()¶

Generator function to iterate over all users.

Yields:	Inner id of users.

build_anti_testset(fill=None)¶

Return a list of ratings that can be used as a testset in the test() method.

The ratings are all the ratings that are not in the trainset, i.e. all the ratings \(r_{ui}\) where the user \(u\) is known, the item \(i\) is known, but the rating \(r_{ui}\) is not in the trainset. As \(r_{ui}\) is unknown, it is either replaced by the fill value or assumed to be equal to the mean of all ratings global_mean.

Parameters:	fill (float) – The value to fill unknown ratings. If `None` the global mean of all ratings `global_mean` will be used.
Returns:	A list of tuples `(uid, iid, fill)` where ids are raw ids.

build_testset()¶

Return a list of ratings that can be used as a testset in the test() method.

The ratings are all the ratings that are in the trainset, i.e. all the ratings returned by the all_ratings() generator. This is useful in cases where you want to to test your algorithm on the trainset.

global_mean

Return the mean of all ratings.

It’s only computed once.

knows_item(iid)¶

Indicate if the item is part of the trainset.

An item is part of the trainset if the item was rated at least once.

Parameters:	iid (int) – The (inner) item id. See this note.
Returns:	`True` if item is part of the trainset, else `False`.

knows_user(uid)¶

Indicate if the user is part of the trainset.

A user is part of the trainset if the user has at least one rating.

Parameters:	uid (int) – The (inner) user id. See this note.
Returns:	`True` if user is part of the trainset, else `False`.

to_inner_iid(riid)¶

Convert an item raw id to an inner id.

See this note.

Parameters:	riid (str) – The item raw id.
Returns:	The item inner id.
Return type:	int
Raises:	`ValueError` – When item is not part of the trainset.

to_inner_uid(ruid)¶

Convert a user raw id to an inner id.

See this note.

Parameters:	ruid (str) – The user raw id.
Returns:	The user inner id.
Return type:	int
Raises:	`ValueError` – When user is not part of the trainset.

to_raw_iid(iiid)¶

Convert an item inner id to a raw id.

See this note.

Parameters:	iiid (int) – The item inner id.
Returns:	The item raw id.
Return type:	str
Raises:	`ValueError` – When `iiid` is not an inner id.

to_raw_uid(iuid)¶

Convert a user inner id to a raw id.

See this note.

Parameters:	iuid (int) – The user inner id.
Returns:	The user raw id.
Return type:	str
Raises:	`ValueError` – When `iuid` is not an inner id.

Reader class¶

class surprise.reader.Reader(name=None, line_format=u'user item rating', sep=None, rating_scale=(1, 5), skip_lines=0)¶

The Reader class is used to parse a file containing ratings.

Such a file is assumed to specify only one rating per line, and each line needs to respect the following structure:

user ; item ; rating ; [timestamp]

where the order of the fields and the separator (here ‘;’) may be arbitrarily defined (see below). brackets indicate that the timestamp field is optional.

For each built-in dataset, Surprise also provides predefined readers which are useful if you want to use a custom dataset that has the same format as a built-in one (see the name parameter).

Parameters:

name (string, optional) – If specified, a Reader for one of the built-in datasets is returned and any other parameter is ignored. Accepted values are ‘ml-100k’, ‘ml-1m’, and ‘jester’. Default is None.
line_format (string) – The fields names, in the order at which they are encountered on a line. Please note that line_format is always space-separated (use the sep parameter). Default is 'user item rating'.
sep (char) – the separator between fields. Example : ';'.
rating_scale (tuple, optional) – The rating scale used for every rating. Default is (1, 5).
skip_lines (int, optional) – Number of lines to skip at the beginning of the file. Default is 0.

evaluate module¶

The evaluate module defines the evaluate() function and GridSearch class

class surprise.evaluate.GridSearch(algo_class, param_grid, measures=[u'rmse', u'mae'], n_jobs=-1, pre_dispatch=u'2*n_jobs', seed=None, verbose=1, joblib_verbose=0)¶

Warning

Deprecated since version 1.05. Use GridSearchCV instead. This class will be removed in later versions.

The GridSearch class, used to evaluate the performance of an algorithm on various combinations of parameters, and extract the best combination. It is analogous to GridSearchCV from scikit-learn.

See User Guide for usage.

Parameters:

algo_class (AlgoBase) – The class object of the algorithm to evaluate.
param_grid (dict) – Dictionary with algorithm parameters as keys and list of values as keys. All combinations will be evaluated with desired algorithm. Dict parameters such as sim_options require special treatment, see this note.
measures (list of string) – The performance measures to compute. Allowed names are function names as defined in the accuracy module. Default is ['rmse', 'mae'].
n_jobs (int) –
The maximum number of algorithm training in parallel.
- If -1, all CPUs are used.
- If 1 is given, no parallel computing code is used at all, which is useful for debugging.
- For n_jobs below -1, (n_cpus + n_jobs + 1) are used. For example, with n_jobs = -2 all CPUs but one are used.
Default is -1.
pre_dispatch (int or string) –
Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
- None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs.
- An int, giving the exact number of total jobs that are spawned.
- A string, giving an expression as a function of n_jobs, as in '2*n_jobs'.
Default is '2*n_jobs'.
seed (int) – The value to use as seed for RNG. It will determine how splits are defined. If None, the current time since epoch is used. Default is None.
verbose (bool) – Level of verbosity. If False, nothing is printed. If True, The mean values of each measure are printed along for each parameter combination. Default is True.
joblib_verbose (int) – Controls the verbosity of joblib: the higher, the more messages.

cv_results¶: dict of arrays – A dict that contains all parameters and accuracy information for each combination. Can be imported into a pandas DataFrame.

best_estimator¶: dict of AlgoBase – Using an accuracy measure as key, get the estimator that gave the best accuracy results for the chosen measure.

best_score¶: dict of floats – Using an accuracy measure as key, get the best score achieved for that measure.

best_params¶: dict of dicts – Using an accuracy measure as key, get the parameters combination that gave the best accuracy results for the chosen measure.

best_index¶: dict of ints – Using an accuracy measure as key, get the index that can be used with cv_results that achieved the highest accuracy for that measure.

evaluate(data)¶

Runs the grid search on dataset.

Class instance attributes can be accessed after the evaluate is done.

Parameters:	data (`Dataset`) – The dataset on which to evaluate the algorithm.

surprise.evaluate.evaluate(algo, data, measures=[u'rmse', u'mae'], with_dump=False, dump_dir=None, verbose=1)¶

Warning

Deprecated since version 1.05. Use cross_validate instead. This function will be removed in later versions.

Evaluate the performance of the algorithm on given data.

Depending on the nature of the data parameter, it may or may not perform cross validation.

Parameters:

algo (AlgoBase) – The algorithm to evaluate.
data (Dataset) – The dataset on which to evaluate the algorithm.
measures (list of string) – The performance measures to compute. Allowed names are function names as defined in the accuracy module. Default is ['rmse', 'mae'].
with_dump (bool) – If True, the predictions and the algorithm will be dumped for later further analysis at each fold (see FAQ). The file names will be set as: '<date>-<algorithm name>-<fold number>'. Default is False.
dump_dir (str) – The directory where to dump to files. Default is '~/.surprise_data/dumps/', or the folder specified by the 'SURPRISE_DATA_FOLDER' environment variable (see FAQ).
verbose (int) – Level of verbosity. If 0, nothing is printed. If 1 (default), accuracy measures for each folds are printed, with a final summary. If 2, every prediction is printed.

Returns:

A dictionary containing measures as keys and lists as values. Each list contains one entry per fold.

dump module¶

The dump module defines the dump() function.

surprise.dump.dump(file_name, predictions=None, algo=None, verbose=0)¶

A basic wrapper around Pickle to serialize a list of prediction and/or an algorithm on drive.

What is dumped is a dictionary with keys 'predictions' and 'algo'.

Parameters:	file_name (str) – The name (with full path) specifying where to dump the predictions. predictions (list of `Prediction`) – The predictions to dump. algo (`Algorithm`, optional) – The algorithm to dump. verbose (int) – Level of verbosity. If `1`, then a message indicates that the dumping went successfully. Default is `0`.

surprise.dump.load(file_name)¶

A basic wrapper around Pickle to deserialize a list of prediction and/or an algorithm that were dumped on drive using dump().

Parameters:	file_name (str) – The path of the file from which the algorithm is to be loaded
Returns:	A tuple `(predictions, algo)` where `predictions` is a list of `Prediction` objects and `algo` is an `Algorithm` object. Depending on what was dumped, some of these may be `None`.

Welcome to Surprise’ documentation!¶

Getting Started¶

Basic usage¶

Automatic cross-validation¶

Train-test split and the fit() method¶

Train on a whole trainset and the predict() method¶

Use a custom dataset¶

Use cross-validation iterators¶

Tune algorithm parameters with GridSearchCV¶

Command line usage¶

Using prediction algorithms¶

Baselines estimates configuration¶

Similarity measure configuration¶

How to build your own prediction algorithm¶

The basics¶

The fit method¶

The trainset attribute¶

When the prediction is impossible¶

Using similarities and baselines¶

Notation standards, References¶

FAQ¶

How to get the top-N recommendations for each user¶

How to compute precision@k and recall@k¶

How to get the k nearest neighbors of a user (or item)¶

How to serialize an algorithm¶

How to build my own prediction algorithm¶

What are raw and inner ids¶

Can I use my own dataset with Surprise, and can it be a pandas dataframe¶

How to tune an algorithm parameters¶

How to get accuracy measures on the training set¶

How to save some data for unbiased accuracy estimation¶

How to have reproducible experiments¶

Where are datasets stored and how to change it?¶

prediction_algorithms package¶

The algorithm base class¶

The predictions module¶

Basic algorithms¶

k-NN inspired algorithms¶

Matrix Factorization-based algorithms¶

Slope One¶

Co-clustering¶

The model_selection package¶

Cross validation iterators¶

Cross validation¶

Parameter search¶

similarities module¶

accuracy module¶

dataset module¶

Trainset class¶

Reader class¶

evaluate module¶

dump module¶

The `fit` method¶

The `trainset` attribute¶