The Million Playlist Dataset README

The Million Playlist Dataset contains 1,000,000 playlists created by users on the Spotify platform. It can be used by researchers interested in exploring how to improve the music listening experience.

What's in the Million Playlist Dataset

The MPD contains a million user-generated playlists. These playlists were created during the period of January 2010 through October 2017. Each playlist in the MPD contains a playlist title, the track list (including track metadata) editing information (last edit time, number of playlist edits) and other miscellaneous information about the playlist. See the Detailed Description section for more details.

License

Usage of the Million Playlist Dataset is subject to these license terms.

Citing the Million Playlist Dataset

Citation information for the dataset can be found at http://recsys-challenge.spotify.com/dataset.

Getting the dataset

The dataset is available at http://recsys-challenge.spotify.com/dataset.

Verifying your dataset

You can validate the dataset by checking the md5 hashes of the data. From the top level directory of the MPD:

% md5sum -c md5sums

This should print out OK for each of the 1,000 slice files in the dataset.

You can also compute a number of statistics for the dataset as follows:

% python src/stats.py data

The output of this program should match what is in 'stats.txt'. Depending on how fast your computer is, stats.py can take 30 minutes or more to run.

Detailed description

The Million Playlist Dataset consists of 1,000 slice files. These files have the naming convention of:

mpd.slice.STARTING_PLAYLIST_ID_-_ENDING_PLAYLIST_ID.json

For example, the first 1,000 playlists in the MPD are in a file called mpd.slice.0-999.json and the last 1,000 playlists are in a file called mpd.slice.999000-999999.json.

Each slice file is a JSON dictionary with two fields: info and playlists.

info Field

The info field is a dictionary that contains general information about the particular slice:

playlists field

This is an array that typically contains 1,000 playlists. Each playlist is a dictionary that contains the following fields:

Here's an example of a typical playlist entry:

    {
        "name": "musical",
        "collaborative": "false",
        "pid": 5,
        "modified_at": 1493424000,
        "num_albums": 7,
        "num_tracks": 12,
        "num_followers": 1,
        "num_edits": 2,
        "duration_ms": 2657366,
        "num_artists": 6,
        "tracks": [
            {
                "pos": 0,
                "artist_name": "Degiheugi",
                "track_uri": "spotify:track:7vqa3sDmtEaVJ2gcvxtRID",
                "artist_uri": "spotify:artist:3V2paBXEoZIAhfZRJmo2jL",
                "track_name": "Finalement",
                "album_uri": "spotify:album:2KrRMJ9z7Xjoz1Az4O6UML",
                "duration_ms": 166264,
                "album_name": "Dancing Chords and Fireflies"
            },
            {
                "pos": 1,
                "artist_name": "Degiheugi",
                "track_uri": "spotify:track:23EOmJivOZ88WJPUbIPjh6",
                "artist_uri": "spotify:artist:3V2paBXEoZIAhfZRJmo2jL",
                "track_name": "Betty",
                "album_uri": "spotify:album:3lUSlvjUoHNA8IkNTqURqd",
                "duration_ms": 235534,
                "album_name": "Endless Smile"
            },
            {
                "pos": 2,
                "artist_name": "Degiheugi",
                "track_uri": "spotify:track:1vaffTCJxkyqeJY7zF9a55",
                "artist_uri": "spotify:artist:3V2paBXEoZIAhfZRJmo2jL",
                "track_name": "Some Beat in My Head",
                "album_uri": "spotify:album:2KrRMJ9z7Xjoz1Az4O6UML",
                "duration_ms": 268050,
                "album_name": "Dancing Chords and Fireflies"
            },
            // 8 tracks omitted
            {
                "pos": 11,
                "artist_name": "Mo' Horizons",
                "track_uri": "spotify:track:7iwx00eBzeSSSy6xfESyWN",
                "artist_uri": "spotify:artist:3tuX54dqgS8LsGUvNzgrpP",
                "track_name": "Fever 99\u00b0",
                "album_uri": "spotify:album:2Fg1t2tyOSGWkVYHlFfXVf",
                "duration_ms": 364320,
                "album_name": "Come Touch The Sun"
            }
        ],

    }

Tools

There are some tools that you can use with the dataset.

stats.py

This python program will iterate through the entire MPD and display summary information about the contents of the MPD.

Usage:

% python src/stats.py data

show.py

This python program will show playlists given their ID.

Show playlist with PID 10:

% python src/show.py 10

Show the first 500 and the last 500 playlists:

% python src/show.py 0-500 999500-1000000

Show the raw json for 3 playlists:

% python src/show.py --raw 10 20 30

Show 1000 playlists without the track details:

% python src/show.py --compact 300-1300 

Show all playlists:

% python src/show.py --compact 0-1000000

How was the dataset built

The Million Playlist Dataset is created by sampling playlists from the billions of playlists that Spotify users have created over the years. Playlists that meet the following criteria are selected at random:

Additionally, some playlists have been modified as follows:

Playlists are sampled randomly, for the most part, but with some dithering to disguise the true distribution of playlists within Spotify. Paper tracks may be added to some playlists to help us identify improper/unlicensed use of the dataset.

Overall demographics of users contributing to the MPD

Gender

Age

Country

Who built the dataset

The million playlist dataset was built by the following researchers @ Spotify: