Table of Contents

Project Statement and Goals
Motivation and Background
Data Description
EDA
Data Cleaning
Metrics
Model Training
Interpreting the Model
Model Testing and Results
Conclusion and What’s Next
Literature Review

Data Description

Million Playlist Dataset:

This dataset contains 5.4 GB of data and was created in 2017. It contains the following information for each playlist and track within:

Playlist specific features:

Song specific features:

To get an idea of some broad features of the dataset, we plot some distributions. Visualizing the number of tracks in the playlists and the number of followers across the playlists was useful in thinking about how some modeling ideas may work.

Histogram of playlist length

from length_hist import length
#length is a dictionary that holds the playlist length distributions

histogram_data = []
for key in length:
    histogram_data += [key] * length[key]

fig, ax = plt.subplots(1,1, figsize = (20,12))
ax.hist(histogram_data, bins = 40, color = 'springgreen', alpha = .8)


ax.set_xlabel("Songs in Playlist", fontsize = 25, labelpad = 20)
ax.set_ylabel("Frequency", fontsize = 25, labelpad = 20)
ax.set_title("Distribution of Playlist Length", fontsize = 30, pad = 50)
ax.set_xlim(0,260)
ax.set_xticks(np.round(np.linspace(0, 250, 11)))

fig1 Playlist length approximately follows a power law distribution. The playlists in the data set were selected by Spotify to have between 5 and 250 tracks. Many playlists naturally have about 25-50 tracks with 66 being the average track count.

Histogram of playlist follower count

#followers is a dictionary with playlist follower histogram data
from followers import followers

followers_hist = []
for key in followers:
    followers_hist += [key] * followers[key]

fig, ax = plt.subplots(1,1, figsize = (20, 12))
ax.hist(followers_hist, bins = 1000, color = 'springgreen', alpha = .8)

titles = ["Linear X Axis", "Logorithmic X Axis"]
ax.set_xlim(5,3000)
ax.set_xlabel("Follower Count", fontsize = 25, labelpad = 20)
ax.set_ylabel("Frequency (log scale)", fontsize = 25, labelpad = 20)
ax.set_title("Playlist Follower Count Distribution (" + titles[0] + ")", fontsize = 30, pad = 50)
ax.set_yscale('symlog', linthreshy = 10)

fig2 Follower count also roughly follows a power law distribution. There are a handful of playlists with over 50,000 followers, but the large majority of playlists have only a few followers. We expect to see much more variance between playlists than would be the case if all sampled playlists had many followers.

Popular artists in MPD

top_artists = []
with open("/Users/johndaciuk/Desktop/Spotify Stats/top_artists.txt", 'r') as f:
    for line in f:
        info_dict = {}
        info = line.split()
        frequency = info[0]
        artist = ""
        for i in range(1, len(info)):
            if i == len(info) - 1:
                artist += str(info[i])
            else:
                artist += str(info[i]) + " "
        frequency = int(frequency)
        info_dict["frequency"] = frequency
        info_dict["artist"] = artist
        top_artists.append(info_dict)
top = pd.DataFrame(top_artists)
top.head(30)

fig, ax = plt.subplots(1,1, figsize = (25,8))
for tick in ax.get_xticklabels():
    tick.set_rotation(90)
ax.bar(top["artist"].iloc[0:30], top["frequency"].iloc[0:30], color = "mediumspringgreen")

ax.set_ylabel("Frequency in MPD tracks", fontsize = 22, labelpad = 30)
ax.set_title("Top 30 Artists", fontsize = 25, pad = 50)

fig3 The most popular artist in our dataset is Drake.

Top playlist titles in MPD

top_titles = []
with open("/Users/johndaciuk/Desktop/Spotify Stats/top_playlist_titles.txt", 'r') as f:
    for line in f:
        info_dict = {}
        info = line.split()
        frequency, title = info[0], info[1]
        frequency = int(frequency)
        info_dict["frequency"] = frequency
        info_dict["playlist title"] = title
        top_titles.append(info_dict)
top = pd.DataFrame(top_titles)

fig, ax = plt.subplots(1,1, figsize = (25,6))
for tick in ax.get_xticklabels():
    tick.set_rotation(90)
ax.bar(top["playlist title"].iloc[0:50], top["frequency"].iloc[0:50], color = "mediumspringgreen")

ax.set_xlabel("Titles", fontsize = 20)
ax.set_ylabel("Frequency in MPD", fontsize = 20)
ax.set_title("Top 50 Playlist Titles", fontsize = 25, pad = 50)

fig4 The most popular playlist titles are country, chill and rap. Many playlist titles convey useful information about genre or mood. We will eagerly use playlist name information when training our model.


                  Summary Stats                                             Most popular songs in MPD

Only 1.9% of the playlists have descriptions, but that still leaves almost 19,000 playlists with descriptions that will likely be helpful. On average, each song appears in about 30 playlists, and, across all playlists, each unique artists has about 8 unique songs.