如何加速Spotipy API调用数以百万计的记录



我正试图获得Spotify 200强排行榜4.5年的音频功能数据。这是68个国家+全球排名,因此总共约有2000万条记录。我正在用所有这些数据查询SQL Lite数据库。这是为数据分析项目做准备,我目前将我的范围限制在每月的第三个星期五,因为我能为图表提取一整天的音频功能的最快时间是15.8分钟。这是18.5天的直接处理,得到全部1701天。

有人看到我能让这更快吗 我目前正在为每个音轨id调用spotipy.audio_features()函数。该函数限制为100个id,我不太确定这是否会更快。

这里有一个处理前的条目示例:

column_names = ['title', 'rank', 'date', 'artist', 'url', 'region', 'chart', 'trend', 'streams']
('You Were Right', 179, '2017-01-20', 'RÜFÜS DU SOL', 'https://open.spotify.com/track/77lqbary6vt1DSc1MBN6sx', 'Australia', 'top200', 'NEW_ENTRY', 14781)

处理后:

column_names = ['title', 'rank', 'date', 'artist', 'url', 'region', 'chart', 'trend', 'streams', 'track_id', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature']
('You Were Right', 179, '2017-01-20', 'RÜFÜS DU SOL', 'https://open.spotify.com/track/77lqbary6vt1DSc1MBN6sx', 'Australia', 'top200', 'NEW_ENTRY', 14781, '77lqbary6vt1DSc1MBN6sx', 0.708, 0.793, 5, -5.426, 0, 0.0342, 0.0136, 0.00221, 0.118, 0.734, 122.006, 239418, 4)

完整脚本:

import sqlite3
import os
import spotipy
import numpy as np
import pandas as pd
from spotipy.oauth2 import SpotifyClientCredentials
from requests.exceptions import ReadTimeout
from datetime import datetime
"""Gets the third Friday of each month and checks that the date exists in the database."""
def date_range_checker(cursor, start_date, end_date):
# Put in the range for that year. It's till 2021.
date_range = pd.date_range(start_date, end_date ,freq='WOM-3FRI')
cursor.execute("""SELECT DISTINCT Date(date) FROM charts""")
sql_date_fetch = cursor.fetchall()
sql_dates = [r[0] for r in sql_date_fetch]
validated_dates = []
for date in date_range:
# print(str(date)[0:-9])
if str(date)[0:-9] in sql_dates:
validated_dates.append(str(date)[0:-9])    

return validated_dates
"""Connects to the database. For each date in validated_dates, it queries all the records with that date. 
Then splits the track IDs from the Spotify link into a new list of tuples. Then for each tuple in that list, it calls the Spotify API with the track ID.
Finally it creates a numpy array for the entire list so the csv converter can be used."""
def main(): 
now_start = datetime.now()
start_time = now_start.strftime("%H:%M:%S")
print(f'Main Function - start time: {start_time}')
""""This script queries """
print("working on it...")
dbname = 'charts.db'
if os.path.exists(dbname):
db = sqlite3.connect(dbname, isolation_level=None)
cursor = db.cursor()
""""Selects 3rd friday of the month because it takes about 15.8 minutes per day. That's 14.2 hours total to get one friday a month for all 4.5 years.
Or 18.6 full days of processing for every single day for all 1701 days.
Fridays are a preferable release day in the industry. Cite this later."""
# Date range list created and checked in this function
validated_dates = date_range_checker(cursor, '2017-02-01', '2017-12-31') # change year here
column_names = ['title', 'rank', 'date', 'artist', 'url', 'region', 'chart', 'trend', 'streams', 'track_id', 'danceability', 
'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 
'duration_ms', 'time_signature']

for date_chosen in validated_dates:
cursor.execute("""SELECT * FROM charts WHERE Date("date") = ?""", (date_chosen,))
db_result = cursor.fetchall()
data_with_track_ids = []
final_data = []
# Splits ID from Spotify link.
for entry in db_result:
track_id = entry[4].split('/')[-1]
entry += (track_id,)
data_with_track_ids.append(entry)
print("I've got all the track IDs. Will start calls to Spotify API now.")
# Calls to spotify with the new extracted track_id
for entry in data_with_track_ids:
track_id = entry[-1]
try:
audio_features = spotify.audio_features(track_id)
except ReadTimeout:
print('Spotify timed out... trying again...')
audio_features = spotify.audio_features(track_id)
entry += (audio_features[0]['danceability'], audio_features[0]['energy'], audio_features[0]['key'], 
audio_features[0]['loudness'], audio_features[0]['mode'], audio_features[0]['speechiness'], audio_features[0]['acousticness'], 
audio_features[0]['instrumentalness'], audio_features[0]['liveness'],
audio_features[0]['valence'], audio_features[0]['tempo'], audio_features[0]['duration_ms'], audio_features[0]['time_signature'])

final_data.append(entry)
np_data = np.array(final_data)
my_dataframe = pd.DataFrame(np_data, columns=column_names)
my_dataframe.to_csv(f'spotify_csv_data/spotify_top_200 {date_chosen}.csv')
now_end = datetime.now()
end_time = now_end.strftime("%H:%M:%S")
print(f'Main Function - Start time: {start_time}. End time: {end_time}.')
print(f'The date {date_chosen} took {now_end - now_start} to run.')

db.close() 

if __name__ == "__main__":
now_start = datetime.now()
start_time = now_start.strftime("%H:%M:%S")
print(f'Script - start time: {start_time}')

os.environ['SPOTIPY_CLIENT_ID'] = 'ENTER YOUR CLIENT_ID'
os.environ['SPOTIPY_CLIENT_SECRET'] = 'ENTER YOUR CLIENT_SECRET'
# Allows for retries. Seems to be enough that it doesn't crash.
spotify = spotipy.Spotify(client_credentials_manager=SpotifyClientCredentials(), requests_timeout=10, retries=10) 
"""Leave above set."""

main()
now_end = datetime.now()
end_time = now_end.strftime("%H:%M:%S")
print(f'Script - Start time: {start_time}. End time: {end_time}.')
print(f'This script took {now_end - now_start} to run.n')

提高性能的几个想法:

  1. 使用并行处理

由于您使用的是Python,因此运行的代码是单线程的。

使用Python的多处理库,您可以(例如)运行相同代码的4个实例,但开始/结束日期相等。这可以使您的数据处理速度提高约4倍。您只需要以没有重叠的方式写入数据。

注:如果您受到Spotify API的费率限制(您很可能会受到限制),您可以为每个实例使用不同的API密钥。(建立多个帐户或借用好友API密钥)。

Sql查询优化

值得查看您的查询,看看哪里出了问题。我个人并不熟悉SQL,只是给你一些想法。

  1. 介绍您的程序以了解更多信息

请参阅如何评测Python脚本?

  1. 使用某种缓存技术来避免冗余的api调用,并避免填充重复的数据。(请参阅下面使用ids_seen的最后一块代码中的潜在解决方案)

蟒蛇3

# Splits ID from Spotify link.
for entry in db_result:
track_id = entry[4].split('/')[-1]
entry += (track_id,)
data_with_track_ids.append(entry)

在这个代码中,条目是什么类型?db_result有多大?

关于您的以下代码,还有一件事值得一提:

蟒蛇3

# Calls to spotify with the new extracted track_id
for entry in data_with_track_ids:
track_id = entry[-1]
try:
audio_features = spotify.audio_features(track_id)
except ReadTimeout:
print('Spotify timed out... trying again...')
audio_features = spotify.audio_features(track_id)
entry += (audio_features[0]['danceability'], audio_features[0]['energy'], audio_features[0]['key'], 
audio_features[0]['loudness'], audio_features[0]['mode'], audio_features[0]['speechiness'], audio_features[0]['acousticness'], 
audio_features[0]['instrumentalness'], audio_features[0]['liveness'],
audio_features[0]['valence'], audio_features[0]['tempo'], audio_features[0]['duration_ms'], audio_features[0]['time_signature'])

final_data.append(entry)

在try-except块中,您正在为data_with_track_ids中的每个条目发出请求。data_with_track_ids数据结构中有多少元素?如果你强行调用api,预计会被Spotify服务器限制并超时。

你应该在超时后增加一个短暂的等待期,以减少机会获得速率限制或IP被禁止。哦,等等,看起来当你初始化spotify变量时,重试是在spotify源代码中自动设置和处理的。

编辑

这里有一种方法可以通过使用Python的集合数据结构来避免发出冗余请求。这可以作为您的";高速缓存":

# Calls to spotify with the new extracted track_id
ids_seen = set()
for entry in data_with_track_ids:
track_id = entry[-1]
if track_id not in ids_seen:
try:
# retries are already built-in and defined in your __main__(), spotify variable
audio_features = spotify.audio_features(track_id)
except SpotifyException as se:
print('Spotify timed out...Maximum retries exceeded...moving on to next track_id...')
print("TRACK ID IS: {}".format(track_id))
print("Error details: {}".format(se))
ids_seen.add(track_id)
continue
# on success, add track id to ids seen
ids_seen.add(track_id)
else:
print("We have seen this ID before... ID = {}".format(track_id))
continue # skips the next 5 instructions and starts again at top of loop, next iteration
entry += (audio_features[0]['danceability'], audio_features[0]['energy'], audio_features[0]['key'], 
audio_features[0]['loudness'], audio_features[0]['mode'], audio_features[0]['speechiness'], audio_features[0]['acousticness'], 
audio_features[0]['instrumentalness'], audio_features[0]['liveness'],
audio_features[0]['valence'], audio_features[0]['tempo'], audio_features[0]['duration_ms'], audio_features[0]['time_signature'])

final_data.append(entry)

若您被限制为每天1000个请求,那个么只需将程序睡眠24小时或停止程序(并保存当前迭代和数据上下文),然后在允许更多请求后再次运行。看见https://developer.spotify.com/documentation/web-api/guides/rate-limits/

配置文件,配置文件,概要文件。但瓶颈可能是soptify的api。虽然你可以并行以加快获取速度,但他们不会太感谢你,如果你做得太多,你可能会发现自己的费率有限。因此,分析并查看哪些内容需要时间,但要准备好减少数据集。

问问自己你能做些什么来加快算法:

  • 你能拿到前N个点击吗
  • 你真的需要这些数据吗
  • 是否有重复的数据

即使数据没有重复,也要创建一个由track_id索引的本地缓存,并将每个请求存储在其中。与其从spotify端点请求,不如在缓存中查找(将数据存储在另一个sqlite数据库或同一数据库中的另一个表中)。如果没有返回任何数据,则提取并将数据保存到缓存中,然后返回

  • 如果正在执行冗余查找,速度会更快
  • 即使您没有,如果您更改了某些内容并需要再次运行大量代码,您也可以非常快地重新运行代码(至少就当前速度而言)

所以缓存、配置文件并查看您的算法。

您正在为每个音轨调用spotify.audio_features(track_id),即使您已经获取了它的数据。每个星期五的结果应该只介绍几首新歌,但你正在重新获取所有200首的信息。不要那样做。为歌曲信息制作另一个数据库表。获取track_id的信息后,将其写入数据库。在获取track_id的信息之前,请查看是否已将其存储在数据库中。然后,您将只进行必要的API调用,而不是200*num_weeks*num_countries。

最新更新