在DataFrame中添加zenodo的Spotify数据



我想将来自https://doi.org/10.5281/zenodo.4778562的charts.zip中的所有数据添加到一个DataFrame中。数据由每年一个文件组成,其中包含多个csv。我编写了以下代码:

header = 0
dfs = []
for file in glob.glob('Charts/*/201?/*.csv'):
region = file.split('/')[1]
dates = re.findall('d{4}-d{2}-d{2}', file.split('/')[-1])
weekly_chart = pd.read_csv(file, header=header, sep='t')
weekly_chart['week_start'] = datetime.strptime(dates[0], '%Y-%m-%d')
weekly_chart['week_end'] = datetime.strptime(dates[1], '%Y-%m-%d')
weekly_chart['region'] = region
dfs.append(weekly_chart)
all_charts = pd.concat(dfs)

但是,当我运行它,python返回:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_12886/3473678833.py in <module>
9     weekly_chart['region'] = region
10     dfs.append(weekly_chart)
---> 11 all_charts = pd.concat(dfs)
~/Downloads/enter/lib/python3.9/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
309                     stacklevel=stacklevel,
310                 )
--> 311             return func(*args, **kwargs)
312 
313         return wrapper
~/Downloads/enter/lib/python3.9/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
344     ValueError: Indexes have overlapping values: ['a']
345     """
--> 346     op = _Concatenator(
347         objs,
348         axis=axis,
~/Downloads/enter/lib/python3.9/site-packages/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
401 
402         if len(objs) == 0:
--> 403             raise ValueError("No objects to concatenate")
404 
405         if keys is None:
ValueError: No objects to concatenate

我该如何修复它?

我认为glob.glob可能只是把事情复杂化了…这对我来说很合适。

# Gives you a list of EVERY file in the Charts directory 
# and sub directories that is a CSV
file_list = []
for path, subdirs, files in os.walk("Charts"):
file_list.extend([os.path.join(path, x) for x in files if x.endswith('.csv')])
dfs = []
for file in file_list:
region = file.split('/')[1]
dates = re.findall('d{4}-d{2}-d{2}', file.split('/')[-1])
df = pd.read_csv(file, sep='t')
df['week_start'] = dates[0]
df['week_end'] = dates[1]
df['region'] = region
dfs.append(df)
all_charts = pd.concat(dfs, ignore_index=True)
print(all_charts)

输出:

position                 song_id                                          song_name           artist   streams  ...  peak_position  position_status  week_start    week_end  region
0              1  7wGoVu4Dady5GV0Sv4UIsx                                           rockstar      Post Malone  17532665  ...              1                0  2017-10-20  2017-10-27      us
1              2  75ZvA4QfFiZvzhj2xkaWAh                                       I Fall Apart      Post Malone   8350785  ...              2                0  2017-10-20  2017-10-27      us
2              3  2fQrGHiQOvpL9UgPvtYy6G                                       Bank Account        21 Savage   7589124  ...              3                1  2017-10-20  2017-10-27      us
3              4  43ZyHQITOjhciSUUNPVRHc                                         Gucci Gang         Lil Pump   7584237  ...              4                1  2017-10-20  2017-10-27      us
4              5  5tz69p7tJuGPeMGwNTxYuV                                     1-800-273-8255            Logic   7527770  ...              1               -2  2017-10-20  2017-10-27      us
...          ...                     ...                                                ...              ...       ...  ...            ...              ...         ...         ...     ...
273595       196  6kex4EBAj0WHXDKZMEJaaF         Swalla (feat. Nicki Minaj & Ty Dolla $ign)     Jason Derulo   3747830  ...              8               -5  2018-03-02  2018-03-09  global
273596       197  0CokSRCu5hZgPxcZBaEzVE                       Glorious (feat. Skylar Grey)       Macklemore   3725286  ...             14               -8  2018-03-02  2018-03-09  global
273597       198  7oK9VyNzrYvRFo7nQEYkWN                                     Mr. Brightside      The Killers   3717326  ...            148               -3  2018-03-02  2018-03-09  global
273598       199  7EUfNvyCVxQV3oN5ScA2Lb                                         Next To Me  Imagine Dragons   3681739  ...            122              -77  2018-03-02  2018-03-09  global
273599       200  6u0EAxf1OJTLS7CvInuNd7  Vai malandra (feat. Tropkillaz & DJ Yuri Martins)           Anitta   3676542  ...             30              -23  2018-03-02  2018-03-09  global

如果您真的希望日期是日期,您可以在最后的两列上运行此命令。

all_charts['week_start'] = pd.to_datetime(all_charts['week_start'])

就我个人而言,我也会这样做:

all_charts['week_start'] = pd.to_datetime(all_charts['week_start'])
all_charts['week_end'] = pd.to_datetime(all_charts['week_end'])
all_charts['region'] = all_charts['region'].astype('category')
all_charts['artist'] = all_charts['artist'].astype('category')
all_charts['song_name'] = all_charts['song_name'].astype('category')
all_charts['song_id'] = all_charts['song_id'].astype('category')
all_charts.set_index(['region', 'week_start', 'week_end', 'position'], inplace=True)
all_charts.position_status = pd.to_numeric(all_charts.position_status, errors='coerce')
print(df.head(10))

给:

song_id           song_name          artist   streams  last_week_position  weeks_on_chart  peak_position  position_status
region week_start week_end   position
us     2017-10-20 2017-10-27 1         7wGoVu4Dady5GV0Sv4UIsx            rockstar     Post Malone  17532665                 1.0               3              1              0.0
2         75ZvA4QfFiZvzhj2xkaWAh        I Fall Apart     Post Malone   8350785                 2.0               6              2              0.0
3         2fQrGHiQOvpL9UgPvtYy6G        Bank Account       21 Savage   7589124                 4.0               5              3              1.0
4         43ZyHQITOjhciSUUNPVRHc          Gucci Gang        Lil Pump   7584237                 5.0               3              4              1.0
5         5tz69p7tJuGPeMGwNTxYuV      1-800-273-8255           Logic   7527770                 3.0              26              1             -2.0
6         5Gd19NupVe5X8bAqxf9Iaz            Gorgeous    Taylor Swift   6940802                 NaN               1              6              NaN
7         0ofbQMrRDsUaVKq2mGLEAb              Havana  Camila Cabello   6623184                10.0              12              7              3.0
8         2771LMNxwf62FTAdpJMQfM        Bodak Yellow         Cardi B   6472727                 6.0              14              3             -2.0
9         5Z3GHaZ6ec9bsiI5BenrbY  Young Dumb & Broke          Khalid   5982108                 9.0              29              6              0.0
10        7GX5flRQZVHRAGd6B4TmDO       XO Tour Llif3    Lil Uzi Vert   5822583                 8.0               9              2             -2.0

最新更新