如何做一个for循环来加载pickle文件?



我正在尝试使用for循环自动加载12个具有相似名称的pickle文件。

我有3个不同城市的AirBnB数据(泽西市,纽约市和里约热内卢),每个城市有4种类型的文件(列表,日历,地区和评论);我总共有12个文件,文件名非常相似(city_fileType.pkl)。

jc_listings.pkl, jc_calendar.pkl, jc_locale.pkl, jc_reviews.pkl  # Jersey city dataset
nyc_listings.pkl, nyc_calendar.pkl , nyc_locale.pkl, nyc_reviews # New York City dataset
rio_listings.pkl, rio_calendar.pkl, rio_locale.pkl, rio_reviews.pkl # Rio city dataset

我正在尝试自动加载这些文件。

当我运行代码时:

path_data = '../Data/' # local path
jc_listings = pd.read_pickle(path_data+'jc_listings.pkl')
jc_listings.info()

但是当我尝试自动化时,它确实工作正常。我正在尝试:

# load data
path_data = '../Data/'
#list of all data names
city_data = ['jc_listings','jc_calendar','jc_locale','jc_reviews',
'nyc_listings','nyc_calendar','nyc_locale','nyc_reviews',
'rio_listings','rio_calendar','rio_locale','rio_reviews']
# loop to load all the data with respective name
for city in city_data:
data_name = city
print(data_name) # just to inspect and troubleshoot
city = pd.read_pickle(path_data+data_name+'.pkl')
print(type(city)) # just to inspect and troubleshoot

这运行没有错误,打印输出看起来很好。然而当我尝试

rio_reviews.info()

我得到以下错误:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In [37], line 3
1 # inspecting the data
----> 3 rio_reviews.info()
NameError: name 'rio_reviews' is not defined

我建议您采用另一种方法:

import pandas as pd
from pathlib import Path
data = Path('../Data')
cities = ['jc', 'nyc', 'rio']
files = ['listings', 'calendar', 'locale', 'reviews']
dfs = {}
for city in cities:
for file in files:
dfs[city][file] = pd.read_pickle(data / f'{city}_{file}.pkl')

这将给出一个字典dfs,您可以从其中访问每个城市数据,如下所示:

dfs['jc']['listings'].info()
dfs['rio']['reviews'].info()

…例如,

我们可以使用itertools.product进一步简化代码:

import pandas as pd
from pathlib import Path
from itertools import product
data = Path('../Data')
cities = ['jc', 'nyc', 'rio']
files = ['listings', 'calendar', 'locale', 'reviews']
dfs = {}
for city, file in product(cities, files):
dfs[city][file] = pd.read_pickle(data / f'{city}_{file}.pkl')

看起来您已经在city中存储了所有数据,但还没有定义"rio_reviews"变量,这就是为什么你得到这个错误

最新更新