重建数据帧字典的有效方法

我有一个字典，里面装满了多个数据帧。现在我正在寻找一种有效的方法来更改密钥结构，但当涉及更多的数据帧/更大的数据帧时，我找到的解决方案相当缓慢。这就是为什么我想问是否有人知道比我更方便/高效/更快的方式或方法。因此，首先，我创建了这个例子来展示我最初的起点：

import pandas as pd
import numpy as np
# assign keys to dic
teams = ["Arsenal", "Chelsea", "Manchester United"]
dic_teams = {}
# fill dic with random entries
for t1 in teams:
dic_teams[t1] = pd.DataFrame({'date': pd.date_range("20180101", periods=30), 
'Goals': pd.Series(np.random.randint(0,5, size = 30)),
'Chances': pd.Series(np.random.randint(0,15, size = 30)),
'Fouls': pd.Series(np.random.randint(0, 20, size = 30)),
'Offside': pd.Series(np.random.randint(0, 10, size = 30))})
dic_teams[t1] = dic_teams[t1].set_index('date')
dic_teams[t1].index.name = None

现在我基本上有了一个字典，其中每个键都是一个团队，这意味着我有一个每个团队的数据帧，其中包含了他们一段时间以来的比赛表现信息。现在我更愿意更改这个特定的字典，这样我就可以得到一个关键字是日期的结构，而不是一个团队。这意味着我为每个日期都有一个数据框架，其中充满了每个团队在该日期的表现。我使用以下代码做到了这一点，这很有效，但一旦我添加了更多的团队和性能因素，速度就会非常慢：

# prepare lists for looping
dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
dic_dates = {}
# new structure where key = date
for d in dates:
dic_dates[d] = pd.DataFrame(index = teams, columns = perf)
for t2 in teams:
dic_dates[d].loc[t2] = dic_teams[t2].loc[d]

因为我使用的是嵌套循环，所以字典的重组很慢。有人知道我如何改进第二段代码吗？我不一定只是在寻找解决方案，也在寻找如何做得更好的逻辑或想法。

提前感谢，如有任何帮助，将不胜感激

创建Pandas数据帧的方式(奇怪的是(非常慢，而且直接索引。

复制数据帧的速度惊人。因此，您可以使用多次复制的空引用数据帧。这是代码：

dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
zygote = pd.DataFrame(index = teams, columns = perf)
dic_dates = {}
# new structure where key = date
for d in dates:
dic_dates[d] = zygote.copy()
for t2 in teams:
dic_dates[d].loc[t2] = dic_teams[t2].loc[d]

这大约比我机器上的参考速度快2倍。

克服缓慢的数据帧直接索引是很棘手的。我们可以用numpy来做。事实上，我们可以将数据帧转换为3D numpy数组，使用numpy执行换位，最后将切片再次转换为数据帧。注意，这种方法假设所有值都是整数，并且输入数据帧结构良好。

以下是最终实现：

dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
dic_dates = {}
# Create a numpy array from Pandas dataframes
# Assume the order of the `dates` and `perf` indices are the same in all dataframe (and their order)
full = np.empty(shape=(len(teams), len(dates), len(perf)), dtype=int)
for tId,tName in enumerate(teams):
full[tId,:,:] = dic_teams[tName].to_numpy()
# New structure where key = date, created from the numpy array
for dId,dName in enumerate(dates):
dic_dates[dName] = pd.DataFrame({pName: full[:,dId,pId] for pId,pName in enumerate(perf)}, index = teams)

这个实现比我的机器上的引用快6.4倍。请注意，大约75%的时间遗憾地花费在pd.DataFrame调用中。因此，如果您想要更快的代码，请使用基本的3D numpy数组！

相关内容

最新更新

热门标签：