拆分熊猫数据帧列



我有一个带有"Innings"列的df,其中包含以下数据:

Innings
[{'InningID': 297503, 'GameID': 47547, 'InningNumber': 1, 'AwayTeamRuns': 0, 'HomeTeamRuns': 3}, {'InningID': 297504, 'GameID': 47547, 'InningNumber': 2, 'AwayTeamRuns': 2, 'HomeTeamRuns': 2}, {'InningID': 297505, 'GameID': 47547, 'InningNumber': 3, 'AwayTeamRuns': 0, 'HomeTeamRuns': 2}, {'InningID': 297506, 'GameID': 47547, 'InningNumber': 4, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297507, 'GameID': 47547, 'InningNumber': 5, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297508, 'GameID': 47547, 'InningNumber': 6, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297509, 'GameID': 47547, 'InningNumber': 7, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297510, 'GameID': 47547, 'InningNumber': 8, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297512, 'GameID': 47547, 'InningNumber': 9, 'AwayTeamRuns': 1, 'HomeTeamRuns': None}]
[{'InningID': 297511, 'GameID': 47546, 'InningNumber': 1, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297513, 'GameID': 47546, 'InningNumber': 2, 'AwayTeamRuns': 1, 'HomeTeamRuns': 0}, {'InningID': 297514, 'GameID': 47546, 'InningNumber': 3, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297515, 'GameID': 47546, 'InningNumber': 4, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297516, 'GameID': 47546, 'InningNumber': 5, 'AwayTeamRuns': 1, 'HomeTeamRuns': 0}, {'InningID': 297517, 'GameID': 47546, 'InningNumber': 6, 'AwayTeamRuns': 1, 'HomeTeamRuns': 3}, {'InningID': 297518, 'GameID': 47546, 'InningNumber': 7, 'AwayTeamRuns': 1, 'HomeTeamRuns': 0}, {'InningID': 297519, 'GameID': 47546, 'InningNumber': 8, 'AwayTeamRuns': 0, 'HomeTeamRuns': 1}, {'InningID': 297520, 'GameID': 47546, 'InningNumber': 9, 'AwayTeamRuns': 1, 'HomeTeamRuns': 2}]
[{'InningID': 297521, 'GameID': 50022, 'InningNumber': 1, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297522, 'GameID': 50022, 'InningNumber': 2, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297523, 'GameID': 50022, 'InningNumber': 3, 'AwayTeamRuns': 0, 'HomeTeamRuns': 1}, {'InningID': 297524, 'GameID': 50022, 'InningNumber': 4, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297525, 'GameID': 50022, 'InningNumber': 5, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297526, 'GameID': 50022, 'InningNumber': 6, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297527, 'GameID': 50022, 'InningNumber': 7, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297528, 'GameID': 50022, 'InningNumber': 8, 'AwayTeamRuns': 0, 'HomeTeamRuns': 2}, {'InningID': 297529, 'GameID': 50022, 'InningNumber': 9, 'AwayTeamRuns': 3, 'HomeTeamRuns': 1}]
[{'InningID': 297530, 'GameID': 47556, 'InningNumber': 1, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297532, 'GameID': 47556, 'InningNumber': 2, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297534, 'GameID': 47556, 'InningNumber': 3, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297536, 'GameID': 47556, 'InningNumber': 4, 'AwayTeamRuns': 2, 'HomeTeamRuns': 0}, {'InningID': 297542, 'GameID': 47556, 'InningNumber': 5, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297545, 'GameID': 47556, 'InningNumber': 6, 'AwayTeamRuns': 0, 'HomeTeamRuns': 1}, {'InningID': 297549, 'GameID': 47556, 'InningNumber': 7, 'AwayTeamRuns': 0, 'HomeTeamRuns': 2}, {'InningID': 297554, 'GameID': 47556, 'InningNumber': 8, 'AwayTeamRuns': 0, 'HomeTeamRuns': 1}, {'InningID': 297558, 'GameID': 47556, 'InningNumber': 9, 'AwayTeamRuns': 0, 'HomeTeamRuns': None}]
[{'InningID': 297531, 'GameID': 47557, 'InningNumber': 1, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297533, 'GameID': 47557, 'InningNumber': 2, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297535, 'GameID': 47557, 'InningNumber': 3, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297537, 'GameID': 47557, 'InningNumber': 4, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297543, 'GameID': 47557, 'InningNumber': 5, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297546, 'GameID': 47557, 'InningNumber': 6, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297551, 'GameID': 47557, 'InningNumber': 7, 'AwayTeamRuns': 0, 'HomeTeamRuns': 6}, {'InningID': 297560, 'GameID': 47557, 'InningNumber': 8, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297566, 'GameID': 47557, 'InningNumber': 9, 'AwayTeamRuns': 0, 'HomeTeamRuns': None}]
[{'InningID': 297539, 'GameID': 47549, 'InningNumber': 1, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297540, 'GameID': 47549, 'InningNumber': 2, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297544, 'GameID': 47549, 'InningNumber': 3, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297548, 'GameID': 47549, 'InningNumber': 4, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297553, 'GameID': 47549, 'InningNumber': 5, 'AwayTeamRuns': 0, 'HomeTeamRuns': 5}, {'InningID': 297557, 'GameID': 47549, 'InningNumber': 6, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297561, 'GameID': 47549, 'InningNumber': 7, 'AwayTeamRuns': 3, 'HomeTeamRuns': 0}, {'InningID': 297573, 'GameID': 47549, 'InningNumber': 8, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297579, 'GameID': 47549, 'InningNumber': 9, 'AwayTeamRuns': 0, 'HomeTeamRuns': None}]
[{'InningID': 297538, 'GameID': 47558, 'InningNumber': 1, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297541, 'GameID': 47558, 'InningNumber': 2, 'AwayTeamRuns': 2, 'HomeTeamRuns': 0}, {'InningID': 297547, 'GameID': 47558, 'InningNumber': 3, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297552, 'GameID': 47558, 'InningNumber': 4, 'AwayTeamRuns': 2, 'HomeTeamRuns': 0}, {'InningID': 297556, 'GameID': 47558, 'InningNumber': 5, 'AwayTeamRuns': 0, 'HomeTeamRuns': 5}, {'InningID': 297562, 'GameID': 47558, 'InningNumber': 6, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297568, 'GameID': 47558, 'InningNumber': 7, 'AwayTeamRuns': 2, 'HomeTeamRuns': 0}, {'InningID': 297577, 'GameID': 47558, 'InningNumber': 8, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297584, 'GameID': 47558, 'InningNumber': 9, 'AwayTeamRuns': 1, 'HomeTeamRuns': 0}]
[{'InningID': 297550, 'GameID': 47548, 'InningNumber': 1, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297555, 'GameID': 47548, 'InningNumber': 2, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297559, 'GameID': 47548, 'InningNumber': 3, 'AwayTeamRuns': 0, 'HomeTeamRuns': 2}, {'InningID': 297567, 'GameID': 47548, 'InningNumber': 4, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297569, 'GameID': 47548, 'InningNumber': 5, 'AwayTeamRuns': 1, 'HomeTeamRuns': 0}, {'InningID': 297578, 'GameID': 47548, 'InningNumber': 6, 'AwayTeamRuns': 1, 'HomeTeamRuns': 0}, {'InningID': 297583, 'GameID': 47548, 'InningNumber': 7, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297587, 'GameID': 47548, 'InningNumber': 8, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297591, 'GameID': 47548, 'InningNumber': 9, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297594, 'GameID': 47548, 'InningNumber': 10, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297597, 'GameID': 47548, 'InningNumber': 11, 'AwayTeamRuns': 0, 'HomeTeamRuns': 1}]
[]
[{'InningID': 297564, 'GameID': 47553, 'InningNumber': 1, 'AwayTeamRuns': 2, 'HomeTeamRuns': 0}, {'InningID': 297572, 'GameID': 47553, 'InningNumber': 2, 'AwayTeamRuns': 1, 'HomeTeamRuns': 0}, {'InningID': 297574, 'GameID': 47553, 'InningNumber': 3, 'AwayTeamRuns': 0, 'HomeTeamRuns': 1}, {'InningID': 297580, 'GameID': 47553, 'InningNumber': 4, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297586, 'GameID': 47553, 'InningNumber': 5, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297590, 'GameID': 47553, 'InningNumber': 6, 'AwayTeamRuns': 1, 'HomeTeamRuns': 0}, {'InningID': 297596, 'GameID': 47553, 'InningNumber': 7, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297599, 'GameID': 47553, 'InningNumber': 8, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297603, 'GameID': 47553, 'InningNumber': 9, 'AwayTeamRuns': 0, 'HomeTeamRuns': 2}]
[{'InningID': 297563, 'GameID': 47555, 'InningNumber': 1, 'AwayTeamRuns': 1, 'HomeTeamRuns': 0}, {'InningID': 297570, 'GameID': 47555, 'InningNumber': 2, 'AwayTeamRuns': 0, 'HomeTeamRuns': 1}, {'InningID': 297575, 'GameID': 47555, 'InningNumber': 3, 'AwayTeamRuns': 0, 'HomeTeamRuns': 5}, {'InningID': 297582, 'GameID': 47555, 'InningNumber': 4, 'AwayTeamRuns': 0, 'HomeTeamRuns': 3}, {'InningID': 297588, 'GameID': 47555, 'InningNumber': 5, 'AwayTeamRuns': 0, 'HomeTeamRuns': 3}, {'InningID': 297593, 'GameID': 47555, 'InningNumber': 6, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297595, 'GameID': 47555, 'InningNumber': 7, 'AwayTeamRuns': 1, 'HomeTeamRuns': 0}, {'InningID': 297598, 'GameID': 47555, 'InningNumber': 8, 'AwayTeamRuns': 1, 'HomeTeamRuns': 2}, {'InningID': 297601, 'GameID': 47555, 'InningNumber': 9, 'AwayTeamRuns': 0, 'HomeTeamRuns': None}]

我需要将局数列分成许多(数字可以一行不同...... 我已经尝试了爆炸函数,但不起作用,或者因为字段的随机数而拆分。

最好的方法是创建新的列,例如:

InningID1 : xxx, InningNumber1: x, AwayTeamsRuns1: x, HomeTeamRuns1: x, InningID2 : xxx, InningNumber2: x, AwayTeamsRuns2: x, HomeTeamRuns2: x etc...

我不关心 GameId,因为我已经在 df 的同一行中有这些信息。

有什么建议吗?

非常感谢

杰弗里

你可以试试这个,先清理数据,将'替换为",将None转换为"None",然后遍历DataFrame,将json字符串转换为字典,使用json_normalize将字典转换为DataFrame,最后concat它们:

df['Innings'] = df['Innings'].str.replace("'", '"')
df['Innings'] = df['Innings'].str.replace("None", '"None"')
dfs = []
for i in range(0, df.shape[0]):
j_str = df.loc[i]['Innings']
x = json.loads(j_str)
dfs.append(pd.json_normalize(x))
df = pd.concat(dfs).reset_index(drop=['index'])
print(df)

输出:

InningID  GameID  InningNumber  AwayTeamRuns HomeTeamRuns
0     297503   47547             1             0            3
1     297504   47547             2             2            2
2     297505   47547             3             0            2
3     297506   47547             4             0            0
4     297507   47547             5             0            0
..       ...     ...           ...           ...          ...
87    297588   47555             5             0            3
88    297593   47555             6             0            0
89    297595   47555             7             1            0
90    297598   47555             8             1            2
91    297601   47555             9             0         None

这看起来是一个很好的机会来使用DictVectorizer

import pandas as pd
from sklearn.feature_extraction import DictVectorizer
vectorizer = DictVectorizer()
dicts = df.explode("Innings")["Innings"]
data = vectorizer.fit_transform(dicts).todense()
innings_df = pd.DataFrame(data, columns=vectorizer.feature_names_)

缺少的键将自动填充零,请确保这是您想要的行为。

现在,如果您愿意,可以将innings_df与原始数据帧(df(连接/联接

最新更新