包含 Pandas 数据帧中许多字典的分解列表

>我有一个数据集，如下所示(在数据帧中(：

**_id** **paper_title**   **references**                                                                  **full_text**
1         XYZ              [{'abc':'something','def':'something'},{'def':'something'},...many others]       something
2         XYZ              [{'abc':'something','def':'something'},{'def':'something'},...many others]       something
3         XYZ              [{'abc':'something'},{'def':'something'},...many others]                         something

预期：

**_id** **paper_title**   **abc**    **def**                               **full_text**
1         XYZ          something  something                               something               
something  something
.    
.
(all the dic in list with respect to_id column)
2         XYZ          something  something                               something               
something  something
.    
.
(all the dic in list with respect to_id column)

我尝试df['column_name'].apply(pd.Series).apply(pd.Series)将列表和字典拆分为数据帧列，但没有帮助，因为它没有拆分字典。

我的数据帧的第一行：df.head(1(

假设原始数据帧是具有一个键：值对和一个名为"reference"的键的字典列表：

print(df)                                                                                                                                
id paper_title                                         references       full_text
0   1         xyz  [{'reference': 'description1'}, {'reference': ...       some text
1   2         xyz  [{'reference': 'descriptiona'}, {'reference': ...       more text
2   3         xyz  [{'reference': 'descriptioni'}, {'reference': ...  even more text

然后，您可以使用concat将引用与其索引分开：

df1 = pd.concat([pd.DataFrame(i) for i in df['references']], keys = df.index).reset_index(level=1,drop=True)
print(df1)                                                                                                                               
reference
0    description1
0    description2
0    description3
1    descriptiona
1    descriptionb
1    descriptionc
2    descriptioni
2   descriptionii
2  descriptioniii

然后使用DataFrame.join将列在其索引上重新连接在一起：

df = df.drop('references', axis=1).join(df1).reset_index(drop=True)
print(df)                                                                                                                                
id paper_title       full_text       reference
0   1         xyz       some text    description1
1   1         xyz       some text    description2
2   1         xyz       some text    description3
3   2         xyz       more text    descriptiona
4   2         xyz       more text    descriptionb
5   2         xyz       more text    descriptionc
6   3         xyz  even more text    descriptioni
7   3         xyz  even more text   descriptionii
8   3         xyz  even more text  descriptioniii

在大量阅读熊猫的文档后，我发现使用 apply(pd.系列(是我在问题中寻找的最简单的内容。

这是代码：

df = df.explode('reference')

# 它将列表分解为子集列的行

df = df['reference'].apply(pd.Series).merge(df, left_index=True, right_index=True, how ='outer')

# 将数据帧单元格内的列表拆分为行，并与集合论中的原始数据帧(如 AUB( 合并

旁注：合并时在列中查找唯一值，因为会有许多具有重复值的列

我希望这可以帮助那些使用数据帧/系列的人，其列的列表包含多个字典，并希望将多个词典的列表拆分为以值作为行的新列。

相关内容

最新更新

热门标签：