我正在分析ProPublica在这里发布的Facebook政治广告。
这就是我的意思。我有一整列目标,我想分析,但它的格式对我这种技能水平的人来说是非常难以访问的。
这只来自一个单元格:[{"target": "NAge", "segment": "21 and older"}, {"target": "MinAge", "segment": "21"}, {"target": "Retargeting", "segment": "people who may be similar to their customers"}, {"target": "Region", "segment": "the United States"}]
另一个:[{"target": "NAge", "segment": "18 and older"}, {"target": "Location Type", "segment": "HOME"}, {"target": "Interest", "segment": "Hispanic culture"}, {"target": "Interest", "segment": "Republican Party (United States)"}, {"target": "Location Granularity", "segment": "country"}, {"target": "Country", "segment": "the United States"}, {"target": "MinAge", "segment": 18}]
我需要做的是将每个";目标";项成为列标签,每个列标签对应";段";成为该列中的一个可能值。
或者,解决方案是创建一个函数来调用每行中的每个字典键来计数频率吗?
- 列为
dicts
的lists
。- 可以使用
pandas.explode()
将list
中的每个dict
移动到单独的列中 - 通过使用
pandas.json_normalize()
、.join()
将dicts
的列转换回df
,将其转换为数据帧,其中键是列标题,值是观测值
- 可以使用
- 使用
.drop()
删除不需要的列 - 如果该列包含字符串dict列表(例如
"[{key: value}]"
(,请参阅将Pandas列中的dictionary/list拆分为Separate Columns中的此解决方案,并使用:df.col2 = df.col2.apply(literal_eval)
和from ast import literal_eval
import pandas as pd
# create sample dataframe
df = pd.DataFrame({'col1': ['x', 'y'], 'col2': [[{"target": "NAge", "segment": "21 and older"}, {"target": "MinAge", "segment": "21"}, {"target": "Retargeting", "segment": "people who may be similar to their customers"}, {"target": "Region", "segment": "the United States"}], [{"target": "NAge", "segment": "18 and older"}, {"target": "Location Type", "segment": "HOME"}, {"target": "Interest", "segment": "Hispanic culture"}, {"target": "Interest", "segment": "Republican Party (United States)"}, {"target": "Location Granularity", "segment": "country"}, {"target": "Country", "segment": "the United States"}, {"target": "MinAge", "segment": 18}]]})
# display(df)
col1 col2
0 x [{'target': 'NAge', 'segment': '21 and older'}, {'target': 'MinAge', 'segment': '21'}, {'target': 'Retargeting', 'segment': 'people who may be similar to their customers'}, {'target': 'Region', 'segment': 'the United States'}]
1 y [{'target': 'NAge', 'segment': '18 and older'}, {'target': 'Location Type', 'segment': 'HOME'}, {'target': 'Interest', 'segment': 'Hispanic culture'}, {'target': 'Interest', 'segment': 'Republican Party (United States)'}, {'target': 'Location Granularity', 'segment': 'country'}, {'target': 'Country', 'segment': 'the United States'}, {'target': 'MinAge', 'segment': 18}]
# use explode to give each dict in a list a separate row
df = df.explode('col2', ignore_index=True)
# normalize the column of dicts, join back to the remaining dataframe columns, and drop the unneeded column
df = df.join(pd.json_normalize(df.col2)).drop(columns=['col2'])
display(df)
col1 target segment
0 x NAge 21 and older
1 x MinAge 21
2 x Retargeting people who may be similar to their customers
3 x Region the United States
4 y NAge 18 and older
5 y Location Type HOME
6 y Interest Hispanic culture
7 y Interest Republican Party (United States)
8 y Location Granularity country
9 y Country the United States
10 y MinAge 18
获取count
- 如果目标是为每个
'target'
和相关的'segment'
获取count
counts = df.groupby(['target', 'segment']).count()
已更新
- 此更新是为完整文件实现的
import pandas as pd
from ast import literal_eval
# load the file
df = pd.read_csv('en-US.csv')
# replace NaNs with '[]', otherwise literal_eval will error
df.targets = df.targets.fillna('[]')
# replace null with None, otherwise literal_eval will error
df.targets = df.targets.str.replace('null', 'None')
# convert the strings to lists of dicts
df.targets = df.targets.apply(literal_eval)
# use explode to give each dict in a list a separate row
df = df.explode('targets', ignore_index=True)
# fillna with {} is required for json_normalize
df.targets = df.targets.fillna({i: {} for i in df.index})
# normalize the column of dicts, join back to the remaining dataframe columns, and drop the unneeded column
normalized = pd.json_normalize(df.targets)
# get the counts
counts = normalized.groupby(['target', 'segment']).segment.count().reset_index(name='counts')