我管理一个PostgreSQL数据库,并正在为用户开发一个访问数据库子集的工具。数据库有很多列,此外,我们还使用大量的hstore键来存储数据库中特定行的附加信息。下面的基本示例
A B C hstore
"foo" 1 4 "Fruit"=>"apple", "Pet"=>"dog", "Country"=>"Norway"
"bar" 4 6 "Pet"=>"cat", "Country"=>"Suriname", "Number"=>"5"
"foobar" 2 8
"baz" 3 1 "Fruit"=>"apple", "Name"=>"David"
数据通常导出到CSV文件中,如下所示:
COPY tableName TO '/filepath/file.csv' DELIMITER ',' CSV HEADER;
我在Python的Pandas数据帧中读到了这样的内容:
import pandas as pd
df = pd.read_csv('/filepath/file.csv')
然后我访问数据的一个子集。这个子集在大多数行中应该有一组公共的hstore键,但不一定在所有行中都有。
我想为每个hstore键创建一个单独的列。如果一行不存在键,则单元格应为空,或者用NULL或NAN值填充,以最简单的方式为准。最有效的方法是什么
您可以使用.str.extractall()
从列hstore
中提取键和值,然后使用.pivot()
将键转换为列标签。通过.groupby()
和.agg()
聚合原始数据帧中每一行的条目。使用.replace()
为空条目设置NaN
。最后,使用.join()
:将结果数据帧连接回原始数据帧
df.join(df['hstore'].str.extractall(r'"(.+?)"=>"(.+?)"')
.reset_index()
.pivot(index=['level_0', 'match'], columns=0, values=1)
.groupby(level=0)
.agg(lambda x: ''.join(x.dropna()))
.replace('', np.nan)
)
结果:
A B C hstore Country Fruit Name Pet
0 "foo" 1 4 "Fruit"=>"apple", "Pet"=>"dog", "Country"=>"Norway" Norway apple NaN dog
1 "bar" 4 6 "Pet"=>"cat", "Country"=>"Suriname" Suriname NaN NaN cat
2 "foobar" 2 8 None NaN NaN NaN NaN
3 "baz" 3 1 "Fruit"=>"apple", "Name"=>"David" NaN apple David NaN
如果您想获得一个新的数据帧进行提取,而不是连接回原始数据帧,您可以删除.join()
步骤并执行.reindex()
,如下所示:
df_out = (df['hstore'].str.extractall(r'"(.+?)"=>"(.+?)"')
.reset_index()
.pivot(index=['level_0', 'match'], columns=0, values=1)
.groupby(level=0)
.agg(lambda x: ''.join(x.dropna()))
.replace('', np.nan)
)
df_out = df_out.reindex(df.index)
结果:
print(df_out)
Country Fruit Name Pet
0 Norway apple NaN dog
1 Suriname NaN NaN cat
2 NaN NaN NaN NaN
3 NaN apple David NaN
如果您不介意按照这里的建议安装一个额外的库(sqlalchemy(,您可以执行以下操作:
import pandas as pd
import sqlalchemy.dialects.postgresql as postgresql
hstore_to_dict = postgresql.HSTORE().result_processor(None, None)
df = pd.read_csv("/filepath/file.csv")
df["hstore"] = df["hstore"].fillna("")
hstore_dict = df["hstore"].map(hstore_to_dict)
hstore_df = pd.json_normalize(hstore_dict)
# optionally merge the expanded hstore dataframe with the original dataframe
df = pd.concat([df, hstore_df], axis=1)
df.drop("hstore", axis=1, inplace=True)
pd.json_normalize
将字典列表转换为数据帧,如下所述。
我假设您的输入CSV文件如下所示:
A,B,C,hstore
bar,4,6,"""Pet""=>""cat"", ""Number""=>""5"", ""Country""=>""Suriname"""
foobar,2,8,
baz,3,1,"""Name""=>""David"", ""Fruit""=>""apple"""
foo,1,4,"""Pet""=>""dog"", ""Fruit""=>""apple"", ""Country""=>""Norway"""
运行代码后,输出将是以下数据帧:
A B C Pet Number Country Name Fruit
0 bar 4 6 cat 5 Suriname NaN NaN
1 foobar 2 8 NaN NaN NaN NaN NaN
2 baz 3 1 NaN NaN NaN David apple
3 foo 1 4 dog NaN Norway NaN apple