访问Python中的PostgreSQL存储键和值,并为每个键创建新的数据帧列



我管理一个PostgreSQL数据库,并正在为用户开发一个访问数据库子集的工具。数据库有很多列,此外,我们还使用大量的hstore键来存储数据库中特定行的附加信息。下面的基本示例

A          B        C        hstore   
"foo"      1        4        "Fruit"=>"apple", "Pet"=>"dog", "Country"=>"Norway" 
"bar"      4        6        "Pet"=>"cat", "Country"=>"Suriname", "Number"=>"5"
"foobar"   2        8
"baz"      3        1        "Fruit"=>"apple", "Name"=>"David"

数据通常导出到CSV文件中,如下所示:

COPY tableName TO '/filepath/file.csv' DELIMITER ',' CSV HEADER;

我在Python的Pandas数据帧中读到了这样的内容:

import pandas as pd
df = pd.read_csv('/filepath/file.csv')

然后我访问数据的一个子集。这个子集在大多数行中应该有一组公共的hstore键,但不一定在所有行中都有。

我想为每个hstore键创建一个单独的列。如果一行不存在键,则单元格应为空,或者用NULL或NAN值填充,以最简单的方式为准。最有效的方法是什么

您可以使用.str.extractall()从列hstore中提取键和值,然后使用.pivot()将键转换为列标签。通过.groupby().agg()聚合原始数据帧中每一行的条目。使用.replace()为空条目设置NaN。最后,使用.join():将结果数据帧连接回原始数据帧

df.join(df['hstore'].str.extractall(r'"(.+?)"=>"(.+?)"')
.reset_index()
.pivot(index=['level_0', 'match'], columns=0, values=1)
.groupby(level=0)
.agg(lambda x: ''.join(x.dropna()))
.replace('', np.nan)
)

结果:

A  B  C                                               hstore   Country  Fruit   Name  Pet
0     "foo"  1  4  "Fruit"=>"apple", "Pet"=>"dog", "Country"=>"Norway"    Norway  apple    NaN  dog
1     "bar"  4  6                  "Pet"=>"cat", "Country"=>"Suriname"  Suriname    NaN    NaN  cat
2  "foobar"  2  8                                                 None       NaN    NaN    NaN  NaN
3     "baz"  3  1                    "Fruit"=>"apple", "Name"=>"David"       NaN  apple  David  NaN

如果您想获得一个新的数据帧进行提取,而不是连接回原始数据帧,您可以删除.join()步骤并执行.reindex(),如下所示:

df_out = (df['hstore'].str.extractall(r'"(.+?)"=>"(.+?)"')
.reset_index()
.pivot(index=['level_0', 'match'], columns=0, values=1)
.groupby(level=0)
.agg(lambda x: ''.join(x.dropna()))
.replace('', np.nan)
)
df_out = df_out.reindex(df.index)

结果:

print(df_out)

Country  Fruit   Name  Pet
0    Norway  apple    NaN  dog
1  Suriname    NaN    NaN  cat
2       NaN    NaN    NaN  NaN
3       NaN  apple  David  NaN

如果您不介意按照这里的建议安装一个额外的库(sqlalchemy(,您可以执行以下操作:

import pandas as pd
import sqlalchemy.dialects.postgresql as postgresql
hstore_to_dict = postgresql.HSTORE().result_processor(None, None)
df = pd.read_csv("/filepath/file.csv")
df["hstore"] = df["hstore"].fillna("")
hstore_dict = df["hstore"].map(hstore_to_dict)
hstore_df = pd.json_normalize(hstore_dict)
# optionally merge the expanded hstore dataframe with the original dataframe 
df = pd.concat([df, hstore_df], axis=1)
df.drop("hstore", axis=1, inplace=True)

pd.json_normalize将字典列表转换为数据帧,如下所述。

我假设您的输入CSV文件如下所示:

A,B,C,hstore
bar,4,6,"""Pet""=>""cat"", ""Number""=>""5"", ""Country""=>""Suriname"""
foobar,2,8,
baz,3,1,"""Name""=>""David"", ""Fruit""=>""apple"""
foo,1,4,"""Pet""=>""dog"", ""Fruit""=>""apple"", ""Country""=>""Norway"""

运行代码后,输出将是以下数据帧:

A  B  C  Pet Number   Country   Name  Fruit
0     bar  4  6  cat      5  Suriname    NaN    NaN
1  foobar  2  8  NaN    NaN       NaN    NaN    NaN
2     baz  3  1  NaN    NaN       NaN  David  apple
3     foo  1  4  dog    NaN    Norway    NaN  apple

最新更新