如何在pandas数据帧中创建新列并从单元格值中插入值



我有一个数据帧,格式如下:

geo_locations   feature                                             mau_audience
0   Aabenraa        Alcholic Drinks,Android users,Architecture,Art...   3.380211241711606,3.230448921378274,3.0,3.5910...
1   Aalborg         Alcholic Drinks,Android users,Architecture,Art...   4.113943352306837,3.8920946026904804,3.7853298...
2   Assens          Alcholic Drinks,Android users,Architecture,Art...   3.041392685158225,3.0,3.0,3.255272505103306,3....
3   Billund         Alcholic Drinks,Android users,Architecture,Art...   3.0,3.0,3.0,3.1760912590556813,3.1461280356782...
4   Bornholm        Alcholic Drinks,Android users,Architecture,Art...   3.0,3.0,3.0,3.113943352306837,3.04139268515822...
5   Esbjerg         Alcholic Drinks,Android users,Architecture,Art...   3.792391689498254,3.5563025007672873,3.5314789...

使用这种格式非常烦人,因为我必须查看单元格的索引才能获得给定位置中某个功能的mau_audience值。与此格式不同,使用以下格式会容易得多:

geo_locations   Alcoholic Drinks     Android Users      Architecture 
0   Aabenraa        3.380211241711606    3.230448921378274. 3.0              
1   Aalborg         4.113943352306837.   3.8920946026904804 3.7853298
2   Assens          3.041392685158225    3.0.               3.0
3   Billund         3.0                  3.0,               3.0
4   Bornholm        3.0                  3.0                3.0
5   Esbjerg         3.792391689498254    3.5563025007672873 3.5314789

如何重新格式化此数据帧?谢谢

假设feature中每个逗号分隔的值在mau_audience中都有相应的值。

创建数据帧

data = pd.DataFrame([
['Aabenraa','Alcholic Drinks,Android users,Architecture', '3.380211241711606,3.230448921378274,3.0'],
['Aalborg','Alcholic Drinks,Android users,Architecture','4.113943352306837,3.8920946026904804,3.7853298']], 
columns = ['geo_locations','feature','mau_audience'])

列名

columns = data.loc[0,'feature'].split(',')

根据功能的第一条记录假定。

拆分mau_audience&转换数据类型

new = data['mau_audience'].str.split(',', expand=True)
new.columns = columns
new = new.astype('float')

在索引上合并新的DataFrame

data = data[['geo_locations']].merge(new,left_index=True, right_index=True)

仅保留原始DataFrame中的地理位置(_L(。

结果

geo_locations  Alcholic Drinks  Android users  Architecture
0      Aabenraa         3.380211       3.230449       3.00000
1       Aalborg         4.113943       3.892095       3.78533
  1. 您显示的数据类似于以下示例:-
import pandas as pd
df=pd.DataFrame({"A":["a,b,c","a,b,d"],"B":["1,2,3","3,4,5"],"C":['a','b']})
  1. 将相应列(在我的情况下为"A"、"B"(的每个单元格中的字符串值转换为列表
df.A=df.A.apply(lambda x : x.split(','))
df.B=df.B.apply(lambda x : x.split(','))
  1. 将这些字符串列表转换为字典,然后连接数据帧
final_df=df[['C']]
temp_df = df.apply(lambda row: dict(zip(row["A"], row["B"])), axis=1)

temp_df将是这种格式的

0    {'a': '1', 'b': '2', 'c': '3'}
1    {'a': '3', 'b': '4', 'd': '5'}
  1. 将以上内容转换为DataFrame,使键为DataFrame的列
temp_df_1=temp_df.apply(pd.Series)
final_df=pd.concat([final_df, temp_df_1], axis=1)

final_df

C  a  b    c      d
0  a  1  2    3     NaN
1  b  3  4   NaN     5

最新更新