我有一个数据帧,格式如下:
geo_locations feature mau_audience
0 Aabenraa Alcholic Drinks,Android users,Architecture,Art... 3.380211241711606,3.230448921378274,3.0,3.5910...
1 Aalborg Alcholic Drinks,Android users,Architecture,Art... 4.113943352306837,3.8920946026904804,3.7853298...
2 Assens Alcholic Drinks,Android users,Architecture,Art... 3.041392685158225,3.0,3.0,3.255272505103306,3....
3 Billund Alcholic Drinks,Android users,Architecture,Art... 3.0,3.0,3.0,3.1760912590556813,3.1461280356782...
4 Bornholm Alcholic Drinks,Android users,Architecture,Art... 3.0,3.0,3.0,3.113943352306837,3.04139268515822...
5 Esbjerg Alcholic Drinks,Android users,Architecture,Art... 3.792391689498254,3.5563025007672873,3.5314789...
使用这种格式非常烦人,因为我必须查看单元格的索引才能获得给定位置中某个功能的mau_audience值。与此格式不同,使用以下格式会容易得多:
geo_locations Alcoholic Drinks Android Users Architecture
0 Aabenraa 3.380211241711606 3.230448921378274. 3.0
1 Aalborg 4.113943352306837. 3.8920946026904804 3.7853298
2 Assens 3.041392685158225 3.0. 3.0
3 Billund 3.0 3.0, 3.0
4 Bornholm 3.0 3.0 3.0
5 Esbjerg 3.792391689498254 3.5563025007672873 3.5314789
如何重新格式化此数据帧?谢谢
假设feature中每个逗号分隔的值在mau_audience中都有相应的值。
创建数据帧
data = pd.DataFrame([
['Aabenraa','Alcholic Drinks,Android users,Architecture', '3.380211241711606,3.230448921378274,3.0'],
['Aalborg','Alcholic Drinks,Android users,Architecture','4.113943352306837,3.8920946026904804,3.7853298']],
columns = ['geo_locations','feature','mau_audience'])
列名
columns = data.loc[0,'feature'].split(',')
根据功能的第一条记录假定。
拆分mau_audience&转换数据类型
new = data['mau_audience'].str.split(',', expand=True)
new.columns = columns
new = new.astype('float')
在索引上合并新的DataFrame
data = data[['geo_locations']].merge(new,left_index=True, right_index=True)
仅保留原始DataFrame中的地理位置(_L(。
结果
geo_locations Alcholic Drinks Android users Architecture
0 Aabenraa 3.380211 3.230449 3.00000
1 Aalborg 4.113943 3.892095 3.78533
- 您显示的数据类似于以下示例:-
import pandas as pd
df=pd.DataFrame({"A":["a,b,c","a,b,d"],"B":["1,2,3","3,4,5"],"C":['a','b']})
- 将相应列(在我的情况下为"A"、"B"(的每个单元格中的字符串值转换为列表
df.A=df.A.apply(lambda x : x.split(','))
df.B=df.B.apply(lambda x : x.split(','))
- 将这些字符串列表转换为字典,然后连接数据帧
final_df=df[['C']]
temp_df = df.apply(lambda row: dict(zip(row["A"], row["B"])), axis=1)
temp_df将是这种格式的
0 {'a': '1', 'b': '2', 'c': '3'}
1 {'a': '3', 'b': '4', 'd': '5'}
- 将以上内容转换为DataFrame,使键为DataFrame的列
temp_df_1=temp_df.apply(pd.Series)
final_df=pd.concat([final_df, temp_df_1], axis=1)
final_df
C a b c d
0 a 1 2 3 NaN
1 b 3 4 NaN 5