我有以下数据帧,包括一些车辆和所述车辆的组件:
df
>>>> Vehicle Component
0 Ford Air conditioner
1 Ford airbag
2 Ford engine with 150 H/P
3 Toyota airbag
4 Toyota 1-year concierge assistance
5 Toyota ABS breaks
6 Chrysler ABS breaks
7 Chrysler airbag
8 Chrysler air conditioner
9 Chrysler engine with 250 H/P
我想使用以下格式创建第二个数据帧,即一个伪数据帧,其中我为每个车辆组件现有组合添加一个 1,否则添加一个 0。
second_df
>>>> Vehicle Air conditioner airbag engine with 150 H/P engine with 250 H/P ABS breaks 1-year concierge assistance
0 Ford 1 1 1 0 0 0
1 Toyota 0 1 0 0 1 1
2 Chrysler 1 1 0 1 1 0
我使用下面发布的解决方案实现了这一点,但它效率很低。感谢您的帮助。
您可以使用
df.crosstab
创建频率表:
import pandas as pd
df = pd.DataFrame(
{'Component': ['Air conditioner', 'Air conditioner', 'airbag', 'engine with 150 H/P', 'airbag',
'1-year concierge assistance', 'ABS breaks', 'ABS breaks', 'airbag',
'air conditioner', 'engine with 250 H/P'],
'Vehicle': ['Ford', 'Ford', 'Ford', 'Ford', 'Toyota', 'Toyota', 'Toyota',
'Chrysler', 'Chrysler', 'Chrysler', 'Chrysler']})
result = pd.crosstab(index=[df['Vehicle']], columns=[df['Component']]).clip(upper=1)
print(result)
收益 率
Component 1-year concierge assistance ABS breaks Air conditioner
Vehicle
Chrysler 0 1 0
Ford 0 0 1
Toyota 1 1 0
Component air conditioner airbag engine with 150 H/P engine with 250 H/P
Vehicle
Chrysler 1 1 0 1
Ford 0 1 1 0
Toyota 0 1 0 0
由于频率表可能包含大于 1 的值(如果df
包含重复行),因此clip(upper=1)
用于将这些值减少回 1。
这是我构建的代码,效率非常低,因为它使用嵌套循环。如果有人发布更优雅的实现,将不胜感激。
import pandas as pd
import numpy as np
data = pd.read_csv('data.csv')
data['vehicle'] = data['vehicle'].apply(str)
vhs = np.unique(data['vehicle'])
vhs = [x for x in vhs if str(x) != 'nan']
data['Component'] = data['Component'].apply(str)
components = np.unique(data['Component'])
components = [x for x in components if str(x) != 'nan']
componentes = ['vehicle'] + components
my_df = pd.DataFrame(columns=componentes)
vhs = np.array(vhs)
my_df['vehicles'] = vhs
my_df = my_df.fillna(0)
for vh in vhs:
sub_data = data[data['vehicle']==vh]
compies = np.unique(sub_data['Component'])
for comp in compies:
my_df[comp][my_df['vehicles']==vh] = 1
my_df.to_csv('my_vhs.csv',index=False)