Python 中伪数据透视表的矢量化实现



我有以下数据帧,包括一些车辆和所述车辆的组件:

df
>>>>        Vehicle       Component
      0     Ford          Air conditioner
      1     Ford          airbag
      2     Ford          engine with 150 H/P
      3     Toyota        airbag
      4     Toyota        1-year concierge assistance
      5     Toyota        ABS breaks
      6     Chrysler      ABS breaks
      7     Chrysler      airbag
      8     Chrysler      air conditioner
      9     Chrysler      engine with 250 H/P

我想使用以下格式创建第二个数据帧,即一个伪数据帧,其中我为每个车辆组件现有组合添加一个 1,否则添加一个 0。

second_df
>>>>       Vehicle      Air conditioner   airbag     engine with 150 H/P    engine with 250 H/P  ABS breaks  1-year concierge assistance  
     0     Ford         1                  1         1                      0                    0           0        
     1     Toyota       0                  1         0                      0                    1           1
     2     Chrysler     1                  1         0                      1                    1           0

我使用下面发布的解决方案实现了这一点,但它效率很低。感谢您的帮助。

您可以使用

df.crosstab创建频率表:

import pandas as pd
df = pd.DataFrame(
    {'Component': ['Air conditioner', 'Air conditioner', 'airbag', 'engine with 150 H/P', 'airbag',
                   '1-year concierge assistance', 'ABS breaks', 'ABS breaks', 'airbag', 
                   'air conditioner', 'engine with 250 H/P'], 
     'Vehicle': ['Ford', 'Ford', 'Ford', 'Ford', 'Toyota', 'Toyota', 'Toyota',
                 'Chrysler', 'Chrysler', 'Chrysler', 'Chrysler']})
result = pd.crosstab(index=[df['Vehicle']], columns=[df['Component']]).clip(upper=1)
print(result)

收益 率

Component  1-year concierge assistance  ABS breaks  Air conditioner  
Vehicle                                                               
Chrysler                             0           1                0   
Ford                                 0           0                1   
Toyota                               1           1                0   
Component  air conditioner  airbag  engine with 150 H/P  engine with 250 H/P  
Vehicle                                                                       
Chrysler                 1       1                    0                    1  
Ford                     0       1                    1                    0  
Toyota                   0       1                    0                    0  

由于频率表可能包含大于 1 的值(如果df包含重复行),因此clip(upper=1)用于将这些值减少回 1。

这是我构建的代码,效率非常低,因为它使用嵌套循环。如果有人发布更优雅的实现,将不胜感激。

import pandas as pd
import numpy as np
data = pd.read_csv('data.csv')
data['vehicle'] = data['vehicle'].apply(str)
vhs = np.unique(data['vehicle'])
vhs = [x for x in vhs if str(x) != 'nan']

data['Component'] = data['Component'].apply(str)
components = np.unique(data['Component'])
components = [x for x in components if str(x) != 'nan']
componentes = ['vehicle'] + components
my_df = pd.DataFrame(columns=componentes)
vhs = np.array(vhs)
my_df['vehicles'] = vhs
my_df = my_df.fillna(0)
for vh in vhs:
    sub_data = data[data['vehicle']==vh]
    compies = np.unique(sub_data['Component'])
    for comp in compies:
        my_df[comp][my_df['vehicles']==vh] = 1
my_df.to_csv('my_vhs.csv',index=False)

最新更新