我想创建一个带有4个变量的海运箱形图("图纸","充分加热太昂贵","供暖系统不足","劣质建筑材料"),并将温度放在y轴上。问题是,许多人对每个意见调查了不止一个选项。我想知道我应该如何在每一行分开的选项,同时仍然保持所有的数据。以下是一些数据:
CausesCold
Draughts 15.0
Draughts 19.0
Heating it sufficiently is too expensive 0.0
Draughts 10.0
Draughts 15.0
Draughts 20.0
Heating it sufficiently is too expensive,Heatin... 5.0
Heating it sufficiently is too expensive,Heatin... 18.0
Heating system in inadequate,Draughts 15.0
Heating system in inadequate,Poor building fabric 15.0
Heating it sufficiently is too expensive,Heatin... 21.0
Heating system in inadequate 21.0
Heating system in inadequate 21.0
Heating it sufficiently is too expensive 10.0
Draughts 0.0
Heating it sufficiently is too expensive,Poor b... 18.0
Heating system in inadequate 18.0
Poor building fabric,Draughts 19.0
Heating system in inadequate,Poor building fabr... 19.0
Heating system in inadequate 18.0
Heating system in inadequate 17.0
Heating it sufficiently is too expensive,Poor b... 18.0
Heating it sufficiently is too expensive,Heatin... 15.0
Heating it sufficiently is too expensive,Heatin... 15.0
Heating it sufficiently is too expensive,Poor b... 20.0
Heating it sufficiently is too expensive 17.0
Heating it sufficiently is too expensive 17.0
Heating system in inadequate 0.0
Heating it sufficiently is too expensive 10.0
Heating it sufficiently is too expensive,Heatin... 0.0
我希望它是这样的:
CurrentThermostatTemp
CausesCold
Poor building fabric 20.0
Poor building fabric 17.0
Poor building fabric 20.0
Poor building fabric 19.0
Poor building fabric 20.0
Poor building fabric 17.0
Poor building fabric 18.0
Poor building fabric 22.0
Poor building fabric 25.0
Poor building fabric 20.0
Poor building fabric 15.0
Poor building fabric 19.0
Poor building fabric 20.0
Poor building fabric 20.0
Poor building fabric 20.0
Poor building fabric 21.0
Poor building fabric 19.0
Poor building fabric 20.0
Poor building fabric 18.0
Poor building fabric 20.0
Poor building fabric 17.0
Poor building fabric 25.0
Poor building fabric 18.0
Poor building fabric 20.0
Poor building fabric 16.0
Poor building fabric 15.0
Poor building fabric 21.0
Poor building fabric 25.0
Poor building fabric 23.0
Poor building fabric 30.0
... ...
Draughts 20.0
Draughts 20.0
Draughts 17.0
Draughts 16.0
Draughts 25.0
Draughts 21.0
Draughts 21.0
Draughts 18.0
Draughts 20.0
Draughts 20.0
Draughts 18.0
我不清楚这里的数据是如何格式化的。温控器的读数已经在它自己的一列了吗?
在任何情况下,您都可能希望使用pandas.Series.str.split
之类的temp = data['CausesCold'].str.split(',', n = 1, expand = True)
这将创建一个包含两个编号列的新数据帧。
如果我假设恒温器的值已经在一个单独的列中关闭,那么我将合并到这个"temp"数据框恒温器的值。比如:
temp['thermostat']=df['thermostat']
你的temp df看起来像这样:
|********************************|
|0 |1 |thermostat |
|Reason 1. |Reason 2 |Number |
|Reason 1. |Reason 2 |Number |
|Reason 1. |null |Number |
|********************************|
您希望0和1列与相应的恒温器值堆叠。
分割df
df=temp[['0','thermostat']]
df1=temp[['1','thermostat']]
,然后追加它们。也可能有些人只有一个答案(即列'1'为null'),所以继续处理它。
df=df.append(df1.dropna(subset=['1']))
如果您处于具有原始数据源的不幸位置,其中原因和恒温器代码都在相同的单个字符串中,我可能会作为第一步对该字符串中的任何数字进行正则表达式提取,并将其定义为一个名为"恒温器"的新列或类似的东西。
无论如何,这应该会让你朝着正确的方向前进。这未必是到达目的地最有效的方式,但它会让你到达目的地。