我正在尝试自动化并构建一个更干净的代码。我希望我的代码得到一个CSV,按X分组(当前变量名为"Class"(然后从平均值中去除每3std。
import pandas as pd
import numpy as np
my_path = "data_291018.csv"
data_loc = pd.read_csv(my_path)
df = pd.DataFrame(data_loc)
df = df.drop(df.columns[df.columns.str.contains('unnamed', case=False)], axis=1)
class_8 = df[df["Class"] == 8]
class_11 = df[df["Class"] == 11]
heads = df.columns[4:].values
for i in heads:
class_8[i] = class_8[i].apply(lambda x: x if abs(x-class_8[i].mean()) < 3*class_8[i].std() else np.nan)
class_11[i] = class_11[i].apply(lambda x: x if abs(x-class_11[i].mean()) < 3*class_11[i].std() else np.nan)
both = pd.concat([class_8, class_11])
both.to_csv("data.csv", sep=',')
我尝试了添加而不是在两个不同的DF上运行
new_df = df.copy()
class_df = df.groupby("Class")
和运行
for i in heads:
new_df[i] = new_df[i].apply(lambda x: x if abs(x-class_df[i].mean()) < 3*class_df[i].std() else np.nan)
它失败了。。。"raise ValueError("只能比较标记相同的值"ValueError:('只能比较标记相同的系列对象',u'出现在索引SubjNum'(">
你能帮帮我吗?在后面的阶段中,我希望通过一个以上的变量进行分组。
非常感谢!
DF看起来像这样:
SubjNum Class Genderm1f2 LRLevel exp1 exp2 exp3 exp4 exp5
8001 8 1 1 88 2 15 19 92
8002 8 2 1 85 59 19 20 97
8003 8 2 1 84 52 12 18 91
8004 11 2 1 85 44 17 20 92
8005 11 2 1 81 35 400 18 93
8006 11 1 1 190 56 20 17 97
我想从基于类别/性别等的平均值中删除超过3 std的细胞。
SubjNum Class Genderm1f2 LRLevel exp1 exp2 exp3 exp4 exp5
8001 8 1 1 88 . 15 19 92
8002 8 2 1 85 59 19 20 97
8003 8 2 1 84 52 12 18 91
8004 11 2 1 85 44 17 20 92
8005 11 2 1 81 35 . 18 93
8006 11 1 1 . 56 20 17 97
正如我所能理解的,我只是把我的观察结果放在这里,这样你就可以看看它是否与你所寻找的相关,然而,专家们仍在等待完美的答案:
示例中的模拟数据帧:
>>> df
SubjNum Class Genderm1f2 LRLevel exp1 exp2 exp3 exp4 exp5
0 8001 8 1 1 88 2 15 19 92
1 8002 8 2 1 85 59 19 20 97
2 8003 8 2 1 84 52 12 18 91
3 8004 11 2 1 85 44 17 20 92
4 8005 11 2 1 81 35 400 18 93
5 8006 11 1 1 190 56 20 17 97
基于这两列的平均值:
>>> df.groupby(['Class', 'Genderm1f2']).mean()
SubjNum LRLevel exp1 exp2 exp3 exp4 exp5
Class Genderm1f2
8 1 8001.0 1.0 88.0 2.0 15.0 19.0 92.0
2 8002.5 1.0 84.5 55.5 15.5 19.0 94.0
11 1 8006.0 1.0 190.0 56.0 20.0 17.0 97.0
2 8004.5 1.0 83.0 39.5 208.5 19.0 92.5
基于这两列的标准偏差:
>>> df.groupby(['Class', 'Genderm1f2']).std()
SubjNum LRLevel exp1 exp2 exp3 exp4 exp5
Class Genderm1f2
8 1 NaN NaN NaN NaN NaN NaN NaN
2 0.707107 0.0 0.707107 4.949747 4.949747 1.414214 4.242641
11 1 NaN NaN NaN NaN NaN NaN NaN
2 0.707107 0.0 2.828427 6.363961 270.821897 1.414214 0.707107
只是由具有聚合CCD_ 1&CCD_ 2。
>>> df.groupby(['Class', 'Genderm1f2']).agg(['mean','std'])
SubjNum LRLevel exp1 exp2 exp3 exp4 exp5
mean std mean std mean std mean std mean std mean std mean std
Class Genderm1f2
8 1 8001.0 NaN 1 NaN 88.0 NaN 2.0 NaN 15.0 NaN 19 NaN 92.0 NaN
2 8002.5 0.707107 1 0.0 84.5 0.707107 55.5 4.949747 15.5 4.949747 19 1.414214 94.0 4.242641
11 1 8006.0 NaN 1 NaN 190.0 NaN 56.0 NaN 20.0 NaN 17 NaN 97.0 NaN
2 8004.5 0.707107 1 0.0 83.0 2.828427 39.5 6.363961 208.5 270.821897 19 1.414214 92.5 0.707107
只是由具有聚合CCD_ 3&其值大于3的CCD_ 4。
>>> df.groupby(['Class', 'Genderm1f2']).agg(['mean','std']) > 3
SubjNum LRLevel exp1 exp2 exp3 exp4 exp5
mean std mean std mean std mean std mean std mean std mean std
Class Genderm1f2
8 1 True False False False True False False False True False True False True False
2 True False False False True False True True True True True False True True
11 1 True False False False True False True False True False True False True False
2 True False False False True False True True True True True False True False