从文本文件pandas python简化有条件的数组



我正在尝试从文本文件中访问数据,并应用正常测试,置信区间,方差分析测试等。

是否有一种更简单的方法来使用我的数据中的熊猫创建条件阵列,而无需手动键入36行代码,就像我在下面完成的36行代码?

以后我需要在这些包装中访问不同的口味,因此我需要进行大约7次的配方。

revels_data = pd.read_csv("revels2.txt")
rd = revels_data
# packet sums
total_1 = (rd.loc[rd["Packet number"] == 1, "Contents"].sum())
total_2 = (rd.loc[rd["Packet number"] == 2, "Contents"].sum())
total_3 = (rd.loc[rd["Packet number"] == 3, "Contents"].sum())
total_4 = (rd.loc[rd["Packet number"] == 4, "Contents"].sum())
total_5 = (rd.loc[rd["Packet number"] == 5, "Contents"].sum())
total_6 = (rd.loc[rd["Packet number"] == 6, "Contents"].sum())
total_7 = (rd.loc[rd["Packet number"] == 7, "Contents"].sum())
total_8 = (rd.loc[rd["Packet number"] == 8, "Contents"].sum())
total_9 = (rd.loc[rd["Packet number"] == 9, "Contents"].sum())
total_10 = (rd.loc[rd["Packet number"] == 10, "Contents"].sum())
total_11 = (rd.loc[rd["Packet number"] == 11, "Contents"].sum())
total_12 = (rd.loc[rd["Packet number"] == 12, "Contents"].sum())
total_13 = (rd.loc[rd["Packet number"] == 13, "Contents"].sum())
total_14 = (rd.loc[rd["Packet number"] == 14, "Contents"].sum())
total_15 = (rd.loc[rd["Packet number"] == 15, "Contents"].sum())
total_16 = (rd.loc[rd["Packet number"] == 16, "Contents"].sum())
total_17 = (rd.loc[rd["Packet number"] == 17, "Contents"].sum())
total_18 = (rd.loc[rd["Packet number"] == 18, "Contents"].sum())
total_19 = (rd.loc[rd["Packet number"] == 19, "Contents"].sum())
total_20 = (rd.loc[rd["Packet number"] == 20, "Contents"].sum())
total_21 = (rd.loc[rd["Packet number"] == 21, "Contents"].sum())
total_22 = (rd.loc[rd["Packet number"] == 22, "Contents"].sum())
total_23 = (rd.loc[rd["Packet number"] == 23, "Contents"].sum())
total_24 = (rd.loc[rd["Packet number"] == 24, "Contents"].sum())
total_25 = (rd.loc[rd["Packet number"] == 25, "Contents"].sum())
total_26 = (rd.loc[rd["Packet number"] == 26, "Contents"].sum())
total_27 = (rd.loc[rd["Packet number"] == 27, "Contents"].sum())
total_28 = (rd.loc[rd["Packet number"] == 28, "Contents"].sum())
total_29 = (rd.loc[rd["Packet number"] == 29, "Contents"].sum())
total_30 = (rd.loc[rd["Packet number"] == 30, "Contents"].sum())
total_31 = (rd.loc[rd["Packet number"] == 31, "Contents"].sum())
total_32 = (rd.loc[rd["Packet number"] == 32, "Contents"].sum())
total_33 = (rd.loc[rd["Packet number"] == 33, "Contents"].sum())
total_34 = (rd.loc[rd["Packet number"] == 34, "Contents"].sum())
total_35 = (rd.loc[rd["Packet number"] == 35, "Contents"].sum())
total_36 = (rd.loc[rd["Packet number"] == 36, "Contents"].sum())
# create total array
a = np.array([total_1, total_2, total_3, total_4, total_5, total_6, total_7,
total_8, total_9, total_10, total_11, total_12, total_13, total_14, total_15,
total_16, total_17, total_18, total_19, total_20, total_21, total_22, total_23,
total_24, total_25, total_26, total_27, total_28, total_29, total_30, total_31,
total_32, total_33, total_34, total_35, total_36])
# mean confidence interval
print(st.t.interval(0.95, len(a)-1, loc=np.mean(a), scale=st.sem(a)))

谢谢!

编辑:

数据集看起来像:

Packet number,Flavour,Contents
1,orange,4
2,orange,3
3,orange,2
4,orange,4
5,orange,3
...
36,orange,3
1,toffee,4
2,toffee,3
...
1,chocolate,5
...

等。

所需的数据:

对于每种风味类型,我都需要一个阵列/内容列表进行分析,即

橙色:

4
3
2
4
...

因此,我可以对这些新创建的数组进行各种测试

iiuc您可以执行以下操作。

如果您在Packet number列中只有36个不同的值(从136(:

a = rd.groupby('Packet number')['Contents'].sum()

如果您有更多,并且想先过滤它们:

a = rd[rd['Packet number'].between(1, 36)].groupby('Packet number')['Contents'].sum()

更新:

源DF

In [233]: df
Out[233]:
   Packet number    Flavour  Contents
0              1     orange         4
1              2     orange         3
2              3     orange         2
3              4     orange         4
4              5     orange         3
5             36     orange         3
6              1     toffee         4
7              2     toffee         3
8              1  chocolate         5

简单布尔索引

In [234]: df.loc[df.Flavour == 'orange', 'Contents']
Out[234]:
0    4
1    3
2    2
3    4
4    3
5    3
Name: Contents, dtype: int64

...加总和

In [235]: df.loc[df.Flavour == 'orange', 'Contents'].sum()
Out[235]: 19

滤波器,groupby,聚合

In [237]: df.loc[df.Flavour.isin(['orange','toffee'])].groupby('Flavour')['Contents'].sum()
Out[237]:
Flavour
orange    19
toffee     7
Name: Contents, dtype: int64

最新更新