如何使用蟒蛇熊猫找到夏皮罗-威尔克?



我需要找到数据帧的夏皮罗威尔克测试。

关于夏皮罗威尔克酒店 https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html

数据帧 1:

Stationid
10
11
12
13
14
15
16
17

数据框 2:

Stationid  Maintanance
10           55
15           38
21          100
10           56
22          101
15           39
10           56

我需要在数据帧 1 上的数据帧 2 中为站 ID 夏皮罗

预期输出

Stationid   W           P 
10  0.515        55.666667
15  0.555        38.500000

注意:表中给出的W,p不是正确的值。

首先按isin进行筛选,然后使用带有强制转换输出的GroupBy.applySeries新列:

#check if numeric
print (df2['Maintanance'].dtypes)
int64
from scipy.stats import shapiro
df3 = df2[df2['Stationid'].isin(df1['Stationid'])]
df = (df3.groupby('Stationid')
.apply(lambda x: pd.Series(shapiro(x), index=['W','P']))
.reset_index())
print (df)
Stationid         W         P
0         10  0.689908  0.004831
1         15  0.747003  0.036196

编辑:

data = ['abc15','acv1','acv2','acv3','acv4','abc18','acv5','acv6'] 
df1 = pd.DataFrame(data,columns=['Stationid']) 
print (df1)
Stationid
0     abc15
1      acv1
2      acv2
3      acv3
4      acv4
5     abc18
6      acv5
7      acv6
data1=[['abc15',55],['abc18',38],['ark',100],['abc15',56],['ark',101],['abc19',39],['abc15',56]] 
df2=pd.DataFrame(data1,columns=['Stationid','Maintanance']) 
print(df2) 
Stationid  Maintanance
0     abc15           55
1     abc18           38
2       ark          100
3     abc15           56
4       ark          101
5     abc19           39
6     abc15           56

问题是shapiro如果值数小于 3,则无法工作,因此添加了对长度为>2的数据的过滤:

from scipy.stats import shapiro
df3 = df2[df2['Stationid'].isin(df1['Stationid'])]
print (df3)
Stationid  Maintanance
0     abc15           55
1     abc18           38 < group with length 1 (abc18)
3     abc15           56
6     abc15           56
df = (df3.groupby('Stationid')
.apply(lambda x: pd.Series(shapiro(x), index=['W','P']) if len(x) > 2 
else pd.Series([np.nan, np.nan], index=['W','P']))
.reset_index())
print (df)
Stationid     W         P
0     abc15  0.75 -0.000001
1     abc18   NaN       NaN

或过滤掉以下组:

from scipy.stats import shapiro
df3 = df2[df2['Stationid'].isin(df1['Stationid'])]
print (df3)
Stationid  Maintanance
0     abc15           55
1     abc18           38
3     abc15           56
6     abc15           56
df3 = df3[df3.groupby('Stationid')['Stationid'].transform('size') > 2]
print (df3)
Stationid  Maintanance
0     abc15           55
3     abc15           56
6     abc15           56
df = (df3.groupby('Stationid')[['Maintanance']]
.apply(lambda x: pd.Series(shapiro(x), index=['W','P']))
.reset_index())
print (df)
Stationid     W         P
0     abc15  0.75 -0.000001

必须有一个更干净的方法,但这可以完成工作:

import pandas as pd
from scipy import stats

df1 = pd.DataFrame({'Stationid': [10, 11, 12, 13, 14, 15, 16, 17]})
df2 = pd.DataFrame({'Stationid': [10, 15, 21, 10, 22, 15, 10],
'Maintanance': [55, 38, 100, 56, 101, 39, 56]})
df2['Maintanance'] = df2['Maintanance'].astype(int)
df = df1.merge(df2, on='Stationid', how='inner').groupby('Stationid').apply(stats.shapiro).reset_index().rename(columns={0: 'shapiro'})
df = df.join(df['shapiro'].apply(lambda val: pd.Series(val, index=['W', 'P'])))
df[['Stationid', 'W', 'P']]
#   Stationid         W         P
#0         10  0.689908  0.004831
#1         15  0.747003  0.036196

最新更新