如何使用蟒蛇熊猫找到夏皮罗-威尔克?

我需要找到数据帧的夏皮罗威尔克测试。

关于夏皮罗威尔克酒店 https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html

数据帧 1：

Stationid
10
11
12
13
14
15
16
17

数据框 2：

Stationid  Maintanance
10           55
15           38
21          100
10           56
22          101
15           39
10           56

我需要在数据帧 1 上的数据帧 2 中为站 ID 夏皮罗

预期输出

Stationid   W           P 
10  0.515        55.666667
15  0.555        38.500000

注意：表中给出的W，p不是正确的值。

首先按isin进行筛选，然后使用带有强制转换输出的GroupBy.apply来Series新列：

#check if numeric
print (df2['Maintanance'].dtypes)
int64
from scipy.stats import shapiro
df3 = df2[df2['Stationid'].isin(df1['Stationid'])]
df = (df3.groupby('Stationid')
.apply(lambda x: pd.Series(shapiro(x), index=['W','P']))
.reset_index())
print (df)
Stationid         W         P
0         10  0.689908  0.004831
1         15  0.747003  0.036196

编辑：

data = ['abc15','acv1','acv2','acv3','acv4','abc18','acv5','acv6'] 
df1 = pd.DataFrame(data,columns=['Stationid']) 
print (df1)
Stationid
0     abc15
1      acv1
2      acv2
3      acv3
4      acv4
5     abc18
6      acv5
7      acv6
data1=[['abc15',55],['abc18',38],['ark',100],['abc15',56],['ark',101],['abc19',39],['abc15',56]] 
df2=pd.DataFrame(data1,columns=['Stationid','Maintanance']) 
print(df2) 
Stationid  Maintanance
0     abc15           55
1     abc18           38
2       ark          100
3     abc15           56
4       ark          101
5     abc19           39
6     abc15           56

问题是shapiro如果值数小于 3，则无法工作，因此添加了对长度为>2的数据的过滤：

from scipy.stats import shapiro
df3 = df2[df2['Stationid'].isin(df1['Stationid'])]
print (df3)
Stationid  Maintanance
0     abc15           55
1     abc18           38 < group with length 1 (abc18)
3     abc15           56
6     abc15           56
df = (df3.groupby('Stationid')
.apply(lambda x: pd.Series(shapiro(x), index=['W','P']) if len(x) > 2 
else pd.Series([np.nan, np.nan], index=['W','P']))
.reset_index())
print (df)
Stationid     W         P
0     abc15  0.75 -0.000001
1     abc18   NaN       NaN

或过滤掉以下组：

from scipy.stats import shapiro
df3 = df2[df2['Stationid'].isin(df1['Stationid'])]
print (df3)
Stationid  Maintanance
0     abc15           55
1     abc18           38
3     abc15           56
6     abc15           56
df3 = df3[df3.groupby('Stationid')['Stationid'].transform('size') > 2]
print (df3)
Stationid  Maintanance
0     abc15           55
3     abc15           56
6     abc15           56
df = (df3.groupby('Stationid')[['Maintanance']]
.apply(lambda x: pd.Series(shapiro(x), index=['W','P']))
.reset_index())
print (df)
Stationid     W         P
0     abc15  0.75 -0.000001

必须有一个更干净的方法，但这可以完成工作：

import pandas as pd
from scipy import stats

df1 = pd.DataFrame({'Stationid': [10, 11, 12, 13, 14, 15, 16, 17]})
df2 = pd.DataFrame({'Stationid': [10, 15, 21, 10, 22, 15, 10],
'Maintanance': [55, 38, 100, 56, 101, 39, 56]})
df2['Maintanance'] = df2['Maintanance'].astype(int)
df = df1.merge(df2, on='Stationid', how='inner').groupby('Stationid').apply(stats.shapiro).reset_index().rename(columns={0: 'shapiro'})
df = df.join(df['shapiro'].apply(lambda val: pd.Series(val, index=['W', 'P'])))
df[['Stationid', 'W', 'P']]
#   Stationid         W         P
#0         10  0.689908  0.004831
#1         15  0.747003  0.036196

相关内容

最新更新

热门标签：