如何使用循环来计算nan的数量



csv文件中有很多站,我不知道如何使用循环来计算每个站的nan数量。有我到此为止,一个一个数。有人可以帮我吗,提前谢谢你。

station1= train_df[train_df['station'] == 28079004]
station1 = station1[['date', 'O_3']]
count_nan = len(station1) - station1.count()
print(count_nan)

我认为需要通过stationset_index创建索引,过滤列以检查缺失值并按sum最后计数:

train_df = pd.DataFrame({'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'date':pd.date_range('2015-01-01', periods=6),
'O_3':[np.nan,3,np.nan,9,2,np.nan],
'station':[28079004] * 2 + [28079005] * 4})
print (train_df)
B  C       date  O_3   station
0  4  7 2015-01-01  NaN  28079004
1  5  8 2015-01-02  3.0  28079004
2  4  9 2015-01-03  NaN  28079005
3  5  4 2015-01-04  9.0  28079005
4  5  2 2015-01-05  2.0  28079005
5  4  3 2015-01-06  NaN  28079005
df = train_df.set_index('station')[['date', 'O_3']].isnull().sum(level=0).astype(int)
print (df)
date  O_3
station            
28079004     0    1
28079005     0    2

另一种解决方案:

df = train_df[['date', 'O_3']].isnull().groupby(train_df['station']).sum().astype(int)
print (df)
date  O_3
station            
28079004     0    1
28079005     0    2

尽管 jez 已经回答了,而且这个答案在这里可能更好。这是分组依据的样子:

import pandas as pd
import numpy as np
np.random.seed(444)
n = 10
train_df = pd.DataFrame({
'station': np.random.choice(np.arange(28079004,28079008), size=n),
'date': pd.date_range('2018-01-01', periods=n),
'O_3': np.random.choice([np.nan,1], size=n)
})
print(train_df)
s = train_df.groupby('station')['O_3'].apply(lambda x: x.isna().sum())
print(s)

指纹:

station       date  O_3
0  28079007 2018-01-01  NaN
1  28079004 2018-01-02  1.0
2  28079007 2018-01-03  NaN
3  28079004 2018-01-04  NaN
4  28079007 2018-01-05  NaN
5  28079004 2018-01-06  1.0
6  28079007 2018-01-07  NaN
7  28079004 2018-01-08  NaN
8  28079006 2018-01-09  NaN
9  28079007 2018-01-10  1.0

和输出:

station
28079004    2
28079006    1
28079007    4

最新更新