我正在尝试删除具有超过 3 个或 k 个连续 NaN 的列。熊猫新手。任何帮助,不胜感激。
数据看起来像
200 2000 7632
123 NaN 1232
98 NaN 12324
4231 NaN 673
87 76 1000
你可以做这样的事情:
df=pd.DataFrame()
df['col1']=[np.nan,1,2,np.nan,3,np.nan,np.nan]
df['col2']=[np.nan,np.nan,np.nan,np.nan,1,2,3]
df['col3']=[1,2,3,4,np.nan,np.nan,np.nan]
print(df)
col1 col2 col3
0 NaN NaN 1.0
1 1.0 NaN 2.0
2 2.0 NaN 3.0
3 NaN NaN 4.0
4 3.0 1.0 NaN
5 NaN 2.0 NaN
6 NaN 3.0 NaN
df_filtered=df.loc[:,(df.notna().cumsum().shift().apply(lambda x: x.value_counts()).fillna(0)<3).all()]
print(df_filtered)
col1
0 NaN
1 1.0
2 2.0
3 NaN
4 3.0
5 NaN
6 NaN
注意: 如果它有 3 个或更多,这将消除,要从 4 中消除,您必须将 3 替换为 4
也许不是最有效的解决方案,但很容易使用more-itertools
实现:对于每一列,尝试locate
3NaN
s 的第一个元组,如果找到,请将此列添加到要删除的列列表中。
import pandas as pd
import more_itertools as mit
df = pd.DataFrame({'col1': [1,2,3,4], 'col2': [None,None,5,None], 'col3': [6,None,None,None]})
to_drop = []
for c in df:
try:
next(mit.locate(df[c].isna(), lambda *x: all(x) == True, 3))
to_drop.append(c)
except:
pass
df = df.drop(to_drop, 1)
print(df)
结果:
col1 col2
0 1 NaN
1 2 NaN
2 3 5.0
3 4 NaN
您可以使用这个简单的例子:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':[1,2,3,4], 'col2':[None,None,None,5], 'col3':[6, None, None, 5] })
df
:
col1 col2 col3
0 1 NaN 6.0
1 2 NaN NaN
2 3 NaN NaN
3 4 5.0 5.0
编辑
连续下降NaN:
bad_cols=[]
for col in list(df):
for i in range(df.shape[0]-2):
w = df.loc[i:i+2, col]
if np.sum(w.isna()) == 3:
bad_cols.append(col)
break
df.drop(bad_cols, axis=1, inplace=True)
df
:
col1 col3
0 1 6.0
1 2 NaN
2 3 NaN
3 4 5.0