从连续的n行集合中选择最多非na列的行

我有一个带有时间索引的df和一些带有数值的列，但在某些情况下也包含缺失的值。如:

timeindex   ColA    ColB    ColC
00:02:00      454    436    4334
00:04:00             653
00:06:00      3423   4354 
00:08:00      3432
00:10:00      2343
00:12:00     32432          23423

我想创建数据帧的一个子集，这样对于每一个连续的3行组，它选择具有最少数量的缺失值的行。所以对于上面的df，子集df看起来像:

timeindex   ColA    ColB    ColC
00:02:00      454    436    4334
00:12:00     32432          23423

你能告诉我如何才能做到这一点吗?

使用df.filter选择列，检查空字符串，sum在轴1上，然后最后groupby.idxmax

idx = (df.assign(count=df.filter(like="Col").notnull().sum(1))
.groupby(np.arange(len(df))//3)["count"].idxmax())
print (df.loc[idx])
timeindex   ColA ColB   ColC
0  00:02:00    454  436   4334
5  00:12:00  32432       23423

# split the dataframe into groups of 3
df_dict = {n: df.iloc[n:n+3, :] 
for n in range(0, len(df), 3)}
# find indexes of the minimum number of None for each group
mask = []
for g in df_dict.values():
mask.append((g.isnull().sum(axis=1)).idxmin())
# keep only those rows
df.iloc[mask]

替换这一行:

mask.append((g.isnull().sum(axis=1)).idxmin())

通过这一行:

mask.append((g.eq('').sum(axis=1)).idxmin())

相关内容

最新更新

热门标签：