我正在做数据预处理和管理缺失值。我想在列上设置阈值。对于单个列,如果值count小于50,则删除该列。
import numpy as np
import pandas as pd
from pandas import DataFrame
df = pd.read_csv('cbc_updated_1.csv')
然后得到列数
a = df.count(axis = 0)
print(a)
根据列的计数给出列的名称。
IP ABN(RBC)RET Abn Scattergram 46
IP ABN(RBC)Reticulocytosis 23
IP ABN(PLT)Thrombocytosis 47
IP ABN(PLT)PLT Abn Scattergram 0
IP SUS(WBC)Blasts? 57
IP SUS(WBC)Abn Lympho? 10
IP SUS(WBC)Left Shift? 190
IP SUS(WBC)Atypical Lympho? 126
IP SUS(RBC)RBC Agglutination? 0
IP SUS(RBC)Turbidity/HGB Interf? 9
IP SUS(RBC)Iron Deficiency? 27
IP SUS(RBC)HGB Defect? 3
IP SUS(RBC)Fragments? 168
IP SUS(PLT)PLT Clumps? 73
dtype: int64
接下来我想在上面的数据上运行循环来检查我的阈值条件…但是我做不到……我试了下面的代码…
for i in a:
if i < 50:
print(i)
结果我只得到了值,没有得到列名。我两个都需要。
46
23
47
0
10
0
9
27
3
我怎样才能得到这个?
试试这个:
>>> a[a < 50]
IP ABN(RBC)RET Abn Scattergram 46
IP ABN(RBC)Reticulocytosis 23
IP ABN(PLT)Thrombocytosis 47
IP ABN(PLT)PLT Abn Scattergram 0
IP SUS(WBC)Abn Lympho? 10
IP SUS(RBC)RBC Agglutination? 0
IP SUS(RBC)Turbidity/HGB Interf? 9
IP SUS(RBC)Iron Deficiency? 27
IP SUS(RBC)HGB Defect? 3
dtype: int64
>>>
如果你想循环:
for x in a[a < 50].reset_index().to_numpy().tolist():
print(*x)
IP ABN(RBC)RET Abn Scattergram 46
IP ABN(RBC)Reticulocytosis 23
IP ABN(PLT)Thrombocytosis 47
IP ABN(PLT)PLT Abn Scattergram 0
IP SUS(WBC)Abn Lympho? 10
IP SUS(RBC)RBC Agglutination? 0
IP SUS(RBC)Turbidity/HGB Interf? 9
IP SUS(RBC)Iron Deficiency? 27
IP SUS(RBC)HGB Defect? 3