我的数据帧如下所示:
ID VALUE1 VALUE2 VALUE3
1 NaN [ab,c] Good
1 google [ab,c] Good
2 NaN [ab,c1] NaN
2 First [ab,c1] Good1
2 First [ab,c1]
3 NaN [ab,c] Good
要求是: ID 是关键。我有 3 行 ID 2。因此,我需要将两行合并为 1 行,以便所有列都有有效的值(不包括 Null 和空格(。
我的预期输出是:
ID VALUE1 VALUE2 VALUE3
1 google [ab,c] Good
2 First [ab,c1] Good1
3 NaN [ab,c] Good
我们是否有任何 pandas 函数来实现这一点,或者我是否应该将数据分成两个或多个数据帧并基于 NaN/空格进行合并? 感谢您的帮助
Micheal G 上面有一个更优雅的解决方案。 这是我更耗时和业余的方法:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [1,1,2,2,2,3],
"V1": [np.nan,'google',np.nan,'First','First',np.nan],
"V2": [['ab','c'],['ab','c'],['ab','c1'],['ab','c1'],['ab','c1'],['ab','c']],
"V3": ['Good','Good',np.nan,np.nan,'Good1','Good']
})
uniq = df.ID.unique() #Get the unique values in ID
df = df.set_index(['ID']) #Since we are try find the rows with the least amount of nan's.
#Setting the index by ID is going to make our future statements faster and easier.
newDf = pd.DataFrame()
for i in uniq: #Running the loop per unique value in column ID
temp = df.loc[i]
if(isinstance(temp, pd.Series)): #if there is only 1 row with the i, add that row to out new DataFrame
newDf = newDf.append(temp)
else:
NonNanCountSeries = temp.apply(lambda x: x.count(), axis=1)
#Get the number of non-nan's in the per each row. It is given in list.
NonNanCountList = NonNanCountSeries.tolist()
newDf = newDf.append(temp.iloc[NonNanCountList.index(max(NonNanCountList))])
#Let's break this down.
#Find the max in out nanCountList: max(NonNanCountList))
#Find the index of where the max is. Paraphrased: get the row number with the
#most amount of non-nan's: NonNanCountList.index(max(NonNanCountList))
#Get the row by passing the index into temp.iloc
#Add the row to newDf and update newDf
print(newDf)
哪个应该返回:
V1 V2 V3
1 google [ab, c] Good
2 First [ab, c1] Good1
3 NaN [ab, c] Good
注意,我把谷歌大写了。
import pandas as pd
import numpy as np
data = {'ID' : [1,1,2,2,2,3], 'VALUE1':['NaN','Google','NaN', 'First', 'First','NaN'], 'VALUE2':['abc', 'abc', 'abc1', 'abc1', 'abc1', 'abc'], 'VALUE3': ['Good', 'Good', 'NaN', 'Good1', '0', 'Good']}
df = pd.DataFrame(data)
df_ = df.replace('NaN', np.NaN).fillna('zero', inplace=False)
df2 = df_.sort_values(['VALUE1', 'ID'])
mask = df2.ID.duplicated()
print (df_[~mask])
输出
ID VALUE1 VALUE2 VALUE3
1 1 Google abc Good
3 2 First abc1 Good1
5 3 zero abc Good
最后,请注意面具中的 tilda 字符 (~( 是必不可少的