假设两个数据帧:
import pandas as pd
import numpy as np
d1 = {}
d2 = {}
np.random.seed(5)
for col in list("ABCDEF"):
d1[col] = np.random.randn(12)
d2[col+'2'] = np.random.random_integers(0,100, 12)
t_index = pd.date_range(start = '2015-01-31', periods = 12, freq = "M")
dat1 = pd.DataFrame(d1, index = t_index)
dat2 = pd.DataFrame(d2, index = t_index)
我想按dat2中的行对dat1的行进行排序,并从dat1中提取有序数据的子集。下面是一个示例,其中每行的前5个值是从dat1中提取的。例如,使用:
A B C D E F
2015-01-31 0.441227 -0.817548 -0.723062 -0.205149 0.230843 -0.25395
2015-02-28 -0.330870 -1.168279 -0.042419 -0.232108 -0.042166 0.42985
A2 B2 C2 D2 E2 F2
2015-01-31 47 47 82 66 64 40
2015-02-28 30 16 60 57 77 74
我会得到:
0 1 2 3 4
2015-01-31 A B E D C
2015-02-28 A D C F E
0 1 2 3 4
2015-01-31 0.441227 -0.817548 0.230843 -0.205149 -0.723062
2015-02-28 -0.330870 -0.232108 -0.042419 0.429850 -0.042166
这是我的解决方案。最大的问题是,该代码不处理dat1或dat2中的NA值,这是一个需要解决的巨大问题。
def sortByAnthr(X,Y):
return([x for (x,y) in sorted(zip(X,Y), key=lambda pair: pair[1])])
def r_selectr(dat2,dat1, n):
ordr_cols = dat1.apply(lambda x: sortByAnthr(x.index,dat2.loc[x.name,:]),axis=1).iloc[:,-n:]
ordr_cols.columns = list(range(0,n)) #assign column names
ordr_r = ordr_cols.apply(lambda x: dat1.ix[x.name,x.values].tolist(),axis=1)
return([ordr_cols, ordr_r])
ordr_cols,ordr_r = r_selectr(dat2,dat1,5)
ordr_cols.iloc[:2,:]
0 1 2 3 4
2015-01-31 A B E D C
2015-02-28 A D C F E
ordr_r.iloc[:2,:]
0 1 2 3 4
2015-01-31 0.441227 -0.817548 0.230843 -0.205149 -0.723062
2015-02-28 -0.330870 -0.232108 -0.042419 0.429850 -0.042166
例如,对于NA,以上内容无法正确排序:
dat1.iloc[[1,2],[1,3,5]]=np.nan
dat2.iloc[[1,4],[2,4,5]]=np.nan
这是我的解决方案。现在,它通过将每行的dat1和dat2中的非NA值的索引相交来处理NA。然而,这在应用程序中引入了一个问题,即应用程序需要为每一行提供相同大小的输出。填充无法/未排序项目的函数是fillVacuum。
def fillVacuum(toFill,MatchLengthOf):
if len(toFill)<len(MatchLengthOf):
[toFill.insert(i, np.nan) for i in range(len(MatchLengthOf)-len(toFill))]
return()
def sortByAnthr(X,Y,Xindex):
#intersect non-na column indexes between two datasets
idx = np.intersect1d(X.notnull().nonzero()[0],Y.notnull().nonzero()[0])
#order the subset of X.index by Y
ordrX = [x for (x,y) in sorted(zip(Xindex[idx],Y[idx]), key=lambda pair: pair[1])]
#due to molding that'll happen later in apply, it is necessary to fill removed indexes
fillVacuum(ordrX, Xindex)
return(ordrX)
def OrderRow(row,df):
ordrd_row = df.ix[row.dropna().name,row.dropna().values].tolist()
fillVacuum(ordrd_row, row)
return(ordrd_row)
def r_selectr(dat2,dat1, n):
ordr_cols = dat1.apply(lambda x: sortByAnthr(x,dat2.loc[x.name,:],x.index),axis=1).iloc[:,-n:]
ordr_cols.columns = list(range(0,n)) #assign interpretable column names
ordr_r = ordr_cols.apply(lambda x: OrderRow(x,dat1),axis=1)
return([ordr_cols, ordr_r])
ordr_cols,ordr_r = r_selectr(dat2,dat1,5)
这些函数产生以下结果:
dat1.iloc[:2,:]
A B C D E F
2015-01-31 0.441227 -0.817548 -0.723062 -0.205149 0.230843 -0.253954
2015-02-28 NaN NaN -0.042419 -0.232108 NaN 0.429850
dat2.iloc[:2,:]
A2 B2 C2 D2 E2 F2
2015-01-31 47 47 82 66 64 40
2015-02-28 NaN 16 60 57 77 NaN
ordr_cols.iloc[:2,:]
0 1 2 3 4
2015-01-31 A B E D C
2015-02-28 NaN NaN NaN D C
ordr_r.iloc[:2,:]
0 1 2 3 4
2015-01-31 0.441227 -0.817548 0.230843 -0.205149 -0.723062
2015-02-28 NaN NaN NaN -0.232108 -0.042419