我有一个数据框,我试图在三个类别中使用马氏距离获得最接近的匹配,如:
from io import StringIO
from sklearn import metrics
import pandas as pd
stringdata = StringIO(u"""pid,ratio1,pct1,rsp
0,2.9,26.7,95.073615
1,11.6,29.6,96.963660
2,0.7,37.9,97.750412
3,2.7,27.9,102.750412
4,1.2,19.9,93.750412
5,0.2,22.1,96.750412
""")
stats = ['ratio1','pct1','rsp']
df = pd.read_csv(stringdata)
d = metrics.pairwise.pairwise_distances(df[stats].as_matrix(),
metric='mahalanobis')
print(df)
print(d)
其中pid
列是唯一标识符。
我需要做的是接受pairwise_distances
调用返回的ndarray
并更新原始数据帧,因此每行都有其最接近的N个匹配列表(因此pid
0可能有一个有序列表,按距离为2,1,5,3,4(或实际是什么),但我完全困惑这是如何在python中完成的。
from io import StringIO
from sklearn import metrics
stringdata = StringIO(u"""pid,ratio1,pct1,rsp
0,2.9,26.7,95.073615
1,11.6,29.6,96.963660
2,0.7,37.9,97.750412
3,2.7,27.9,102.750412
4,1.2,19.9,93.750412
5,0.2,22.1,96.750412
""")
stats = ['ratio1','pct1','rsp']
df = pd.read_csv(stringdata)
dist = metrics.pairwise.pairwise_distances(df[stats].as_matrix(),
metric='mahalanobis')
dist = pd.DataFrame(dist)
ranks = np.argsort(dist, axis=1)
df["rankcol"] = ranks.apply(lambda row: ','.join(map(str, row)), axis=1)
df