我想为一个测试组找到唯一的对,这意味着控制组中的每个个体只应该被选择一次。我有性别,年龄和教育可以匹配他们。我把性别和教育分成两组,因为它们是二元分类。之后,我想在年龄上找到与某个测试个体的最佳匹配-因此使用具有1个最近邻的KNN方法。我使用的dummyData可以在这里找到。
下面是初始化和分段:
import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors
TestGroup = pd.read_csv('KNN_DummyData1.csv', names = ['Gender', 'Age', 'Education'])
ControlGroup = pd.read_csv('KNN_DummyData2.csv', names = ['Gender', 'Age', 'Education'])
#### Split TestGroup and ControlGroup into males and females, high and low education
Males_highEd = TestGroup.loc[(TestGroup['Gender'] == 1) & (TestGroup['Education'] == 1)]
Males_highEd.reset_index(drop=True, inplace=True)
Males_highEd.drop(columns=['Gender', 'Education'], inplace=True)
Males_Ctrl_highEd = ControlGroup.loc[(ControlGroup['Gender'] == 1) & (ControlGroup['Education'] == 1)]
Males_Ctrl_highEd.reset_index(drop=True, inplace=True)
Males_Ctrl_highEd.drop(columns=['Gender', 'Education'], inplace=True)
这部分是实际的配对,我适合于控制组并使用来自控制组的值填充空DataFrame。匹配一个控件后,我尝试将其从原始DataFrame (Males_Ctrl_highEd)中删除
Matched_Males_Ctrl_highEd = pd.DataFrame().reindex_like(Males_highEd)
nbrs = NearestNeighbors(n_neighbors=1, algorithm='ball_tree').fit(Males_Ctrl_highEd)
for i in range(len(Males_highEd)):
distances, indices = nbrs.kneighbors(Males_highEd[i:i+1])
Matched_Males_Ctrl_highEd.loc[0].iat[i] = Males_Ctrl_highEd.loc[indices[0]]
print(f"{i} controls of {len(Males_highEd)} tests found")
Males_Ctrl_highEd = Males_Ctrl_highEd.drop(labels=indices[0], axis=0)
目前我得到以下错误的第6行:
ValueError: setting an array element with a sequence.
我已经尝试了各种方法来分配一个控件到匹配的控制组,但我似乎不能成功地从原始DataFrame复制到空的一个。
如果有任何帮助,我在MatLab中做了一个工作实现(但也需要在Python中实现):
ControlGroup = Data;
Idx = NaN(length(Data),1);
for i=1:length(Data)
Idx(i,1) = knnsearch(Data2,Data(i,:),'distance','seuclidean');
ControlGroup(i,:) = Data2(Idx(i),:);
Data2(Idx(i),:) = [];
end
如果你有任何想法或意见关于不同的实现可以做同样的,我洗耳恭听。
我最终在KNN匹配中只使用年龄(并手动匹配二进制特征),执行以下解决方案:
neeededNeighbors = max(TestGroup["Age"].value_counts())+1
nn = NearestNeighbor(n_neighbors = neededNeighbors, algorithm="ball_tree", metric = "euclidian").fit(ControlGroup["Age"].to_numpy().reshape(-1,1))
TestGroup.sort_values(by="Age"),inplace=True)
distances, indices = nn.kneighbors(TestGroup["Age"].to_numpy().reshape(-1,1))
min_age = min(TestGroup["Age"])
max_age = max(TestGroup["Age"])
ages = list(range(min_year,max_year+1))
idx = pd.DataFrame(np.unique(indices,axis=0),index = ages)
cntr = pd.DataFrame(index=ages,colums=["cntrs"])
cntr["cntrs"] = 0
matchedControlGroup = pd.DataFrame().reindex_like(TestGroup)
matchedID = pd.DataFrame(np.full_like(np.arrange(len(matchedControlGroup)), np.nan, dtype=np.double))
for i in range(len(TestGroup)):
if TestGroup["Age"].loc[i] in cntr.index:
x = TestGroup["Age"].loc[i]
matchedControlGroup.loc[i] = ControlGroup.loc[idx.loc[x][cntr.loc[x][0]]]
cntr.loc[i] += 1
matchedID.loc[i] = TestGroup["ID"].loc[i]
matchedID["ID_Match"] = matchedID
这样,我就可以参考每个年龄组需要多少人,并遍历每个年龄组以获得与个人的下一个最佳匹配。这意味着每个年龄组中的第一个将得到更好的匹配,并且根据可用对照的数量,可能存在重叠。
我也做了一个不发生这种情况的实现-然而,我找不到一种方法,在每次找到匹配时我都不需要重新配置KNN,这使得实现非常缓慢。