这是需要更新的数据帧。
ID Doctor_Name ... Doctor_Office_Zipcode website
0 1 Wendy Mayer ... 10019 []
1 2 Michael S. Tseng ... 10017 []
这是我们需要从中选择匹配值的第二个数据帧
name website
0 Dr. Wendy Mayer DDS ['mayerdentalcare.com']
1 Dr. Usha Gorti DDS ['westenddentalnyc.com']
2 Dr. Alexander J. Antonakos ['newyorkcosmeticdental.com', 'zocdoc.com']
问题:我需要在第一个数据框中更新网站,其中有来自第二个数据框的匹配。在第二数据帧中可能有多行匹配。我需要选择第一个匹配的值并在原始数据框中更新。
使用此代码后,我在所有地方得到Nan。
我想要这样的东西
ID Doctor_Name ... Doctor_Office_Zipcode website
0 1 Wendy Mayer ... 10019 ['mayerdentalcare.com']
1 2 Michael S. Tseng ... 10017 ['newyorktseng.com', 'zocy.com']
下面是我的代码:
raw_data = pd.read_csv(orig)
web_data = pd.read_csv(website)
raw_data['website'] = np.empty((len(raw_data), 0)).tolist()
for x in raw_data['Doctor_Name']:
print('now searching for {}', x)
values = web_data[web_data['name'].str.contains(x, case=False, na=False, regex=False)]
if values is not None and values.size > 0:
y = values['website'].iloc[:1]
raw_data.loc[raw_data['Doctor_Name'] == x, 'website'] = values.head(1)
您可以使用str.extract
从df2的name
列中提取匹配的Doctor_Name
df2['Doctor_Name'] = df2['name'].str.extract('('+'|'.join(df1['Doctor_Name'])+')')
print(df2)
name website Doctor_Name
0 Dr. Wendy Mayer DDS ['mayerdentalcare.com'] Wendy Mayer
1 Dr. Usha Gorti DDS ['westenddentalnyc.com'] NaN
2 Dr. Alexander J. Antonakos ['newyorkcosmeticdental.com', 'zocdoc.com'] NaN
然后对匹配的名称进行merge
并删除可能的多个匹配
df1['website'] = df1.merge(df2, on='Doctor_Name', how='left').drop_duplicates('Doctor_Name')['website_y']
print(df1)
ID Doctor_Name Doctor_Office_Zipcode website
0 1 Wendy Mayer 10019 ['mayerdentalcare.com']
1 2 Michael S. Tseng 10017 NaN