如果字符串 df2.domain 出现在 df.url 中,则合并 Pandas 数据帧



我有两个数据帧:df

ID   url
111   vk.com/audio/12353546
222   twitter.com/lenad
333   avito.ru/phones
333   facebook.ru/chats

和另一个DF2

domain   Maincategory   Subcategory
vk.com    Entertainment   Social Network
twitter.com    Entertainment   Social Network
facebook.com   Entertainment   Social Network
avito.com     Online shop      Buys
vk.com/audio    Entertainment   Social Network Music

我需要执行以下操作:如果 df2.domain.values.tolist() 中的某个字符串出现在 df.url 中,则添加来自df2df 列(MaincategorySubcategory )并得到:

ID   url    Maincategory    Subcategory
111   vk.com/audio/12353546    Entertainment   Social Network Music
222   twitter.com/lenad    Entertainment   Social Network
333   avito.ru/phones    Online shop      Buys
333   facebook.ru/chats    Entertainment   Social Network

我有一些方法可以做到这一点,但这需要太多时间。

mapping = dict(df2.set_index('domain')['Maincategory'])
def map_to_substring(x):
    for key in mapping.keys():
        if key in x:
            return mapping[key]
    return 'None'
df['Maincategory'] = df.url.apply(lambda x: map_to_substring(x))
mapping1 = dict(df2.set_index('domain')['Subcategory'])
def map_to_substring1(x):
    for key in mapping1.keys():
        if key in x:
            return mapping1[key]
    return 'None'
df['Subcategory'] = df.url.apply(lambda x: map_to_substring1(x))

我怎样才能改善这一点?

我的理解是你想从 df 获取 id 并将其放在 df2 上?

有很多方法可以做到这一点。 您可以合并 DF:

pd.merge(df2,df, on='url')

但是,如果您在"URL"上有重复的值,则可能会导致问题

您还可以映射

df2['id'] = df2['url'].map(lambda x: df.loc[df['url']==x, 'ID'].iloc[0])

希望有帮助,和平

import pandas as pd
df1 = pd.DataFrame({'ID': ['111', '222', '333', '333'],'url':['vk.com/audio/12353546','twitter.com/lenad','avito.ru/phones','facebook.ru/chats']})
print "----original df1----"
print df1
df2 = pd.DataFrame({
                'Maincaregory':['Entertainment','Entertainment','Entertainment','Online shop','Entertainment'],
                'Subcategory':['Social Network','Social Network','Social Network','Buys','Social Network Music'],
               'domain':  ['vk.com','twitter.com','facebook.com','avito.com','vk.com/audio']})
print "n----original df2----"                    
print df2
row = df1.shape[0]
dname =[]
for x in range(row):
    name = df1.iloc[x]['url'].split(".")
    if name[0] == 'vk':
        cat = df1.iloc[x]['url'].split("/")
        dname.append(cat[0]+"/"+cat[1])
    else:
        dname.append(name[0]+".com")
df1['domain']=dname
print "n----for merge df1----"
print df1
df3 = pd.merge(df1,df2, how='inner',on=['domain'])
df3= df3.drop('domain',1)
print "n----what you want----"
print df3

结果:

----original df1----
    ID                    url
0  111  vk.com/audio/12353546
1  222      twitter.com/lenad
2  333        avito.ru/phones
3  333      facebook.ru/chats
----original df2----
    Maincaregory           Subcategory        domain
0  Entertainment        Social Network        vk.com
1  Entertainment        Social Network   twitter.com
2  Entertainment        Social Network  facebook.com
3    Online shop                  Buys     avito.com
4  Entertainment  Social Network Music  vk.com/audio
----for merge df1----
    ID                    url        domain
0  111  vk.com/audio/12353546  vk.com/audio
1  222      twitter.com/lenad   twitter.com
2  333        avito.ru/phones     avito.com
3  333      facebook.ru/chats  facebook.com
----what you want----
    ID                    url   Maincaregory           Subcategory
0  111  vk.com/audio/12353546  Entertainment  Social Network Music
1  222      twitter.com/lenad  Entertainment        Social Network
2  333        avito.ru/phones    Online shop                  Buys
3  333      facebook.ru/chats  Entertainment        Social Network

最新更新