我有两个数据帧:df
ID url
111 vk.com/audio/12353546
222 twitter.com/lenad
333 avito.ru/phones
333 facebook.ru/chats
和另一个DF2
domain Maincategory Subcategory
vk.com Entertainment Social Network
twitter.com Entertainment Social Network
facebook.com Entertainment Social Network
avito.com Online shop Buys
vk.com/audio Entertainment Social Network Music
我需要执行以下操作:如果 df2.domain.values.tolist()
中的某个字符串出现在 df.url
中,则添加来自df2
的 df
列(Maincategory
和 Subcategory
)并得到:
ID url Maincategory Subcategory
111 vk.com/audio/12353546 Entertainment Social Network Music
222 twitter.com/lenad Entertainment Social Network
333 avito.ru/phones Online shop Buys
333 facebook.ru/chats Entertainment Social Network
我有一些方法可以做到这一点,但这需要太多时间。
mapping = dict(df2.set_index('domain')['Maincategory'])
def map_to_substring(x):
for key in mapping.keys():
if key in x:
return mapping[key]
return 'None'
df['Maincategory'] = df.url.apply(lambda x: map_to_substring(x))
mapping1 = dict(df2.set_index('domain')['Subcategory'])
def map_to_substring1(x):
for key in mapping1.keys():
if key in x:
return mapping1[key]
return 'None'
df['Subcategory'] = df.url.apply(lambda x: map_to_substring1(x))
我怎样才能改善这一点?
我的理解是你想从 df 获取 id 并将其放在 df2 上?
有很多方法可以做到这一点。 您可以合并 DF:
pd.merge(df2,df, on='url')
但是,如果您在"URL"上有重复的值,则可能会导致问题
您还可以映射
df2['id'] = df2['url'].map(lambda x: df.loc[df['url']==x, 'ID'].iloc[0])
希望有帮助,和平
import pandas as pd
df1 = pd.DataFrame({'ID': ['111', '222', '333', '333'],'url':['vk.com/audio/12353546','twitter.com/lenad','avito.ru/phones','facebook.ru/chats']})
print "----original df1----"
print df1
df2 = pd.DataFrame({
'Maincaregory':['Entertainment','Entertainment','Entertainment','Online shop','Entertainment'],
'Subcategory':['Social Network','Social Network','Social Network','Buys','Social Network Music'],
'domain': ['vk.com','twitter.com','facebook.com','avito.com','vk.com/audio']})
print "n----original df2----"
print df2
row = df1.shape[0]
dname =[]
for x in range(row):
name = df1.iloc[x]['url'].split(".")
if name[0] == 'vk':
cat = df1.iloc[x]['url'].split("/")
dname.append(cat[0]+"/"+cat[1])
else:
dname.append(name[0]+".com")
df1['domain']=dname
print "n----for merge df1----"
print df1
df3 = pd.merge(df1,df2, how='inner',on=['domain'])
df3= df3.drop('domain',1)
print "n----what you want----"
print df3
结果:
----original df1----
ID url
0 111 vk.com/audio/12353546
1 222 twitter.com/lenad
2 333 avito.ru/phones
3 333 facebook.ru/chats
----original df2----
Maincaregory Subcategory domain
0 Entertainment Social Network vk.com
1 Entertainment Social Network twitter.com
2 Entertainment Social Network facebook.com
3 Online shop Buys avito.com
4 Entertainment Social Network Music vk.com/audio
----for merge df1----
ID url domain
0 111 vk.com/audio/12353546 vk.com/audio
1 222 twitter.com/lenad twitter.com
2 333 avito.ru/phones avito.com
3 333 facebook.ru/chats facebook.com
----what you want----
ID url Maincaregory Subcategory
0 111 vk.com/audio/12353546 Entertainment Social Network Music
1 222 twitter.com/lenad Entertainment Social Network
2 333 avito.ru/phones Online shop Buys
3 333 facebook.ru/chats Entertainment Social Network