首先我向您展示我的代码
df = pd.read_pickle('domain_list.pkl')
cc_base= pd.read_pickle('cc_base.pkl')
以下是每个数据库中的单个记录的样子:
print(cc_base.iloc[100])
Domain kinresto.com
Phone 12395550108
Alternative phone 2 13195550115.0
Alternative phone 3 12525550126.0
Alternative phone 4 13075550133.0
Alternative phone 5 NaN
print(cc_base.iloc[102])
Domain msg.com
Phone 13075550133.0
Alternative phone 2 13195550115.0
Alternative phone 3 12395550108.0
Alternative phone 4 NaN
Alternative phone 5 NaN
和来自第二数据库的行
print(df.iloc[44556])
Phone 12395550108
counts 2
Domain_list ["['msg.com'", " 'kinresto.com']"]
我想检查一下domain_list中的哪一个域的电话df['phone']是cc_base['phone']中的主号码
结果数据帧中的行应类似于以下
Phone 12395550108
counts 2
Domain_list ["['msg.com'", " 'kinresto.com']"]
Main_phone_for_domain ["kinresto.com"]
Alternative_for ["msg.com"]
我知道域名列表有多难看
.replace("'", "").replace(']', '').replace('[', '').replace('"', '').replace(' ', '')
最长的domain_list有3000个项目
首先,为了确保Domain_list
值具有可比性,让我们先删除无关字符:
import re
def clean_list (domains):
return [re.sub("['[] ]",'',dom) for dom in domains]
对于给定的示例行,这里是清洁的效果(显示内容之前和之后(:
>>> df
Phone counts Domain_list
0 12395550108 2 [['msg.com', 'kinresto.com']]
>>> df.Domain_list = df.Domain_list.apply(clean_list)
>>> df
Phone counts Domain_list
0 12395550108 2 [msg.com, kinresto.com]
为了根据电话号码找到主域,我使用了以下功能并在apply
:中使用
def find_main_domain(phone):
main_domain = cc_base[cc_base.Phone == phone]['Domain']
return main_domain
df['Main_phone_for_domain'] = df.Phone.apply(find_main_domain)
在这个阶段,以下是df的内容:
>> df
Phone counts Domain_list Main_phone_for_domain
0 12395550108 2 [msg.com, kinresto.com] kinresto.com
对于最后一列,我们简单地包括以下所有其他域:
df['Alternative_for'] = df.apply(lambda row: [x for x in row.Domain_list if x != row.Main_phone_for_domain], axis=1)
以下是最终内容:
>>> df
Phone counts Domain_list Main_phone_for_domain Alternative_for
0 12395550108 2 [msg.com, kinresto.com] kinresto.com [msg.com]