熊猫将单个值与列表进行比较



首先我向您展示我的代码

df = pd.read_pickle('domain_list.pkl')
cc_base= pd.read_pickle('cc_base.pkl')

以下是每个数据库中的单个记录的样子:

print(cc_base.iloc[100])
Domain                  kinresto.com
Phone                    12395550108
Alternative phone 2    13195550115.0
Alternative phone 3    12525550126.0
Alternative phone 4    13075550133.0
Alternative phone 5              NaN
print(cc_base.iloc[102])
Domain                       msg.com
Phone                  13075550133.0
Alternative phone 2    13195550115.0
Alternative phone 3    12395550108.0
Alternative phone 4              NaN
Alternative phone 5              NaN

和来自第二数据库的行

print(df.iloc[44556])
Phone                            12395550108
counts                                     2
Domain_list    ["['msg.com'", " 'kinresto.com']"]

我想检查一下domain_list中的哪一个域的电话df['phone']是cc_base['phone']中的主号码

结果数据帧中的行应类似于以下

Phone                                             12395550108
counts                                                      2
Domain_list                ["['msg.com'", " 'kinresto.com']"]
Main_phone_for_domain                        ["kinresto.com"]
Alternative_for                                   ["msg.com"]

我知道域名列表有多难看

.replace("'", "").replace(']', '').replace('[', '').replace('"', '').replace(' ', '')

最长的domain_list有3000个项目

首先,为了确保Domain_list值具有可比性,让我们先删除无关字符:

import re
def clean_list (domains):
return [re.sub("['[] ]",'',dom) for dom in domains]

对于给定的示例行,这里是清洁的效果(显示内容之前和之后(:

>>> df
Phone  counts                     Domain_list
0  12395550108       2  [['msg.com',  'kinresto.com']]
>>> df.Domain_list = df.Domain_list.apply(clean_list)
>>> df
Phone  counts              Domain_list
0  12395550108       2  [msg.com, kinresto.com]

为了根据电话号码找到主域,我使用了以下功能并在apply:中使用

def find_main_domain(phone):
main_domain = cc_base[cc_base.Phone == phone]['Domain']
return main_domain
df['Main_phone_for_domain'] = df.Phone.apply(find_main_domain)

在这个阶段,以下是df的内容:

>> df
Phone  counts              Domain_list Main_phone_for_domain
0  12395550108       2  [msg.com, kinresto.com]          kinresto.com

对于最后一列,我们简单地包括以下所有其他域:

df['Alternative_for'] = df.apply(lambda row: [x for x in row.Domain_list if x != row.Main_phone_for_domain], axis=1)

以下是最终内容:

>>> df
Phone  counts              Domain_list Main_phone_for_domain Alternative_for
0  12395550108       2  [msg.com, kinresto.com]          kinresto.com       [msg.com]

最新更新