查看整个熊猫数据框中的电子邮件和电话号码



我有一个持续增长的大型数据集,大约有52列。我正试图想出一种方法,定期检查和标记电子邮件和电话号码等数据,以便在下游手动处理。

我可以找到电子邮件和号码,但我还没有成功地将其动态应用于整个数据帧。我试图将"finder"放在函数中,然后应用于整个数据帧。我只是也不确定从那里去哪里。

我想要的结果只是一个项目列表,说明他们在任何一列中是否有电话或电子邮件。(不过,最后我可能想列出哪些列存在问题。(

谢谢你的帮助,让我朝着正确的方向前进!

import pandas as pd, phonenumbers, re
#create small dummy data set
df = pd.DataFrame({
'ID':[1,2,5,25,26],
'Lineage':['apple', 'square', 'please Gino Mcneill gm@yahoo.com', 'ball', '888-555-5556 Ryan Parkes rp@abc.io'],
'ShortDesc':['618-552-2255','Gino Mcneill gm@github.com','',' please call now','if you have trouble you should call 816-455-5599 ASAP' ],
'LongDesc':['Eesha Hinton', 'for help with product 56789, call 618-578-0055 immediately, or email Gino Mcneill gm@yahoo.com', 'maybe six five today for ever','more random text that could be really long and annoying','over the hills and through the woods']
})
#find phone numbers in one column
for row in df['ShortDesc']:  
for match in phonenumbers.PhoneNumberMatcher(row, "US"):
print (phonenumbers.format_number(match.number, phonenumbers.PhoneNumberFormat.E164))
#find emails in one column
for row in df['ShortDesc']:  
for match in re.findall('S+@S+', row): 
print(match)
#create function for finding phone numbers
def phone_cleaner(x):
for match in phonenumbers.PhoneNumberMatcher(x,"US"):
return phonenumbers.format_number(match.number, phonenumbers.PhoneNumberFormat.E164)
#apply finding function to whole df
df2 = df
df2['ShortDesctel'] = df2['ShortDesc'].apply(phone_cleaner)
df2

#pseudo code:
#for each row
#for each column
#does email exist
#does phone number exist
#if yes to email
#if yes to phone number
#return ID
#Desired Result:
#ID, Email, Phone
#1          x
#2   x      x
#5   x
#25            ----Edit after the fact, this row wouldn't be displayed
#26  x      x

我们可以创建新的数据帧,并使用regex查找电子邮件和电话,仅适用于"object"类型为的列

(pd.DataFrame({'ID': df['ID'],
'Email': df.select_dtypes(object)
.applymap(lambda x: bool(re.findall('S+@S+', str(x))))
.any(axis=1),
'Phone': df.select_dtypes(object)
.applymap(lambda x: bool(re.findall('d+-d+-d+', str(x))))
.any(axis=1)}
)
.replace({True: 'x', False: ''})
) 

输出:

ID  Email   Phone
0    1              x
1    2      x       x
2    5      x   
3   25      
4   26      x       x

如果你想使用"phonesclear"函数,你可以把它放在那里,而不是我的解决方案中建议的regex模式。

也许。。。

import pandas as pd
import re
import numpy as np
# regex patterns for email and phone numbers (could be simpler if more basic matching is required)
email_regex_pattern = r'(?:[a-z0-9!#$%&''*+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&''*+/=?^_`{|}~-]+)*|"(?:[x01-x08x0bx0cx0e-x1fx21x23-x5bx5d-x7f]|\[x01-x09x0bx0cx0e-x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])).){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[x01-x08x0bx0cx0e-x1fx21-x5ax53-x7f]|\[x01-x09x0bx0cx0e-x7f])+)])'
phone_number_regex_pattern = r'(?:(?:+?1s*(?:[.-]s*)?)?(?:(s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])s*)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))s*(?:[.-]s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})s*(?:[.-]s*)?([0-9]{4})(?:s*(?:#|x.?|ext.?|extension)s*(d+))?'
#create small dummy data set
df = pd.DataFrame({
'ID':[1,2,5,25,26],
'Lineage':['apple', 'square', 'please Gino Mcneill gm@yahoo.com', 'ball', '888-555-5556 Ryan Parkes rp@abc.io'],
'ShortDesc':['618-552-2255','Gino Mcneill gm@github.com','',' please call now','if you have trouble you should call 816-455-5599 ASAP' ],
'LongDesc':['Eesha Hinton', 'for help with product 56789, call 618-578-0055 immediately, or email Gino Mcneill gm@yahoo.com', 'maybe six five today for ever','more random text that could be really long and annoying','over the hills and through the woods']
})

mask = df[['Lineage', 'ShortDesc', 'LongDesc']].apply(lambda x: x.str.contains(email_regex_pattern,regex=True)).any(axis=1)
print('Rows with emails:n')
print(df[mask])

输出:

Rows with emails:
ID  Lineage ShortDesc   LongDesc
1   2   square  Gino Mcneill gm@github.com  for help with product 56789, call 618-578-0055...
2   5   please Gino Mcneill gm@yahoo.com        maybe six five today for ever
4   26  888-555-5556 Ryan Parkes rp@abc.io  if you have trouble you should call 816-455-55...   over the hills and through the woods

然后。。。

mask = df[['Lineage', 'ShortDesc', 'LongDesc']].apply(lambda x: x.str.contains(phone_number_regex_pattern,regex=True)).any(axis=1)
print('nnRows with phone numbers:n')
print(df[mask])

输出:

Rows with phone numbers:
ID  Lineage ShortDesc   LongDesc
0   1   apple   618-552-2255    Eesha Hinton
1   2   square  Gino Mcneill gm@github.com  for help with product 56789, call 618-578-0055...
4   26  888-555-5556 Ryan Parkes rp@abc.io  if you have trouble you should call 816-455-55...   over the hills and through the woods

进一步测试需要更好的测试数据帧。

最新更新