我有一个持续增长的大型数据集,大约有52列。我正试图想出一种方法,定期检查和标记电子邮件和电话号码等数据,以便在下游手动处理。
我可以找到电子邮件和号码,但我还没有成功地将其动态应用于整个数据帧。我试图将"finder"放在函数中,然后应用于整个数据帧。我只是也不确定从那里去哪里。
我想要的结果只是一个项目列表,说明他们在任何一列中是否有电话或电子邮件。(不过,最后我可能想列出哪些列存在问题。(
谢谢你的帮助,让我朝着正确的方向前进!
import pandas as pd, phonenumbers, re
#create small dummy data set
df = pd.DataFrame({
'ID':[1,2,5,25,26],
'Lineage':['apple', 'square', 'please Gino Mcneill gm@yahoo.com', 'ball', '888-555-5556 Ryan Parkes rp@abc.io'],
'ShortDesc':['618-552-2255','Gino Mcneill gm@github.com','',' please call now','if you have trouble you should call 816-455-5599 ASAP' ],
'LongDesc':['Eesha Hinton', 'for help with product 56789, call 618-578-0055 immediately, or email Gino Mcneill gm@yahoo.com', 'maybe six five today for ever','more random text that could be really long and annoying','over the hills and through the woods']
})
#find phone numbers in one column
for row in df['ShortDesc']:
for match in phonenumbers.PhoneNumberMatcher(row, "US"):
print (phonenumbers.format_number(match.number, phonenumbers.PhoneNumberFormat.E164))
#find emails in one column
for row in df['ShortDesc']:
for match in re.findall('S+@S+', row):
print(match)
#create function for finding phone numbers
def phone_cleaner(x):
for match in phonenumbers.PhoneNumberMatcher(x,"US"):
return phonenumbers.format_number(match.number, phonenumbers.PhoneNumberFormat.E164)
#apply finding function to whole df
df2 = df
df2['ShortDesctel'] = df2['ShortDesc'].apply(phone_cleaner)
df2
#pseudo code:
#for each row
#for each column
#does email exist
#does phone number exist
#if yes to email
#if yes to phone number
#return ID
#Desired Result:
#ID, Email, Phone
#1 x
#2 x x
#5 x
#25 ----Edit after the fact, this row wouldn't be displayed
#26 x x
我们可以创建新的数据帧,并使用regex查找电子邮件和电话,仅适用于"object"类型为的列
(pd.DataFrame({'ID': df['ID'],
'Email': df.select_dtypes(object)
.applymap(lambda x: bool(re.findall('S+@S+', str(x))))
.any(axis=1),
'Phone': df.select_dtypes(object)
.applymap(lambda x: bool(re.findall('d+-d+-d+', str(x))))
.any(axis=1)}
)
.replace({True: 'x', False: ''})
)
输出:
ID Email Phone
0 1 x
1 2 x x
2 5 x
3 25
4 26 x x
如果你想使用"phonesclear"函数,你可以把它放在那里,而不是我的解决方案中建议的regex模式。
也许。。。
import pandas as pd
import re
import numpy as np
# regex patterns for email and phone numbers (could be simpler if more basic matching is required)
email_regex_pattern = r'(?:[a-z0-9!#$%&''*+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&''*+/=?^_`{|}~-]+)*|"(?:[x01-x08x0bx0cx0e-x1fx21x23-x5bx5d-x7f]|\[x01-x09x0bx0cx0e-x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])).){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[x01-x08x0bx0cx0e-x1fx21-x5ax53-x7f]|\[x01-x09x0bx0cx0e-x7f])+)])'
phone_number_regex_pattern = r'(?:(?:+?1s*(?:[.-]s*)?)?(?:(s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])s*)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))s*(?:[.-]s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})s*(?:[.-]s*)?([0-9]{4})(?:s*(?:#|x.?|ext.?|extension)s*(d+))?'
#create small dummy data set
df = pd.DataFrame({
'ID':[1,2,5,25,26],
'Lineage':['apple', 'square', 'please Gino Mcneill gm@yahoo.com', 'ball', '888-555-5556 Ryan Parkes rp@abc.io'],
'ShortDesc':['618-552-2255','Gino Mcneill gm@github.com','',' please call now','if you have trouble you should call 816-455-5599 ASAP' ],
'LongDesc':['Eesha Hinton', 'for help with product 56789, call 618-578-0055 immediately, or email Gino Mcneill gm@yahoo.com', 'maybe six five today for ever','more random text that could be really long and annoying','over the hills and through the woods']
})
mask = df[['Lineage', 'ShortDesc', 'LongDesc']].apply(lambda x: x.str.contains(email_regex_pattern,regex=True)).any(axis=1)
print('Rows with emails:n')
print(df[mask])
输出:
Rows with emails:
ID Lineage ShortDesc LongDesc
1 2 square Gino Mcneill gm@github.com for help with product 56789, call 618-578-0055...
2 5 please Gino Mcneill gm@yahoo.com maybe six five today for ever
4 26 888-555-5556 Ryan Parkes rp@abc.io if you have trouble you should call 816-455-55... over the hills and through the woods
然后。。。
mask = df[['Lineage', 'ShortDesc', 'LongDesc']].apply(lambda x: x.str.contains(phone_number_regex_pattern,regex=True)).any(axis=1)
print('nnRows with phone numbers:n')
print(df[mask])
输出:
Rows with phone numbers:
ID Lineage ShortDesc LongDesc
0 1 apple 618-552-2255 Eesha Hinton
1 2 square Gino Mcneill gm@github.com for help with product 56789, call 618-578-0055...
4 26 888-555-5556 Ryan Parkes rp@abc.io if you have trouble you should call 816-455-55... over the hills and through the woods
进一步测试需要更好的测试数据帧。