-
我正在根据我的关键字从用户频道id中抓取他们的公共信息电子邮件,但是一些频道id重复,然后电子邮件也重复,同时抓取大量的频道id,所以在我将它们逐行写入我的文本之前,我需要他们也检查可能重复的电子邮件,并忽略电子邮件是否已经存在于文本文件中。
-
如果你给我写如何删除空格,我也会很优雅,因为我已经有代码,有时工作其他不工作,不知怎的,它写空行与空格。
我的代码逐行写入所有邮件:
with open("scraped_emails.txt", 'a') as f:
for email in cleanEmail:
f.write(email.replace(" ", "")+ 'n')
你可以添加一个if
语句来检查你想要附加的电子邮件是否已经在文件中,通过这样做:
cleanEmail = ['a@b.com', ' glennbz@veriznon.net ', 'x@yy.ul']
with open("scraped_emails.txt", 'r+') as f:
emails = f.read()
for email in cleanEmail:
if email not in emails:
f.write(email.strip() + 'n')
请注意,我添加了strip()
方法,这将通过删除前后空白来解决空白问题。
# Output
a@b.com
glenjnnbz@veriznon.net
x@yy.ul
如果我理解正确,你想清理你的文件scraped_emails.txt
,删除重复并通过删除空白来纠正电子邮件?我会做两步:
- 解析所有来自
scraped_emails.txt
的电子邮件,剥离空间并将它们存储在一组(唯一的) - 用清理后的值覆盖现有文件。如果对此不确定,请先写入另一个文件,然后检查结果
clean_emails = set()
file_name = "scraped_emails.txt"
# initial reading of emails
print(f"Reading {file_name} to clean emails ..")
initial_line_counter = 0
with open(file_name, "r") as f_in:
for line in f_in:
# remember input lines, just for statistics
initial_line_counter += 1
# strips newlines and whitespaces
cleaned_email = line.rstrip("n").strip()
# you mentioned empty lines - this prevents adding of empty strings to your set
if cleaned_email:
clean_emails.add(cleaned_email)
# opening the file with the attribute mode="w" overwrites existing files
with open(file_name, "w") as f_out:
for email in clean_emails:
f_out.write(f"{email}n")
print(f"Reduced {initial_line_counter} to {len(clean_emails)} cleaned email addresses")
您可以使用scraped_emails.txt
进行测试,其中包含以下内容:
some_mail1@yahoo.com
some_mail2@yahoo.com
some_mail3@yahoo.com
some_mail4@yahoo.com
some_mail5@yahoo.com
some_mail6@yahoo.com
some_mail@y7ahoo.com
some_mail8@yahoo.com
some_mail9@yahoo.com
some_mail9@yahoo.com
some_mail9@yahoo.com