我有一个包含一些电子邮件的文本文件。所有这些开始如下:
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
我的问题是获得唯一的电子邮件ID,所以我尝试了 -
fhand = open("mbox-short.txt")
emails=[]
for line in fhand:
if line.startswith("From:"):
l=line.lstrip("From:").rstrip()
emails.append(l)
unique = []
for email in emails:
if email not in unique:
unique.append(email)
print(email)
print("nTotal Unique Contacts=",len(unique))
输出 -
stephen.marquard@uct.ac.za
louis@media.berkeley.edu
zqian@umich.edu
rjlowe@iupui.edu
cwen@iupui.edu
gsilver@umich.edu
wagnermr@iupui.edu
antranig@caret.cam.ac.uk
gopal.ramasammycook@gmail.com
david.horwitz@uct.ac.za
ray@media.berkeley.edu
总唯一联系人= 11
这是正确的答案,但是 -
当我在LSTRIP("来自:"(中使用额外的空间时,实际的电子邮件从:"开始,这就是我得到的 -
fhand = open("mbox-short.txt")
emails=[]
for line in fhand:
if line.startswith("From:"):
l=line.lstrip("From: ").rstrip()
emails.append(l)
unique = []
for email in emails:
if email not in unique:
unique.append(email)
print(email)
print("nTotal Unique Contacts=",len(unique))
输出 -
stephen.marquard@uct.ac.za
louis@media.berkeley.edu
zqian@umich.edu
jlowe@iupui.edu
cwen@iupui.edu
gsilver@umich.edu
wagnermr@iupui.edu
antranig@caret.cam.ac.uk
gopal.ramasammycook@gmail.com
david.horwitz@uct.ac.za
ay@media.berkeley.edu
总唯一联系人= 11
因此,我们看到从r开始的电子邮件受到影响,因为他们的第一个字母从输出中消失了,而其他字母根本没有影响。请帮助我理解为什么会发生这种情况。谢谢
让我们开始查看lstrip
的文档:
str。
返回字符串的副本,并删除了带领字符的副本。字符参数是一个字符串,指定要删除的字符集。
这意味着您的代码的以下部分:
line.lstrip("From: ")
删除F
,r
,o
,m
,:
的每一次出现,并从左侧开始直至找到另一个字符。一些例子:
>>> "From: rrabc@example.com".lstrip("From: ")
'abc@example.com'
>>> "From: morF@example.com".lstrip("From: ")
'@example.com'
>>> " mmmrrroooFFF: x@example.com".lstrip("From: ")
'x@example.com'
使用str.split()
AMD,然后访问第一个元素。您也可以使用set
获取所有唯一的电子邮件。
ex:
emails = set()
with open("mbox-short.txt") as fhand:
for line in fhand:
if line.startswith("From:"):
emails.add(line.strip().split()[1])
print(emails)
print("nTotal Unique Contacts=",len(emails))
您可以使用re
获取这些邮件:
import re
with open("mbox-short.txt", 'r') as f:
emails= list(set(re.findall(r'[w.]+@[w.]+', f.read())))