改进正则表达式以捕获来自谷歌搜索的完整电子邮件



为了练习并帮助我姐姐从医生那里为她的宝宝发送电子邮件,我设计了这个电子邮件收集器。它进行搜索,清理给定的URL,将它们添加到字典中,并以两种不同的方式解析它们以查找电子邮件。

代码是从不同的地方获取的,所以如果你纠正我,请清楚地解释你的改进,因为我已经在我的知识极限工作了。

问题是如何更好地获取电子邮件(如果可能的话,改进代码)。我将在下面发布代码和确切的输出:

我的程序代码:

import requests, re, webbrowser, bs4
from selenium import webdriver
from bs4 import BeautifulSoup
import time, random, webbrowser
import urllib.request
def google_this():                #Googles and gets the first few links
    search_terms = ['Fiat','Lambrusco']
    added_terms = 'email contact? @'
    #This searches for certain keywords in Google and parses results with BS
    for el in search_terms:
        webpage = 'http://google.com/search?q=' + str(el) + str(added_terms)
        print('Searching for the terms...', el,added_terms)
        headers = {'User-agent':'Mozilla/5.0'}
        res = requests.get(webpage, headers=headers)
        #res.raise_for_status()
        statusCode = res.status_code
        if statusCode == 200:
            soup = bs4.BeautifulSoup(res.text,'lxml')
            serp_res_rawlink = soup.select('.r a')
            dicti = []                  #This gets the href links
            for link in serp_res_rawlink:
                url = link.get('href')
                if 'pdf' not in url:
                    dicti.append(url)
            dicti_url = []              #This cleans the "url?q=" from link
            for el in dicti:
                if '/url?q=' in el:
                    result = (el.strip('/url?q='))
                    dicti_url.append(result)
            #print(dicti_url)
            dicti_pretty_links = []     #This cleans the gibberish at end of url
            for el in dicti_url[0:4]:
                pretty_url = el.partition('&')[0]
                dicti_pretty_links.append(pretty_url)
            print(dicti_pretty_links)
            for el in dicti_pretty_links:   #This converts page in BS soup
                # browser = webdriver.Firefox()
                # browser.get(el)
                # print('I have been in the element below and closed the window')
                # print(el)
                # time.sleep(1)
                # browser.close()
                webpage = (el)
                headers = {'User-agent':'Mozilla/5.0'}
                res = requests.get(webpage, headers=headers)
                #res.raise_for_status()
                statusCode = res.status_code
                if statusCode == 200:
                    soup = bs4.BeautifulSoup(res.text,'lxml')
                    #This is the first way to search for an email in soup
                    emailRegex = re.compile(r'([a-zA-Z0-9_.+]+@+[a-zA-Z0-9_.+])', re.VERBOSE)
                    mo = emailRegex.findall(res.text)
                    #mo = emailRegex.findall(soup.prettify())
                    print('THIS BELOW IS REGEX')
                    print(mo)
                    #This is the second way to search for an email in soup:
                    mailtos = soup.select('a[href^=mailto]')
                    for el in mailtos:
                        print('THIS BELOW IS MAILTOS')
                        print(el.text)
    time.sleep(random.uniform(0.5,1))
google_this()

这是上面完全相同的代码时的输出。如您所见,似乎可以找到一些电子邮件,但在"@"符号之后的剪切处:

C:UsersSKAppDataLocalProgramsPythonPython35-32python.exe C:/Users/SK/PycharmProjects/untitled/another_temperase.py
Searching for the terms... Fiat email contact? @
['http://www.fcagroup.com/en-US/footer/Pages/contacts.aspx', 'http://www.fiat.co.uk/header-contacts', 'http://www.fiatusa.com/webselfservice/fiat/', 'https://twitter.com/nic_fincher81/status/672505531689394176']
THIS BELOW IS REGEX
['investor.relations@f', 'investor.relations@f', 'sustainability@f', 'sustainability@f', 'mediarelations@f', 'mediarelations@f']
THIS BELOW IS MAILTOS
investor.relations@fcagroup.com
THIS BELOW IS MAILTOS
sustainability@fcagroup.com
THIS BELOW IS MAILTOS
mediarelations@fcagroup.com
THIS BELOW IS REGEX
[]
THIS BELOW IS REGEX
[]
THIS BELOW IS REGEX
['nic_fincher81@y', 'nic_fincher81@y', 'nic_fincher81@y', 'nic_fincher81@y', 'nic_fincher81@y', 'nic_fincher81@y']
Searching for the terms... Lambrusco email contact? @
['http://www.labattagliola.it/%3Flang%3Den']
Process finished with exit code 0

我会推荐一个限制性更强的版本,它仍然可以捕获所有电子邮件:

([a-zA-Z0-9_.+]+@[a-zA-Z0-9_.+]+) 

@后的第一个字母之后没有捕获任何内容的问题是因为正则表达式缺少+

([a-zA-Z0-9_.+]+@+[a-zA-Z0-9_.+]+) 

原来这部分[a-zA-Z0-9_.+]简单地说要抓住以下任何字符之一a-zA-Z0-9._+

我也会小心@+它说要捕获 1 个或多个"@"符号。

因此,可能有效的电子邮件可能如下所示:

......

............@

......

最新更新