在使用正则抓取电子邮件时，无法排除不需要的文件扩展

我已经在Python中使用regular expression编写了一个脚本，以从某些网站获取电子邮件地址。我使用的是硒，因为很少有动态。但是，只要没有类似于这些页面中可用的电子邮件，我的脚本就可以了，就像himalayan-institute-logo@2x.png中一样。

在获取电子邮件时，我如何排除以 .png或 .jpg结尾的扩展名？

我使用的正则图案：

[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+

我正在尝试的脚本：

import re
from selenium import webdriver
URLS = (
    'https://www.himalayaninstitute.org/about/',
    'http://www.innovaprint.com.sg/',
    'http://www.cityscape.com.sg/?page_id=37',
    'http://www.yogaville.org',
    )
def get_email(driver,link):
    driver.get(link)
    email = re.findall(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+',driver.page_source)
    if email: 
        print(link,email[0])
    else: 
        print(link)
if __name__ == '__main__':
    chromeOptions = webdriver.ChromeOptions()
    chromeOptions.add_argument("--headless")
    driver = webdriver.Chrome(chrome_options=chromeOptions)
    for url in URLS:
        get_email(driver,url)
    driver.quit()

输出我有：

https://www.himalayaninstitute.org/about/ himalayan-institute-logo@2x.png
http://www.innovaprint.com.sg/ info@innovacoms.com
http://www.cityscape.com.sg/?page_id=37 info@cityscape.com.sg
http://www.yogaville.org Yantra-@500.png

最后一部分[a-zA-Z0-9-.]+是一个广泛的匹配，该匹配不考虑点的位置。例如，它也可以匹配.....

一种可能性可能是仍然使用模式的第一部分[a-zA-Z0-9_.+-]+@匹配，包括 @ sign。

然后使用正lookahead声明右边的内容不会以.png或.jpg结尾，并匹配一个模式，其中dot至少在1个字符之间，而不是一个点。

[a-zA-Z0-9_.+-]+@[a-zA-Z0-9]+(?:.[a-zA-Z0-9]+)*(?!.(?:png|jpg)).[a-zA-Z0-9]+

说明

[a-zA-Z0-9_.+-]+@匹配允许的字符，然后是 @
[a-zA-Z0-9]+匹配字符类中列出的任何一个
(?:非捕获组
- .[a-zA-Z0-9]+匹配一个点，然后是1次以上字符类中列出的内容
)*关闭非捕获组并重复0次以上
(?!负面lookahead，断言下面的不是
- .(?:png|jpg)匹配.png或.jog
).[a-zA-Z0-9]+关闭LookAhead并匹配1次以上的点，并且字符类中列出的内容

REGEX DEMO

相关内容

最新更新

热门标签：