正在尝试从网站中抓取电子邮件地址



我试图抓取这个网站:

www.united-church.ca/search/locater/all?keyw=&mission_units_ucc_ministry_type_advanced=10&locll=

我确实用Scrapy抓取了它,但我无法抓取电子邮件地址。有人能帮我吗?

这是我到目前为止的代码:

# -*- coding: utf-8 -*-
import scrapy
from ..items import ChurchItem

class ChurchSpiderSpider(scrapy.Spider):
name = 'church_spider'
page_number = 1
start_urls = ['https://www.united-church.ca/search/locator/all?keyw=&mission_units_ucc_ministry_type_advanced=10&locll=']
def parse(self, response):
items = ChurchItem()
container = response.css(".icon-ministry")
for t in container:
church_name = t.css(".field-name-locator-ministry-title a::text").extract()
church_phone = t.css(".field-name-field-phone::text").extract()
church_address = t.css(".thoroughfare::text").extract()
church_email = t.css(".field-name-field-mu-email span::text").extract()
items["church_name"] = church_name
items["church_phone"] = church_phone
items["church_address"] = church_address
items["church_email"] = church_email
yield items
# next_page = 'https://www.united-church.ca/search/locator/all?keyw=&mission_units_ucc_ministry_type_advanced=10&locll=&page=' + str(ChurchSpiderSpider.page_number)
# if ChurchSpiderSpider.page_number <= 110:
#     ChurchSpiderSpider.page_number += 1
#     yield response.follow(next_page, callback=self.parse)

我找到了部分解决方案,但它仍然不完整。现在的输出是:

{'church_address': ['7763 Highway 21'],
'church_email': ['herbklaehn', ' [at] ', 'gmail.com'],
'church_name': ['Allenford United Church'],
'church_phone': ['519-35-6232']}

如何将[at]替换为@并将电子邮件地址组合为一个字符串?

使用美丽的汤

获取电子邮件的一个简单方法是用class=field-name-field-mu-email'查找div,然后将奇数显示替换为正确的电子邮件格式。

例如:

from bs4 import BeautifulSoup
url = 'https://www.united-church.ca/search/locator/all?keyw=&mission_units_ucc_ministry_type_advanced=10&locll='
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
for div in soup.findAll('div', attrs={'class': 'field-name-field-mu-email'}):
print (div.find('span').text.replace(' [at] ', '@'))
Out[1]:
alpcharge@sasktel.net
guc-eug@bellnet.ca
pioneerpastoralcharge@gmail.com
acmeunitedchurch@gmail.com
cmcphers@lakeheadu.ca
mbm@kos.net
tommaclaren@gmail.com
agassizunited@shaw.ca
buchurch@xplornet.com
dmitchell008@yahoo.ca
karen.charlie62@gmail.com
trinityucbdn@westman.wave.ca
gepc.ucc.mail@gmail.com
monacampbell181@gmail.com
herbklaehn@gmail.com

您可以尝试使用Selenium进行网络抓取,我尝试过这段代码,它给出了完美的结果。

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome("chromedriver")
driver.get("https://www.united-church.ca/search/locator/all?keyw=&mission_units_ucc_ministry_type_advanced=10&locll=")
content = driver.page_source
soup = BeautifulSoup(content)
for all_emails in soup.find_all('a',class_="spamspan"):
print(all_emails.text)

结果:

alpcharge@sasktel.net
guc-eug@bellnet.ca
pioneerpastoralcharge@gmail.com
acmeunitedchurch@gmail.com
cmcphers@lakeheadu.ca
mbm@kos.net
tommaclaren@gmail.com
agassizunited@shaw.ca
buchurch@xplornet.com
dmitchell008@yahoo.ca
karen.charlie62@gmail.com
trinityucbdn@westman.wave.ca
gepc.ucc.mail@gmail.com
monacampbell181@gmail.com
herbklaehn@gmail.com

最新更新