我试图使用beautifulSoup进行分页网络抓取,所以我使用网络驱动程序分页到其他页面。但是,我真的不确定是否有任何其他方法可以使用 webdriver 从动态网页获取内容并与我的代码匹配。以下是我尝试实现网络驱动程序的完整代码,但网络驱动程序不起作用。我即将抓取的网络是[链接在这里][1]
for i in range(1, MAX_PAGE_NUM + 1):
page_num = (MAX_PAGE_DIG - len(str(i))) * "0" + str(i)
raw = requests.get('').text
driver.get(raw)
raw = raw.replace("</br>", "")
soup = BeautifulSoup(raw, 'html.parser')
name = soup.find_all('div', {'class' :'cbp-vm-companytext'})
phone = [re.findall('>.*?<',d.find('span')['data-content'])[0][1:][:-1] for d in soup.find_all('div',{'class':'cbp-vm-cta'})]
addresses = [x.text.strip().split("rn")[-1].strip() for x in soup.find_all("div", class_='cbp-vm-address')]
print(addresses)
print(name)
num_page_items = len(addresses)
with open('results.csv', 'a') as f:
for i in range(num_page_items):
f.write(name[i].text + "," + phone[i] + "," + addresses[i] + "," + "n")
当然,我错误地在代码中添加了网络驱动程序。我应该修复什么才能使网络驱动程序正常工作?
如果您使用Selenium
来读取页面,那么您也可以使用Selenium
来搜索页面上的元素。
有些元素没有companytext
所以如果你得到单独的companytext
和单独的address
/phone
那么你可能会创建错误的对:(second name, first phone, first address)
、(third name, second phone, second address)
等。 最好找到name
、phone
、address
组的元素,然后在这个元素内搜索name
、phone
、address
- 如果找不到名称,那么你必须输入空名称或在此组中搜索带有名称的不同元素。我发现某些元素显示带有徽标而不是名称的图像,并且它们的名称<img alt="...">
使用标准write()
在文件中写入 CSV 数据不是好主意address
因为它可能有许多,
并且可能会创建许多列。使用模块csv
它将地址作为单列放在" "
中。
from selenium import webdriver
import csv
MAX_PAGE_NUM = 5
#driver = webdriver.Chrome()
driver = webdriver.Firefox()
with open('results.csv', 'w') as f:
csv_writer = csv.writer(f)
csv_writer.writerow(["Business Name", "Phone Number", "Address"])
for page_num in range(1, MAX_PAGE_NUM+1):
#page_num = '{:03}'.format(page_num)
url = 'https://www.yellowpages.my/listing/results.php?keyword=boutique&where=selangor&screen={}'.format(page_num)
driver.get(url)
for item in driver.find_elements_by_xpath('//div[@id="content_listView"]//li'):
try:
name = item.find_element_by_xpath('.//div[@class="cbp-vm-companytext"]').text
except Exception as ex:
#print('ex:', ex)
name = item.find_element_by_xpath('.//a[@class="cbp-vm-image"]/img').get_attribute('alt')
phone = item.find_element_by_xpath('.//div[@class="cbp-vm-cta"]//span[@data-original-title="Phone"]').get_attribute('data-content')
phone = phone[:-4].split(">")[-1]
address = item.find_element_by_xpath('.//div[@class="cbp-vm-address"]').text
address = address.split('n')[-1]
print(name, '|', phone, '|', address)
csv_writer.writerow([name, phone, address])
顺便说一句:您不必将页码转换为三位数字 - 即。001
- 它也适用于1
.但是如果你想转换,那么使用字符串格式
page_num = '{:03}'.format(i)
它也只能在没有Selenium
的情况下requests
和BeautifulSoup
的情况下完成.
如果你必须从Selenium
获取HTML,那么你有driver.page_source
- 但driver.get()
需要url
,然后你不需要requests
。
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
编辑:只有当我使用"lxml"
而不是"html.parser"
时,我才能requests
和BeautifulSoup
获得它,而无需Selenium
。HTML中似乎存在一些错误,"html.parser"
正确解析它有问题
import requests
from bs4 import BeautifulSoup as BS
import csv
#import webbrowser
MAX_PAGE_NUM = 5
#headers = {
# "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:74.0) Gecko/20100101 Firefox/74.0"
#}
with open('results.csv', 'w') as f:
csv_writer = csv.writer(f)
csv_writer.writerow(["Business Name", "Phone Number", "Address"])
for page_num in range(1, MAX_PAGE_NUM+1):
#page_num = '{:03}'.format(page_num)
url = 'https://www.yellowpages.my/listing/results.php?keyword=boutique&where=selangor&screen={}'.format(page_num)
response = requests.get(url) #, headers=headers)
soup = BS(response.text, 'lxml')
#soup = BS(response.text, 'html.parser')
#with open('temp.html', 'w') as fh:
# fh.write(response.text)
#webbrowser.open('temp.html')
#all_items = soup.find('div', {'id': 'content_listView'}).find_all('li')
#print('len:', len(all_items))
#for item in all_items:
for item in soup.find('div', {'id': 'content_listView'}).find_all('li'):
try:
name = item.find('div', {'class': 'cbp-vm-companytext'}).text
except Exception as ex:
#print('ex:', ex)
name = item.find('a', {'class': 'cbp-vm-image'}).find('img')['alt']
phone = item.find('div', {'class': 'cbp-vm-cta'}).find('span', {'data-original-title': 'Phone'})['data-content']
phone = phone[:-4].split(">")[-1].strip()
address = item.find('div', {'class': 'cbp-vm-address'}).text
address = address.split('n')[-1].strip()
print(name, '|', phone, '|', address)
csv_writer.writerow([name, phone, address])