使用python beautifulsoup 3删除分页的结果



我能够为First&最后一页,但只能提取CSV中的第1页数据。我需要将所有10页的数据提取到CSV中。代码中我哪里出错了?

导入已安装的模块

import requests
from bs4 import BeautifulSoup
import csv

要从网页获取数据,我们将使用requests get()方法

url = "https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore"
page = requests.get(url)

检查http响应状态代码

print(page.status_code)

现在我已经从网页上收集了数据,让我们看看我们得到了什么

print(page.text)

通过使用beautiulsoup的pretify()方法,可以以漂亮的格式查看上述数据。为此,我们将创建一个bs4对象,并使用漂亮的方法

soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())
outfile = open('gymlookup.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Name", "Address", "Phone"])

查找包含公司信息的所有DIV

product_name_list = soup.findAll("div",{"class":"CompanyInfo"})

提取第一个和最后一个页码

paging = soup.find("div",{"class":"pg-full-width me-pagination"}).find("ul",{"class":"pagination"}).find_all("a")
start_page = paging[1].text
last_page = paging[len(paging)-2].text

现在循环通过这些元素

for element in product_name_list:

获取"div",{"class":"CompanyInfo"}标签的1个块,并查找/存储名称、地址、电话

name = element.find('h2').text
address = element.find('address').text.strip()
phone = element.find("ul",{"class":"submenu"}).text.strip()

将姓名、地址、电话写入csv

writer.writerow([name, address, phone])

现在将转到下一个"div",{"class":"CompanyInfo"}标记并重复

outfile.close()

您只需要更多的循环。您现在需要循环浏览每个页面url:请参阅下面的内容。

import requests
from bs4 import BeautifulSoup
import csv
root_url = "https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore"
html = requests.get(root_url)
soup = BeautifulSoup(html.text, 'html.parser')
paging = soup.find("div",{"class":"pg-full-width me-pagination"}).find("ul",{"class":"pagination"}).find_all("a")
start_page = paging[1].text
last_page = paging[len(paging)-2].text

outfile = open('gymlookup.csv','w', newline='')
writer = csv.writer(outfile)
writer.writerow(["Name", "Address", "Phone"])

pages = list(range(1,int(last_page)+1))
for page in pages:
url = 'https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore&page=%s' %(page)
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
#print(soup.prettify())
print ('Processing page: %s' %(page))
product_name_list = soup.findAll("div",{"class":"CompanyInfo"})
for element in product_name_list:
name = element.find('h2').text
address = element.find('address').text.strip()
phone = element.find("ul",{"class":"submenu"}).text.strip()
writer.writerow([name, address, phone])
outfile.close()
print ('Done')  

您应该使用页面属性https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=健身房&l=lahore&第2页

10页的示例代码:

url = "https://www.lookup.pk/dynamic/search.aspx?searchtype=kl&k=gym&l=lahore&page={}"
for page_num in range(1, 10):
page = requests.get(url.format(page_num)
#further processing

最新更新