获取表中所有项目的web链接,然后进行分页



我可以获取特定网页上的所有链接,但在分页方面遇到了问题。我正在做以下事情:

import requests, bs4, re
from bs4 import BeautifulSoup
from urllib.parse import urljoin
r = requests.get(start_url)
soup = BeautifulSoup(r.text,'html.parser')
a_tags = soup.find_all('a')
print(a_tags)
links = [urljoin(start_url, a['href'])for a in a_tags]
print(links)

作为一个玩具示例,我使用以下网站:

start_url = 'https://www.opencodez.com/page/1'

我可以通过这种方式获得所有链接。然而,我正试图通过转到下一页并做同样的事情,并将所有链接输出到csv文件来实现更多的自动化。

我尝试了以下操作,但没有得到任何输出:

start_url = 'https://www.opencodez.com/'
with open('names.csv', mode='w') as csv_file:
fieldnames = ['Name']
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
writer.writeheader()
article_link = []
def scraping(webpage, page_number):
next_page = webpage + str(page_number)
r = requests.get(str(next_page))
soup = BeautifulSoup(r.text,'html.parser')
a_tags = soup.find_all('a')
print(a_tags)
links = [urljoin(start_url, a['href'])for a in a_tags]
print(links)
for x in range(len(soup)):
article_link.append(links)
if page_number < 16:
page_number = page_number + 1
scraping(webpage, page_number)
scraping('https://www.opencodez.com/page/', 1)
#creating the data frame and populating its data into the csv file
data = { 'Name': article_link}
df = DataFrame(data, columns = ['Article_Link'])
df.to_csv(r'C:Usersxxxxxnames.csv')

你能帮我确定哪里出了问题吗?我不介意在输出控制台或打印在csv文件中获得链接

您的代码到处都有问题,但这对我有效:

import requests, bs4, re
from bs4 import BeautifulSoup
from urllib.parse import urljoin
start_url = 'https://www.opencodez.com/'
r = requests.get(start_url)                      # first page scraping
soup = BeautifulSoup(r.text,'html.parser')
a_tags = soup.find_all('a')
article_link = []
links = [urljoin(start_url, a['href'])for a in a_tags]
article_link.append(links)
for page in range(2,19):              # for every page after 1
links = []                          # resetting lists on every page just in case
a_tags = []                 
url = 'https://www.opencodez.com/page/'+str(page)
r = requests.get(start_url)                    
soup = BeautifulSoup(r.text,'html.parser')
a_tags = soup.find_all('a')
links = [urljoin(start_url, a['href'])for a in a_tags]
article_link.append(links)
print(article_link)

我基本上只是更改了添加到列表article_link的方式。这个变量目前是一个长度为18的列表。article_link内的每个列表是136个链路的列表。

最新更新