问题与网页抓取谷歌python美丽的汤



我正在写代码:我想打开一些已经找到的子页面。

import bs4
import requests
url = 'https://www.google.com/search?q=python'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
list_sites = soup.select('a[href]')
print(len(list_sites))

我想打开例如网站在谷歌像'python',然后打开一些第一个链接,但我有一个功能选择的问题。我应该把里面找到链接子页?喜欢:波兰Python编码器组-新闻,欢迎来到Python.org,…我试着把:a[href], a, h3类,但它不工作…

这是你需要的吗?

from bs4 import BeautifulSoup
import requests, urllib.parse
import lxml
def print_extracted_data_from_url(url):
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get(url, headers=headers).text
soup = BeautifulSoup(response, 'lxml')

for container in soup.findAll('div', class_='tF2Cxc'):
head_link = container.a['href']
print(head_link)
return soup.select_one('a#pnnext')

next_page_node = print_extracted_data_from_url('https://www.google.com/search?hl=en-US&q=python')

最新更新