Python:抓取网站主要URL和标题的谷歌结果



我正试图从谷歌搜索中抓取给定数量的结果,但到目前为止我遇到了两个问题:一个是我不知道如何将URL和标题连接到同一个循环中,所以它们可以以的格式一起显示

(Title)
(Website URL)
(---------)
(Title)
(Website URL)
(---------)

我设法实现了这种格式,但循环已经进行了好几次,而不仅仅是显示前10名的结果。我相信这与我如何构建循环以协同工作有关,但我不知道如何避免这种情况。

另一个问题是,我想在搜索结果中获得每个网站的主URL和标题,但当我设法获得正确的标题时,我似乎得到了来自同一网站的许多链接,而不仅仅是主URL。例如,如果我搜索";数据科学";,显示的第二个或第三个标题来自Coursera,而链接来自维基百科。我只想要主URL,这样标题与网站URL匹配,我该如何获得它?

任何输入都将非常感谢

import requests
from bs4 import BeautifulSoup
import re
query = "data science"
search = query.replace(' ', '+')
results = 10
url = (f"https://www.google.com/search?q={search}&num={results}")
requests_results = requests.get(url)
soup_link = BeautifulSoup(requests_results.content, "html.parser")
soup_title = BeautifulSoup(requests_results.text,"html.parser")
links = soup_link.find_all("a")
heading_object=soup_title.find_all( 'h3' )
for link in links:
for info in heading_object:
get_title = info.getText()
link_href = link.get('href')
if "url?q=" in link_href and not "webcache" in link_href:
print(get_title)
print(link.get('href').split("?q=")[1].split("&sa=U")[0])
print("------")

links的长度似乎与heading_object列表不匹配。我认为最好是你把它过滤得更远,而不仅仅是";a";。

编辑你的解决方案,你可以循环通过这样的链接:

import requests
from bs4 import BeautifulSoup
import re
query = "data science"
search = query.replace(' ', '+')
results = 10
url = (f"https://www.google.com/search?q={search}&num={results}")
requests_results = requests.get(url)
soup_link = BeautifulSoup(requests_results.content, "html.parser")
links = soup_link.find_all("a")
for link in links:
link_href = link.get('href')
if "url?q=" in link_href and not "webcache" in link_href:
title = link.find_all('h3')
if len(title) > 0:
print(link.get('href').split("?q=")[1].split("&sa=U")[0])
print(title[0].getText())
print("------")

我们可以直接从链接中获取标题,而不是为标题和链接保留2个列表。我们通过在链接对象内执行另一个find_all('h3')来实现这一点。由于有些链接与url?q=格式匹配,但不是您想要显示的实际结果的一部分,例如用于相关搜索的展开手风琴等,我们也需要过滤掉这些链接。我们可以通过检查他们是否有";h3";这就是为什么我们有len(title) > 0

尝试使用requestsparams作为dict,它更可读,例如:

params = {
"q": "fus ro dah", 
"hl": "en",
"gl": "us",
"num": "100"
}
requests.get('https://www.google.com/search', params=params)

确保您使用请求headers并传递user-agent作为真正的用户访问。否则,谷歌最终会阻止您的请求,因为默认的requestsuser-agent是python请求。检查您的用户代理是什么。

headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

您不需要创建多个soups(BeautifulSoup()对象(,只需创建一个,并在需要时调用它。CSS选择器参考。

soup = BeautifulSoup(html.text, 'YOUR PARSER OF CHOISE') # try to use 'lxml', it's one of the fastest
# call it
soup.select()
soup.findAll()
soup.a.tag_parent
soup.p.next_element
for i in soup.select('css_selector'):
some_variable = i.select_one('css_selector')

一个IDE中的代码和完整示例:

import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
'q': 'data science',
'hl': 'en',
'num': '100'
}
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')
# container with all needed data
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
displayed_link = result.select_one('.TbwUpd.NJjxre').text
try:
snippet = result.select_one('#rso .lyLwlc').text
except: snippet = None
print(f'{title}n{link}n{displayed_link}n{snippet}n')
print('---------------')
'''
Data Science Specialization - Coursera
https://www.coursera.org/specializations/jhu-data-science
https://www.coursera.org › ... › Data Analysis
Offered by Johns Hopkins University. Launch Your Career in Data Science. A ten-course introduction to data science, developed and taught by .
---------------
'''

或者,您也可以使用SerpAPI的Google Organic Results API做同样的事情。这是一个付费的API免费计划。

主要区别在于,您只需要迭代结构化JSON并获得所需的数据,而不需要弄清楚如何选择某些元素并从中提取数据,或者如果它们出现,或者如果您不想处理JavaScript网站,例如谷歌地图,则绕过谷歌块。

要集成的代码:

from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"), # serpapi API key
"engine": "google",              # search engine
"q": "data science",             # search query
"hl": "en"                       # language of the search
}
search = GoogleSearch(params)      # where data extraction happens
results = search.get_dict()        # JSON -> Python dictionary
for result in results['organic_results']:
title = result['title']
link = result['link']
displayed_link = result['displayed_link']
snippet = result['snippet']
print(f"{title}n{link}n{displayed_link}n{snippet}n")
print('---------------')
'''
Data science - Wikipedia
https://en.wikipedia.org/wiki/Data_science
https://en.wikipedia.org › wiki › Data_science
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured ...
---------------
'''

免责声明,我为SerpApi工作。

最新更新