使用美丽汤时丢失信息



我正在遵循"使用Python自动化无聊的东西"的指南。 练习一个名为"项目:"我感到幸运"谷歌搜索的项目

但 CSS 选择器不返回任何内容

import requests,sys,webbrowser,bs4,pyperclip
if len(sys.argv) > 1:
address = ' '.join(sys.argv[1:])
else:
address = pyperclip.paste()
res = requests.get('http://google.com/search?q=' + str(address))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,"html.parser")
linkElems = soup.select('.r a')
for i in range (5):
webbrowser.open('http://google.com' + linkElems[i].get('href'))**

我已经在 IDLE 外壳中测试了相同的代码

看来

linkElems = soup.select('.r') 

不返回任何内容

在我检查了美丽汤返回的值后

soup = bs4.BeautifulSoup(res.text,"html.parser")

我找到了所有class='r'class='rc'无缘无故地消失了。 但它们在原始HTML文件中。

请告诉我为什么以及如何避免此类问题

要获取定义类r的 HTML 版本,有必要在标头中设置User-Agent

import requests
from bs4 import BeautifulSoup
address = 'linux'
headers={'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'}
res = requests.get('http://google.com/search?q=' + str(address), headers=headers)
res.raise_for_status()
soup = BeautifulSoup(res.text,"html.parser")
linkElems = soup.select('.r a')
for a in linkElems:
if a.text.strip() == '':
continue
print(a.text)

指纹:

Linux.orghttps://www.linux.org/
Puhverdatud
Tõlgi see leht
Linux – Vikipeediahttps://et.wikipedia.org/wiki/Linux
Puhverdatud
Sarnased
Linux - Wikipediahttps://en.wikipedia.org/wiki/Linux
...and so on.

Google 阻止您的请求的原因是默认请求用户代理是python-requests。检查您的user-agent,从而阻止您的请求并导致具有不同元素和选择器的完全不同的 HTML。但有时您可能会收到不同的 HTML,在使用user-agent时具有不同的选择器。

详细了解user-agent和 HTTP 请求headers

user-agent传递到请求headers

headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('YOUR_URL', headers=headers)

尝试使用lxml解析器,它更快。

在线 IDE 中的

Code 和完整示例:

from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "My query goes here"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link)
-----
'''
https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
https://www.benlcollins.com/spreadsheets/google-sheets-query-sql/
https://www.exoscale.com/syslog/explaining-mysql-queries/
https://blog.hubspot.com/marketing/sql-tutorial-introduction
https://mode.com/sql-tutorial/sql-sub-queries/
https://www.mssqltips.com/sqlservertip/1255/getting-io-and-time-statistics-for-sql-server-queries/
https://stackoverflow.com/questions/2698401/how-to-store-mysql-query-results-in-another-table
https://www.khanacademy.org/computing/computer-programming/sql/relational-queries-in-sql/a/more-efficient-sql-with-query-planning-and-optimization
http://cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf
https://www.sommarskog.se/query-plan-mysteries.html
'''

或者,您可以使用SerpApi的Google Organic Results API做同样的事情。这是一个带有免费计划的付费 API。

不同之处在于,您只需要从JSON字符串中提取所需的数据,而不是弄清楚如何从Google中提取,维护或绕过块。

要集成的代码:


params = {
"engine": "google",
"q": "My query goes here",
"hl": "en",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['link'])
-------
'''
https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
https://www.benlcollins.com/spreadsheets/google-sheets-query-sql/
https://www.exoscale.com/syslog/explaining-mysql-queries/
https://blog.hubspot.com/marketing/sql-tutorial-introduction
https://mode.com/sql-tutorial/sql-sub-queries/
https://www.mssqltips.com/sqlservertip/1255/getting-io-and-time-statistics-for-sql-server-queries/
https://stackoverflow.com/questions/2698401/how-to-store-mysql-query-results-in-another-table
https://www.khanacademy.org/computing/computer-programming/sql/relational-queries-in-sql/a/more-efficient-sql-with-query-planning-and-optimization
http://cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf
https://www.sommarskog.se/query-plan-mysteries.html
'''

免责声明,我为 SerpApi 工作。

最新更新