我试图通过以下代码Web Scraping:
from bs4 import BeautifulSoup
import requests
import pandas as pd
page = requests.get('https://www.google.com/search?q=phagwara+weather')
soup = BeautifulSoup(page.content, 'html-parser')
day = soup.find(id='wob_wc')
print(day.find_all('span'))
但不断得到以下错误:
File "C:UsersmynameDesktopwebscraping.py", line 6, in <module>
soup = BeautifulSoup(page.content, 'html-parser')
File "C:UsersmynameAppDataLocalProgramsPythonPython38-32libsite-packagesbs4__init__.py", line 225, in __init__
raise FeatureNotFound(
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: html-parser. Do you need to install a parser library?
我安装了lxml和html5lib,但这个问题仍然存在。
您需要将"html解析器"更改为soup = BeautifulSoup(page.content, 'html.parser')
您需要提到标签,所以它应该是soup.find("div", id="wob_wc"))
而不是soup.find(id="wob_wc")
解析器名称是html.parser
而不是html-parser
,区别在于点。
同样在默认情况下,Google
通常会给你一个200
的响应,以防止你知道你是否屏蔽了。通常你必须检查r.content
。
我已经包含了headers
,现在它开始工作了。
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'}
r = requests.get(
"https://www.google.com/search?q=phagwara+weather", headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup.find("div", id="wob_wc"))
实际上,您不需要迭代整个内容:"div #wob_wc"
,因为当前位置、天气、日期、温度、降水量、湿度和风由一个元素组成,不在其他任何地方重复,您可以使用select()
或find()
。
如果你想迭代某些东西,那么迭代温度预测是个好主意,例如:
for forecast in soup.select('.wob_df'):
high_temp = forecast.select_one('.vk_gy .wob_t:nth-child(1)').text
low_temp = forecast.select_one('.QrNVmd .wob_t:nth-child(1)').text
print(f'High: {high_temp}, Low: {low_temp}')
'''
High: 67, Low: 55
High: 65, Low: 56
High: 68, Low: 55
'''
看看SelectorGadget Chrome扩展,在那里你可以通过点击浏览器中所需的元素来获取CSS
选择器。CSS
选择器参考。
在线IDE中的代码和完整示例:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "phagwara weather",
"hl": "en",
"gl": "us"
}
response = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(response.text, 'lxml')
weather_condition = soup.select_one('#wob_dc').text
tempature = soup.select_one('#wob_tm').text
precipitation = soup.select_one('#wob_pp').text
humidity = soup.select_one('#wob_hm').text
wind = soup.select_one('#wob_ws').text
current_time = soup.select_one('#wob_dts').text
print(f'Weather condition: {weather_condition}n'
f'Tempature: {tempature}°Fn'
f'Precipitation: {precipitation}n'
f'Humidity: {humidity}n'
f'Wind speed: {wind}n'
f'Current time: {current_time}n')
for forecast in soup.select('.wob_df'):
day = forecast.select_one('.QrNVmd').text
weather = forecast.select_one('img.uW5pk')['alt']
high_temp = forecast.select_one('.vk_gy .wob_t:nth-child(1)').text
low_temp = forecast.select_one('.QrNVmd .wob_t:nth-child(1)').text
print(f'Day: {day}nWeather: {weather}nHigh: {high_temp}, Low: {low_temp}n')
---------
'''
Weather condition: Partly cloudy
Temperature: 87°F
Precipitation: 5%
Humidity: 70%
Wind speed: 4 mph
Current time: Tuesday 4:00 PM
Forcast temperature:
Day: Tue
Weather: Partly cloudy
High: 90, Low: 76
...
'''
或者,您也可以使用SerpApi的Google Direct Answer Box API来实现同样的目的。这是一个付费的API免费计划。
您的示例中的主要区别在于,您只需要迭代已经提取的数据,而不需要从头开始做任何事情,或者弄清楚如何绕过谷歌的块。
要集成的代码:
params = {
"engine": "google",
"q": "phagwara weather",
"api_key": os.getenv("API_KEY"),
"hl": "en",
"gl": "us",
}
search = GoogleSearch(params)
results = search.get_dict()
loc = results['answer_box']['location']
weather_date = results['answer_box']['date']
weather = results['answer_box']['weather']
temp = results['answer_box']['temperature']
precipitation = results['answer_box']['precipitation']
humidity = results['answer_box']['humidity']
wind = results['answer_box']['wind']
forecast = results['answer_box']['forecast']
print(f'{loc}n{weather_date}n{weather}n{temp}°Fn{precipitation}n{humidity}n{wind}n')
print(json.dumps(forecast, indent=2))
---------
'''
Phagwara, Punjab, India
Tuesday 4:00 PM
Partly cloudy
87°F
5%
70%
4 mph
[
{
"day": "Tuesday",
"weather": "Partly cloudy",
"temperature": {
"high": "90",
"low": "76"
},
"thumbnail": "https://ssl.gstatic.com/onebox/weather/48/partly_cloudy.png"
}
...
]
'''
免责声明,我为SerpApi工作。