试图从网站上的表格中抓取文本

我是这方面的新手，但我一直在尝试在网站上收集数据(https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA)但我总是空荡荡的。我试过BeautifulSoup和Scrapy，但我发不出短信。

最终，我想将表中每一种葡萄酒的行放入一个数据帧.csv(来自所有页面)中，但目前我甚至无法获得第一个葡萄酒生产商的名称。

如果你查看网页，所有的细节都在没有id或class的标签中。

我的BeautifulSoup尝试

URL = 'https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) 
Chrome/106.0.0.0 Safari/537.36 Edg/106.0.1370.52"}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, "html.parser")
soup2 = soup.prettify()
producer = soup2.find_all('td').get_text()
print(producer)

哪个抛出了错误：

producer = soup2.find_all('td').get_text()
AttributeError: 'str' object has no attribute 'find_all'

我的报废尝试

winedf = pd.DataFrame()
class WineSpider(scrapy.Spider):
name = 'wine_spider'
def start_requests(self):
dwwa_url = "https://awards.decanter.com/DWWA/2022/search/wines?competitionType=DWWA"
yield scrapy.Request(url=dwwa_url, callback=self.parse_front)
def parse_front(self, response):
table = response.xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table')
page_links = table.xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/div[2]/div[1]/ul/li[3]/a(@class,
"dwwa-page-link") @href')
links_to_follow = page_links.extract()
for url in links_to_follow:
yield response.follow(url=url, callback=self.parse_pages)
def parse_pages(self, response):
wine_name = Selector(response=response).xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table/tbody/
tr[1]/td[1]/text()').get()
wine_name_ext = wine_name.extract().strip()
winedf.append(wine_name_ext)
medal = Selector(response=response).xpath('//*[@id="root"]/div/div[2]/div[4]/div[2]/table/tbody/tr[1]/
td[4]/text()').get()
medal_ext = medal.extract().strip()
winedf.append(medal_ext)

它产生并清空df。

如有任何帮助，我们将不胜感激。

谢谢！

加载要抓取的站点时，请始终使用网络监视器检查它加载的内容。在这种情况下，您可以看到它从api动态加载数据。这意味着您可以完全跳过抓取，直接将数据从api加载到pandas:中

import pandas as pd
df = pd.read_json('https://decanterresultsapi.decanter.com/api/DWWA/2022/wines/search?competitionType=DWWA')

它给出了所有14858项：

>06484>Marlborough06478><1td>021>td style="text-align:left；">静止-干燥(残留糖低于5 g/L)><2022>WWA//tr>>021>td style="text-align:left；">静止-干燥(残留糖低于5 g/L)><2022>WWA//tr>021>td style="text-align:left；">静止-干燥(残留糖低于5g/L)>td style="text-align:left；">B<2022><1td>DWWA//tr>>706486WWA 2022021>

	制作人	名称	国家style="text-align:left；">地区
0	Yealands Estate Wines	Babydoll Sauvignon Blanc	新西兰
1	Yealands Estate Wines	Reserve Pinot Gris	DWWA 2022	7	Marlborough	不适用	白色	B	2022
2	Yealands Estate Wines	娃娃灰皮诺	706479		Marlborough	不适用	白色	A	2022
3	Yealands Estate Wines		不适用	白色
4	Yealands Estate Wines	Reserve Sauvignon Blanc	Marlborough	Awatere Valley	白色	静止-干燥(残留糖低于5 g/L)

尝试：

import pandas as pd
url = "https://decanterresultsapi.decanter.com/api/DWWA/2022/wines/search?competitionType=DWWA"
df = pd.read_json(url)
# print last items in df:
print(df.tail().to_markdown())

打印：

>arani>色>td style="text-align:right；">716526>>td style="ext-align:right；">717754静止-中等(残留糖在19至44 g/L之间)>WWA 2022><1td>90<2td>波兰

	制作人	名称	国家style="text-align:left；">地区
14853	Telavi酒窖	718257	DWWA 2022：left；">Georgia	Kakhetti	Kindzmarauli	2021	静止-中等(残留糖19至44 g/L之间)		14844	Štrigova	Muškatžuti	DWWA 2022	87克罗地亚大陆集团="text align:left；">Zagorje-MeŞimurje	2021		14855	Kopjar	MuscatžUti	Continental	Zagorje-MeŞimurje	2021		14856	Cleebron-Güglingen	Blanc De Noir Fein&Fruchtig	719836/td>	白色		14857	Winnice Czajkowski	托马8大奖赛	719891	6	否适用	不适用	2021	白色	静止-中等(残留糖在19至44 g/L之间)

相关内容

最新更新

热门标签：