如何使用美丽汤和请求从网站获取数据?



我是网页抓取的初学者,我需要帮助解决这个问题。 allrecipes.com,该网站是一个网站,您可以在其中根据搜索找到食谱,在本例中为"馅饼":

链接到 HTML 文件: 'view-source:https://www.allrecipes.com/search/results/?wt=pie&sort=re' (右键>查看页面源(

我想创建一个程序,该程序接受输入,在所有食谱上搜索它,并返回一个列表,其中包含前五个配方的元组,其中包含制作所需的时间、服务产量、成分等数据。 这是我到目前为止的程序:

import requests
from bs4 import BeautifulSoup
def searchdata():
inp=input('what recipe would you like to search')
url ='http://www.allrecipes.com/search/results/?wt='+str(inp)+'&sort=re'
r=requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
links=[]
#fill in code for finding top 3 or five links

for i in range(3)
a = requests.get(links[i])
soupa = BeautifulSoup(a.text, 'html.parser')
#fill in code to find name, ingrediants, time, and serving size with data from soupa

names=[]
time=[]
servings=[]
ratings=[]
ingrediants=[]


searchdata()

是的,我知道,我的代码非常混乱,但是我应该在两个代码填写区域中填写什么? 谢谢

搜索食谱后,您必须获取每个食谱的链接,然后再次请求每个链接,因为您要查找的信息在搜索页面上不可用。如果没有OOP,这看起来就不干净,所以这是我写的类,它可以做你想要的。

import requests
from time import sleep
from bs4 import BeautifulSoup

class Scraper:
links = []
names = []
def get_url(self, url):
url = requests.get(url)
self.soup = BeautifulSoup(url.content, 'html.parser')
def print_info(self, name):
self.get_url(f'https://www.allrecipes.com/search/results/?wt={name}&sort=re')
if self.soup.find('span', class_='subtext').text.strip()[0] == '0':
print(f'No recipes found for {name}')
return
results = self.soup.find('section', id='fixedGridSection')
articles = results.find_all('article')
texts = []
for article in articles:
txt = article.find('h3', class_='fixed-recipe-card__h3')
if txt:
if len(texts) < 5:
texts.append(txt)
else:
break
self.links = [txt.a['href'] for txt in texts]
self.names = [txt.a.span.text for txt in texts]
self.get_data()
def get_data(self):
for i, link in enumerate(self.links):
self.get_url(link)
print('-' * 4 + self.names[i] + '-' * 4)
info_names = [div.text.strip() for div in self.soup.find_all(
'div', class_='recipe-meta-item-header')]
ingredient_spans = self.soup.find_all('span', class_='ingredients-item-name')
ingredients = [span.text.strip() for span in ingredient_spans]
for i, div in enumerate(self.soup.find_all('div', class_='recipe-meta-item-body')):
print(info_names[i].capitalize(), div.text.strip())
print()
print('Ingredients'.center(len(ingredients[0]), ' '))
print('n'.join(ingredients))
print()
print('*' * 50, end='nn')

chrome = Scraper()
chrome.print_info(input('What recipe would you like to search: '))

最新更新