基本的Python BeautifulSoup网页抓取Tripadvisor评论和数据清理



我是编程和StackOverflow的初学者,我只需要从TripAdvisor页面进行一些基本的网页抓取并从中清除一些有用的信息。很好地显示它等。我试图隔离咖啡馆的标题,评级的数量和评级本身。我想我可能需要将其转换为文本并使用正则表达式或其他东西?我真的不知道。我的意思的一个例子是:

输出:

咖啡咖啡馆,5 个气泡中的 4 个,201 条评论。

类似的东西。我将把我的代码放在下面,我能得到的任何帮助都会很棒,我将无限感激。干杯。

from bs4 import BeautifulSoup
def get_HTML(url):
response = urllib.request.urlopen(url)
html = response.read()
return html

Tripadvisor_reviews_HTML=get_HTML(
'https://www.tripadvisor.com.au/Restaurants- 
g255068-c8-Brisbane_Brisbane_Region_Queensland.html')

def get_review_count(HTML):
soup = BeautifulSoup(Tripadvisor_reviews_HTML, "lxml")
for element in soup(attrs={'class' : 'reviewCount'}):
print(element)
get_review_count(Tripadvisor_reviews_HTML)
def get_review_score(HTML):
soup = BeautifulSoup(Tripadvisor_reviews_HTML, "lxml")
for four_point_five_score in soup(attrs={'alt' : '4.5 of 5 bubbles'}):
print(four_point_five_score)

get_review_score(Tripadvisor_reviews_HTML)
def get_cafe_name(HTML):
soup = BeautifulSoup(Tripadvisor_reviews_HTML, "lxml")
for name in soup(attrs={'class' : "property_title"}):
print(name)

get_cafe_name(Tripadvisor_reviews_HTML)

你忘了在每个打印语句中使用.text。但是,请尝试以下方法从该站点获取所有三个字段。

from bs4 import BeautifulSoup
import urllib.request
URL = "https://www.tripadvisor.com.au/Restaurants-g255068-c8-Brisbane_Brisbane_Region_Queensland.html"
def get_info(link):
response = urllib.request.urlopen(link)
soup = BeautifulSoup(response.read(),"lxml")
for items in soup.find_all(class_="shortSellDetails"):
name = items.find(class_="property_title").get_text(strip=True)
bubble = items.find(class_="ui_bubble_rating").get("alt")
review = items.find(class_="reviewCount").get_text(strip=True)
print(name,bubble,review)
if __name__ == '__main__':
get_info(URL)

您可能会得到的结果如下:

Double Shot New Farm 4.5 of 5 bubbles 218 reviews
Goodness Gracious Cafe 4.5 of 5 bubbles 150 reviews
New Farm Deli & Cafe 4.5 of 5 bubbles 273 reviews
Coffee Anthology 4.5 of 5 bubbles 116 reviews

相关内容

  • 没有找到相关文章

最新更新