我是一个相当新的网页抓取,所以如果我的问题的答案是显而易见的道歉。我制作了一个Web Scraper,它可以浏览steam游戏(《文明6》)的评论,并获得诸如在游戏上花费的时间,他们是否推荐它,他们拥有的产品等等信息。
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
url = "https://steamcommunity.com/app/289070/reviews/?browsefilter=toprated&snr=1_5_100010_"
review_dict = {
"found_helpful": [],
"title": [], #recommended or not
"hours": [],
"prods_in_account": [],
"words_in_review": []
}
def data_scrapper():
"""
get's the reviews from the steam page.
"""
response = requests.get(url)
soup = bs(response.content, "html.parser")
card_div = soup.findAll("div",attrs={"class","apphub_Card modalContentLink interactable"})
for cards in card_div:
found_helpful = cards.find("div", attrs={"class": "found_helpful"})
vote_header = cards.find("div", attrs={"class": "vote_header"})
hours = cards.find("div", attrs={"class": "hours"})
products = cards.find("div", attrs={"class": "apphub_CardContentMoreLink ellipsis"})
words_in_review = cards.find("div", attrs={"class": "apphub_CardTextContent"})
review_dict["found_helpful"].append(found_helpful)
review_dict["title"].append(vote_header)
review_dict["hours"].append(hours)
review_dict["prods_in_account"].append(products)
review_dict["words_in_review"].append(len(words_in_review))
data_scrapper()
review_df = pd.DataFrame.from_dict(review_dict)
review_df.to_csv("review.csv", sep=",")
我的问题是,当我运行我的代码,我期待一个有组织的CSV文件,但我得到这个:
,found_helpful,title,hours,prods_in_account,words_in_review
0,"<div class=""found_helpful"">
3,398 people found this review helpful<br/>159 people found this review funny <div class=""review_award_aggregated tooltip"" data-tooltip-class=""review_reward_tooltip"" data-tooltip-html='<div class=""review_award_ctn_hover""> <div class=""review_award"" data-reaction=""6"" data-reactioncount=""5"">
<img class=""review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/6.png?v=5""/>
<span class=""review_award_count "">5</span>
</div>
<div class=""review_award"" data-reaction=""3"" data-reactioncount=""3"">
<img class=""review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/3.png?v=5""/>
<span class=""review_award_count "">3</span>
</div>
<div class=""review_award"" data-reaction=""5"" data-reactioncount=""2"">
<img class=""review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/5.png?v=5""/>
<span class=""review_award_count "">2</span>
</div>
<div class=""review_award"" data-reaction=""1"" data-reactioncount=""1"">
<img class=""review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/1.png?v=5""/>
<span class=""review_award_count hidden"">1</span>
</div>
<div class=""review_award"" data-reaction=""9"" data-reactioncount=""1"">
<img class=""review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/9.png?v=5""/>
<span class=""review_award_count hidden"">1</span>
</div>
<div class=""review_award"" data-reaction=""18"" data-reactioncount=""1"">
<img class=""review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/18.png?v=5""/>
<span class=""review_award_count hidden"">1</span>
</div>
<div class=""review_award"" data-reaction=""19"" data-reactioncount=""1"">
<img class=""review_award_icon tooltip"" src=""https://store.akamai.steamstatic.com/public/images/loyalty/reactions/still/19.png?v=5""/>
<span class=""review_award_count hidden"">1</span>
</div>
</div>'><img class=""reward_btn_icon"" src=""https://community.akamai.steamstatic.com/public/shared/images//award_icon_blue.svg""/>14</div>
</div>","<div class=""vote_header"">
<div class=""reviewInfo"">
<div class=""thumb"">
<img height=""44"" src=""https://community.akamai.steamstatic.com/public/shared/images/userreviews/icon_thumbsDown.png?v=1"" width=""44""/>
</div>
<div class=""title"">Not Recommended</div>
<div class=""hours"">8,028.3 hrs on record</div>
</div>
<div style=""clear: left""></div>
</div>","<div class=""hours"">8,028.3 hrs on record</div>","<div class=""apphub_CardContentMoreLink ellipsis"">167 products in account</div>",38
我修改了提取和附加我的数据的函数,但我仍然得到这个奇怪的文件,任何线索,我做错了什么?
对现有代码进行以下更改:
for cards in card_div:
found_helpful = cards.find("div", attrs={"class": "found_helpful"}).get_text()
vote_header = cards.find("div", attrs={"class": "vote_header"}).get_text()
hours = cards.find("div", attrs={"class": "hours"}).get_text()
products = cards.find("div", attrs={"class": "apphub_CardContentMoreLink ellipsis"}).get_text()
words_in_review = cards.find("div", attrs={"class": "apphub_CardTextContent"}).get_text()
review_dict["found_helpful"].append(found_helpful)
review_dict["title"].append(vote_header)
review_dict["hours"].append(hours)
review_dict["prods_in_account"].append(products)
review_dict["words_in_review"].append(len(words_in_review))
review_df = pd.DataFrame.from_dict(review_dict)
cols = review_df.select_dtypes(['object']).columns
review_df[cols] = review_df[cols].apply(lambda x: x.str.strip())
输出:
found_helpful title hours prods_in_account words_in_review
0 1,266 people found this review helpful20 peopl... Recommendedn456.9 hrs on record 456.9 hrs on record 536 products in account 770
1 1,127 people found this review helpful14 peopl... Recommendedn92.1 hrs on record 92.1 hrs on record 135 products in account 574
2 853 people found this review helpful49 people ... Recommendedn1,360.8 hrs on record 1,360.8 hrs on record 18 products in account 181
3 1,832 people found this review helpful18 peopl... Recommendedn520.5 hrs on record 520.5 hrs on record 281 products in account 7114
4 3,370 people found this review helpful40 peopl... Not Recommendedn415.7 hrs on record 415.7 hrs on record 102 products in account 853
5 5,724 people found this review helpful172 peop... Not Recommendedn256.7 hrs on record 256.7 hrs on record 180 products in account 2072
6 393 people found this review helpful10 people ... Recommendedn22.8 hrs on record 22.8 hrs on record 85 products in account 278
7 3,229 people found this review helpful62 peopl... Not Recommendedn58.6 hrs on record 58.6 hrs on record 264 products in account 894
8 1,373 people found this review helpful22 peopl... Not Recommendedn195.3 hrs on record 195.3 hrs on record 75 products in account 556
9 3,398 people found this review helpful159 peop... Not Recommendedn8,028.8 hrs on record 8,028.8 hrs on record 167 products in account 8007