网页抓取子类或清理html返回



我正试图从rotowire获得美国职棒大联盟的比赛赔率和总得分。我已经尝试了两种方法,虽然我可以接近,但我不太清楚下一步需要做什么。第一种方法看起来像是我需要刮孩子们的课"复合皮革"我采用的另一种方法返回了一堆新行和其他额外的字符,尽管我试图只获取文本并将其去掉。

from bs4 import BeautifulSoup
import requests
url = 'https://www.rotowire.com/baseball/daily-lineups.php'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
##First approach
oddsData = soup.find_all('div', {'class': 'lineup__odds-item'})
print(oddsData)
#Second approach
gameOdds = [g.text.strip() for g in oddsData]
print(gameOdds)

第一种方法返回以下内容。我只想要CLE-165和7.0跑。

[<div class="lineup__odds-item">
<b>LINE</b> 
<span class="composite hide">CLE -165</span>
<span class="fanduel">–</span>
<span class="draftkings hide">–</span>
<span class="betmgm hide">–</span>
<span class="pointsbet hide">–</span>
</div>, <div class="lineup__odds-item">
<b>O/U</b> 
<span class="composite hide">7.0 Runs</span>
<span class="fanduel">–</span>
<span class="draftkings hide">–</span>
<span class="betmgm hide">–</span>
<span class="pointsbet hide">–</span>

第二种方法返回以下内容。

['LINExa0rn                                                CLE -165n–n–n–n–', 'O/Uxa0rn                                                7.0 Runsn–n–n–n–'

要获取LINEO/U内容,可以使用以下示例:

import requests
from bs4 import BeautifulSoup
url = "https://www.rotowire.com/baseball/daily-lineups.php"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for i in soup.select(".lineup__bottom"):
visit = i.find_previous(class_="lineup__mteam is-visit").get_text(
strip=True
)
home = i.find_previous(class_="lineup__mteam is-home").get_text(strip=True)
line = i.select_one(
'.lineup__odds-item b:contains("LINE") ~ span'
).text
ou = i.select_one('.lineup__odds-item b:contains("O/U") ~ span').text
print(
"{:<25} {:<25} LINE: {:<10} O/U: {:<10}".format(visit, home, line, ou)
)

打印:

White Sox(33-22)          Indians(30-24)            LINE: CLE -165   O/U: 7.0 Runs  
Twins(22-32)              Orioles(18-37)            LINE: MIN -170   O/U: 9.0 Runs  
Rays(35-21)               Yankees(30-25)            LINE: TB -135    O/U: 7.5 Runs  
Marlins(24-29)            Blue Jays(28-25)          LINE: TOR -160   O/U: 8.5 Runs  
Phillies(26-29)           Reds(24-29)               LINE: –          O/U: 7.5 Runs  
Nationals(22-29)          Braves(25-27)             LINE: ATL -140   O/U: 8.5 Runs  
Tigers(23-32)             Brewers(29-26)            LINE: MIL -180   O/U: 7.5 Runs  
Padres(34-22)             Cubs(31-23)               LINE: CHC -130   O/U: 8.5 Runs  
Pirates(20-34)            Royals(27-26)             LINE: KC -170    O/U: 8.5 Runs  
Red Sox(32-22)            Astros(30-24)             LINE: HOU -165   O/U: 9.0 Runs  
Rangers(22-34)            Rockies(21-34)            LINE: –          O/U: 9.5 Runs  
Mets(26-21)               Diamondbacks(20-36)       LINE: NYM -120   O/U: 9.0 Runs  
Angels(25-30)             Giants(34-21)             LINE: SF -160    O/U: 7.0 Runs  
Athletics(32-25)          Mariners(28-28)           LINE: OAK -160   O/U: 7.5 Runs  
Cardinals(31-24)          Dodgers(32-23)            LINE: LAD -200   O/U: 8.5 Runs  

据我所知,您希望按照以下通用方法清理数据:

def clean_page(html, pretty_print=False):
"""
>>> junk = "some random HTML<P> for you to try to parse</p>"
>>> clean_page(junk)
'<div><p>some random HTML</p><p> for you to try to parse</p></div>'
>>> print clean_page(junk, pretty_print=True)
<div>
<p>some random HTML</p>
<p> for you to try to parse</p>
</div>
"""
from lxml.html import fromstring
from lxml.html import tostring
return tostring(fromstring(html), pretty_print=pretty_print)

如果需要,这里有一篇详细的文章:https://schoolofdata.org/handbook/recipes/cleaning-data-scraped-from-the-web/?__cf_chl_managed_tk

如果这就是你想要的,请告诉我!!

最新更新