我正试图从rotowire获得美国职棒大联盟的比赛赔率和总得分。我已经尝试了两种方法,虽然我可以接近,但我不太清楚下一步需要做什么。第一种方法看起来像是我需要刮孩子们的课"复合皮革"我采用的另一种方法返回了一堆新行和其他额外的字符,尽管我试图只获取文本并将其去掉。
from bs4 import BeautifulSoup
import requests
url = 'https://www.rotowire.com/baseball/daily-lineups.php'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
##First approach
oddsData = soup.find_all('div', {'class': 'lineup__odds-item'})
print(oddsData)
#Second approach
gameOdds = [g.text.strip() for g in oddsData]
print(gameOdds)
第一种方法返回以下内容。我只想要CLE-165和7.0跑。
[<div class="lineup__odds-item">
<b>LINE</b>
<span class="composite hide">CLE -165</span>
<span class="fanduel">–</span>
<span class="draftkings hide">–</span>
<span class="betmgm hide">–</span>
<span class="pointsbet hide">–</span>
</div>, <div class="lineup__odds-item">
<b>O/U</b>
<span class="composite hide">7.0 Runs</span>
<span class="fanduel">–</span>
<span class="draftkings hide">–</span>
<span class="betmgm hide">–</span>
<span class="pointsbet hide">–</span>
第二种方法返回以下内容。
['LINExa0rn CLE -165n–n–n–n–', 'O/Uxa0rn 7.0 Runsn–n–n–n–'
要获取LINE
和O/U
内容,可以使用以下示例:
import requests
from bs4 import BeautifulSoup
url = "https://www.rotowire.com/baseball/daily-lineups.php"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for i in soup.select(".lineup__bottom"):
visit = i.find_previous(class_="lineup__mteam is-visit").get_text(
strip=True
)
home = i.find_previous(class_="lineup__mteam is-home").get_text(strip=True)
line = i.select_one(
'.lineup__odds-item b:contains("LINE") ~ span'
).text
ou = i.select_one('.lineup__odds-item b:contains("O/U") ~ span').text
print(
"{:<25} {:<25} LINE: {:<10} O/U: {:<10}".format(visit, home, line, ou)
)
打印:
White Sox(33-22) Indians(30-24) LINE: CLE -165 O/U: 7.0 Runs
Twins(22-32) Orioles(18-37) LINE: MIN -170 O/U: 9.0 Runs
Rays(35-21) Yankees(30-25) LINE: TB -135 O/U: 7.5 Runs
Marlins(24-29) Blue Jays(28-25) LINE: TOR -160 O/U: 8.5 Runs
Phillies(26-29) Reds(24-29) LINE: – O/U: 7.5 Runs
Nationals(22-29) Braves(25-27) LINE: ATL -140 O/U: 8.5 Runs
Tigers(23-32) Brewers(29-26) LINE: MIL -180 O/U: 7.5 Runs
Padres(34-22) Cubs(31-23) LINE: CHC -130 O/U: 8.5 Runs
Pirates(20-34) Royals(27-26) LINE: KC -170 O/U: 8.5 Runs
Red Sox(32-22) Astros(30-24) LINE: HOU -165 O/U: 9.0 Runs
Rangers(22-34) Rockies(21-34) LINE: – O/U: 9.5 Runs
Mets(26-21) Diamondbacks(20-36) LINE: NYM -120 O/U: 9.0 Runs
Angels(25-30) Giants(34-21) LINE: SF -160 O/U: 7.0 Runs
Athletics(32-25) Mariners(28-28) LINE: OAK -160 O/U: 7.5 Runs
Cardinals(31-24) Dodgers(32-23) LINE: LAD -200 O/U: 8.5 Runs
据我所知,您希望按照以下通用方法清理数据:
def clean_page(html, pretty_print=False):
"""
>>> junk = "some random HTML<P> for you to try to parse</p>"
>>> clean_page(junk)
'<div><p>some random HTML</p><p> for you to try to parse</p></div>'
>>> print clean_page(junk, pretty_print=True)
<div>
<p>some random HTML</p>
<p> for you to try to parse</p>
</div>
"""
from lxml.html import fromstring
from lxml.html import tostring
return tostring(fromstring(html), pretty_print=pretty_print)
如果需要,这里有一篇详细的文章:https://schoolofdata.org/handbook/recipes/cleaning-data-scraped-from-the-web/?__cf_chl_managed_tk
如果这就是你想要的,请告诉我!!