我想为这个网站制作一个网页抓取器:https://www.ncaagamesim.com/college-basketball-predictions.asp
它有一个具有我想要的信息的。对于每一行,我想得到赔率数字,然后根据团队从预测列中的平均保证金数字中减去或相加。然后将该号码与其中一个队名一起存储在某个位置。
这似乎是一个非常简单的网页抓取程序,但我没有这方面的经验,希望得到一些建议。许多教程都使用Python和Beautiful Soup,所以我想我会使用它,但我不确定如何将信息存储到电子表格中。谢谢
所以你是对的,你会用漂亮的汤来提取数据。只需将其放入带有pandas
的数据帧中,即可将其放入电子表格中
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.ncaagamesim.com/college-basketball-predictions.asp'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table')
# Get column names
headers = table.find_all('th')
cols = [ x.text for x in headers ]
# Get all rows in table body
table_rows = table.find_all('tr')
rows = []
# Grab the text of each td, and put into a rows list
for each in table_rows[1:]:
odd_avail = True
data = each.find_all('td')
time = data[0].text.strip()
try:
matchup, odds = data[1].text.strip().split('xa0')
odd_margin = float(odds.split('by')[-1].strip())
except:
matchup = data[1].text.strip()
odd_margin = '-'
odd_avail = False
odd_team_win = data[1].find_all('img')[-1]['title']
sim_team_win = data[2].find('img')['title']
sim_margin = float(re.findall("d+.d+", data[2].text)[-1])
if odd_avail == True:
if odd_team_win == sim_team_win:
diff = sim_margin - odd_margin
else:
diff = -1*odd_margin - sim_margin
else:
diff = '-'
row = {cols[0]:time, 'Matchup':matchup, 'Odds Winner':odd_team_win, 'Odds':odd_margin, 'Simulation Winner':sim_team_win, 'Simulation Margin':sim_margin, 'Diff':diff}
rows.append(row)
df = pd.DataFrame(rows)
df.to_csv('odds.csv', index=False)
输出:
print (df.to_string())
Time Matchup Odds Winner Odds Simulation Winner Simulation Margin Diff
0 2 PM Buffalo @ Western Michigan Buffalo 9.5 Buffalo 7.3 -2.2
1 3 PM Akron @ Northern Illinois Akron 9 Akron 6.5 -2.5
2 4:30 PM Kent State @ Central Michigan Kent State 6 Kent State 8.8 2.8
3 5 PM St. Katherine @ UNLV UNLV - UNLV 37.0 -
4 5:30 PM Alabama State @ Mississippi Valley State Alabama State 6.5 Alabama State 5.9 -0.6
5 7 PM Wisconsin (5) @ Michigan (4) Michigan 3.5 Wisconsin 1.2 -4.7
6 7 PM Eastern Illinois @ SIU Edwardsville Eastern Illinois 6 Eastern Illinois 7.4 1.4
7 7 PM Butler @ St. John's St. John's 2 St. John's 7.5 5.5
8 7 PM Saint Joseph's @ Davidson Davidson 12.5 Davidson 14.8 2.3
9 7 PM Ole Miss @ Florida Florida 3.5 Florida 8.5 5
10 7 PM Ball State @ Bowling Green Bowling Green 7.5 Bowling Green 2.7 -4.8
11 7 PM Miami (Ohio) @ Ohio Ohio 8.5 Ohio 8.0 -0.5
12 7 PM Eastern Michigan @ Toledo Toledo 11 Toledo 10.6 -0.4
13 7 PM Miami @ Boston College Miami 3 Boston College 4.9 -7.9
14 7 PM Duke @ Virginia Tech Duke 1.5 Virginia Tech 8.2 -9.7
15 7:30 PM TCU @ Oklahoma Oklahoma 8 Oklahoma 8.3 0.3
16 8 PM Kansas (22) @ Oklahoma State Kansas 3.5 Kansas 0.7 -2.8
17 8:30 PM Alcorn State @ Grambling State Grambling State 7.5 Alcorn State 2.8 -10.3
18 9 PM Syracuse @ UNC UNC 3.5 UNC 2.6 -0.9
19 9 PM Providence @ Marquette Marquette 3 Marquette 9.0 6
20 9 PM Alabama @ Kentucky Kentucky 2 Alabama 4.0 -6
21 9 PM UC Riverside @ USC (12) USC 14.5 USC 14.3 -0.2