如何制作一个简单的网页抓取器并将信息导出到电子表格中



我想为这个网站制作一个网页抓取器:https://www.ncaagamesim.com/college-basketball-predictions.asp

它有一个具有我想要的信息的。对于每一行,我想得到赔率数字,然后根据团队从预测列中的平均保证金数字中减去或相加。然后将该号码与其中一个队名一起存储在某个位置。

这似乎是一个非常简单的网页抓取程序,但我没有这方面的经验,希望得到一些建议。许多教程都使用Python和Beautiful Soup,所以我想我会使用它,但我不确定如何将信息存储到电子表格中。谢谢

所以你是对的,你会用漂亮的汤来提取数据。只需将其放入带有pandas的数据帧中,即可将其放入电子表格中

import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.ncaagamesim.com/college-basketball-predictions.asp'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table')
# Get column names
headers = table.find_all('th')
cols = [ x.text for x in headers ]
# Get all rows in table body
table_rows = table.find_all('tr')
rows = []
# Grab the text of each td, and put into a rows list
for each in table_rows[1:]:
odd_avail = True
data = each.find_all('td')
time = data[0].text.strip()
try:
matchup, odds = data[1].text.strip().split('xa0')
odd_margin = float(odds.split('by')[-1].strip())
except:
matchup = data[1].text.strip()
odd_margin = '-'
odd_avail = False
odd_team_win = data[1].find_all('img')[-1]['title']


sim_team_win = data[2].find('img')['title']
sim_margin = float(re.findall("d+.d+", data[2].text)[-1])

if odd_avail == True:
if odd_team_win == sim_team_win:
diff = sim_margin - odd_margin
else:
diff = -1*odd_margin - sim_margin 
else:
diff = '-'




row = {cols[0]:time, 'Matchup':matchup, 'Odds Winner':odd_team_win, 'Odds':odd_margin, 'Simulation Winner':sim_team_win, 'Simulation Margin':sim_margin, 'Diff':diff}
rows.append(row)

df = pd.DataFrame(rows)   
df.to_csv('odds.csv', index=False)

输出:

print (df.to_string())
Time                                     Matchup       Odds Winner  Odds Simulation Winner  Simulation Margin  Diff
0      2 PM                Buffalo  @ Western Michigan            Buffalo   9.5           Buffalo                7.3  -2.2
1      3 PM                 Akron  @ Northern Illinois              Akron     9             Akron                6.5  -2.5
2   4:30 PM             Kent State  @ Central Michigan         Kent State     6        Kent State                8.8   2.8
3      5 PM                       St. Katherine  @ UNLV              UNLV     -              UNLV               37.0     -
4   5:30 PM  Alabama State  @ Mississippi Valley State      Alabama State   6.5     Alabama State                5.9  -0.6
5      7 PM                Wisconsin (5) @ Michigan (4)          Michigan   3.5         Wisconsin                1.2  -4.7
6      7 PM       Eastern Illinois  @ SIU Edwardsville   Eastern Illinois     6  Eastern Illinois                7.4   1.4
7      7 PM                       Butler  @ St. John's         St. John's     2        St. John's                7.5   5.5
8      7 PM                 Saint Joseph's  @ Davidson           Davidson  12.5          Davidson               14.8   2.3
9      7 PM                        Ole Miss  @ Florida            Florida   3.5           Florida                8.5     5
10     7 PM                Ball State  @ Bowling Green      Bowling Green   7.5     Bowling Green                2.7  -4.8
11     7 PM                       Miami (Ohio)  @ Ohio               Ohio   8.5              Ohio                8.0  -0.5
12     7 PM                 Eastern Michigan  @ Toledo             Toledo    11            Toledo               10.6  -0.4
13     7 PM                    Miami  @ Boston College              Miami     3    Boston College                4.9  -7.9
14     7 PM                      Duke  @ Virginia Tech               Duke   1.5     Virginia Tech                8.2  -9.7
15  7:30 PM                            TCU  @ Oklahoma           Oklahoma     8          Oklahoma                8.3   0.3
16     8 PM               Kansas (22) @ Oklahoma State             Kansas   3.5            Kansas                0.7  -2.8
17  8:30 PM            Alcorn State  @ Grambling State    Grambling State   7.5      Alcorn State                2.8 -10.3
18     9 PM                            Syracuse  @ UNC                UNC   3.5               UNC                2.6  -0.9
19     9 PM                    Providence  @ Marquette          Marquette     3         Marquette                9.0     6
20     9 PM                        Alabama  @ Kentucky           Kentucky     2           Alabama                4.0    -6
21     9 PM                    UC Riverside  @ USC (12)               USC  14.5               USC               14.3  -0.2

最新更新