无法使用 Python 抓取幻想表



我正在尝试从以下站点抓取幻想玩家数据:http://www.fplstatistics.co.uk/。该表在打开网站时出现,但在我抓取网站时不可见。

我尝试了以下方法:

import requests as rq
from bs4 import BeautifulSoup
fplStatsPage = rq.get('http://www.fplstatistics.co.uk')
fplStatsPageSoup = BeautifulSoup(fplStatsPage.text, 'html.parser')
fplStatsPageSoup

桌子不见了。代替表应该在哪里的是:

<div>
The 'Player Data' is out of date.
<br/> <br/>
You need to refresh the web page.
<br/> <br/>
Press F5 or hit <i class="fa fa-refresh"></i>
</div>

每当更新表时,此消息都会显示在站点上。

然后,我查看了开发人员工具,看看是否可以找到从中检索表数据的URL,但是我没有运气。可能是因为我不知道如何很好地阅读开发人员工具。

然后我尝试刷新页面,如上面的消息使用Selenium:

from selenium import webdriver
import time
chromeDriverPath = '/Users/SplitShiftKing/Downloads/Software/chromedriver'
driver = webdriver.Chrome(chromeDriverPath)
driver.get('http://www.fplstatistics.co.uk')
driver.refresh()
#To give site enough time to refresh
time.sleep(15)
html = driver.page_source
fplStatsPageSoup = BeautifulSoup(html, 'html.parser')
fplStatsPageSoup

输出与以前相同。该表显示在站点上,但不显示在输出中。

如能提供协助,将不胜感激。我已经查看了有关溢出的类似问题,但一直无法找到解决方案。

为什么不直接转到提取该数据的源。您唯一需要弄清楚的是列名,但这可以在一个请求中获取所有数据,而无需使用硒:

import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
s = requests.Session()
url = 'http://www.fplstatistics.co.uk/'
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Mobile Safari/537.36'}
response = s.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if '"iselRow"' in script.text:
iselRowVal = re.search('"value":(.+?)});}', script.text).group(1).strip()

url = 'http://www.fplstatistics.co.uk/Home/AjaxPricesFHandler'
payload = {
'iselRow': iselRowVal,
'_': ''}

jsonData = requests.get(url, params=payload).json()
df = pd.DataFrame(jsonData['aaData'])

输出:

print (df.head(5).to_string())
0               1        2  3  4    5    6      7  8    9      10     11     12  13  14              15                                                16
0            Mustafi  Arsenal  D  A  0.3  5.2  £5.2m  0  ---    110  -95.6  -95.6  -1  -1         Mustafi  Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H) 
1           Bellerín  Arsenal  D  I  0.3  5.4  £5.4m  0  ---  54024    2.6    2.6  -2  -2        Bellerin  Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H) 
2          Kolasinac  Arsenal  D  I  0.6  5.2  £5.2m  0  ---   5464  -13.9  -13.9  -2  -2       Kolasinac  Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H) 
3     Maitland-Niles  Arsenal  D  A  2.6  4.6  £4.6m  0  ---  11924  -39.0  -39.0  -2  -2  Maitland-Niles  Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H) 
4           Sokratis  Arsenal  D  S  1.5  4.9  £4.9m  0  ---  19709  -29.4  -29.4  -2  -2        Sokratis  Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H) 

通过请求driver.page_source,您将取消从Selenium获得的任何好处:页面源代码不包含您想要的表。该表在页面加载后通过 Javascript 动态更新。您需要在driver上使用方法检索表,而不是使用BeautifulSoup。 例如:

>>> from selenium import webdriver
>>> d = webdriver.Chrome()
>>> d.get('http://www.fplstatistics.co.uk')
>>> table = d.find_element_by_id('myDataTable')
>>> print('n'.join(x.text for x in table.find_elements_by_tag_name('tr')))
Name
Club
Pos
Status
%Owned
Price
Chgs
Unlocks
Delta
Target
Kelly Crystal Palace D A 30.7 £4.3m 0 --- 0
101.0
Rico Bournemouth D A 14.6 £4.3m 0 --- 0
100.9
Baldock Sheffield Utd D A 7.1 £4.8m 0 --- 88 99.8
Rashford Man Utd F A 26.4 £9.0m 0 --- 794 98.6
Son Spurs M A 21.6 £10.0m 0 --- 833 98.5
Henderson Sheffield Utd G A 7.8 £4.7m 0 --- 860 98.4
Grealish Aston Villa M A 8.9 £6.1m 0 --- 1088 98.0
Kane Spurs F A 19.3 £10.9m 0 --- 3961 92.9
Reid West Ham D A 4.6 £3.9m 0 --- 4029 92.7
Richarlison Everton M A 7.7 £7.8m 0 --- 5405 90.3

相关内容

  • 没有找到相关文章

最新更新