我正在尝试从以下站点抓取幻想玩家数据:http://www.fplstatistics.co.uk/。该表在打开网站时出现,但在我抓取网站时不可见。
我尝试了以下方法:
import requests as rq
from bs4 import BeautifulSoup
fplStatsPage = rq.get('http://www.fplstatistics.co.uk')
fplStatsPageSoup = BeautifulSoup(fplStatsPage.text, 'html.parser')
fplStatsPageSoup
桌子不见了。代替表应该在哪里的是:
<div>
The 'Player Data' is out of date.
<br/> <br/>
You need to refresh the web page.
<br/> <br/>
Press F5 or hit <i class="fa fa-refresh"></i>
</div>
每当更新表时,此消息都会显示在站点上。
然后,我查看了开发人员工具,看看是否可以找到从中检索表数据的URL,但是我没有运气。可能是因为我不知道如何很好地阅读开发人员工具。
然后我尝试刷新页面,如上面的消息使用Selenium:
from selenium import webdriver
import time
chromeDriverPath = '/Users/SplitShiftKing/Downloads/Software/chromedriver'
driver = webdriver.Chrome(chromeDriverPath)
driver.get('http://www.fplstatistics.co.uk')
driver.refresh()
#To give site enough time to refresh
time.sleep(15)
html = driver.page_source
fplStatsPageSoup = BeautifulSoup(html, 'html.parser')
fplStatsPageSoup
输出与以前相同。该表显示在站点上,但不显示在输出中。
如能提供协助,将不胜感激。我已经查看了有关溢出的类似问题,但一直无法找到解决方案。
为什么不直接转到提取该数据的源。您唯一需要弄清楚的是列名,但这可以在一个请求中获取所有数据,而无需使用硒:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
s = requests.Session()
url = 'http://www.fplstatistics.co.uk/'
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Mobile Safari/537.36'}
response = s.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if '"iselRow"' in script.text:
iselRowVal = re.search('"value":(.+?)});}', script.text).group(1).strip()
url = 'http://www.fplstatistics.co.uk/Home/AjaxPricesFHandler'
payload = {
'iselRow': iselRowVal,
'_': ''}
jsonData = requests.get(url, params=payload).json()
df = pd.DataFrame(jsonData['aaData'])
输出:
print (df.head(5).to_string())
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 Mustafi Arsenal D A 0.3 5.2 £5.2m 0 --- 110 -95.6 -95.6 -1 -1 Mustafi Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)
1 Bellerín Arsenal D I 0.3 5.4 £5.4m 0 --- 54024 2.6 2.6 -2 -2 Bellerin Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)
2 Kolasinac Arsenal D I 0.6 5.2 £5.2m 0 --- 5464 -13.9 -13.9 -2 -2 Kolasinac Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)
3 Maitland-Niles Arsenal D A 2.6 4.6 £4.6m 0 --- 11924 -39.0 -39.0 -2 -2 Maitland-Niles Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)
4 Sokratis Arsenal D S 1.5 4.9 £4.9m 0 --- 19709 -29.4 -29.4 -2 -2 Sokratis Everton(A) Bournemouth(A) Chelsea(H) Man Utd(H)
通过请求driver.page_source
,您将取消从Selenium获得的任何好处:页面源代码不包含您想要的表。该表在页面加载后通过 Javascript 动态更新。您需要在driver
上使用方法检索表,而不是使用BeautifulSoup
。 例如:
>>> from selenium import webdriver
>>> d = webdriver.Chrome()
>>> d.get('http://www.fplstatistics.co.uk')
>>> table = d.find_element_by_id('myDataTable')
>>> print('n'.join(x.text for x in table.find_elements_by_tag_name('tr')))
Name
Club
Pos
Status
%Owned
Price
Chgs
Unlocks
Delta
Target
Kelly Crystal Palace D A 30.7 £4.3m 0 --- 0
101.0
Rico Bournemouth D A 14.6 £4.3m 0 --- 0
100.9
Baldock Sheffield Utd D A 7.1 £4.8m 0 --- 88 99.8
Rashford Man Utd F A 26.4 £9.0m 0 --- 794 98.6
Son Spurs M A 21.6 £10.0m 0 --- 833 98.5
Henderson Sheffield Utd G A 7.8 £4.7m 0 --- 860 98.4
Grealish Aston Villa M A 8.9 £6.1m 0 --- 1088 98.0
Kane Spurs F A 19.3 £10.9m 0 --- 3961 92.9
Reid West Ham D A 4.6 £3.9m 0 --- 4029 92.7
Richarlison Everton M A 7.7 £7.8m 0 --- 5405 90.3