Scrapy:无法定位表或抓取表中的数据



对于一个小组项目,我试图在https://www.basketball-reference.com/players/a/allenra02.html内刮工资表。

我尝试了多个CSS和Xpath选择器,如

#all_salaries > tbody > tr:nth-child(1)
#all_salaries > tbody
#all_salaries > tbody > tr:nth-child(1) > td.right
#all_salaries
//*[@id="all_salaries"]/tbody/tr[1]/td[3]
//*[@id="all_salaries"]/tbody
//*[@id="all_salaries"]

代码如下:

def start_requests(self):
start_urls = ['https://www.basketball-reference.com/players/a/allenra02.html']
for url in start_urls:
yield scrapy.Request(url=url, callback=self.parse_season)
def parse_player(self, response): 
response.css('#all_salaries > tbody)

我试着把它打印出来,但是它一直返回一个空列表。其他的表都很好,除了这张。

编辑:

我的最终解看起来像

regex = re.compile(r'<!--(.*)-->', re.DOTALL)
salaries = response.xpath('//*[@id="all_all_salaries"]/comment()').get()

if salaries:
salaries = response.xpath('//*[@id="all_all_salaries"]/comment()').re(regex)[0]
salaries_sel = scrapy.Selector(text=salaries, type="html")
all_salaries = salaries_sel.css('#all_salaries > tbody > tr').extract()

可以使用BeautifulSoup提取注释,然后用pandas解析表。我选择只取出工资表,但您可以通过这种方式在评论中获得所有表格。

import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
url = "https://www.basketball-reference.com/players/a/allenra02.html"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in str(each):
try:
tables.append(pd.read_html(str(each), attrs = {'id': 'all_salaries'})[0])
break
except:
continue
print(tables[0].to_string())

输出:

Season                 Team   Lg        Salary
0   1996-97      Milwaukee Bucks  NBA    $1,785,000
1   1997-98      Milwaukee Bucks  NBA    $2,052,360
2   1998-99      Milwaukee Bucks  NBA    $2,320,000
3   1999-00      Milwaukee Bucks  NBA    $9,000,000
4   2000-01      Milwaukee Bucks  NBA   $10,130,000
5   2001-02      Milwaukee Bucks  NBA   $11,250,000
6   2002-03      Milwaukee Bucks  NBA   $12,375,000
7   2003-04  Seattle SuperSonics  NBA   $13,500,000
8   2004-05  Seattle SuperSonics  NBA   $14,625,000
9   2005-06  Seattle SuperSonics  NBA   $13,223,140
10  2006-07  Seattle SuperSonics  NBA   $14,611,570
11  2007-08       Boston Celtics  NBA   $16,000,000
12  2008-09       Boston Celtics  NBA   $18,388,430
13  2009-10       Boston Celtics  NBA   $18,776,860
14  2010-11       Boston Celtics  NBA   $10,000,000
15  2011-12       Boston Celtics  NBA   $10,000,000
16  2012-13           Miami Heat  NBA    $3,090,000
17  2013-14           Miami Heat  NBA    $3,229,050
18   Career  (may be incomplete)  NaN  $184,356,410

这是因为该表实际上在原始源代码中被注释掉了,后来通过javascript添加。查看如何获取评论内容:Scrapy:提取评论(隐藏)内容

最新更新