对于一个小组项目,我试图在https://www.basketball-reference.com/players/a/allenra02.html内刮工资表。
我尝试了多个CSS和Xpath选择器,如
#all_salaries > tbody > tr:nth-child(1)
#all_salaries > tbody
#all_salaries > tbody > tr:nth-child(1) > td.right
#all_salaries
//*[@id="all_salaries"]/tbody/tr[1]/td[3]
//*[@id="all_salaries"]/tbody
//*[@id="all_salaries"]
代码如下:
def start_requests(self):
start_urls = ['https://www.basketball-reference.com/players/a/allenra02.html']
for url in start_urls:
yield scrapy.Request(url=url, callback=self.parse_season)
def parse_player(self, response):
response.css('#all_salaries > tbody)
我试着把它打印出来,但是它一直返回一个空列表。其他的表都很好,除了这张。
编辑:
我的最终解看起来像
regex = re.compile(r'<!--(.*)-->', re.DOTALL)
salaries = response.xpath('//*[@id="all_all_salaries"]/comment()').get()
if salaries:
salaries = response.xpath('//*[@id="all_all_salaries"]/comment()').re(regex)[0]
salaries_sel = scrapy.Selector(text=salaries, type="html")
all_salaries = salaries_sel.css('#all_salaries > tbody > tr').extract()
可以使用BeautifulSoup提取注释,然后用pandas解析表。我选择只取出工资表,但您可以通过这种方式在评论中获得所有表格。
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
url = "https://www.basketball-reference.com/players/a/allenra02.html"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in str(each):
try:
tables.append(pd.read_html(str(each), attrs = {'id': 'all_salaries'})[0])
break
except:
continue
print(tables[0].to_string())
输出:
Season Team Lg Salary
0 1996-97 Milwaukee Bucks NBA $1,785,000
1 1997-98 Milwaukee Bucks NBA $2,052,360
2 1998-99 Milwaukee Bucks NBA $2,320,000
3 1999-00 Milwaukee Bucks NBA $9,000,000
4 2000-01 Milwaukee Bucks NBA $10,130,000
5 2001-02 Milwaukee Bucks NBA $11,250,000
6 2002-03 Milwaukee Bucks NBA $12,375,000
7 2003-04 Seattle SuperSonics NBA $13,500,000
8 2004-05 Seattle SuperSonics NBA $14,625,000
9 2005-06 Seattle SuperSonics NBA $13,223,140
10 2006-07 Seattle SuperSonics NBA $14,611,570
11 2007-08 Boston Celtics NBA $16,000,000
12 2008-09 Boston Celtics NBA $18,388,430
13 2009-10 Boston Celtics NBA $18,776,860
14 2010-11 Boston Celtics NBA $10,000,000
15 2011-12 Boston Celtics NBA $10,000,000
16 2012-13 Miami Heat NBA $3,090,000
17 2013-14 Miami Heat NBA $3,229,050
18 Career (may be incomplete) NaN $184,356,410
这是因为该表实际上在原始源代码中被注释掉了,后来通过javascript添加。查看如何获取评论内容:Scrapy:提取评论(隐藏)内容