剪贴视图(响应)没有显示所有抓取的HTML



我只是在挖掘Scrapy,所以原谅这个基本问题,但是为什么当我在Scrapy shell中使用view(response)时,它不显示从它抓取的文件中的所有HTML ?

我设置蜘蛛抓取页面(Barry Bonds在Baseball Reference上的页面),使用与教程中相同的代码,仅更改蜘蛛的名称和保存为的文件名。

抓取页面后,我在Safari中打开HTML(在Mac上),整个页面就显示出来了。

然后,回到终端,我使用以下命令:
scrapy shell fileLocationOnComputer
view(response)

它打开Safari,显示页面的大部分都不见了。

这里有两个截图来描述我的问题

感谢你们提供的任何帮助!

Scrapy shell view(response) not showing all HTML 
  • 我们只知道scrapy不能渲染javascript

  • 作为scrapy不能渲染javascript,这就是为什么scrapyshell视图(响应)看不到HTML部分,这是动态加载的JavaScript和仅仅因为这个原因,它不显示所有的HTML

  • Safari, chrome或任何其他浏览器将显示完整的HTML DOM不管是动态的还是静态的,只有这样你才能看到区别动态和静态HTML之间的任何浏览器,当你会从浏览器中打开make disableJavaScript并刷新对手url那么你将永远看不到动态HTML。这就是为什么Scrapy视图(响应)不显示所有的HTML。

pull static table with pandas

import pandas as pd
df =pd.read_html('https://www.baseball-reference.com/players/b/bondsba01.shtml')[0]
print(df)

输出:

Year            Age             Tm  ...  IBB     Pos          Awards
0            1985             20        PIT-min  ...    0     NaN      PRW · CARL
1            1986             21        PIT-min  ...    0     NaN       HAW · PCL
2            1986             21            PIT  ...    2    *8/H           RoY-6
3            1987             22            PIT  ...    3  *78H/9             NaN
4            1988             23            PIT  ...   14   *7H/8             NaN
5            1989             24            PIT  ...   22    *7/H             NaN
6            1990             25            PIT  ...   15   *7/H8  AS,MVP-1,GG,SS
7            1991             26            PIT  ...   25   *7/H8     MVP-2,GG,SS
8            1992             27            PIT  ...   32    *7/H  AS,MVP-1,GG,SS
9            1993             28            SFG  ...   43    *7/H  AS,MVP-1,GG,SS
10           1994             29            SFG  ...   18    *7/H  AS,MVP-4,GG,SS
11           1995             30            SFG  ...   22    *7/H       AS,MVP-12
12           1996             31            SFG  ...   30   *7/H8  AS,MVP-5,GG,SS
13           1997             32            SFG  ...   34      *7  AS,MVP-5,GG,SS
14           1998             33            SFG  ...   29    *7/H     AS,MVP-8,GG
15           1999             34            SFG  ...    9    7/DH          MVP-24
16           2000             35            SFG  ...   22    *7/H     AS,MVP-2,SS
17           2001             36            SFG  ...   35   *7/DH     AS,MVP-1,SS
18           2002             37            SFG  ...   68   *7/DH     AS,MVP-1,SS
19           2003             38            SFG  ...   61   *7/DH     AS,MVP-1,SS
20           2004             39            SFG  ...  120   *7/HD     AS,MVP-1,SS
21           2005             40            SFG  ...    3     7/H             NaN
22           2006             41            SFG  ...   38   *7H/D             NaN
23           2007             42            SFG  ...   43   *7H/D              AS
24         22 Yrs         22 Yrs         22 Yrs  ...  688     NaN             NaN
25  162 Game Avg.  162 Game Avg.  162 Game Avg.  ...   37     NaN             NaN
26            NaN            NaN            NaN  ...  IBB     Pos          Awards
27   SFG (15 yrs)   SFG (15 yrs)   SFG (15 yrs)  ...  575     NaN             NaN
28    PIT (7 yrs)    PIT (7 yrs)    PIT (7 yrs)  ...  113     NaN             NaN
[29 rows x 30 columns]


这些表并不是真正动态的。它们实际上就在html注释中。

有两种方法:

  1. 使用BeautifulSoup拉出Comments然后解析
  2. 只需删除注释标签

这将得到所有的表。现在只需通过特定属性或df_list中的索引位置拉出您想要的对象。

import pandas as pd
import requests
response = requests.get('https://www.baseball-reference.com/players/b/bondsba01.shtml')
html = response.text.replace("<!--","").replace("-->","")
df_list = pd.read_html(html)

指定一个表:

import pandas as pd
import requests
response = requests.get('https://www.baseball-reference.com/players/b/bondsba01.shtml')
html = response.text.replace("<!--","").replace("-->","")
df = pd.read_html(html, attrs={'id':'batting_postseason'})[0]

输出:

print(df)
Year               Age                Tm  ...   IBB   WPA    cWPA
0               1990                25               PIT  ...   0.0 -0.13   -0.2%
1               1991                26               PIT  ...   0.0 -0.68  -14.7%
2               1992                27               PIT  ...   1.0  0.08    1.0%
3                NaN               NaN               NaN  ...   NaN   NaN     NaN
4               1997                32               SFG  ...   0.0  0.31    3.3%
5                NaN               NaN               NaN  ...   NaN   NaN     NaN
6               2000                35               SFG  ...   1.0 -0.10   -1.6%
7                NaN               NaN               NaN  ...   NaN   NaN     NaN
8               2002                37               SFG  ...   3.0  0.05    2.6%
9               2002                37               SFG  ...   3.0  0.59    9.0%
10              2002                37               SFG  ...   7.0  0.56   22.9%
11              2003                38               SFG  ...   6.0  0.50    5.7%
12  7 Yrs (9 Series)  7 Yrs (9 Series)  7 Yrs (9 Series)  ...  21.0  1.18   27.9%
13            4 NLDS            4 NLDS            4 NLDS  ...  10.0  0.76    9.9%
14            4 NLCS            4 NLCS            4 NLCS  ...   4.0 -0.14   -5.0%
15              1 WS              1 WS              1 WS  ...   7.0  0.56   22.9%
[16 rows x 32 columns]

最新更新