我只是在挖掘Scrapy,所以原谅这个基本问题,但是为什么当我在Scrapy shell中使用view(response)时,它不显示从它抓取的文件中的所有HTML ?
我设置蜘蛛抓取页面(Barry Bonds在Baseball Reference上的页面),使用与教程中相同的代码,仅更改蜘蛛的名称和保存为的文件名。
抓取页面后,我在Safari中打开HTML(在Mac上),整个页面就显示出来了。
然后,回到终端,我使用以下命令:scrapy shell fileLocationOnComputer
view(response)
它打开Safari,显示页面的大部分都不见了。
这里有两个截图来描述我的问题
感谢你们提供的任何帮助!
Scrapy shell view(response) not showing all HTML
我们只知道scrapy不能渲染javascript
作为scrapy不能渲染javascript,这就是为什么scrapy
shell
视图(响应)看不到HTML部分,这是动态加载的JavaScript和仅仅因为这个原因,它不显示所有的HTMLSafari, chrome或任何其他浏览器将显示完整的HTML DOM不管是动态的还是静态的,只有这样你才能看到区别动态和静态HTML之间的任何浏览器,当你会从浏览器中打开
make disable
JavaScript并刷新对手url那么你将永远看不到动态HTML。这就是为什么Scrapy视图(响应)不显示所有的HTML。
pull static table with pandas
import pandas as pd
df =pd.read_html('https://www.baseball-reference.com/players/b/bondsba01.shtml')[0]
print(df)
输出:
Year Age Tm ... IBB Pos Awards
0 1985 20 PIT-min ... 0 NaN PRW · CARL
1 1986 21 PIT-min ... 0 NaN HAW · PCL
2 1986 21 PIT ... 2 *8/H RoY-6
3 1987 22 PIT ... 3 *78H/9 NaN
4 1988 23 PIT ... 14 *7H/8 NaN
5 1989 24 PIT ... 22 *7/H NaN
6 1990 25 PIT ... 15 *7/H8 AS,MVP-1,GG,SS
7 1991 26 PIT ... 25 *7/H8 MVP-2,GG,SS
8 1992 27 PIT ... 32 *7/H AS,MVP-1,GG,SS
9 1993 28 SFG ... 43 *7/H AS,MVP-1,GG,SS
10 1994 29 SFG ... 18 *7/H AS,MVP-4,GG,SS
11 1995 30 SFG ... 22 *7/H AS,MVP-12
12 1996 31 SFG ... 30 *7/H8 AS,MVP-5,GG,SS
13 1997 32 SFG ... 34 *7 AS,MVP-5,GG,SS
14 1998 33 SFG ... 29 *7/H AS,MVP-8,GG
15 1999 34 SFG ... 9 7/DH MVP-24
16 2000 35 SFG ... 22 *7/H AS,MVP-2,SS
17 2001 36 SFG ... 35 *7/DH AS,MVP-1,SS
18 2002 37 SFG ... 68 *7/DH AS,MVP-1,SS
19 2003 38 SFG ... 61 *7/DH AS,MVP-1,SS
20 2004 39 SFG ... 120 *7/HD AS,MVP-1,SS
21 2005 40 SFG ... 3 7/H NaN
22 2006 41 SFG ... 38 *7H/D NaN
23 2007 42 SFG ... 43 *7H/D AS
24 22 Yrs 22 Yrs 22 Yrs ... 688 NaN NaN
25 162 Game Avg. 162 Game Avg. 162 Game Avg. ... 37 NaN NaN
26 NaN NaN NaN ... IBB Pos Awards
27 SFG (15 yrs) SFG (15 yrs) SFG (15 yrs) ... 575 NaN NaN
28 PIT (7 yrs) PIT (7 yrs) PIT (7 yrs) ... 113 NaN NaN
[29 rows x 30 columns]
这些表并不是真正动态的。它们实际上就在html注释中。
有两种方法:
- 使用BeautifulSoup拉出
Comments
然后解析 - 只需删除注释标签
这将得到所有的表。现在只需通过特定属性或df_list
中的索引位置拉出您想要的对象。
import pandas as pd
import requests
response = requests.get('https://www.baseball-reference.com/players/b/bondsba01.shtml')
html = response.text.replace("<!--","").replace("-->","")
df_list = pd.read_html(html)
指定一个表:
import pandas as pd
import requests
response = requests.get('https://www.baseball-reference.com/players/b/bondsba01.shtml')
html = response.text.replace("<!--","").replace("-->","")
df = pd.read_html(html, attrs={'id':'batting_postseason'})[0]
输出:
print(df)
Year Age Tm ... IBB WPA cWPA
0 1990 25 PIT ... 0.0 -0.13 -0.2%
1 1991 26 PIT ... 0.0 -0.68 -14.7%
2 1992 27 PIT ... 1.0 0.08 1.0%
3 NaN NaN NaN ... NaN NaN NaN
4 1997 32 SFG ... 0.0 0.31 3.3%
5 NaN NaN NaN ... NaN NaN NaN
6 2000 35 SFG ... 1.0 -0.10 -1.6%
7 NaN NaN NaN ... NaN NaN NaN
8 2002 37 SFG ... 3.0 0.05 2.6%
9 2002 37 SFG ... 3.0 0.59 9.0%
10 2002 37 SFG ... 7.0 0.56 22.9%
11 2003 38 SFG ... 6.0 0.50 5.7%
12 7 Yrs (9 Series) 7 Yrs (9 Series) 7 Yrs (9 Series) ... 21.0 1.18 27.9%
13 4 NLDS 4 NLDS 4 NLDS ... 10.0 0.76 9.9%
14 4 NLCS 4 NLCS 4 NLCS ... 4.0 -0.14 -5.0%
15 1 WS 1 WS 1 WS ... 7.0 0.56 22.9%
[16 rows x 32 columns]