Python请求并没有提取所有元素

我正在尝试从以下页面提取TR数据：http://www.datasheetcatalog.com/catalog/p1342320.shtml

我正在使用请求和BeautifulSoup。然而，我并没有得到所有的行(第二个表中只有12行，而不是22行(。有人对此有解释吗(前提是打印response.content时行在那里(？

这是我正在使用的代码：

from bs4 import BeautifulSoup
import requests
session = requests.Session()
url = 'http://www.datasheetcatalog.com/catalog/p1342320.shtml'
response = session.get(url)
soup = BeautifulSoup(response.content,"lxml")
trs=  soup.findAll('table')[8].findAll('tr')
print (len(trs))

经过对html页面的详细检查，我发现beautifulsoup在点击comments((后停止了。因此，解决方案是将解析器从"lxml"更改为"html5lib"：

soup = BeautifulSoup(response.content,"html5lib")

html无效，破坏了此处的BeautifulSoup以修复

....
html_doc = response.text.replace('<table <', '<')
html_doc = re.sub(r'<!--s+d+s+--!>', '', html_doc)
html_doc = re.sub(r'</?font.*?>' ,'', html_doc)
soup = BeautifulSoup(html_doc, "html.parser")
trs=  soup.findAll('table')[8].findAll('tr')
print (len(trs))

注意：使用lxml返回7而不是22

相关内容

最新更新

热门标签：