我使用BeautifulSoup 4与Python 2.7。我想从一个网站提取某些元素(数量,见下面的例子)。由于某种原因,lxml解析器不允许我从页面中提取所需的所有元素。它将只打印前三个元素。我正试图使用html5lib解析器,看看我是否可以提取所有的。
页面包含多个项目,以及它们的价格和数量。包含每个项目所需信息的代码部分如下所示:
<td class="size-price last first" colspan="4">
<span>453 grams </span>
<span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
</span>
</td>
让我们考虑以下三种情况:
CASE 1 - DATA:
#! /usr/bin/python
from bs4 import BeautifulSoup
data = """
<td class="size-price last first" colspan="4">
<span>453 grams </span>
<span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
</span>
</td>"""
soup = BeautifulSoup(data)
print soup.td.span.text
打印:
453 grams
CASE 2 - LXML:
#! /usr/bin/python
from bs4 import BeautifulSoup
from urllib import urlopen
webpage = urlopen('The URL goes here')
soup=BeautifulSoup(webpage, "lxml")
print soup.find('td', {'class': 'size-price'}).span.text
打印:
453 grams
CASE - HTML5LIB:
#! /usr/bin/python
from bs4 import BeautifulSoup
from urllib import urlopen
webpage = urlopen('The URL goes here')
soup=BeautifulSoup(webpage, "html5lib")
print soup.find('td', {'class': 'size-price'}).span.text
我得到以下错误:
Traceback (most recent call last):
File "C:UsersDomPython-CodesrcTesting-Code.py", line 6, in <module>
print soup.find('td', {'class': 'size-price'}).span.text
AttributeError: 'NoneType' object has no attribute 'span'
我如何适应我的代码,以便提取我想要使用html5lib解析器的信息?如果在使用html5lib之后在控制台中简单地打印soup,我就可以看到所有想要的信息,因此我认为它将允许我得到我想要的。这不是lxml解析器的情况下,所以我也很好奇的事实,lxml解析器似乎没有提取所有的数量使用lxml解析器,如果我使用:
print [td.span.text for td in soup.find_all('td', {'class': 'size-price'})]
from lxml import etree
html = 'your html'
tree = etree.HTML(html)
tds = tree.xpath('.//td[@class="size-price last first"]')
for td in tds:
price = td.xpath('.//span[@class="price"]')[0].text
strike = td.xpath('.//span[@class="strike"]')[0].text
spans = td.xpath('.//span')
quantity = [i.text for i in spans if 'grams' in i.text][0].strip(' ')
尝试如下:
from bs4 import BeautifulSoup
data = """
<td class="size-price last first" colspan="4">
<span>453 grams </span>
<span> <span class="strike">$619.06</span> <span
class="price">$523.91</span>
</span>
</td>"""
soup = BeautifulSoup(data)
text = soup.get_text(strip=True)
print text