这是我的html树
<li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
</h3>Get the IndianOil Citibank <b>Card</b>. Apply Now!
<br />
<a href="e%253DGOOGLE ------">Get 10X Rewards On Shopping</a> -
<a href="S%2526eOfferCode%253DGSCCSLEX ------">Save Over 5% On Fuel</a>
<br />
<cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>
从这个html中,我需要提取<br>标记
第1行:获得IndianOil花旗银行卡。立即申请!
line2:购物可获得10倍奖励-燃油可节省5%以上
在python中应该怎么做?
我想您只是在每次<br/>
之前要求行。
下面的代码将通过剥离<b>
和<a>
标记并打印following-sibling
是<br/>
的每个元素的.tail
来为您提供的示例执行此操作。
from lxml import etree
doc = etree.HTML("""
<li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
</h3>Get the IndianOil Citibank <b>Card</b>. Apply Now!
<br />
<a href="e%253DGOOGLE ------">Get 10X Rewards On Shopping</a> -
<a href="S%2526eOfferCode%253DGSCCSLEX ------">Save Over 5% On Fuel</a>
<br />
<cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>""")
etree.strip_tags(doc,'a','b')
for element in doc.xpath('//*[following-sibling::*[name()="br"]]'):
print repr(element.tail.strip())
收益率:
'Get the IndianOil Citibank Card. Apply Now!'
'Get 10X Rewards On Shopping -n Save Over 5% On Fuel'
与HTML的所有解析一样,您需要对HTML的格式做出一些假设。如果我们可以假设前一行是从<br>
标签到块级标签或另一个<br>
之前的所有内容,那么我们可以执行以下操作。。。
from BeautifulSoup import BeautifulSoup
doc = """
<li class="taf"><h3><a href="26eOfferCode%3DGSONESTP-----------" id="pa1">
Citibank <b>Credit Card</b> - Save over 5% on fuel | Citibank.co.in</a>
</h3>Get the IndianOil Citibank <b>Card</b>. Apply Now!
<br />
<a href="e%253DGOOGLE ------">Get 10X Rewards On Shopping</a> -
<a href="S%2526eOfferCode%253DGSCCSLEX ------">Save Over 5% On Fuel</a>
<br />
<cite>www.citibank.co.in/<b>CreditCards</b></cite>
</li>
"""
soup = BeautifulSoup(doc)
现在我们已经解析了HTML,接下来我们定义了不想作为行的一部分处理的标记列表。确实还有其他块标记,但这对这个HTML来说是有效的。
block_tags = ["div", "p", "h1", "h2", "h3", "h4", "h5", "h6", "br"]
我们循环遍历每个<br>
标签,再遍历它的兄弟标签,直到没有更多标签,或者达到块级标签。每次循环时,我们都会将节点添加到行的前面。NavigableStrings
没有name
属性,但我们希望将它们包括在内,因此while循环中有两部分测试。
for node in soup.findAll("br"):
line = ""
sibling = node.previousSibling
while sibling is not None and (not hasattr(sibling, "name") or sibling.name not in block_tags):
line = unicode(sibling) + line
sibling = sibling.previousSibling
print line
不中继<br>
标签的解决方案:
import lxml.html
html = "..."
tree = lxml.html.fromstring(html)
line1 = ''.join(tree.xpath('//li[@class="taf"]/text() | b/text()')[:3]).strip()
line2 = ' - '.join(tree.xpath('//li[@class="taf"]//a[not(@id)]/text()'))
我不知道你是想用lxml还是想用漂亮的汤。但是对于使用xpath的lxml,这里有一个示例
import lxml
from lxml import etree
import urllib2
response = urllib2.urlopen('your url here')
html = response.read()
imdb = etree.HTML(html)
titles = imdb.xpath('/html/body/li/a/text()')//xpath for "line 2" data.[use firebug]
我使用的xpath
用于给定的html片段。它可能会在原始上下文中发生变化。
你也可以试试cssselect in lxml
。
import lxml.html
import urllib
data = urllib.urlopen('your url').read()
doc = lxml.html.fromstring(data)
elements = doc.cssselect('your csspath here') // CSSpath[using firebug extension]
for element in elements:
print element.text_content()