这是HTML:
<div class="body">
<p>this is the<br />
text that i want to<br />
.<br />
.<br />
get from html file<br />
.<br />
.</p>
<div class="sender">someone</div>
</div>
我只想要<p>
标签中的文本,里面没有<br/>
标签。我还需要行之间的句点
我使用的是lxml,我的代码如下:jokes = tree.xpath("//div[contains(@class,'body')]/p/text()")
它将列表中的每一行作为一个项目返回。但我需要所有<p>
标签的文本作为列表中的一项
有没有办法将没有br标签的整个p标签作为一个项目添加到列表中?
类似这样的东西:
this is the
text that i want to
.
.
get from html file
.
.
当我通过以下代码将列表保存到文件中时:
with open('c:\f.txt','w') as f:
for l in jokes:
f.write(l+'**************')
这就是我在文件中看到的:
this is the************
text that i want to************
.************
.************
get from html file************
.************
.************
根据刮取的范围,可能会过度使用,但请尝试BeautifulSoup
HTML = """"<div class="body">
<p>this is the<br />
text that i want to<br />
.<br />
.<br />
get from html file<br />
.<br />
.</p>
<div class="sender">someone</div>
</div>
"""
soup = BeautifulSoup(HTML)
print soup.p.get_text()
@皮特说得对,美丽汤会在这里有所帮助。就其价值而言,您还可以使用以下功能剥离标签:
def stripTags(in_text):
# convert in_text to a mutable object (e.g. list)
s_list = list(in_text)
i,j = 0,0
while i < len(s_list):
# iterate until a left-angle bracket is found
if s_list[i] == '<':
while s_list[i] != '>':
# pop everything from the the left-angle bracket until the right-angle bracket
s_list.pop(i)
# pops the right-angle bracket, too
s_list.pop(i)
else:
i=i+1
# convert the list back into text
join_char=''
return join_char.join(s_list)