我如何用正则表达式或像beautifulsoup, lxml:
这样的工具包来解析这样的句子?input = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""
这:
Yesterday
<person>Peter Smith</person>
drove
to
<location>New York</location>
我不能使用re.findall("<person>(.*?)</person>", input)
,因为标签变化
看看使用BeautifulSoup
是多么容易:
from bs4 import BeautifulSoup
data = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""
soup = BeautifulSoup(data, 'html.parser')
for item in soup:
print item
打印:
Yesterday
<person>Peter Smith</person>
drove to
<location>New York</location>
UPD(将非标签项分割成空格并在新行上打印每个部分):
soup = BeautifulSoup(data, 'html.parser')
for item in soup:
if not isinstance(item, Tag):
for part in item.split():
print part
else:
print item
打印:
Yesterday
<person>Peter Smith</person>
drove
to
<location>New York</location>
希望对你有帮助。
试试这个正则表达式-
>>> import re
>>> input = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""
>>> print re.sub("<[^>]*?[^/]s*>[^<]*?</.*?>",r"ng<0>n",input)
Yesterday
<person>Peter Smith</person>
drove to
<location>New York</location>
>>>
正则表达式的演示