如何使用正则表达式或工具包将句子解析为令牌



我如何用正则表达式或像beautifulsoup, lxml:

这样的工具包来解析这样的句子?
input = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""

这:

Yesterday
<person>Peter Smith</person>
drove
to
<location>New York</location>

我不能使用re.findall("<person>(.*?)</person>", input),因为标签变化

看看使用BeautifulSoup是多么容易:

from bs4 import BeautifulSoup
data = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""
soup = BeautifulSoup(data, 'html.parser')
for item in soup:
    print item

打印:

Yesterday
<person>Peter Smith</person>
drove to
<location>New York</location>

UPD(将非标签项分割成空格并在新行上打印每个部分):

soup = BeautifulSoup(data, 'html.parser')
for item in soup:
    if not isinstance(item, Tag):
        for part in item.split():
            print part
    else:
        print item

打印:

Yesterday
<person>Peter Smith</person>
drove
to
<location>New York</location>

希望对你有帮助。

试试这个正则表达式-

>>> import re
>>> input = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""
>>> print re.sub("<[^>]*?[^/]s*>[^<]*?</.*?>",r"ng<0>n",input)
Yesterday
<person>Peter Smith</person>
drove to
<location>New York</location>
>>> 

正则表达式的演示

最新更新