nltk- 以下代码中 re 的含义

这里表示什么

def clean_html(html):
"""
Remove HTML markup from the given string.
:param html: the HTML string to be cleaned
:type html: str
:rtype: str
"""
# First we remove inline JavaScript/CSS:
cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</1>)", "", html.strip())
# Then we remove html comments. This has to be done before removing regular
# tags since comments can contain '>' characters.
cleaned = re.sub(r"(?s)<!--(.*?)-->[n]?", "", cleaned)
# Next we can remove the remaining tags:
cleaned = re.sub(r"(?s)<.*?>", " ", cleaned)
# Finally, we deal with whitespace
cleaned = re.sub(r"&nbsp;", " ", cleaned)
cleaned = re.sub(r"  ", " ", cleaned)
cleaned = re.sub(r"  ", " ", cleaned)
return cleaned.strip()
raise NotImplementedError ("To remove HTML markup, use BeautifulSoup's get_text() function")

re是一个提供类似于 Perl 中的正则表达式匹配操作的模块。它提供了一组可以通过 re 调用的函数。{function_name} 来处理正则表达式。看看： https://docs.python.org/3.7/library/re.html

相关内容

最新更新

热门标签：