如何在任何维基百科文章的div(id= "BodyContent")中抓取文本。我正在使用Python的BeautifulSoup和nltk。

page=nltk.clean_html(soup.findAll('div',id="bodyContent"))

当我尝试运行这段代码时，它显示：

Traceback (most recent call last):
  File "C:Python27wiki3.py", line 36, in <module>
    page=nltk.clean_html(soup.findAll('div',id="bodyContent"))
  File "C:Python27libsite-packagesnltk-2.0.4-py2.7.eggnltkutil.py", line 340, in clean_html
    cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</1>)", "", html.strip())
AttributeError: 'ResultSet' object has no attribute 'strip'

你给clean_html一个BeautifulSoup对象的迭代对象（这是findAll返回的），而不是一个字符串（这是clean_html想要的）。

假设你想要一个div字符串的列表，每个字符串都已被清理，请执行以下操作：

page = [nltk.clean_html(str(d)) for d in soup.findAll('div',id="bodyContent")]

或

page = map(nltk.clean_html, soup.findAll('div',id="bodyContent"))

import urllib导入网址库2从美丽汤进口美丽汤导入 NLTK进口再导入编解码器

文章="马拉塔帝国"article = urllib.quote（article）

开瓶器 = urllib2.build_opener（）opener.addheaders = [（'User-agent'， 'Mozilla/5.0'）] #wikipedia 需要这个

资源 = opener.open（"http://en.wikipedia.org/wiki/" + 文章）data = resource.read（）

汤

=美丽汤（数据）

for node in soup.findAll（'div'，id="bodyContent"）： page = ''.join（node.findAll（text=True））

f=codecs.open（"wikiscrap2"，"w"，"utf-8-sig"）f.写（页面）; ................至少使用此代码，我能够使用bodyContent标签检索维基百科页面的内容

相关内容

最新更新

热门标签：