Python HTML解析与美丽的汤和过滤停止词

我正在从一个网站解析出特定的信息到一个文件。现在的程序，我有看一个网页，并找到正确的HTML标签和解析出正确的内容。现在我想进一步过滤这些"结果"。

例如:http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx

我正在解析出位于标签。这个解析器很好地完成了这项工作，但我想进一步处理这些结果。

当我运行这个解析器时，它会删除数字、符号、逗号和斜杠(或/)，但保留所有文本。当我在网站上运行它时，我得到这样的结果:

cup olive oil
cup chicken broth
cloves garlic minced
tablespoon paprika

现在我想通过删除诸如"cup"，"cloves"，"minced"，"tablesoon"等停止词来进一步处理这个问题。我该怎么做呢?这段代码是用python写的，我不是很擅长它，我只是使用这个解析器来获取信息，我可以手动输入，但我宁愿不。

任何关于如何做到这一点的详细帮助将不胜感激!我的代码如下:我将如何做到这一点?

代码:

import urllib2
import BeautifulSoup
def main():
    url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
    data = urllib2.urlopen(url).read()
    bs = BeautifulSoup.BeautifulSoup(data)
    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [s.getText().strip('123456789.,/ ') for s in ingreds.findAll('li')]
    fname = 'PorkRecipe.txt'
    with open(fname, 'w') as outf:
        outf.write('n'.join(ingreds))
if __name__=="__main__":
    main()

import urllib2
import BeautifulSoup
import string
badwords = set([
    'cup','cups',
    'clove','cloves',
    'tsp','teaspoon','teaspoons',
    'tbsp','tablespoon','tablespoons',
    'minced'
])
def cleanIngred(s):
    # remove leading and trailing whitespace
    s = s.strip()
    # remove numbers and punctuation in the string
    s = s.strip(string.digits + string.punctuation)
    # remove unwanted words
    return ' '.join(word for word in s.split() if not word in badwords)
def main():
    url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
    data = urllib2.urlopen(url).read()
    bs = BeautifulSoup.BeautifulSoup(data)
    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [cleanIngred(s.getText()) for s in ingreds.findAll('li')]
    fname = 'PorkRecipe.txt'
    with open(fname, 'w') as outf:
        outf.write('n'.join(ingreds))
if __name__=="__main__":
    main()

在

搜索结果

olive oil
chicken broth
garlic,
paprika
garlic powder
poultry seasoning
dried oregano
dried basil
thick cut boneless pork chops
salt and pepper to taste

?我不知道为什么它留下了逗号- s.strip(string.punctuation)应该已经处理好了

相关内容

最新更新

热门标签：