Python HTML解析与美丽的汤和过滤停止词



我正在从一个网站解析出特定的信息到一个文件。现在的程序,我有看一个网页,并找到正确的HTML标签和解析出正确的内容。现在我想进一步过滤这些"结果"。

例如:http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx

我正在解析出位于标签。这个解析器很好地完成了这项工作,但我想进一步处理这些结果。

当我运行这个解析器时,它会删除数字、符号、逗号和斜杠(或/),但保留所有文本。当我在网站上运行它时,我得到这样的结果:

cup olive oil
cup chicken broth
cloves garlic minced
tablespoon paprika

现在我想通过删除诸如"cup","cloves","minced","tablesoon"等停止词来进一步处理这个问题。我该怎么做呢?这段代码是用python写的,我不是很擅长它,我只是使用这个解析器来获取信息,我可以手动输入,但我宁愿不。

任何关于如何做到这一点的详细帮助将不胜感激!我的代码如下:我将如何做到这一点?

代码:

import urllib2
import BeautifulSoup
def main():
    url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
    data = urllib2.urlopen(url).read()
    bs = BeautifulSoup.BeautifulSoup(data)
    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [s.getText().strip('123456789.,/ ') for s in ingreds.findAll('li')]
    fname = 'PorkRecipe.txt'
    with open(fname, 'w') as outf:
        outf.write('n'.join(ingreds))
if __name__=="__main__":
    main()
import urllib2
import BeautifulSoup
import string
badwords = set([
    'cup','cups',
    'clove','cloves',
    'tsp','teaspoon','teaspoons',
    'tbsp','tablespoon','tablespoons',
    'minced'
])
def cleanIngred(s):
    # remove leading and trailing whitespace
    s = s.strip()
    # remove numbers and punctuation in the string
    s = s.strip(string.digits + string.punctuation)
    # remove unwanted words
    return ' '.join(word for word in s.split() if not word in badwords)
def main():
    url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
    data = urllib2.urlopen(url).read()
    bs = BeautifulSoup.BeautifulSoup(data)
    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [cleanIngred(s.getText()) for s in ingreds.findAll('li')]
    fname = 'PorkRecipe.txt'
    with open(fname, 'w') as outf:
        outf.write('n'.join(ingreds))
if __name__=="__main__":
    main()

搜索结果

olive oil
chicken broth
garlic,
paprika
garlic powder
poultry seasoning
dried oregano
dried basil
thick cut boneless pork chops
salt and pepper to taste

?我不知道为什么它留下了逗号- s.strip(string.punctuation)应该已经处理好了

最新更新