我正在从一个网站解析出特定的信息到一个文件。现在的程序,我有看一个网页,并找到正确的HTML标签和解析出正确的内容。现在我想进一步过滤这些"结果"。
例如:http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx
我正在解析出位于
当我运行这个解析器时,它会删除数字、符号、逗号和斜杠(或/),但保留所有文本。当我在网站上运行它时,我得到这样的结果:
cup olive oil
cup chicken broth
cloves garlic minced
tablespoon paprika
现在我想通过删除诸如"cup","cloves","minced","tablesoon"等停止词来进一步处理这个问题。我该怎么做呢?这段代码是用python写的,我不是很擅长它,我只是使用这个解析器来获取信息,我可以手动输入,但我宁愿不。
任何关于如何做到这一点的详细帮助将不胜感激!我的代码如下:我将如何做到这一点?
代码:import urllib2
import BeautifulSoup
def main():
url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
data = urllib2.urlopen(url).read()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
ingreds = [s.getText().strip('123456789.,/ ') for s in ingreds.findAll('li')]
fname = 'PorkRecipe.txt'
with open(fname, 'w') as outf:
outf.write('n'.join(ingreds))
if __name__=="__main__":
main()
import urllib2
import BeautifulSoup
import string
badwords = set([
'cup','cups',
'clove','cloves',
'tsp','teaspoon','teaspoons',
'tbsp','tablespoon','tablespoons',
'minced'
])
def cleanIngred(s):
# remove leading and trailing whitespace
s = s.strip()
# remove numbers and punctuation in the string
s = s.strip(string.digits + string.punctuation)
# remove unwanted words
return ' '.join(word for word in s.split() if not word in badwords)
def main():
url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
data = urllib2.urlopen(url).read()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
ingreds = [cleanIngred(s.getText()) for s in ingreds.findAll('li')]
fname = 'PorkRecipe.txt'
with open(fname, 'w') as outf:
outf.write('n'.join(ingreds))
if __name__=="__main__":
main()
在 搜索结果
olive oil
chicken broth
garlic,
paprika
garlic powder
poultry seasoning
dried oregano
dried basil
thick cut boneless pork chops
salt and pepper to taste
?我不知道为什么它留下了逗号- s.strip(string.punctuation)应该已经处理好了