《纽约时报》摘要提取器蟒蛇2



我正在尝试使用NewsWire API和python 2.7访问《纽约时报》文章摘要。这是代码:

from urllib2 import urlopen
import urllib2
from json import loads
import codecs
import time
import newspaper
posts = list()
articles = list()
i=30
keys= dict()
count=0
offset=0
while(offset<40000):
if(len(posts)>=30000): break
if(700<offset<800):
offset=offset + 100
#for p in xrange(100):    
try:
url = "http://api.nytimes.com/svc/news/v3/content/nyt/all.json?offset="+str(offset)+"&api-key=ACCESSKEY"    
data= loads(urlopen(url).read())
print str(len(posts) )+ "  offset=" + str(offset) 
if posts and articles and keys:
outfile= open("articles_next.tsv", "w")
for s in articles:
outfile.write(s.encode("utf-8") + "n")
outfile.close()
outfile= open("summary_next.tsv", "w")
for s in posts:
outfile.write(s.encode("utf-8") + "n")
outfile.close()    
indexfile=open("ind2_next.tsv", "w")
for x in keys.keys():
indexfile.write('n' + str(x) + "    " + str(keys[x]))
indexfile.close()
for item in data["results"]:
if(('url' in item) & ('abstract' in item)) :
url= item["url"]
abst=item["abstract"]
if(url not in keys.values()):
keys[count]=url
article = newspaper.Article(url)
article.download()
article.parse()
try:
el_post = article.text.replace('nn',' ').replace("Advertisement Continue reading the main story",'')
except XMLSyntaxError, e:
continue                    
articles.append(el_post)
count=count + 1
res= abst # url + "    " + abst 
# print res.encode("utf-8")               
posts.append(res) # Here is the appending statement.
if(len(posts)>=30000): 
break
except urllib2.HTTPError, e:
print e
time.sleep(1)
offset=offset + 21
continue
except urllib2.URLError,e:
print e
time.sleep(1)
offset=offset + 21
continue
offset=offset + 19
print str(len(posts))
print str(len(keys))

我得到了一个很好的总结。但有时我会遇到一些奇怪的句子作为总结的一部分。以下是示例:

Here’s what you need to know to start your day.
Corrections appearing in print on Monday, August 28, 2017.

它们被认为是某篇文章的总结。请帮我从《纽约时报》的新闻中摘录这篇文章的完美摘要。如果出现这种情况,我想使用标题,但标题也很奇怪。

因此,我查看了汇总结果。

可以删除重复的语句,如Corrections appearing in print on Monday, August 28, 2017.,其中只有日期不同。

最简单的方法是检查该语句是否存在于vaible本身中。示例,

# declare at the top
# create a list that consists of repetitive statements. I found 'quotation of the day' being repeated as well
REMOVE_STATEMENTS = ["Corrections appearing in print on", "Quotation of the Day for"] 

然后,

if (statement not in res for statement in REMOVE_STATEMENTS):
posts.append(res)

至于剩下的不需要的语句,除非您在res中搜索要忽略的关键字,或者它们是重复的,否则无法区分它们。如果你找到了,只需将它们添加到我创建的列表中。

最新更新