我正在使用python的Newspaper模块。
在教程中,它描述了如何汇集不同报纸的构建。它同时生成它们。(请参阅上面链接中的"多线程文章下载")
有什么方法可以直接从URL列表中提取文章吗?也就是说,有没有什么方法可以将多个url输入到下面的设置中,并让它同时下载和解析它们?
from newspaper import Article
url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
a = Article(url, language='zh') # Chinese
a.download()
a.parse()
print(a.text[:150])
我可以通过为每个文章URL创建一个Source
来实现这一点。(免责声明:不是python开发者)
import newspaper
urls = [
'http://www.baltimorenews.net/index.php/sid/234363921',
'http://www.baltimorenews.net/index.php/sid/234323971',
'http://www.atlantanews.net/index.php/sid/234323891',
'http://www.wpbf.com/news/funeral-held-for-gabby-desouza/33874572',
]
class SingleSource(newspaper.Source):
def __init__(self, articleURL):
super(StubSource, self).__init__("http://localhost")
self.articles = [newspaper.Article(url=url)]
sources = [SingleSource(articleURL=u) for u in urls]
newspaper.news_pool.set(sources)
newspaper.news_pool.join()
for s in sources:
print s.articles[0].html
我知道这个问题很老了,但它是我在谷歌上搜索如何获取多线程报纸时出现的第一个链接之一。虽然Kyles的答案很有帮助,但它并不完整,我认为它有一些拼写错误。。。
import newspaper
urls = [
'http://www.baltimorenews.net/index.php/sid/234363921',
'http://www.baltimorenews.net/index.php/sid/234323971',
'http://www.atlantanews.net/index.php/sid/234323891',
'http://www.wpbf.com/news/funeral-held-for-gabby-desouza/33874572',
]
class SingleSource(newspaper.Source):
def __init__(self, articleURL):
super(SingleSource, self).__init__("http://localhost")
self.articles = [newspaper.Article(url=articleURL)]
sources = [SingleSource(articleURL=u) for u in urls]
newspaper.news_pool.set(sources)
newspaper.news_pool.join()
我将Stubsource更改为Singlesource,并将其中一个URL改为articleURL。当然,这只是下载网页,你仍然需要解析它们才能获得文本。
multi=[]
i=0
for s in sources:
i+=1
try:
(s.articles[0]).parse()
txt = (s.articles[0]).text
multi.append(txt)
except:
pass
在我的100个url样本中,与按顺序处理每个url相比,这花费了一半的时间。(编辑:将样本量增加到2000个后,减少了约四分之一。)
(编辑:用多线程完成了整个工作!)我在实现中使用了这个非常好的解释。对于100个url的样本大小,使用4个线程所花费的时间与上面的代码相当,但将线程数增加到10可以进一步减少约一半。较大的样本量需要更多的线程才能产生可比较的差异。
import newspaper
from multiprocessing.dummy import Pool as ThreadPool
def getTxt(url):
article = Article(url)
article.download()
try:
article.parse()
txt=article.text
return txt
except:
return ""
pool = ThreadPool(10)
# open the urls in their own threads
# and return the results
results = pool.map(getTxt, urls)
# close the pool and wait for the work to finish
pool.close()
pool.join()
以Joseph的Valls答案为基础。我假设最初的海报想要使用多线程来提取一堆数据并将其正确存储在某个地方。经过多次尝试,我想我已经找到了一个解决方案,它可能不是最有效的,但它是有效的,我试图让它变得更好,然而,我认为newspaper3k插件可能有点bug。但是,这可以将所需的元素提取到DataFrame中。
import newspaper
from newspaper import Article
from newspaper import Source
import pandas as pd
gamespot_paper = newspaper.build('https://www.gamespot.com/news/', memoize_articles=False)
bbc_paper = newspaper.build("https://www.bbc.com/news", memoize_articles=False)
papers = [gamespot_paper, bbc_paper]
news_pool.set(papers, threads_per_source=4)
news_pool.join()
#Create our final dataframe
df_articles = pd.DataFrame()
#Create a download limit per sources
limit = 100
for source in papers:
#tempoary lists to store each element we want to extract
list_title = []
list_text = []
list_source =[]
count = 0
for article_extract in source.articles:
article_extract.parse()
if count > limit:
break
#appending the elements we want to extract
list_title.append(article_extract.title)
list_text.append(article_extract.text)
list_source.append(article_extract.source_url)
#Update count
count +=1
df_temp = pd.DataFrame({'Title': list_title, 'Text': list_text, 'Source': list_source})
#Append to the final DataFrame
df_articles = df_articles.append(df_temp, ignore_index = True)
print('source extracted')
请提出任何改进建议!
我不熟悉Newspaper模块,但以下代码使用了一个URL列表,应该与链接页面中提供的URL等效:
import newspaper
from newspaper import news_pool
urls = ['http://slate.com','http://techcrunch.com','http://espn.com']
papers = [newspaper.build(i) for i in urls]
news_pool.set(papers, threads_per_source=2)
news_pool.join()