使用pygooglenews抓取谷歌新闻



我正在尝试用pygooglenews从谷歌新闻中抓取。我试图通过使用for循环改变目标日期,一次刮掉100多篇文章(谷歌设定的限制为100)。下面是到目前为止我所拥有的,但我一直得到错误信息

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-84-4ada7169ebe7> in <module>
----> 1 df = pd.DataFrame(get_news('Banana'))
2 writer = pd.ExcelWriter('My Result.xlsx', engine='xlsxwriter')
3 df.to_excel(writer, sheet_name='Results', index=False)
4 writer.save()
<ipython-input-79-c5266f97934d> in get_titles(search)
9 
10     for date in date_list[:-1]:
---> 11         search = gn.search(search, from_=date, to_=date_list[date_list.index(date)])
12         newsitem = search['entries']
13 
~AppDataRoamingPythonPython37site-packagespygooglenews__init__.py in search(self, query, helper, when, from_, to_, proxies, scraping_bee)
140         if from_ and not when:
141             from_ = self.__from_to_helper(validate=from_)
--> 142             query += ' after:' + from_
143 
144         if to_ and not when:
TypeError: unsupported operand type(s) for +=: 'dict' and 'str'
import pandas as pd
from pygooglenews import GoogleNews
import datetime
gn = GoogleNews()
def get_news(search):
stories = []
start_date = datetime.date(2021,3,1)
end_date = datetime.date(2021,3,5)
delta = datetime.timedelta(days=1)
date_list = pd.date_range(start_date, end_date).tolist()

for date in date_list[:-1]:
search = gn.search(search, from_=date.strftime('%Y-%m-%d'), to_=(date+delta).strftime('%Y-%m-%d'))
newsitem = search['entries']
for item in newsitem:
story = {
'title':item.title,
'link':item.link,
'published':item.published
}
stories.append(story)
return stories
df = pd.DataFrame(get_news('Banana'))

提前谢谢你。

看起来您正确地将字符串传递到get_news(),然后将其作为第一个参数(search)传递到gn.search()

但是,您将search重新分配给gn.search()的结果:

search = gn.search(search, from_=date.strftime('%Y-%m-%d'), to_=(date+delta).strftime('%Y-%m-%d'))
# ^^^^^^
# gets overwritten with the result of gn.search()

在下一次迭代中,这个重新赋值的search被传递给gn.search(),这是它不喜欢的。

如果你看一下pygooglenews中的代码,看起来gn.search()返回的是dict,这就解释了错误。

要解决这个问题,只需使用一个不同的变量,例如:
result = gn.search(search, from_=date.strftime('%Y-%m-%d'), to_=(date+delta).strftime('%Y-%m-%d'))
newsitem = result['entries']

我知道pygooglenews有100篇文章的限制,所以你必须创建一个循环,让它每天分别抓取。

最新更新