目标:我正在尝试下载基于关键字的新闻文章,以进行情绪分析。
这段代码几个月前还在工作,但现在它返回了一个null值。我尝试修复该问题,但links=soup.select(".r a")
返回空值。
import pandas as pd
import requests
from bs4 import BeautifulSoup
import string
import nltk
from urllib.request import urlopen
import sys
import webbrowser
import newspaper
import time
from newspaper import Article
Company_name1 =[]
Article_number1=[]
Article_Title1=[]
Article_Authors1=[]
Article_pub_date1=[]
Article_Text1=[]
Article_Summary1=[]
Article_Keywords1=[]
Final_dataframe=[]
class Newspapr_pd:
def __init__(self,term):
self.term=term
self.subjectivity=0
self.sentiment=0
self.url='https://www.google.com/search?q={0}&safe=active&tbs=qdr:w,sdb:1&tbm=nws&source=lnt&dpr=1'.format(self.term)
def NewsArticlerun_pd(self):
response=requests.get(self.url)
response.raise_for_status()
#print(response.text)
soup=bs4.BeautifulSoup(response.text,'html.parser')
links=soup.select(".r a")
numOpen = min(5, len(links))
Article_number=0
for i in range(numOpen):
response_links = webbrower.open("https://www.google.com" + links[i].get("href"))
#For different language newspaper refer above table
article = Article(response_links, language="en") # en for English
Article_number+=1
print('*************************************************************************************')
Article_number1.append(Article_number)
Company_name1.append(self.term)
#To download the article
try:
article.download()
#To parse the article
article.parse()
#To perform natural language processing ie..nlp
article.nlp()
#To extract title
Article_Title1.append(article.title)
#To extract text
Article_Text1.append(article.text)
#To extract Author name
Article_Authors1.append(article.authors)
#To extract article published date
Article_pub_date1.append(article.publish_date)
#To extract summary
Article_Summary1.append(article.summary)
#To extract keywords
Article_Keywords1.append(article.keywords)
except:
print('Error in loading page')
continue
for art_num,com_name,title,text,auth,pub_dt,summaries,keywds in zip(Article_number1,Company_name1,Article_Title1,Article_Text1,Article_Authors1,Article_pub_date1,Article_Summary1,Article_Keywords1):
Final_dataframe.append({'Article_link_num':art_num, 'Company_name':com_name,'Article_Title':title,'Article_Text':text,'Article_Author':auth,
'Article_Published_date':pub_dt,'Article_Summary':summaries,'Article_Keywords':keywds})
list_of_companies=['Amazon','Jetairways','nirav modi']
for i in list_of_companies:
comp = str('"'+ i + '"')
a=Newspapr_pd(comp)
a.NewsArticlerun_pd()
Final_new_dataframe=pd.DataFrame(Final_dataframe)
Final_new_dataframe.tail()
这是一个非常复杂的问题,因为Google News不断更改它们的类名。此外,谷歌将在文章URL中添加各种前缀,并添加一些隐藏的广告或社交媒体标签。
下面的答案只针对从谷歌新闻中抓取的文章。需要更多的测试来确定它是如何处理大量关键词和谷歌新闻改变页面结构的。
Newspaper3k
提取更为复杂,因为每一篇文章都可能具有不同的结构。我建议查看我的Newspaper3k使用概述文档,了解如何设计代码的这一部分的详细信息。
附言:我现在正在写一个新的新闻刮刀,因为Newspaper3k的开发已经死了。我不确定我的代码的发布日期。
import requests
import re as regex
from bs4 import BeautifulSoup
def get_google_news_article(search_string):
articles = []
url = f'https://www.google.com/search?q={search_string}&safe=active&tbs=qdr:w,sdb:1&tbm=nws&source=lnt&dpr=1'
response = requests.get(url)
raw_html = BeautifulSoup(response.text, "lxml")
main_tag = raw_html.find('div', {'id': 'main'})
for div_tag in main_tag.find_all('div', {'class': regex.compile('xpd')}):
for a_tag in div_tag.find_all('a', href=True):
if not a_tag.get('href').startswith('/search?'):
none_articles = bool(regex.search('amazon.com|facebook.com|twitter.com|youtube.com|wikipedia.org', a_tag['href']))
if none_articles is False:
if a_tag.get('href').startswith('/url?q='):
find_article = regex.search('(.*)(&sa=)', a_tag.get('href'))
article = find_article.group(1).replace('/url?q=', '')
if article.startswith('https://'):
articles.append(article)
return articles
list_of_companies = ['amazon', 'jet airways', 'nirav modi']
for company_name in list_of_companies:
print(company_name)
search_results = get_google_news_article(company_name)
for item in sorted(set(search_results)):
print(item)
print('n')
这是上面代码的输出:
amazon
https://9to5mac.com/2021/11/15/amazon-releases-native-prime-video-app-for-macos-with-purchase-support-and-more/
https://wtvbam.com/2021/11/15/india-police-to-question-amazon-executives-in-probe-over-marijuana-smuggling/
https://www.cnet.com/home/smart-home/all-the-new-amazon-features-for-your-smart-home-alexa-disney-echo/
https://www.cnet.com/tech/amazon-unveils-black-friday-deals-starting-on-nov-25/
https://www.crossroadstoday.com/i/amazons-best-black-friday-deals-for-2021-2/
https://www.reuters.com/technology/ibm-amazon-partner-extend-reach-data-tools-oil-companies-2021-11-15/
https://www.theverge.com/2021/11/15/22783275/amazon-basics-smart-switches-price-release-date-specs
https://www.tomsguide.com/news/amazon-echo-motion-detection
https://www.usatoday.com/story/money/shopping/2021/11/15/amazon-black-friday-2021-deals-online/8623710002/
https://www.winknews.com/2021/11/15/new-amazon-sortation-center-began-operations-monday-could-bring-faster-deliveries/
jet airways
https://economictimes.indiatimes.com/markets/expert-view/first-time-in-two-decades-new-airlines-are-starting-instead-of-closing-down-jyotiraditya-scindia/articleshow/87660724.cms
https://menafn.com/1103125331/Jet-Airways-to-resume-operations-in-Q1-2022
https://simpleflying.com/jet-airways-100-aircraft-5-years/
https://simpleflying.com/jet-airways-q3-loss/
https://www.business-standard.com/article/companies/defunct-carrier-jet-airways-posts-rs-306-cr-loss-in-september-quarter-121110901693_1.html
https://www.business-standard.com/article/markets/stocks-to-watch-ril-aurobindo-bhel-m-m-jet-airways-idfc-powergrid-121110900189_1.html
https://www.financialexpress.com/market/nykaa-hdfc-zee-media-jet-airways-power-grid-berger-paints-petronet-lng-stocks-in-focus/2366063/
https://www.moneycontrol.com/news/business/earnings/jet-airways-standalone-september-2021-net-sales-at-rs-41-02-crore-up-313-51-y-o-y-7702891.html
https://www.spokesman.com/stories/2021/nov/11/boeing-set-to-dent-airbus-india-dominance-with-737/
https://www.timesnownews.com/business-economy/industry/article/times-now-summit-2021-jet-airways-will-make-a-comeback-into-indian-skies-akasa-to-take-off-next-year-says-jyotiraditya-scindia/831090
nirav modi
https://m.republicworld.com/india-news/general-news/piyush-goyal-says-few-rotten-eggs-destroyed-credibility-of-countrys-ca-sector.html
https://www.bulletnews.net/akkad-bakkad-rafu-chakkar-review-the-story-of-robbing-people-by-making-fake-banks/
https://www.daijiworld.com/news/newsDisplay%3FnewsID%3D893048
https://www.devdiscourse.com/article/law-order/1805317-hc-seeks-centres-stand-on-bankers-challenge-to-dismissal-from-service
https://www.geo.tv/latest/381560-arif-naqvis-extradition-case-to-be-heard-after-nirav-modi-case-ruling
https://www.hindustantimes.com/india-news/cbiand-ed-appointments-that-triggered-controversies-101636954580012.html
https://www.law360.com/articles/1439470/suicide-test-ruling-delays-abraaj-founder-s-extradition-case
https://www.moneycontrol.com/news/trends/current-affairs-trends/nirav-modi-extradition-case-outcome-of-appeal-to-also-affect-pakistani-origin-global-financier-facing-16-charges-of-fraud-and-money-laundering-7717231.html
https://www.thehansindia.com/hans/opinion/news-analysis/uniform-law-needed-for-free-exit-of-rich-businessmen-714566
https://www.thenews.com.pk/print/908374-uk-judge-delays-arif-naqvi-s-extradition-to-us