我很难从此页面打印出文本,因为BeautifulSoup没有拾取SPAN类或部分类标签。我想从Motley Fool中拉出文字,然后按句子解析。
https://www.fool.com/earnings/call-transcripts/2019/04/26/exxon-mobil-corp-corp-corp-corp-corp-corp-xom-q1-2019-2019-earnings-conference-conference-c.aspx
到目前为止,当它偶尔会拉入文字时,句子解析起作用,但是,美丽的汤只会偶尔拉入文字。
from textblob import TextBlob
from html.parser import HTMLParser
import re
def news():
# the target we want to open
url = dataframe_url
#open with GET method
resp=requests.get(url)
#http_respone 200 means OK status
if resp.status_code==200:
soup = BeautifulSoup(resp.text,"html.parser")
#l = soup.find("span",attrs={'class':"article-content"})
l = soup.find("section",attrs={'class':"usmf-new article-body"})
#print ('n-----n'.join(tokenizer.tokenize(l.text)))
textlist.extend(tokenizer.tokenize(l.text))
else:
print("Error")
为了捕获成绩单,您可以尝试这样的事情 - 并修改以适合您的需求:
import requests
from bs4 import BeautifulSoup as bs
with requests.Session() as s:
response = s.get('https://www.fool.com/earnings/call-transcripts/2019/04/26/exxon-mobil-corp-xom-q1-2019-earnings-conference-c.aspx')
soup = bs(response.content, 'lxml')
heads = soup.find_all('h2')
selections = ['Prepared Remarks:','Questions and Answers:']
for selection in selections:
for head in heads:
if head.text == selection:
for elem in head.findAllNext():
if elem.name != 'script':
print(elem.text)
if 'Duration' in elem.text:
break
让我知道它是否足够接近。