BeautifulSoup不从Span类或部分类标签中拾取文本



我很难从此页面打印出文本,因为BeautifulSoup没有拾取SPAN类或部分类标签。我想从Motley Fool中拉出文字,然后按句子解析。

https://www.fool.com/earnings/call-transcripts/2019/04/26/exxon-mobil-corp-corp-corp-corp-corp-corp-xom-q1-2019-2019-earnings-conference-conference-c.aspx

到目前为止,当它偶尔会拉入文字时,句子解析起作用,但是,美丽的汤只会偶尔拉入文字。

from textblob import TextBlob
from html.parser import HTMLParser
import re
def news(): 
    # the target we want to open     
    url = dataframe_url
    #open with GET method 
    resp=requests.get(url) 
    #http_respone 200 means OK status 
    if resp.status_code==200: 
        soup = BeautifulSoup(resp.text,"html.parser")
        #l = soup.find("span",attrs={'class':"article-content"})
        l = soup.find("section",attrs={'class':"usmf-new article-body"})
        #print ('n-----n'.join(tokenizer.tokenize(l.text)))
        textlist.extend(tokenizer.tokenize(l.text))
    else: 
        print("Error")

为了捕获成绩单,您可以尝试这样的事情 - 并修改以适合您的需求:

import requests
from bs4 import BeautifulSoup as bs
with requests.Session() as s:
    response = s.get('https://www.fool.com/earnings/call-transcripts/2019/04/26/exxon-mobil-corp-xom-q1-2019-earnings-conference-c.aspx')
soup = bs(response.content, 'lxml')
heads = soup.find_all('h2')
selections = ['Prepared Remarks:','Questions and Answers:']
for selection in selections:
    for head in heads:
        if head.text == selection:
            for elem in head.findAllNext():
                if elem.name != 'script':                    
                    print(elem.text)
                if 'Duration' in elem.text:
                    break

让我知道它是否足够接近。

最新更新