正则表达式,用于在 python 中的 html 标签之间抓取字符串



我正在尝试从 https://finance.yahoo.com/quote/GOOG?ltr=1 和元素中提取价格:

<title>GOOG 989.68 1.85 0.19% : Alphabet Inc. - Yahoo Finance</title>

但我的输出不包含 989.68 的价格。相反,我得到这个:

['GOOG : Summary for Alphabet Inc. - Yahoo Finance']

这是我的代码:

import urllib.request 
import re
htmlfile = urllib.request.urlopen("http://finance.yahoo.com/q?s=GOOG");
htmltext = htmlfile.read();
pattern = re.compile('<title>(.*?)</title>');
price = pattern.findall(str(htmltext));
print(price);

我在<title></title>中没有看到任何股票信息,但我能够使用BeautifulSoup让它工作:

import requests
from bs4 import BeautifulSoup
page = requests.get('https://finance.yahoo.com/quote/GOOG?ltr=1')
soup = BeautifulSoup(page.content, 'html.parser')
container = soup.select_one('div#quote-header-info')
print(container.find('h1').text)
for ele in container.find_all('span'):
    print(ele.text)

其输出为

GOOG - Alphabet Inc.
NasdaqGS - NasdaqGS Delayed Price. Currency in USD
989.68
+1.85 (+0.19%)
At close:  4:00PM EDT

我强烈建议不要使用data-reactid来查找您的元素,因为这可能会并且在网站新版本发布后很可能会发生变化。它是 React 框架使用的内部 ID。此外,在某些浏览器中,React 甚至没有将 react-id 作为属性,而是作为.innerHTML<</p>

div class="one_answers">

价格实际上并不包含在标题中。转到页面源代码并亲自查看。如果你只使用美丽汤而不是re,它会简单得多:

import requests
from bs4 import BeautifulSoup
url = 'https://finance.yahoo.com/quote/GOOG'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
# Use this to look at the source code
# print soup.prettify()
# Here is the exact tag of the span containing the price, 
# not sure if it'll be the same every time
for span in soup.find_all('span', attrs={'class': 'Trsdu(0.3s) Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(b)'}):
    price = span.text
    break
print price
989.68
# Here is a more generic tag for the span, the value for this can change as well, 
# but its a simpler change. The price is contained in the first span like this, 
# so a break will make sure you get the correct one
for span in soup.find_all('span', attrs={'data-reactid': '14'}):
    price = span.text
    break
print price
989.68

您也可以这样做来获得所需的输出,而无需使用正则表达式:

import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://finance.yahoo.com/quote/GOOG?ltr=1').text, 'lxml')
for item in soup.select("div#quote-header-info"):
    title = item.select("h1")[0].text
    price = [elem.text for elem in item.select("span")[1:3]]
    print("Name: {}nClosing Status: {}".format(title,' '.join(price)))

结果:

Name: GOOG - Alphabet Inc.
Closing Status: 989.68 +1.85 (+0.19%)

使用正则表达式可以获取所需的项目。这是代码。

import urllib
import re
htmlfile = urllib.urlopen("http://finance.yahoo.com/q?s=GOOG")
htmltext = htmlfile.read()
# for the title
pattern = re.compile('<title>(.*?)</title>')
title = pattern.findall(str(htmltext))
print('title:',title[0])
# regularMarketPrice
pattern = re.compile('"regularMarketPrice":{"raw":(.*?),')
regularMarketPrice = pattern.findall(str(htmltext))
print('regularMarketPrice:', regularMarketPrice[0])
# regularMarketChange
pattern = re.compile('"regularMarketChange":{"raw":(.*?),')
regularMarketChange = pattern.findall(str(htmltext))
print('regularMarketChange:',regularMarketChange[0])
# regularMarketChangePercent
pattern = re.compile('"regularMarketChangePercent":{"raw":(.*?),')
regularMarketChangePercent = pattern.findall(str(htmltext))
print('regularMarketChangePercent:',regularMarketChangePercent[0])  # x100 to get percent
# for close time
pattern = re.compile('<span data-reactid="21">At close:(.*?)</span>')
at_close = pattern.findall(str(htmltext))
print('At close:',at_close[0])

输出:

('title:', 'GOOG : Summary for Alphabet Inc. - Yahoo Finance')
('regularMarketPrice:', '989.68')
('regularMarketChange:', '1.8499756')
('regularMarketChangePercent:', '0.0018727671')
('At close:', '  4:00PM EDT')

我已经浏览了您提到的页面网址的html源代码。正如你所说,价格是在javascript的帮助下加载到标题中的。如果您检查 html 源代码,您可以在标题标记之前看到脚本。因为每当您使用脚本向网站发出请求时,它都会返回 HTML 代码作为响应。Python 脚本不理解 JavaScript,因此标题中没有加载价格。我建议您使用请求库来提出请求,因为它有高级功能.请求文档。和其他人一样,我会建议你使用BeautifulSoup来解析html。这很容易理解。美丽汤文档。使用lxml解析器。因此,如果您在脚本中遵循这些内容,则代码应该是

import requests
from bs4 import BeautifulSoup
url="https://finance.yahoo.com/quote/GOOG?ltr=1"
response=requests.get(url)
soup=BeautifulSoup(response.contemt,"lxml")
price=soup.find("span",{"data-reactid":"35"}).text
print price

这应该按预期返回价格。

最新更新