0 2008 15美元
0 1 2018 21461美元 3 4 5 6
python3网络抓取(我正试图从html数据中提取表,并将其存储到一个新的数据帧中。我需要所有的"td"值,但当我尝试迭代时,循环只返回第一行,而不是所有行。下面是我的代码和输出
!pip install yfinance
!pip install pandas
!pip install requests
!pip install bs4
!pip install plotly
import yfinance as yf
import pandas as pd
import requests
from bs4 import BeautifulSoup
import plotly.graph_objects as go
from plotly.subplots import make_subplots
def make_graph(stock_data, revenue_data, stock):
fig = make_subplots(rows=2, cols=1, shared_xaxes=True, subplot_titles=("Historical Share Price", "Historical Revenue"), vertical_spacing = .3)
stock_data_specific = stock_data[stock_data.Date <= '2021--06-14']
revenue_data_specific = revenue_data[revenue_data.Date <= '2021-04-30']
fig.add_trace(go.Scatter(x=pd.to_datetime(stock_data_specific.Date, infer_datetime_format=True), y=stock_data_specific.Close.astype("float"), name="Share Price"), row=1, col=1)
fig.add_trace(go.Scatter(x=pd.to_datetime(revenue_data_specific.Date, infer_datetime_format=True), y=revenue_data_specific.Revenue.astype("float"), name="Revenue"), row=2, col=1)
fig.update_xaxes(title_text="Date", row=1, col=1)
fig.update_xaxes(title_text="Date", row=2, col=1)
fig.update_yaxes(title_text="Price ($US)", row=1, col=1)
fig.update_yaxes(title_text="Revenue ($US Millions)", row=2, col=1)
fig.update_layout(showlegend=False,
height=900,
title=stock,
xaxis_rangeslider_visible=True)
fig.show()
tsla = yf.Ticker("TSLA")
tsla
tesla_data = tsla.history(period="max")
tesla_data
tesla_data.reset_index(inplace=True)
tesla_data.head()
url = "https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue"
html_data = requests.get(url).text
soup = BeautifulSoup(html_data, 'html.parser')
tesla_revenue = pd.DataFrame(columns=["Date", "Revenue"])
for row in soup.find("tbody").find_all('tr'):
col = row.find_all("td")
date = col[0].text
revenue = col[1].text
tesla_revenue = tesla_revenue.append({"Date":date, "Revenue":revenue}, ignore_index=True)
tesla_revenue
日期会发生什么
它工作得很好,但您将数据附加到循环之外,所以您总是得到上一次迭代的结果。
如何修复
修复你的缩进并将附加部分放入你的循环
tesla_revenue = pd.DataFrame(columns=["Date", "Revenue"])
for row in soup.find("tbody").find_all('tr'):
col = row.find_all("td")
date = col[0].text
revenue = col[1].text
tesla_revenue = tesla_revenue.append({"Date":date, "Revenue":revenue}, ignore_index=True)
tesla_revenue
示例
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue"
html_data = requests.get(url).text
soup = BeautifulSoup(html_data, 'html.parser')
tesla_revenue = pd.DataFrame(columns=["Date", "Revenue"])
for row in soup.find("tbody").find_all('tr'):
col = row.find_all("td")
date = col[0].text
revenue = col[1].text
tesla_revenue = tesla_revenue.append({"Date":date, "Revenue":revenue}, ignore_index=True)
tesla_revenue
输出
日期使用适当的类和标签查找主表
res=requests.get("https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue")
soup=BeautifulSoup(res.text,"html.parser")
teable=soup.find("table",class_="historical_data_table table")
main_data=table.find_all("tr")
现在将数据附加到列表并创建列表数据列表,以创建DataFrame 的行数据
main_lst=[]
for i in main_data[1:]:
lst=[data.get_text(strip=True) for data in i.find_all("td")]
main_lst.append(lst)
现在使用该数据显示为df
import pandas as pd
df=pd.DataFrame(columns=["Date","Price"],data=main_lst)
df
输出:
Date Price
0 2020 $31,536
1 2019 $24,578
2 2018 $21,461
3 2017 $11,759
...
在使用pandas
的一个衬垫中
df=pd.read_html("https://www.macrotrends.net/stocks/charts/TSLA/tesla/revenue")
print(len(df))
print(df[0])
输出
6
Date Price
0 2020 $31,536
1 2019 $24,578
2 2018 $21,461
3 2017 $11,759