我正在尝试从网站中抓取数据
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.mohfw.gov.in/"
r = requests.get(url)
html =r.text
soup = BeautifulSoup(html,'html.parser')
#print(soup)
id = soup.find('div',id='cases')
table_body = id.find('tbody')
table_rows = table_body.find_all('tr')
sl_no = []
States = []
Cases = []
Recovered = []
Deaths = []
试图循环并将表行添加到以上空白列,但出现错误
for tr in table_rows:
td = tr.find_all('td')
sl_no.append(td[0].text)
States.append(td[1].text)
Cases.append(td[2].text)
Recovered.append(td[3].text)
Deaths.append(td[-1].text)
headers = ['sl_no','States','Cases','Recovered','Deaths']
df = pd.DataFrame(list(zip(sl_no,States,Cases,Recovered,Deaths)),columns=headers)
df1 = df.drop(index=27)
这是我的错误
States.append(td[1].text)
IndexError: list index out of range
您可以测试td
列表的长度,问题是最后一个是长度1
,因此通过td[1]
:选择列表的第二个值时出错
for tr in table_rows:
td = tr.find_all('td')
print (len(td))
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
4
1
因此,您的解决方案应该更改为过滤所有长度为5:的td值
for tr in table_rows:
td = tr.find_all('td')
if len(td) == 5:
sl_no.append(td[0].text)
States.append(td[1].text)
Cases.append(td[2].text)
Recovered.append(td[3].text)
Deaths.append(td[-1].text)
headers = ['sl_no','States','Cases','Recovered','Deaths']
df = pd.DataFrame(list(zip(sl_no,States,Cases,Recovered,Deaths)),columns=headers)
print (df)
sl_no States Cases Recovered Deaths
0 1 Andhra Pradesh 23 1 0
1 2 Andaman and Nicobar Islands 9 0 0
2 3 Bihar 15 0 1
3 4 Chandigarh 8 0 0
4 5 Chhattisgarh 7 0 0
5 6 Delhi 87 6 2
6 7 Goa 5 0 0
7 8 Gujarat 69 1 6
8 9 Haryana 36 18 0
9 10 Himachal Pradesh 3 0 1
10 11 Jammu and Kashmir 48 2 2
11 12 Karnataka 83 5 3
12 13 Kerala 202 19 1
13 14 Ladakh 13 3 0
14 15 Madhya Pradesh 47 0 3
15 16 Maharashtra 198 25 8
16 17 Manipur 1 0 0
17 18 Mizoram 1 0 0
18 19 Odisha 3 0 0
19 20 Puducherry 1 0 0
20 21 Punjab 38 1 1
21 22 Rajasthan 59 3 0
22 23 Tamil Nadu 67 4 1
23 24 Telengana 71 1 1
24 25 Uttarakhand 7 2 0
25 26 Uttar Pradesh 82 11 0
26 27 West Bengal 22 0 2
我认为您可以使用read_html
:简化代码
url = "https://www.mohfw.gov.in/"
df = pd.read_html(url)[-1]
然后删除最后2行:
df = df.iloc[:-2]
print (df)
S. No. Name of State / UT Total Confirmed cases *
0 1 Andhra Pradesh 23
1 2 Andaman and Nicobar Islands 9
2 3 Bihar 15
3 4 Chandigarh 8
4 5 Chhattisgarh 7
5 6 Delhi 87
6 7 Goa 5
7 8 Gujarat 69
8 9 Haryana 36
9 10 Himachal Pradesh 3
10 11 Jammu and Kashmir 48
11 12 Karnataka 83
12 13 Kerala 202
13 14 Ladakh 13
14 15 Madhya Pradesh 47
15 16 Maharashtra 198
16 17 Manipur 1
17 18 Mizoram 1
18 19 Odisha 3
19 20 Puducherry 1
20 21 Punjab 38
21 22 Rajasthan 59
22 23 Tamil Nadu 67
23 24 Telengana 71
24 25 Uttarakhand 7
25 26 Uttar Pradesh 82
26 27 West Bengal 22
Cured/Discharged/Migrated Death
0 1 0
1 0 0
2 0 1
3 0 0
4 0 0
5 6 2
6 0 0
7 1 6
8 18 0
9 0 1
10 2 2
11 5 3
12 19 1
13 3 0
14 0 3
15 25 8
16 0 0
17 0 0
18 0 0
19 0 0
20 1 1
21 3 0
22 4 1
23 1 1
24 2 0
25 11 0
26 0 2
其中一个<tr>
似乎没有包含您认为应该包含的所有<td>
。
从数据本身的快速查看来看,该数据的最后一个<tr>
似乎包含了所有状态的某种摘要。在这种情况下,您可能应该切断for循环中的最后一个<td>
:
for tr in table_rows[:-1]
或者用包装
for tr in table_rows:
try:
td = tr.find_all('td')
sl_no.append(td[0].text)
States.append(td[1].text)
Cases.append(td[2].text)
Recovered.append(td[3].text)
Deaths.append(td[-1].text)
except Exception as e:
# Pass or handle the exception as you wish.
pass