我正在尝试从以下网站制作数据帧:https://www.ncbi.nlm.nih.gov/books/NBK56068/table/summarytables.t4/?report=objectonly
如果你看一下Water的列标题,有一个上标"a"是超链接,"b"是Protein,所以我的数据帧列标题最终是"Watera"one_answers"Proteinb"。
我可以一个接一个地浏览和编辑它们,但有没有任何方法可以通过程序忽略下标、上标或超链接?
您可以在BeautifulSoup的帮助下删除<sup>
标签,例如:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.ncbi.nlm.nih.gov/books/NBK56068/table/summarytables.t4/?report=objectonly'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# remove <sup>
for sup in soup.select('sup'):
sup.extract()
df = pd.read_html(str(soup))[0]
print(df)
打印:
Life StageGroup Total Water(L/d) ... α-Linolenic Acid(g/d) Protein(g/d)
0 Infants NaN ... NaN NaN
1 0–6 mo 0.7* ... 0.5* 9.1*
2 6–12 mo 0.8* ... 0.5* 11.0
3 Children NaN ... NaN NaN
4 1–3 y 1.3* ... 0.7* 13
5 4–8 y 1.7* ... 0.9* 19
6 Males NaN ... NaN NaN
7 9–13 y 2.4* ... 1.2* 34
8 14–18 y 3.3* ... 1.6* 52
9 19–30 y 3.7* ... 1.6* 56
10 31–50 y 3.7* ... 1.6* 56
11 51–70 y 3.7* ... 1.6* 56
12 > 70 y 3.7* ... 1.6* 56
13 Females NaN ... NaN NaN
14 9–13 y 2.1* ... 1.0* 34
15 14–18 y 2.3* ... 1.1* 46
16 19–30 y 2.7* ... 1.1* 46
17 31–50 y 2.7* ... 1.1* 46
18 51–70 y 2.7* ... 1.1* 46
19 > 70 y 2.7* ... 1.1* 46
20 Pregnancy NaN ... NaN NaN
21 14–18 y 3.0* ... 1.4* 71
22 19–30 y 3.0* ... 1.4* 71
23 31–50 y 3.0* ... 1.4* 71
24 Lactation NaN ... NaN NaN
25 14–18 3.8* ... 1.3* 71
26 19–30 y 3.8* ... 1.3* 71
27 31–50 y 3.8* ... 1.3* 71
[28 rows x 8 columns]