pandas read_html()忽略上标和下标



我正在尝试从以下网站制作数据帧:https://www.ncbi.nlm.nih.gov/books/NBK56068/table/summarytables.t4/?report=objectonly

如果你看一下Water的列标题,有一个上标"a"是超链接,"b"是Protein,所以我的数据帧列标题最终是"Watera"one_answers"Proteinb"。

我可以一个接一个地浏览和编辑它们,但有没有任何方法可以通过程序忽略下标、上标或超链接?

您可以在BeautifulSoup的帮助下删除<sup>标签,例如:

import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.ncbi.nlm.nih.gov/books/NBK56068/table/summarytables.t4/?report=objectonly'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# remove <sup>
for sup in soup.select('sup'):
sup.extract()
df = pd.read_html(str(soup))[0]
print(df)

打印:

Life StageGroup Total Water(L/d)  ... α-Linolenic Acid(g/d) Protein(g/d)
0          Infants              NaN  ...                   NaN          NaN
1           0–6 mo             0.7*  ...                  0.5*         9.1*
2          6–12 mo             0.8*  ...                  0.5*         11.0
3         Children              NaN  ...                   NaN          NaN
4            1–3 y             1.3*  ...                  0.7*           13
5            4–8 y             1.7*  ...                  0.9*           19
6            Males              NaN  ...                   NaN          NaN
7           9–13 y             2.4*  ...                  1.2*           34
8          14–18 y             3.3*  ...                  1.6*           52
9          19–30 y             3.7*  ...                  1.6*           56
10         31–50 y             3.7*  ...                  1.6*           56
11         51–70 y             3.7*  ...                  1.6*           56
12          > 70 y             3.7*  ...                  1.6*           56
13         Females              NaN  ...                   NaN          NaN
14          9–13 y             2.1*  ...                  1.0*           34
15         14–18 y             2.3*  ...                  1.1*           46
16         19–30 y             2.7*  ...                  1.1*           46
17         31–50 y             2.7*  ...                  1.1*           46
18         51–70 y             2.7*  ...                  1.1*           46
19          > 70 y             2.7*  ...                  1.1*           46
20       Pregnancy              NaN  ...                   NaN          NaN
21         14–18 y             3.0*  ...                  1.4*           71
22         19–30 y             3.0*  ...                  1.4*           71
23         31–50 y             3.0*  ...                  1.4*           71
24       Lactation              NaN  ...                   NaN          NaN
25           14–18             3.8*  ...                  1.3*           71
26         19–30 y             3.8*  ...                  1.3*           71
27         31–50 y             3.8*  ...                  1.3*           71
[28 rows x 8 columns]

相关内容

  • 没有找到相关文章

最新更新