Python:加载数据时如何删除脚注,当有一对数字时如何选择第一个



我是python的新手,正在寻求帮助。

resp =requests.get("https://en.wikipedia.org/wiki/World_War_II_casualties")
soup = bs.BeautifulSoup(resp.text)
table = soup.find("table", {"class": "wikitable sortable"})
deaths = []`
for row in table.findAll('tr')[1:]:
death = row.findAll('td')[5].text.strip()
deaths.append(death)

它出来作为

'30,000',
'40,400',
'',
'88,000',
'2,000',
'21,500',
'252,600',
'43,600',
'15,000,000[35]to 20,000,000[35]',
'100',
'340,000 to 355,000',
'6,000',
'3,000,000to 4,000,000',
'1,100',
'83,000',
'100,000[49]',
'85,000 to 95,000',
'600,000',
'1,000,000to 2,200,000',
'6,900,000 to 7,400,000',
...
'557,000',
'5,900,000[115] to 6,000,000[116]',
'40,000to 70,000',
'500,000[39]',
'36,000–50,000',
'11,900',
'10,000',
'20,000,000[141] to 27,000,000[142][143][144][145][146]',
'',
'2,100',
'100',
'7,600',
'200',
'450,900',
'419,400',
'1,027,000[160] to 1,700,000[159]',
'',
'70,000,000to 85,000,000']`

我想绘制一个图表,但 [] 脚注会完全毁了它。许多值都带有脚注。 当一个单元格中有一对时,是否也可以选择第一个数字?如果你们中有人能教我,我将不胜感激...谢谢

您可以将soup.find_next()text=True参数一起使用,然后相应地拆分/剥离。

例如:

import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/World_War_II_casualties'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for tr in soup.table.select('tr:has(td)')[1:]:
tds = tr.select('td')
if not tds[0].b:
continue
name = tds[0].b.get_text(strip=True, separator=' ')
casualties = tds[5].find_next(text=True).strip()
print('{:<30} {}'.format(name, casualties.split('–')[0].split()[0] if casualties else ''))

指纹:

Albania                        30,000
Australia                      40,400
Austria                        
Belgium                        88,000
Brazil                         2,000
Bulgaria                       21,500
Burma                          252,600
Canada                         43,600
China                          15,000,000
Cuba                           100
Czechoslovakia                 340,000
Denmark                        6,000
Dutch East Indies              3,000,000
Egypt                          1,100
Estonia                        83,000
Ethiopia                       100,000
Finland                        85,000
France                         600,000
French Indochina               1,000,000
Germany                        6,900,000
Greece                         507,000
Guam                           1,000
Hungary                        464,000
Iceland                        200
India                          2,200,000
Iran                           200
Iraq                           700
Ireland                        100
Italy                          492,400
Japan                          2,500,000
Korea                          483,000
Latvia                         250,000
Lithuania                      370,000
Luxembourg                     5,000
Malaya & Singapore             100,000
Malta                          1,500
Mexico                         100
Mongolia                       300
Nauru                          500
Nepal                          
Netherlands                    210,000
Newfoundland                   1,200
New Zealand                    11,700
Norway                         10,200
Papua and New Guinea           15,000
Philippines                    557,000
Poland                         5,900,000
Portuguese Timor               40,000
Romania                        500,000
Ruanda-Urundi                  36,000
South Africa                   11,900
South Pacific Mandate          10,000
Soviet Union                   20,000,000
Spain                          
Sweden                         2,100
Switzerland                    100
Thailand                       7,600
Turkey                         200
United Kingdom                 450,900
United States                  419,400
Yugoslavia                     1,027,000
Approx. totals                 70,000,000

最新更新