我正在尝试删除所有新行、制表符,并仅打印名为tables
的列表变量中的文本这些表格是通过抓取世界卫生组织网站页面获得的
In [43]: tables[0]=tables[0].text.strip().replace('n','').replace('t','')
In [44]: tables[0]
Out[44]: u'A Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda Argentina Armenia Australia Austria Azerbaijan'
它一直很好,直到我尝试迭代这些表时,出现了以下情况。
In [45]: for i in tables:
...: tables[i] = tables[i].text.strip().replace('n','').replace('t','')
...: print(tables[i])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-45-7630bb467dfd> in <module>()
1 for i in tables:
----> 2 tables[i] = tables[i].text.strip().replace('n','').replace('t','')
3 print(tables[i])
TypeError: list indices must be integers, not unicode
这是我另一次尝试失败
In [47]: for i in range(len(tables)):
...: tables[i] = tables[i].text.strip().replace('n','').replace('t','')
...: print(tables[i])
...:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-47-84306fc0c373> in <module>()
1 for i in range(len(tables)):
----> 2 tables[i] = tables[i].text.strip().replace('n','').replace('t','')
3 print(tables[i])
AttributeError: 'unicode' object has no attribute 'text'
作为一个美丽的小人物,我请求你们的帮助,伙计们!
这是我的解决方案
In [1]: from bs4 import BeautifulSoup
In [2]: import requests
In [3]: url = 'https://www.who.int/countries/en/'
In [4]: content = requests.get(url).content
In [5]: soup = BeautifulSoup(content,'html5lib')
In [6]: divs = soup.findAll('div', attrs={'class':'largebox'})
In [7]: countries = []
In [8]: for div in divs:
...: li = div.findAll('li')
...: for l in li:
...: print(l.a.text)
这有一个严格的解决方案,但有效。我已经确定,每个字母表本身都有一个div
和class = largebox
,每个国家都是一个包裹在锚(a
(标签中的html
li
项目。因此,遍历所有div和列表项得到了以下结果。
Afghanistan
Albania
Algeria
Andorra
Angola
Antigua and Barbuda
Argentina
Armenia
Australia
Austria
Azerbaijan
Bahamas
Bahrain
Bangladesh
Barbados
Belarus
Belgium
Belize
Benin
Bhutan
Bolivia (Plurinational State of)
Bosnia and Herzegovina
Botswana
Brazil
Brunei Darussalam
Bulgaria
Burkina Faso
Burundi
Cabo Verde
Cambodia
Cameroon
Canada
Central African Republic
Chad
Chile
China
Colombia
Comoros
Congo
Cook Islands
Costa Rica
Côte d'Ivoire
Croatia
Cuba
Cyprus
Czechia
Democratic People's Republic of Korea
Democratic Republic of the Congo
Denmark
Djibouti
Dominica
Dominican Republic
Ecuador
Egypt
El Salvador
Equatorial Guinea
Eritrea
Estonia
Eswatini
Ethiopia
Fiji
Finland
France
Gabon
Gambia
Georgia
Germany
Ghana
Greece
Grenada
Guatemala
Guinea
Guinea-Bissau
Guyana
Haiti
Honduras
Hungary
Iceland
India
Indonesia
Iran (Islamic Republic of)
Iraq
Ireland
Israel
Italy
Jamaica
Japan
Jordan
Kazakhstan
Kenya
Kiribati
Kuwait
Kyrgyzstan
Lao People's Democratic Republic
Latvia
Lebanon
Lesotho
Liberia
Libya
Lithuania
Luxembourg
Madagascar
Malawi
Malaysia
Maldives
Mali
Malta
Marshall Islands
Mauritania
Mauritius
Mexico
Micronesia (Federated States of)
Monaco
Mongolia
Montenegro
Morocco
Mozambique
Myanmar
Namibia
Nauru
Nepal
Netherlands
New Zealand
Nicaragua
Niger
Nigeria
Niue
North Macedonia
Norway
Oman
Pakistan
Palau
Panama
Papua New Guinea
Paraguay
Peru
Philippines
Poland
Portugal
Qatar
Republic of Korea
Republic of Moldova
Romania
Russian Federation
Rwanda
Saint Kitts and Nevis
Saint Lucia
Saint Vincent and the Grenadines
Samoa
San Marino
Sao Tome and Principe
Saudi Arabia
Senegal
Serbia
Seychelles
Sierra Leone
Singapore
Slovakia
Slovenia
Solomon Islands
Somalia
South Africa
South Sudan
Spain
Sri Lanka
Sudan
Suriname
Sweden
Switzerland
Syrian Arab Republic
Tajikistan
Thailand
Timor-Leste
Togo
Tonga
Trinidad and Tobago
Tunisia
Turkey
Turkmenistan
Tuvalu
Uganda
Ukraine
United Arab Emirates
United Kingdom
United Republic of Tanzania
United States of America
Uruguay
Uzbekistan
Vanuatu
Venezuela (Bolivarian Republic of)
Viet Nam
Yemen
Zambia
Zimbabwe
更新在此处找到更好的版本
您显示的最后一个代码块应该可以工作。您得到的错误表明您的tables
对象(我猜它是一个列表?(包含不一致类型的数据。一些条目可能是具有text
属性的对象,但至少有一个条目是Unicode字符串。也许这是因为您使用第一个代码块中的代码手动修改了tables[0]
?如果是这样的话,重建数据(例如通过重新抓取,或从一些早期的中间结果中重新处理数据(应该可以修复错误。或者,如果只修改了索引0
,并且您不介意一次性修复,只需将范围更改为从索引1
(for i in range(1, len(tables))
(开始即可。
中间的代码块不起作用,因为您正在对列表中的项进行迭代,但试图将它们用作索引。您也许可以使用enumerate
在进行时获取索引,并使用该索引进行就地分配:
for i, value in enumerate(tables):
tables[i] = some_processing(value) # details of the processing omitted for brevity
另一种通常更好的方法是用处理过的值创建一个新的列表,而不是试图修改原始列表。这也避免了像我上面提到的第一个问题,因为你会保持原始列表不变,所以如果你发现在创建新列表时第一次做得不对,你可以重新处理它。列表理解通常是从现有列表构建新列表的好方法。我会使用:
processed_tables = [process(value) for value in tables]
或者,如果你根本不需要保留处理过的值,只需要打印出来,你可以使用一个简单的循环和一个局部变量:
for value in tables:
processed_value = process(value)
print(processed_value)
您甚至可以跳过局部变量赋值,在一行中完成所有操作,使用类似print(process(value))
的循环体。