我们应该如何处理这个错误"list indices must be integers, not unicode"



我正在尝试删除所有新行、制表符,并仅打印名为tables的列表变量中的文本这些表格是通过抓取世界卫生组织网站页面获得的

In [43]: tables[0]=tables[0].text.strip().replace('n','').replace('t','')
In [44]: tables[0]
Out[44]: u'A    Afghanistan    Albania    Algeria    Andorra    Angola    Antigua and Barbuda    Argentina    Armenia    Australia    Austria    Azerbaijan'

它一直很好,直到我尝试迭代这些表时,出现了以下情况。

In [45]: for i in tables:
...:     tables[i] =  tables[i].text.strip().replace('n','').replace('t','')
...:     print(tables[i])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-45-7630bb467dfd> in <module>()
1 for i in tables:
----> 2     tables[i] =  tables[i].text.strip().replace('n','').replace('t','')
3     print(tables[i])
TypeError: list indices must be integers, not unicode

这是我另一次尝试失败

In [47]: for i in range(len(tables)):
...:     tables[i] =  tables[i].text.strip().replace('n','').replace('t','')
...:     print(tables[i])
...:     
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-47-84306fc0c373> in <module>()
1 for i in range(len(tables)):
----> 2     tables[i] =  tables[i].text.strip().replace('n','').replace('t','')
3     print(tables[i])
AttributeError: 'unicode' object has no attribute 'text'

作为一个美丽的小人物,我请求你们的帮助,伙计们!

这是我的解决方案

In [1]: from bs4 import BeautifulSoup                                                                             
In [2]: import requests                                                                   
In [3]: url = 'https://www.who.int/countries/en/'                                                                           
In [4]: content = requests.get(url).content                                                                        
In [5]: soup = BeautifulSoup(content,'html5lib')                                                                   
In [6]: divs = soup.findAll('div', attrs={'class':'largebox'})                                                     
In [7]: countries = []                                                                                             
In [8]: for div in divs: 
...:     li = div.findAll('li') 
...:     for l in li: 
...:         print(l.a.text) 

这有一个严格的解决方案,但有效。我已经确定,每个字母表本身都有一个divclass = largebox,每个国家都是一个包裹在锚(a(标签中的htmlli项目。因此,遍历所有div和列表项得到了以下结果。

Afghanistan 
Albania 
Algeria 
Andorra 
Angola 
Antigua and Barbuda 
Argentina 
Armenia 
Australia 
Austria 
Azerbaijan 
Bahamas 
Bahrain 
Bangladesh 
Barbados 
Belarus 
Belgium 
Belize 
Benin 
Bhutan 
Bolivia (Plurinational State of) 
Bosnia and Herzegovina 
Botswana 
Brazil 
Brunei Darussalam 
Bulgaria 
Burkina Faso 
Burundi 
Cabo Verde 
Cambodia 
Cameroon 
Canada 
Central African Republic 
Chad 
Chile 
China 
Colombia 
Comoros 
Congo 
Cook Islands 
Costa Rica 
Côte d'Ivoire 
Croatia 
Cuba 
Cyprus 
Czechia 
Democratic People's Republic of Korea 
Democratic Republic of the Congo 
Denmark 
Djibouti 
Dominica 
Dominican Republic 
Ecuador 
Egypt 
El Salvador 
Equatorial Guinea 
Eritrea 
Estonia 
Eswatini 
Ethiopia 
Fiji 
Finland 
France 
Gabon 
Gambia 
Georgia 
Germany 
Ghana 
Greece 
Grenada 
Guatemala 
Guinea 
Guinea-Bissau 
Guyana 
Haiti 
Honduras 
Hungary 
Iceland 
India 
Indonesia 
Iran (Islamic Republic of) 
Iraq 
Ireland 
Israel 
Italy 
Jamaica 
Japan 
Jordan 
Kazakhstan 
Kenya 
Kiribati 
Kuwait 
Kyrgyzstan 
Lao People's Democratic Republic 
Latvia 
Lebanon 
Lesotho 
Liberia 
Libya 
Lithuania 
Luxembourg 
Madagascar 
Malawi 
Malaysia 
Maldives 
Mali 
Malta 
Marshall Islands 
Mauritania 
Mauritius 
Mexico 
Micronesia (Federated States of) 
Monaco 
Mongolia 
Montenegro 
Morocco 
Mozambique 
Myanmar 
Namibia 
Nauru 
Nepal 
Netherlands 
New Zealand 
Nicaragua 
Niger 
Nigeria 
Niue 
North Macedonia 
Norway 
Oman 
Pakistan 
Palau 
Panama 
Papua New Guinea 
Paraguay 
Peru 
Philippines 
Poland 
Portugal 
Qatar 
Republic of Korea 
Republic of Moldova 
Romania 
Russian Federation 
Rwanda 
Saint Kitts and Nevis 
Saint Lucia 
Saint Vincent and the Grenadines 
Samoa 
San Marino 
Sao Tome and Principe 
Saudi Arabia 
Senegal 
Serbia  
Seychelles 
Sierra Leone 
Singapore 
Slovakia 
Slovenia 
Solomon Islands 
Somalia 
South Africa 
South Sudan 
Spain 
Sri Lanka 
Sudan 
Suriname 
Sweden 
Switzerland 
Syrian Arab Republic 
Tajikistan 
Thailand 
Timor-Leste 
Togo 
Tonga 
Trinidad and Tobago 
Tunisia 
Turkey 
Turkmenistan 
Tuvalu 
Uganda 
Ukraine 
United Arab Emirates 
United Kingdom 
United Republic of Tanzania 
United States of America 
Uruguay 
Uzbekistan 
Vanuatu 
Venezuela (Bolivarian Republic of) 
Viet Nam 
Yemen 
Zambia 
Zimbabwe

更新在此处找到更好的版本

您显示的最后一个代码块应该可以工作。您得到的错误表明您的tables对象(我猜它是一个列表?(包含不一致类型的数据。一些条目可能是具有text属性的对象,但至少有一个条目是Unicode字符串。也许这是因为您使用第一个代码块中的代码手动修改了tables[0]?如果是这样的话,重建数据(例如通过重新抓取,或从一些早期的中间结果中重新处理数据(应该可以修复错误。或者,如果只修改了索引0,并且您不介意一次性修复,只需将范围更改为从索引1(for i in range(1, len(tables))(开始即可。

中间的代码块不起作用,因为您正在对列表中的项进行迭代,但试图将它们用作索引。您也许可以使用enumerate在进行时获取索引,并使用该索引进行就地分配:

for i, value in enumerate(tables):
tables[i] = some_processing(value) # details of the processing omitted for brevity

另一种通常更好的方法是用处理过的值创建一个新的列表,而不是试图修改原始列表。这也避免了像我上面提到的第一个问题,因为你会保持原始列表不变,所以如果你发现在创建新列表时第一次做得不对,你可以重新处理它。列表理解通常是从现有列表构建新列表的好方法。我会使用:

processed_tables = [process(value) for value in tables]

或者,如果你根本不需要保留处理过的值,只需要打印出来,你可以使用一个简单的循环和一个局部变量:

for value in tables:
processed_value = process(value)
print(processed_value)

您甚至可以跳过局部变量赋值,在一行中完成所有操作,使用类似print(process(value))的循环体。

相关内容

最新更新