使用熊猫进行网络抓取时出现'Forbidden'错误



如何在网络抓取数据时绕过这个'禁止'错误?

table_Populations = pd.read_html("https://www.worldometers.info/world-population/population-by-country/", match = "Countries in the world by population (2022)")
df_Populations = pd.DataFrame(table_Populations[0])
#Change Country or area to country
df_Populations.rename(columns = {'Country (or dependency)' : 'Country'}, inplace = True)

您需要注入user-agent作为header来摆脱状态403的禁止

import requests
import pandas as pd
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
df = pd.read_html(requests.get('https://www.worldometers.info/world-population/population-by-country/',headers=headers).text)[0]
print(df)

输出:

# Country (or dependency)  Population (2020)  ... Med. Age  Urban Pop %  World Share
0      1                Honduras            9904607  ...       24         57 %       0.13 %
1      2    United Arab Emirates            9890402  ...       33         86 %       0.13 %
2      3                Djibouti             988000  ...       27         79 %       0.01 %
3      4        Saint Barthelemy               9877  ...     N.A.          0 %       0.00 %
4      5              Seychelles              98347  ...       34         56 %       0.00 %
..   ...                     ...                ...  ...      ...          ...          ...
230  231                  Jordan           10203134  ...       24         91 %       0.13 %
231  232                Portugal           10196709  ...       46         66 %       0.13 %
232  233              Azerbaijan           10139177  ...       32         56 %       0.13 %   
233  234                  Sweden           10099265  ...       41         88 %       0.13 %   
234  235                   India                  0  ...       28         N.A.       0.00 %   
[235 rows x 12 columns]

由于多次尝试,您的IP可能被阻止。使用代理/VPN,代码将工作。

最新更新