Wikipedia刮擦 - 需要协助来构建它 - Wikipedia scraping - need assitance to structure it 小贝子编程网

我正在尝试刮擦此Wikipedia页面。

我正在遇到一些问题，并感谢您的帮助：

有些行有多个名称或链接，我希望它们都被分配给正确的国家。无论如何我可以做到吗？

我想跳过"名称（本机）"列。我该怎么做？

如果我要刮擦"名称（本机）"列。我得到了一些胡言乱语，无论如何是否可以编码？

import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_government_gazettes'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')
table = soup.find('table', class_='wikitable').tbody
rows = table.findAll('tr')
columns = [col.text.encode('utf').replace('xc2xa0','').replace('n', '') for col in rows[1].find_all('td')]
print(columns)

您可以使用pandas函数 read_html并从 DataFrames列表中获取第二个 DataFrame：

url = 'https://en.wikipedia.org/wiki/List_of_government_gazettes'
df = pd.read_html(url)[1].head()
print (df)
       Country/region                                              Name  
0              Albania       Official Gazette of the Republic of Albania   
1              Algeria                                  Official Gazette   
2              Andorra  Official Bulletin of the Principality of Andorra   
3  Antigua and Barbuda              Antigua and Barbuda Official Gazette   
4            Argentina     Official Gazette of the Republic of Argentina   
                                 Name (native)                    Website  
0  Fletorja Zyrtare E Republikës Së Shqipërisë                 qbz.gov.al  
1                   Journal Officiel d'Algérie              joradp.dz/HAR  
2     Butlletí Oficial del Principat d'Andorra                www.bopa.ad  
3         Antigua and Barbuda Official Gazette    www.legalaffairs.gov.ag  
4    Boletín Oficial de la República Argentina  www.boletinoficial.gob.ar

如果检查输出有问题的行26，因为Wiki页面中也有错误的数据。

解决方案应按列名和行设置值：

df.loc[26, 'Name (native)'] = np.nan

Wikipedia刮擦 - 需要协助来构建它

相关内容

最新更新

热门标签：