如何用Python刮擦Wikipedia桌子



我想提取表url是https://en.wikipedia.org/wiki/list_of_companies_of_indonesia我的代码没有提供数据。我们如何获得?

代码:

import requests
from bs4 import BeautifulSoup as bs
url = "https://en.wikipedia.org/wiki/List_of_companies_of_Indonesia"
html = requests.get(url).text
soup = bs(html, 'html.parser')
ta=soup.find_all('table',class_="wikitable sortable jquery-tablesorter")
print(ta)

如果我要拉表并查看 <table>标签,我将始终尝试第一个pandas .read_html()。它将为您进行迭代。在大多数情况下,您可以准确地获得所需的东西,或者至少只需要对数据框架进行一些次要操作。在这种情况下,它可以很好地为您提供完整的表格:

import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_companies_of_Indonesia"
table = pd.read_html(url)[1]

输出:

print (table.to_string())
                                   0                   1                                  2                  3        4                                                  5
0                               Name            Industry                             Sector       Headquarters  Founded                                              Notes
1                  Airfast Indonesia   Consumer services                           Airlines          Tangerang     1971                                    Private airline
2                       Angkasa Pura         Industrials            Transportation services            Jakarta     1962                               State-owned airports
3                Astra International       Conglomerates                                  -            Jakarta     1957    Automotive, financials, industrials, technology
4                  Bank Central Asia          Financials                              Banks            Jakarta     1957                                               Bank
5                       Bank Danamon          Financials                              Banks            Jakarta     1956                                               Bank
6                       Bank Mandiri          Financials                              Banks            Jakarta     1998                                               Bank
7              Bank Negara Indonesia          Financials                              Banks            Jakarta     1946                                               Bank
8              Bank Rakyat Indonesia          Financials                              Banks            Jakarta     1895                                 Micro-finance bank
9                     Bumi Resources     Basic materials                     General mining            Jakarta     1973                                             Mining
10                            Djarum      Consumer goods                            Tobacco  Kudus and Jakarta     1951                                            Tobacco
11   Dragon Computer & Communication          Technology                  Computer hardware            Jakarta     1980                                  Computer hardware
12             Elex Media Komputindo   Consumer services                         Publishing            Jakarta     1985                                          Publisher
13                            Femina   Consumer services                              Media            Jakarta     1972                                    Weekly magazine
14                  Garuda Indonesia   Consumer services                   Travel & leisure          Tangerang     1949                                State-owned airline
15                      Gudang Garam      Consumer goods                            Tobacco             Kediri     1958                                            Tobacco
16                      Gunung Agung   Consumer services                Specialty retailers            Jakarta     1953                                         Bookstores
17       Indocement Tunggal Prakarsa         Industrials      Building materials & fixtures            Jakarta     1985         Cement, part of HeidelbergCement (Germany)
18                          Indofood      Consumer goods                      Food products            Jakarta     1968                                    Food production
19              Indonesian Aerospace         Industrials                          Aerospace            Bandung     1976                        State-owned aircraft design
20    Indonesian Bureau of Logistics      Consumer goods                      Food products            Jakarta     1967                                  Food distribution
21                           Indosat  Telecommunications      Fixed line telecommunications            Jakarta     1967                         Telecommunications network
22               Infomedia Nusantara   Consumer services                         Publishing            Jakarta     1975                                Directory publisher
23      Jalur Nugraha Ekakurir (JNE)         Industrials                  Delivery services            Jakarta     1990                                  Express logistics
24                       Kalbe Farma         Health care                    Pharmaceuticals            Jakarta     1966                                    Pharmaceuticals
25              Kereta Api Indonesia         Industrials                          Railroads            Bandung     1945                                State-owned railway
26                       Kimia Farma         Health care                    Pharmaceuticals            Jakarta     1971                                 State-owned pharma
27             Kompas Gramedia Group   Consumer services                     Media agencies            Jakarta     1965                                      Media holding
28                    Krakatau Steel     Basic materials                       Iron & steel            Cilegon     1970                                  State-owned steel
29                          Lion Air   Consumer services                           Airlines            Jakarta     2000                                   Low-cost airline
30                       Lippo Group          Financials  Real estate holding & development            Jakarta     1950                                        Development
31                          Matahari   Consumer services                Broadline retailers          Tangerang     1982                                  Department stores
32                       MedcoEnergi           Oil & gas           Exploration & production            Jakarta     1980                                Energy, oil and gas
33             Media Nusantara Citra   Consumer services       Broadcasting & entertainment            Jakarta     1997                                              Media
34                   Panin Sekuritas          Financials                Investment services            Jakarta     1989                                             Broker
35                         Pegadaian          Financials                   Consumer finance            Jakarta     1901                     State-owned financial services
36                             Pelni         Industrials              Marine transportation            Jakarta     1952                                           Shipping
37                     Pos Indonesia         Industrials                  Delivery services            Bandung     1995                         State-owned postal service
38                         Pertamina           Oil & gas               Integrated oil & gas            Jakarta     1957                    State-owned oil and natural gas
39             Perusahaan Gas Negara           Oil & gas           Exploration & production            Jakarta     1965                                                Gas
40             Perusahaan Gas Negara           Utilities                   Gas distribution            Jakarta     1965             State-owned natural gas transportation
41         Perusahaan Listrik Negara           Utilities           Conventional electricity            Jakarta     1945                State-owned electrical distribution
42  Phillip Securities Indonesia, PT          Financials                Investment services            Jakarta     1989                                 Financial services
43                            Pindad         Industrials                            Defense            Bandung     1808                                State-owned defense
44                PT Lapindo Brantas           Oil & gas           Exploration & production            Jakarta     1996                                        Oil and gas
45   PT Metro Supermarket Realty Tbk   Consumer services       Food retailers & wholesalers            Jakarta     1955                                       Supermarkets
46                       Salim Group       Conglomerates                                  -            Jakarta     1972            Industrials, financials, consumer goods
47                         Sampoerna      Consumer goods                            Tobacco           Surabaya     1913                                            Tobacco
48                   Semen Indonesia         Industrials      Building materials & fixtures             Gresik     1957                                             Cement
49                          Susi Air   Consumer services                           Airlines        Pangandaran     2004                                    Charter airline
50                  Telkom Indonesia  Telecommunications      Fixed line telecommunications            Bandung     1856                         Telecommunication services
51                         Telkomsel  Telecommunications          Mobile telecommunications            Jakarta     1995           Mobile network, part of Telkom Indonesia
52                        Trans Corp       Conglomerates                                  -            Jakarta     2006  Media, consumer services, real estate, part of...
53                Unilever Indonesia      Consumer goods                  Personal products            Jakarta     1933  Personal care products, part of Unilever (Neth...
54                   United Tractors         Industrials       Commercial vehicles & trucks            Jakarta     1972                                    Heavy equipment
55                           Waskita         Industrials                 Heavy construction            Jakarta     1961                           State-owned construction
import requests
from bs4 import BeautifulSoup as bs
URL = "https://en.wikipedia.org/wiki/List_of_companies_of_Indonesia"
html = requests.get(url).text
soup = bs(html, 'html.parser')
ta=soup.find_all('table',{'class':'wikitable'})
print(ta)

您可以使用旧方式搜索按类名来搜索表。似乎仍在工作。

修复

  1. 在您的代码(第4行)中使用URL代替url
  2. 使用类wikitable
  3. 对您的代码进行了一些优化

因此

import requests
from bs4 import BeautifulSoup
page = requests.get("https://en.wikipedia.org/wiki/List_of_companies_of_Indonesia")
soup = BeautifulSoup(page.content, 'html.parser')
ta = soup.find_all('table',class_="wikitable")
print(ta)

输出

[<table class="wikitable sortable">
<tbody><tr>
<th>Rank
</th>
<th>Image
</th>
<th>Name
</th>
<th>2016 Revenues (USD $M)
</th>
<th>Employees
</th>
<th>Notes
.
.
.

也许不是您想要的。但是您可以尝试这个。

import requests
from bs4 import BeautifulSoup as bs
url = "https://en.wikipedia.org/wiki/List_of_companies_of_Indonesia"
html = requests.get(url).text
soup = bs(html, 'html.parser')
for data in soup.find_all('table', {"class":"wikitable"}):
    for td in data.find_all('td'):
        for link in td.find_all('a'):
            print (link.text)

尝试以下,

import requests
from bs4 import BeautifulSoup as bs
URL = "https://en.wikipedia.org/wiki/List_of_companies_of_Indonesia"
html = requests.get(URL).text
soup = bs(html, 'html.parser')
ta=soup.find("table",{"class":"wikitable sortable"})
print(ta)

获取所有表格

ta=soup.find_all("table",{"class":"wikitable sortable"})

如果要解析表数据,则可以使用pandas进行此操作,并且如果要操纵表数据,则可以非常有效,可以使用PANDAS DataFrame()

导航该表
import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_companies_of_Indonesia"
table = pd.read_html(url,header=0)
print(table[1])

最新更新