使用Selenium进行网络抓取无法完全抓取数据



试图在网页上抓取债券信息。虽然使用Selenium可以获得包含所需数据的表的前几行的数据,但一些行和列的数据并没有被刮取。我不知道为什么。

该网页是包含债券信息的网页

输入代码:

a = driver.find_elements(By.TAG_NAME,'sgx-table-row')
combined=[]
for num in range(len(a)):
    combined.append([])
counter=0
for item in a:
    ticker = item.find_elements(By.TAG_NAME,'a')
    name = item.find_elements(By.TAG_NAME,'sgx-table-cell-text')
    price1 = item.find_elements(By.TAG_NAME,'sgx-table-cell-number')
    
    for item in ticker:
        if len(item.text) != 0:
            combined[counter].append(item.text)
        else:
            pass
    for item in name:
        if len(item.text) !=0:
           
            combined[counter].append(item.text)
        else:
            pass
    for item in price1:
        if len(item.text) != 0:
            
            combined[counter].append(item.text)
        else:
            pass
    counter+=1
    

df = pd.DataFrame(combined)
print(df)

输出代码:

 N518100E 230201  CMHS   99.000      99  0.827   98.173     ﹣     ﹣     0   
1   N519100A 240201  LSHS   97.000      97  0.945   96.055     ﹣     ﹣     0   
2   N520100A 251101  QGES        ﹣       ﹣  0.111        ﹣     ﹣     ﹣     0   
3   N521100V 261101  IRRS        ﹣       ﹣      0        ﹣     ﹣     ﹣     0   
4   NA12100N 420401  PH1S  110.000     110  0.842  109.158     ﹣     ﹣     0   
5   NA16100H 460301  BJGS  108.000     108  1.069  106.931     ﹣     ﹣     0   
6   NA20100F 500301  ZL8S  108.000     108  0.729  107.271     ﹣     ﹣     0   
7   NA21200W 511001  ZFGS   87.000      87      0       87     ﹣     ﹣     0   
8   NX13100H 230701  R1MS  101.500   101.5  0.157  101.343     ﹣     ﹣     0   
9   NX15100Z 250601  AFUS   99.701  99.701  0.331    99.37     ﹣     ﹣     0   
10  NX16100F 260601  BJHS  102.000     102  0.296  101.704     ﹣     ﹣     0   
11  NX18100A 280501  CMGS   90.000      90  0.585   89.415     ﹣     ﹣     0   
12  NX21100N 310701  RXYS        ﹣       ﹣  0.093        ﹣     ﹣     ﹣     0   
13  NY07100X 220901  7PMS  101.380  101.38  1.214  100.166     ﹣     ﹣     0   
14             None  None     None    None   None     None  None  None  None   
15             None  None     None    None   None     None  None  None  None   
16             None  None     None    None   None     None  None  None  None   
17             None  None     None    None   None     None  None  None  None   
18             None  None     None    None   None     None  None  None  None   
19             None  None     None    None   None     None  None  None  None   
20             None  None     None    None   None     None  None  None  None   
21             None  None     None    None   None     None  None  None  None   
22             None  None  

如图所示,超过某一点后,find_all方法返回None,即使网页中的html代码的格式相同(类名和标记相同(。

页面是动态加载的,它调用了几个API,从中接收一些json数据。下面的结果对你有帮助吗?

import requests
import pandas as pd
r = requests.get('https://api.sgx.com/securities/v1.1/bonds?params=nc%2Cadjusted-vwap%2Cbond_accrued_interest%2Cbond_clean_price%2Cbond_dirty_price%2Cbond_date%2Cb%2Cbv%2Cp%2Cc%2Cchange_vs_pc%2Cchange_vs_pc_percentage%2Ccx%2Ccn%2Cdp%2Cdpc%2Cdu%2Ced%2Cfn%2Ch%2Ciiv%2Ciopv%2Clt%2Cl%2Co%2Cp_%2Cpv%2Cptd%2Cs%2Csv%2Ctrading_time%2Cv_%2Cv%2Cvl%2Cvwap%2Cvwap-currency')
df = pd.DataFrame(r.json()['data']['prices'])
df

这返回一个38行×34列的数据帧:

    pv  bond_dirty_price    lt  fn  trading_time    dp  type    du  bv  dpc ... p_  p   bond_accrued_interest   change_vs_pc    s   nc  cx  vl  v   bond_date
0   1.014   101.4   1.014   None    20220722_090753 None    retailbonds None    45.0    None    ... X   0.000   0.501   None    1.019   RMRB    0.0 45.0    45630.0 1658419200000
1   0.998   99.7    0.997   None    20220722_090824 None    retailbonds None    22.0    None    ... X   -0.100  0.380   None    1.000   5A1B    0.0 41.0    40874.0 1658419200000
2   0.964   96.3    0.963   None    20220722_090824 None    retailbonds None    7.0 None    ... X   -0.104  1.068   None    0.966   6AZB    0.0 82.0    79028.0 1658419200000
3   1.013   101.3   1.013   None    20220722_090824 None    retailbonds None    20.0    None    ... X   0.000   0.678   None    1.015   V7AB    0.0 80.0    81040.0 1658419200000
4   1.011   101.3   1.013   None    20220722_090825 None    retailbonds None    22.0    None    ... X   0.198   0.983   None    1.013   V7BB    0.0 9.0 9117.0  1658419200000

最新更新