Python 网页抓取 IndexError 使用来自站点"https://www.boi.org.il/currency.xml?curr=01"的请求和 lxml



我使用以下代码成功地从维基百科(https://en.wikipedia.org/wiki/Web_scraping(中抓取,但是当我在"https://www.boi.org.il/currency.xml?curr=01"上尝试时,我收到错误:

---> 20 print(tree[0].text_content()) 

IndexError: list index out of range

我的代码是:

import requests 
from lxml import html 
# url to scrape data from 
link = 'https://www.boi.org.il/currency.xml?curr=01'
# path to particular element 
path = '/CURRENCIES/LAST_UPDATE'
response = requests.get(link) 
byte_string = response.content 
# get filtered source code 
source_code = html.fromstring(byte_string) 
# jump to preferred html element 
tree = source_code.xpath(path) 
# print texts in first element in list 
print(tree[0].text_content()) 

我想刮掉LAST_UPDATE和费率项目。

谢谢!

您没有直接调用url,因为您首先调用了不同的 url,我不知道为什么! 您所需要的只是直接调用 URL,然后解析它! 您不需要将字节转换为文本,因为您可以使用BS4.text

import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.boi.org.il/currency.xml?/CURRENCIES/LAST_UPDATE')
soup = BeautifulSoup(r.text, 'xml')
for update in soup.findAll('LAST_UPDATE'):
print(update.text)
for rate in soup.findAll('RATE'):
print(rate.text)

输出:

2019-12-06
3.463
4.5463
3.1887
3.8434
2.3706
2.6287
0.5144
0.3791
0.2368
0.3653
3.5052
4.8841
0.0229
0.2145

下面是一个不使用外部库(仅使用 python 核心库(的工作解决方案

import xml.etree.ElementTree as ET
import http.client
connection = http.client.HTTPSConnection("www.boi.org.il")
connection.request("GET", "/currency.xml?curr=01")
response = connection.getresponse()

if response.code == 200:
root = ET.fromstring(response.read())
print('LAST_UPDATE: {}'.format(root.find('.//LAST_UPDATE').text))
print('RATE: {}'.format(root.find('.//RATE').text))

.xml

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<CURRENCIES>
<LAST_UPDATE>2019-12-06</LAST_UPDATE>
<CURRENCY>
<NAME>Dollar</NAME>
<UNIT>1</UNIT>
<CURRENCYCODE>USD</CURRENCYCODE>
<COUNTRY>USA</COUNTRY>
<RATE>3.463</RATE>
<CHANGE>-0.115</CHANGE>
</CURRENCY>
</CURRENCIES>

输出

LAST_UPDATE: 2019-12-06
RATE: 3.463

最新更新