Python 网页抓取 IndexError 使用来自站点"https://www.boi.org.il/currency.xml?curr=01"的请求和 lxml

我使用以下代码成功地从维基百科(https://en.wikipedia.org/wiki/Web_scraping(中抓取，但是当我在"https://www.boi.org.il/currency.xml?curr=01"上尝试时，我收到错误：

---> 20 print(tree[0].text_content())

IndexError: list index out of range

我的代码是：

import requests 
from lxml import html 
# url to scrape data from 
link = 'https://www.boi.org.il/currency.xml?curr=01'
# path to particular element 
path = '/CURRENCIES/LAST_UPDATE'
response = requests.get(link) 
byte_string = response.content 
# get filtered source code 
source_code = html.fromstring(byte_string) 
# jump to preferred html element 
tree = source_code.xpath(path) 
# print texts in first element in list 
print(tree[0].text_content())

我想刮掉LAST_UPDATE和费率项目。

谢谢！

您没有直接调用url，因为您首先调用了不同的 url，我不知道为什么！您所需要的只是直接调用 URL，然后解析它！您不需要将字节转换为文本，因为您可以使用BS4.text

import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.boi.org.il/currency.xml?/CURRENCIES/LAST_UPDATE')
soup = BeautifulSoup(r.text, 'xml')
for update in soup.findAll('LAST_UPDATE'):
print(update.text)
for rate in soup.findAll('RATE'):
print(rate.text)

输出：

下面是一个不使用外部库(仅使用 python 核心库(的工作解决方案

import xml.etree.ElementTree as ET
import http.client
connection = http.client.HTTPSConnection("www.boi.org.il")
connection.request("GET", "/currency.xml?curr=01")
response = connection.getresponse()

if response.code == 200:
root = ET.fromstring(response.read())
print('LAST_UPDATE: {}'.format(root.find('.//LAST_UPDATE').text))
print('RATE: {}'.format(root.find('.//RATE').text))

.xml

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<CURRENCIES>
<LAST_UPDATE>2019-12-06</LAST_UPDATE>
<CURRENCY>
<NAME>Dollar</NAME>
<UNIT>1</UNIT>
<CURRENCYCODE>USD</CURRENCYCODE>
<COUNTRY>USA</COUNTRY>
<RATE>3.463</RATE>
<CHANGE>-0.115</CHANGE>
</CURRENCY>
</CURRENCIES>

输出

LAST_UPDATE: 2019-12-06
RATE: 3.463

相关内容

最新更新

热门标签：