我使用以下代码成功地从维基百科(https://en.wikipedia.org/wiki/Web_scraping(中抓取,但是当我在"https://www.boi.org.il/currency.xml?curr=01"上尝试时,我收到错误:
---> 20 print(tree[0].text_content())
IndexError: list index out of range
我的代码是:
import requests
from lxml import html
# url to scrape data from
link = 'https://www.boi.org.il/currency.xml?curr=01'
# path to particular element
path = '/CURRENCIES/LAST_UPDATE'
response = requests.get(link)
byte_string = response.content
# get filtered source code
source_code = html.fromstring(byte_string)
# jump to preferred html element
tree = source_code.xpath(path)
# print texts in first element in list
print(tree[0].text_content())
我想刮掉LAST_UPDATE和费率项目。
谢谢!
您没有直接调用url
,因为您首先调用了不同的 url,我不知道为什么! 您所需要的只是直接调用 URL,然后解析它! 您不需要将字节转换为文本,因为您可以使用BS4.text
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.boi.org.il/currency.xml?/CURRENCIES/LAST_UPDATE')
soup = BeautifulSoup(r.text, 'xml')
for update in soup.findAll('LAST_UPDATE'):
print(update.text)
for rate in soup.findAll('RATE'):
print(rate.text)
输出:
2019-12-06
3.463
4.5463
3.1887
3.8434
2.3706
2.6287
0.5144
0.3791
0.2368
0.3653
3.5052
4.8841
0.0229
0.2145
下面是一个不使用外部库(仅使用 python 核心库(的工作解决方案
import xml.etree.ElementTree as ET
import http.client
connection = http.client.HTTPSConnection("www.boi.org.il")
connection.request("GET", "/currency.xml?curr=01")
response = connection.getresponse()
if response.code == 200:
root = ET.fromstring(response.read())
print('LAST_UPDATE: {}'.format(root.find('.//LAST_UPDATE').text))
print('RATE: {}'.format(root.find('.//RATE').text))
.xml
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<CURRENCIES>
<LAST_UPDATE>2019-12-06</LAST_UPDATE>
<CURRENCY>
<NAME>Dollar</NAME>
<UNIT>1</UNIT>
<CURRENCYCODE>USD</CURRENCYCODE>
<COUNTRY>USA</COUNTRY>
<RATE>3.463</RATE>
<CHANGE>-0.115</CHANGE>
</CURRENCY>
</CURRENCIES>
输出
LAST_UPDATE: 2019-12-06
RATE: 3.463