我正在尝试刮取一个表,该表在某些单元格中有一个"图形";元素(向上/向下箭头(。不幸的是,库rvest
函数html_table
似乎跳过了这些元素。这就是这样一个带有箭头的单元格在HTML:中的样子
<td>
<span style="font-weight: bold; color: darkgreen">Ba2</span>
<i class="glyphicon glyphicon-arrow-down" title="negative outlook"></i>
</td>
我使用的代码是:
require(rvest)
require(tidyverse)
url = "https://tradingeconomics.com/country-list/rating"
#bypass company firewall
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")
tables <- content %>% html_table(fill = TRUE, trim=TRUE)
但例如上面的单元格,它只给了我Ba2
字符串。有没有办法以某种方式也包括箭头(作为文本,例如Ba2 neg
(?如果R没有这样的功能,Python中的解决方案也会很有用。
谢谢!
我不知道这在R中是否可行,但在Python中,这将为您提供所需的结果。
我试着打印前几行,让您了解数据的外观。
pos
-表示向上箭头,neg
-表示向下箭头
from bs4 import BeautifulSoup
import requests
url = 'https://tradingeconomics.com/country-list/rating'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
t = soup.find('table', attrs= {'id': 'ctl00_ContentPlaceHolder1_ctl01_GridView1'})
tr = t.findAll('tr')
for i in range(1,10):
tds = tr[i].findAll('td')
temp = []
for j in tds:
fa_down = j.find('i', class_='glyphicon-arrow-down')
fa_up = j.find('i', class_='glyphicon-arrow-up')
if fa_up:
print(f'{j.text.strip()} (pos)')
elif fa_down:
print(f'{j.text.strip()} (neg)')
else:
print(f'{j.text.strip()}')
Output:
+------------+---------+-----------+-----------+---------+---------+
| Field 1 | Field 2 | Field 3 | Field 4 | Field 5 | Field 6 |
+------------+---------+-----------+-----------+---------+---------+
| Albania | B+ | B1 | | | 35 |
| Andorra | BBB | | BBB+ | | 62 |
| Angola | CCC+ | Caa1 | CCC | | 21 |
| Argentina | CCC+ | Ca | CCC | CCC | 15 |
| Armenia | | Ba3 | B+ | | 16 |
| Aruba | BBB | | BB | | 52 |
| Australia | AAA | Aaa | AAA (neg) | AAA | 100 |
| Austria | AA+ | Aa1 | AA+ | AAA | 96 |
| Azerbaijan | BB+ | Ba2 (pos) | BB+ | | 48 |
+------------+---------+-----------+-----------+---------+---------+