删除HTML表中的特殊图形字符



我正在尝试刮取一个表,该表在某些单元格中有一个"图形";元素(向上/向下箭头(。不幸的是,库rvest函数html_table似乎跳过了这些元素。这就是这样一个带有箭头的单元格在HTML:中的样子

<td>
<span style="font-weight: bold; color: darkgreen">Ba2</span>
<i class="glyphicon glyphicon-arrow-down" title="negative outlook"></i>
</td>

我使用的代码是:

require(rvest)
require(tidyverse)
url = "https://tradingeconomics.com/country-list/rating"
#bypass company firewall
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")
tables <- content %>% html_table(fill = TRUE, trim=TRUE)

但例如上面的单元格,它只给了我Ba2字符串。有没有办法以某种方式也包括箭头(作为文本,例如Ba2 neg(?如果R没有这样的功能,Python中的解决方案也会很有用。

谢谢!

我不知道这在R中是否可行,但在Python中,这将为您提供所需的结果。

我试着打印前几行,让您了解数据的外观。

pos-表示向上箭头,neg-表示向下箭头

from bs4 import BeautifulSoup
import requests

url = 'https://tradingeconomics.com/country-list/rating'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')

t = soup.find('table', attrs= {'id': 'ctl00_ContentPlaceHolder1_ctl01_GridView1'})
tr = t.findAll('tr')
for i in range(1,10):
tds = tr[i].findAll('td')
temp = []
for j in tds:
fa_down = j.find('i', class_='glyphicon-arrow-down')
fa_up = j.find('i', class_='glyphicon-arrow-up')
if fa_up:
print(f'{j.text.strip()} (pos)')
elif fa_down:
print(f'{j.text.strip()} (neg)')
else:
print(f'{j.text.strip()}')

Output: 
+------------+---------+-----------+-----------+---------+---------+
|  Field 1   | Field 2 |  Field 3  |  Field 4  | Field 5 | Field 6 |
+------------+---------+-----------+-----------+---------+---------+
|  Albania   |    B+   |     B1    |           |         |    35   |
|  Andorra   |   BBB   |           |    BBB+   |         |    62   |
|   Angola   |   CCC+  |    Caa1   |    CCC    |         |    21   |
| Argentina  |   CCC+  |     Ca    |    CCC    |   CCC   |    15   |
|  Armenia   |         |    Ba3    |     B+    |         |    16   |
|   Aruba    |   BBB   |           |     BB    |         |    52   |
| Australia  |   AAA   |    Aaa    | AAA (neg) |   AAA   |   100   |
|  Austria   |   AA+   |    Aa1    |    AA+    |   AAA   |    96   |
| Azerbaijan |   BB+   | Ba2 (pos) |    BB+    |         |    48   |
+------------+---------+-----------+-----------+---------+---------+

最新更新