R-HTML解析器检查某些HTML是否大部分是粗体、斜体等



我有一列HTML。对于我列的每一行,我想确定(是/否)内容是否是粗体、斜体等。例如,许多HTML片段的某些部分是粗体,而有些部分不是。因此,如果它的粗体超过50%,我想将其标记为粗体。

例如,这个应该标记为粗体和斜体:

html <- "<p id="PARA339" style="TEXT-ALIGN: left; MARGIN: 0pt; LINE-HEIGHT: 1.25"><font style="FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman, Times, serif"><i><b>We had a net loss of $1.</b></i><i><b>55</b></i><i><b> million for the year ended December 31, 201</b></i><i><b>6</b></i><i><b> and have an accumulated deficit of $</b></i><i><b>61.5</b></i><i><b> million as of December 31, 201</b></i><i><b>6</b></i><i><b>. To achieve sustainable profitability, we must generate increased revenue.</b></i></font></p>"

我该如何处理?我曾考虑使用regex来计算之间的字符,但一个合适的HTML解析器会更好。我不知道该用什么包,也不知道从哪里开始。感谢

我没有HTML解析器的答案,但有一个正则表达式:

isBold <- function(text) grepl('<b>.*</b>', text)
isItalics <- function(text) grepl('<i>.*</i>', text)
isBold(html)
#[1] TRUE
isItalics(html)
#[1] TRUE

数据

html <- '<p id="PARA339" style="TEXT-ALIGN: left; MARGIN: 0pt; LINE-HEIGHT: 1.25"><font style="FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman, Times, serif"><i><b>We had a net loss of $1.</b></i><i><b>55</b></i><i><b> million for the year ended December 31, 201</b></i><i><b>6</b></i><i><b> and have an accumulated deficit of $</b></i><i><b>61.5</b></i><i><b> million as of December 31, 201</b></i><i><b>6</b></i><i><b>. To achieve sustainable profitability, we must generate increased revenue.</b></i></font></p>'

我们可以计算一个标签与其他标签的比例,而不是只依赖于粗体或斜体标签的一次出现,并且只有当数字大于某个值时才返回TRUE

library(stringr)
isItalics <- function(text) str_count(html,'<i>')/str_count(html, '<[bi]>') > 0.5
isBold <- function(text) str_count(html, '<b>')/str_count(html, '<[bi]>') > 0.5
isBold(html)
#[1] FALSE
isItalics(html)
#[1] TRUE

这对b(或i)标签的出现次数进行计数,并将其除以bi标签的组合出现次数,仅当其大于50%时返回TRUE。如果需要,除了'<[bi]>',您还可以包含更多标签。

更新的数据

html <- '<p id="PARA339" style="TEXT-ALIGN: left; MARGIN: 0pt; LINE-HEIGHT: 1.25"><font style="FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman, Times, serif"><i>We had a net loss of $1.</i><i><b>55</b></i><i> million for the year ended December 31, 201</i><i>6</i><i> and have an accumulated deficit of $</i><i>61.5</i><i> million as of December 31, 201</i><i>6</i><i>. To achieve sustainable profitability, we must generate increased revenue.</i></font></p>'

最新更新