我有一列HTML。对于我列的每一行,我想确定(是/否)内容是否是粗体、斜体等。例如,许多HTML片段的某些部分是粗体,而有些部分不是。因此,如果它的粗体超过50%,我想将其标记为粗体。
例如,这个应该标记为粗体和斜体:
html <- "<p id="PARA339" style="TEXT-ALIGN: left; MARGIN: 0pt; LINE-HEIGHT: 1.25"><font style="FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman, Times, serif"><i><b>We had a net loss of $1.</b></i><i><b>55</b></i><i><b> million for the year ended December 31, 201</b></i><i><b>6</b></i><i><b> and have an accumulated deficit of $</b></i><i><b>61.5</b></i><i><b> million as of December 31, 201</b></i><i><b>6</b></i><i><b>. To achieve sustainable profitability, we must generate increased revenue.</b></i></font></p>"
我该如何处理?我曾考虑使用regex来计算和之间的字符,但一个合适的HTML解析器会更好。我不知道该用什么包,也不知道从哪里开始。感谢
我没有HTML解析器的答案,但有一个正则表达式:
isBold <- function(text) grepl('<b>.*</b>', text)
isItalics <- function(text) grepl('<i>.*</i>', text)
isBold(html)
#[1] TRUE
isItalics(html)
#[1] TRUE
数据
html <- '<p id="PARA339" style="TEXT-ALIGN: left; MARGIN: 0pt; LINE-HEIGHT: 1.25"><font style="FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman, Times, serif"><i><b>We had a net loss of $1.</b></i><i><b>55</b></i><i><b> million for the year ended December 31, 201</b></i><i><b>6</b></i><i><b> and have an accumulated deficit of $</b></i><i><b>61.5</b></i><i><b> million as of December 31, 201</b></i><i><b>6</b></i><i><b>. To achieve sustainable profitability, we must generate increased revenue.</b></i></font></p>'
我们可以计算一个标签与其他标签的比例,而不是只依赖于粗体或斜体标签的一次出现,并且只有当数字大于某个值时才返回TRUE
。
library(stringr)
isItalics <- function(text) str_count(html,'<i>')/str_count(html, '<[bi]>') > 0.5
isBold <- function(text) str_count(html, '<b>')/str_count(html, '<[bi]>') > 0.5
isBold(html)
#[1] FALSE
isItalics(html)
#[1] TRUE
这对b
(或i
)标签的出现次数进行计数,并将其除以b
和i
标签的组合出现次数,仅当其大于50%时返回TRUE
。如果需要,除了'<[bi]>'
,您还可以包含更多标签。
更新的数据
html <- '<p id="PARA339" style="TEXT-ALIGN: left; MARGIN: 0pt; LINE-HEIGHT: 1.25"><font style="FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman, Times, serif"><i>We had a net loss of $1.</i><i><b>55</b></i><i> million for the year ended December 31, 201</i><i>6</i><i> and have an accumulated deficit of $</i><i>61.5</i><i> million as of December 31, 201</i><i>6</i><i>. To achieve sustainable profitability, we must generate increased revenue.</i></font></p>'