r-从字符串中删除HTML标记



我正在尝试使用rvest从HTML标记中提取文本数据。

数据:

[vc_row css_animation="" row_type="row" use_row_as_full_screen_section="no" type="full_width" angled_section="no" text_align="left" background_image_as_pattern="without_pattern"][vc_column][vc_column_text]n\n<h6 class="button" style="padding: 0px 42%;">Description</h6>n\n<ol>n\n t<li>Ideal for : Women</li>n\n t<li>Package Contents : 1 Pcs</li>n\n t<li>Fit Type : Regular, Relaxed, Classic and Slim Fit.</li>n\n t<li>Care Instructions : Machine Wash and Normal Wash.</li>n\n t<li>Occasion : Lough, Smart, Dressy, Business, Casual and Formal.</li>n\n t<li>Sleeve Type : Short sleeves, 3/4th Sleeves, Full Sleeves ,Kimono Sleeves and Off Shoulder.</li>n\n t<li>Browse our Brand In Love for more choices of Shrugs , Tops and T shirt and Western wear Collections.</li>n\n t<li>Care Instructions : Ensure washing the tee in cold water, don't iron directly on the print and don't dry in direct sunlight.</li>n\n</ol>n\n<img class="alignnone wp-image-858" src="https://justelite.in/wp-content/uploads/2020/06/art-size.jpg" alt="" />n\n<h6 class="button" style="padding: 0px 42%;">Reviews</h6>n\n[/vc_column_text][/vc_column][/vc_row]

我所做的是:

html_text(read_html(as.character(data)))

我仍然得到了vc_row css_animation和其他一些没有被删除的标签。

dput数据:

structure(2L, .Label = c("", "[vc_row css_animation="" row_type="row" use_row_as_full_screen_section="no" type="full_width" angled_section="no" text_align="left" background_image_as_pattern="without_pattern"][vc_column][vc_column_text]n\n<h6 class="button" style="padding: 0px 42%;">Description</h6>n\n<ol>n\n t<li>Ideal for : Women</li>n\n t<li>Package Contents : 1 Pcs</li>n\n t<li>Fit Type : Regular, Relaxed, Classic and Slim Fit.</li>n\n t<li>Care Instructions : Machine Wash and Normal Wash.</li>n\n t<li>Occasion : Lough, Smart, Dressy, Business, Casual and Formal.</li>n\n t<li>Sleeve Type : Short sleeves, 3/4th Sleeves, Full Sleeves ,Kimono Sleeves and Off Shoulder.</li>n\n t<li>Browse our Brand In Love for more choices of Shrugs , Tops and T shirt and Western wear Collections.</li>n\n t<li>Care Instructions : Ensure washing the tee in cold water, don't iron directly on the print and don't dry in direct sunlight.</li>n\n</ol>n\n<img class="alignnone wp-image-858" src="https://justelite.in/wp-content/uploads/2020/06/art-size.jpg" alt="" />n\n<h6 class="button" style="padding: 0px 42%;">Reviews</h6>n\n[/vc_column_text][/vc_column][/vc_row]"
), class = "factor")

据我所见,你所得到的不是合适的html标签,因为这些标签通常由"<quot;以及">quot;(例如< h1 >(。你的周围是[ h1 ]。调整上面链接的功能,你可以做:

s <- structure(2L, .Label = c("", "[vc_row css_animation="" row_type="row" use_row_as_full_screen_section="no" type="full_width" angled_section="no" text_align="left" background_image_as_pattern="without_pattern"][vc_column][vc_column_text]n\n<h6 class="button" style="padding: 0px 42%;">Description</h6>n\n<ol>n\n t<li>Ideal for : Women</li>n\n t<li>Package Contents : 1 Pcs</li>n\n t<li>Fit Type : Regular, Relaxed, Classic and Slim Fit.</li>n\n t<li>Care Instructions : Machine Wash and Normal Wash.</li>n\n t<li>Occasion : Lough, Smart, Dressy, Business, Casual and Formal.</li>n\n t<li>Sleeve Type : Short sleeves, 3/4th Sleeves, Full Sleeves ,Kimono Sleeves and Off Shoulder.</li>n\n t<li>Browse our Brand In Love for more choices of Shrugs , Tops and T shirt and Western wear Collections.</li>n\n t<li>Care Instructions : Ensure washing the tee in cold water, don't iron directly on the print and don't dry in direct sunlight.</li>n\n</ol>n\n<img class="alignnone wp-image-858" src="https://justelite.in/wp-content/uploads/2020/06/art-size.jpg" alt="" />n\n<h6 class="button" style="padding: 0px 42%;">Reviews</h6>n\n[/vc_column_text][/vc_column][/vc_row]"
), class = "factor")
cleanFun <- function(htmlString) {
return(gsub("<.*?>|\[.*?\]", "", htmlString))
}
cleanFun(s)
#> [1] "n\nDescriptionn\nn\n tIdeal for : Womenn\n tPackage Contents : 1 Pcsn\n tFit Type : Regular, Relaxed, Classic and Slim Fit.n\n tCare Instructions : Machine Wash and Normal Wash.n\n tOccasion : Lough, Smart, Dressy, Business, Casual and Formal.n\n tSleeve Type : Short sleeves, 3/4th Sleeves, Full Sleeves ,Kimono Sleeves and Off Shoulder.n\n tBrowse our Brand In Love for more choices of Shrugs , Tops and T shirt and Western wear Collections.n\n tCare Instructions : Ensure washing the tee in cold water, don't iron directly on the print and don't dry in direct sunlight.n\nn\nn\nReviewsn\n"

由reprex包(v0.3.0(创建于2020-09-16

最新更新