r语言 - 使用html数据框架而不是网页-并提取格式标签



我试图从HTML的一列中提取格式标签(然后继续记录每一行是否粗体,斜体,什么颜色等),我试图弄清楚是否使用正则表达式或HTML解析器,并指向投资的方向。然而,我似乎无法弄清楚如何从数据框的列中解析它,而不是转到URL。还有,谁能提供一些基本代码来提取HTML中存在的任何格式化标记(甚至是所有标记/属性的列表,我可以从中筛选到手动编译列表中的相关标记)。

HTML类型的示例,我需要从中获取字体大小、字体类型、字体颜色、背景以及斜体:

<div align="left" style="margin-left: 0%; margin-right: 0%; text-indent: 0%; font-size: 10pt; font-family: 'Times New Roman', Times; color: #000000; background: #FFFFFF"> These forward-looking statements are also affected by the risk factors described below in Part I, Item 1A ("Risk Factors") and those set forth from time to time in our filings with the Securities and Exchange Commission ("SEC"), which are available through our website at <i>www.exterran.com </i>and through the SEC's Electronic Data Gathering and Retrieval System ("EDGAR") at <i><u>www.sec.gov</u></i>. Important factors that could cause our actual results to differ materially from the expectations reflected in these forward-looking statements include, among other things: </div>

不使用rvest,而是使用XML包的可能解决方案如下:

htmlstring <- '<div align="left" style="margin-left: 0%; margin-right: 0%; text-indent: 0%; font-size: 10pt; font-family: 'Times New Roman', Times; color: #000000; background: #FFFFFF"> These forward-looking statements are also affected by the risk factors described below in Part I, Item 1A ("Risk Factors") and those set forth from time to time in our filings with the Securities and Exchange Commission ("SEC"), which are available through our website at <i>www.exterran.com </i>and through the SEC's Electronic Data Gathering and Retrieval System ("EDGAR") at <i><u>www.sec.gov</u></i>. Important factors that could cause our actual results to differ materially from the expectations reflected in these forward-looking statements include, among other things: </div>'
htmlstring <- XML::htmlParse(htmlstring)

然后你可以使用XPath找到你需要的,例如斜体部分:

XML::getNodeSet(htmlstring, '//i')

最新更新