正在从HTML标记中的文件中删除文本



我有一个文件要从中提取日期,它是一个HTML源文件,所以里面充满了我不需要的代码和短语。我需要提取包裹在特定HTML标签中的日期的每个实例:

缩写title="(这是我需要的文本)"data utile="

实现这一点最简单的方法是什么?

如果使用Excel VBA,请设置对MSHTML库的引用(引用菜单中标题为Microsoft HTML Object Library

Sub ScrapeDateAbbr()
    Dim hDoc As MSHTML.HTMLDocument
    Dim hElem As MSHTML.HTMLGenericElement
    Dim sFile As String, lFile As Long
    Dim sHtml As String
    'read in the file
    lFile = FreeFile
    sFile = "C:/Users/dick/Documents/My Dropbox/Excel/Testabbr.html"
    Open sFile For Input As lFile
    sHtml = Input$(LOF(lFile), lFile)
    'put into an htmldocument object
    Set hDoc = New MSHTML.HTMLDocument
    hDoc.body.innerHTML = sHtml
    'loop through abbr tags
    For Each hElem In hDoc.getElementsByTagName("abbr")
        'only those that have a data-utime attribute
        If Len(hElem.getAttribute("data-utime")) > 0 Then
            'get the title attribute
            Debug.Print hElem.getAttribute("title")
        End If
    Next hElem
End Sub

我以为这个文件是本地的,因为你调用了一个源文件。如果你需要先下载它,你需要另一个参考MSXML和这个代码

Sub ScrapeDateAbbrDownload()
    Dim xHttp As MSXML2.XMLHTTP
    Dim hDoc As MSHTML.HTMLDocument
    Dim hElem As MSHTML.HTMLGenericElement
    Set xHttp = New MSXML2.XMLHTTP
    xHttp.Open "GET", "file:///C:/Users/dick/Documents/My%20Dropbox/Excel/Testabbr.html"
    xHttp.send
    Do
        DoEvents
    Loop Until xHttp.readyState = 4
    'put into an htmldocument object
    Set hDoc = New MSHTML.HTMLDocument
    hDoc.body.innerHTML = xHttp.responseText
    'loop through abbr tags
    For Each hElem In hDoc.getElementsByTagName("abbr")
        'only those that have a data-utime attribute
        If Len(hElem.getAttribute("data-utime")) > 0 Then
            'get the title attribute
            Debug.Print hElem.getAttribute("title")
        End If
    Next hElem
End Sub

如果您使用的是Java,则可以使用Jsoup。这从你的问题中是不清楚的,请详细说明你到底想做什么

相关内容

  • 没有找到相关文章

最新更新