我有一个文件要从中提取日期,它是一个HTML源文件,所以里面充满了我不需要的代码和短语。我需要提取包裹在特定HTML标签中的日期的每个实例:
缩写title="(这是我需要的文本)"data utile="
实现这一点最简单的方法是什么?
如果使用Excel VBA,请设置对MSHTML库的引用(引用菜单中标题为Microsoft HTML Object Library
)
Sub ScrapeDateAbbr()
Dim hDoc As MSHTML.HTMLDocument
Dim hElem As MSHTML.HTMLGenericElement
Dim sFile As String, lFile As Long
Dim sHtml As String
'read in the file
lFile = FreeFile
sFile = "C:/Users/dick/Documents/My Dropbox/Excel/Testabbr.html"
Open sFile For Input As lFile
sHtml = Input$(LOF(lFile), lFile)
'put into an htmldocument object
Set hDoc = New MSHTML.HTMLDocument
hDoc.body.innerHTML = sHtml
'loop through abbr tags
For Each hElem In hDoc.getElementsByTagName("abbr")
'only those that have a data-utime attribute
If Len(hElem.getAttribute("data-utime")) > 0 Then
'get the title attribute
Debug.Print hElem.getAttribute("title")
End If
Next hElem
End Sub
我以为这个文件是本地的,因为你调用了一个源文件。如果你需要先下载它,你需要另一个参考MSXML和这个代码
Sub ScrapeDateAbbrDownload()
Dim xHttp As MSXML2.XMLHTTP
Dim hDoc As MSHTML.HTMLDocument
Dim hElem As MSHTML.HTMLGenericElement
Set xHttp = New MSXML2.XMLHTTP
xHttp.Open "GET", "file:///C:/Users/dick/Documents/My%20Dropbox/Excel/Testabbr.html"
xHttp.send
Do
DoEvents
Loop Until xHttp.readyState = 4
'put into an htmldocument object
Set hDoc = New MSHTML.HTMLDocument
hDoc.body.innerHTML = xHttp.responseText
'loop through abbr tags
For Each hElem In hDoc.getElementsByTagName("abbr")
'only those that have a data-utime attribute
If Len(hElem.getAttribute("data-utime")) > 0 Then
'get the title attribute
Debug.Print hElem.getAttribute("title")
End If
Next hElem
End Sub
如果您使用的是Java,则可以使用Jsoup。这从你的问题中是不清楚的,请详细说明你到底想做什么