没有标签作为分隔符的网页抓取 html 页面



我正在尝试将网页中的所有文本行导入到字符串数组中。网址在这里:Vaticano-La Sacra Bibbia-Genesi-Cap.1。

不幸的是(也许是网页设计师的选择),标签中没有ID或CLASS。所有行由 1 个或多个 BR> 元素分隔<。开始和结束文本与简单菜单由 2 个标记分隔


。 页面代码的干净摘录在这里:jsfiddle。 我找到了带来文本的方法。现在我在 VBA 中所做的是:

注意:objDoc 是一个来自另一个模块的公共变量,用 .responseText 填充没有问题。

Public Sub ScriviXHTML(strBook As String, intNumCap As Integer)
Dim strDati2 As String
Dim TagBr As IHTMLElementCollection
Dim BrElement As IHTMLElement
Dim intElement As Integer
Dim objChild as Object
Dim strData, strTextCont, strNodeVal, strWholeText As String
Set objDoc2 = New HTMLDocument
Set objDoc2 = objDoc
Set objDoc = Nothing
'Put in variable string HTML code of the web page.
strDati2 = objDoc2.body.innerHTML
'Set in the variable object TAG type BR.
Set TagBr = objDoc2.body.getElementsByTagName("BR")
'Loop for all BRs in the page.
For Each BrElement In TagBr
'Here I try to get the NextSibling element of the <br>
' because seems contain the text I'm looking for.
Set objChild = BrElement.NextSibling
With objChild
' Here I try to put in the variables 
strData = Trim("" & .Data & "")
strTextCont = Trim("" & .textContent & "")
strNodeVal = Trim("" & .NodeValue & "")
strWholeText = Trim("" & .wholeText & "")
End With
intElement = intElement + 1
Next BrElement

两个问题:
1)关于你,这是实现我想做的事情的最佳方式吗?
2)有时Element.NextSibling.Data不存在,运行时错误为"438",因此我手动移动例程的缓冲点以绕过错误。如何拦截此错误?[请不要使用简单的错误恢复下一个!...更好:我如何使用如果...然后。。。End If 语句以检查 NextSibling 中是否存在数据成员?
谢谢。

好吧,您可以按如下方式获取所有文本:

Public Sub GetInfo()
Dim sResponse As String, xhr As Object, html As New HTMLDocument
Set xhr = CreateObject("MSXML2.XMLHTTP")
With xhr
.Open "GET", "http://www.vatican.va/archive/ITA0001/__P1.HTM", False
.send
sResponse = StrConv(.responseBody, vbUnicode)
sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
html.body.innerHTML = sResponse
[A1] = Replace$(Replace$(regexRemove(html.body.innerHTML, "<([^>]+)>"), " &nbsp;", Chr$(32)), Chr$(10), Chr$(32))
End With
End Sub
Public Function regexRemove(ByVal s As String, ByVal pattern As String) As String
Dim regex As Object
Set regex = CreateObject("VBScript.RegExp")
With regex
.Global = True
.MultiLine = True
.IgnoreCase = False
.pattern = pattern
End With
If regex.test(s) Then
regexRemove = regex.Replace(s, vbNullString)
Else
regexRemove = s
End If
End Function

最新更新