>我已经用vba编写了一个脚本,并结合了Selenium来解析网页中所有可用的公司名称。网页已激活延迟加载方法,因此每个滚动中只有 20 个链接可见。如果我滚动 2 次,那么可见的链接数量为 40,依此类推。该网页中有 1000 个链接可用。我的以下脚本可以到达该页面的底部,处理所有滚动并获取该网页中所有可用的名称。
但是,有必要在每次滚动后等待一段时间才能更新该网页的内容。这是我使用hardcoded delay
的地方,但是硬编码的过程非常不一致,有时它会在整个操作完成之前使浏览器退出。
我怎样才能.Wait 6000
修改这部分以使其Explicit Wait
而不是Hardcoded Wait
.
这是我到目前为止写的:
Sub Getlinks()
Dim driver As New ChromeDriver, prevlen&, curlen&
Dim posts As Object, post As Object
With driver
.get "http://fortune.com/fortune500/list/"
prevlen = .FindElementsByClass("company-title").Count
Do
prevlen = curlen
.ExecuteScript ("window.scrollTo(0, document.body.scrollHeight);")
.Wait 6000 ''I like to kick out this hardcoded delay and use explicit wait in place
Set posts = .FindElementsByClass("company-title")
curlen = posts.Count
If prevlen = curlen Then Exit Do
Loop
For Each post In posts
R = R + 1: Cells(R, 1) = post.Text
Next post
End With
End Sub
这是一种完全不同的方法,不需要使用浏览器,而是提交一系列 Web 请求。使用此方法,等待页面加载不是问题。
通常,对于延迟加载页面,它会在您滚动时提交一个新请求以加载页面的数据。如果您监控网络流量,您可以发现发出的请求并模拟这些请求,我在下面完成了此操作。
结果应是公司名称列表,按升序排列,无论 Excel 的第一张是什么。
您需要的东西:
添加对以下内容的引用:
- Microsoft脚本运行时
- Microsoft XML v6.0
- 将 VBA-JSON 代码添加到项目中。你可以在这里找到
编辑
更改了代码以继续从站点拉取数据,直到列表中没有其他项目。感谢@Qharr指出这一点。
法典
Public Sub SubmitRequest()
Const baseURL As String = "http://fortune.com/api/v2/list/2358051/expand/item/ranking/asc/"
Dim Url As String
Dim startingNumber As Long
Dim j As Long
Dim getRequest As MSXML2.XMLHTTP60
Dim Json As Object
Dim Companies As Object
Dim Company As Variant
Dim CompanyArray As Variant
'Create an array to hold each company
ReDim CompanyArray(0 To 50000)
'Create a new XMLHTTP object so we can place a get request
Set getRequest = New MSXML2.XMLHTTP60
'The api seems to only support returning 100 records at a time
'So do in batches of 100
Do
'Build the url, the format is something like
'0/100, where 0 is the starting position, and 100 is the ending position
Url = baseURL & startingNumber & "/" & startingNumber + 100
With getRequest
.Open "GET", Url
.send
'The response is a JSON object, for this code to work -
'You'll need this code https://github.com/VBA-tools/VBA-JSON
'What is returned is a dictionary
Set Json = JsonConverter.ParseJson(.responseText)
Set Companies = Json("list-items")
'Keep checking in batches of 100 until there are no more
If Companies.Count = 0 Then Exit Do
'Iterate the dictionary and return the title (which is the name)
For Each Company In Companies
CompanyArray(j) = Company("title")
j = j + 1
Next
End With
startingNumber = startingNumber + 100
Loop
ReDim Preserve CompanyArray(j - 1)
'Dump the data to the first sheet
ThisWorkbook.Sheets(1).Range("A1:A" & j) = WorksheetFunction.Transpose(CompanyArray)
End Sub
你去吧:
Sub Getlinks()
Dim driver As New ChromeDriver
Dim pcount As Long, R as long
Dim posts As Object, post As Object
With driver
.get "http://fortune.com/fortune500/list/"
Do
.ExecuteScript ("window.scrollTo(0, document.body.scrollHeight);")
Set posts = .FindElementsByClass("company-title")
pcount = posts.Count
Loop Until pcount = 1000
For Each post In posts
R = R + 1: Cells(R, 1) = post.Text
Next post
End With
End Sub
或者更好的是,边走边打印:
Sub Getlinksasyougo()
Dim driver As New ChromeDriver
Dim pcount As Long, R As Long, i As Long
Dim posts As Object, post As Object
With driver
.get "http://fortune.com/fortune500/list/"
i = 1
Do
.ExecuteScript ("window.scrollTo(0, document.body.scrollHeight);")
Set posts = .FindElementsByClass("company-title")
pcount = posts.Count
If i <> pcount Then
For R = i To pcount - 1
Cells(R, 1) = posts(R + 1).Text
Next R
i = pcount
End If
Loop Until pcount = 1000
End With
End Sub
这是一种使用其中一个注释中讨论的"查找微调器元素"方法的方法,这有助于您避免指定您希望页面加载的元素数。微调器的类名实际上会根据它是否可见而变化,这使得在获取页面元素之前等待微调器变得可见 + 再次消失变得非常容易。
此方法仍涉及一些等待;默认情况下,它在每次尝试查找微调器后等待 1/10 秒,直到找到微调器或达到最大尝试次数。但这比每次等待 5 秒要快得多。
另外,不相关,但不要一次向单元格写入一个内容,这真的很慢。先将其写入数组+一次写入整个数组要快得多。
Sub getLinks()
Dim bot As New ChromeDriver
bot.Get "http://fortune.com/fortune500/list/"
Dim posts As WebElements
Dim numPosts As Long
Dim finishedScrolling As Boolean
finishedScrolling = False
Do Until finishedScrolling
'Set beginning post count and scroll down
Dim startPosts As Long
startPosts = numPosts
bot.ExecuteScript "window.scrollTo(0, document.body.scrollHeight);"
'Wait for spinner to become visible, then wait for up to 5 seconds for rehide
Call waitForElements(bot, "div[class^='F500-spinner ']", 50)
Call waitForElements(bot, "div[class^='F500-spinner hide']", 50)
'See if any new posts have loaded
Set posts = bot.FindElementsByClass("company-title")
numPosts = posts.Count
If numPosts = startPosts Then
finishedScrolling = True
End If
Loop
'Write text to results array
Dim post As WebElement
ReDim resultsArr(1 To posts.Count, 1 To 1) As String
Dim i As Long
i = 1
For Each post In posts
resultsArr(i, 1) = post.Text
i = i + 1
Next
'Write array to sheet
With ActiveSheet
.Range(.Cells(1, 1), .Cells(UBound(resultsArr, 1), 1)).Value = resultsArr
End With
End Sub
Sub waitForElements(bot As WebDriver, css As String, maxAttempts As Long, Optional waitTimeMS As Long = 100)
'Use a CSS selector string to wait for element(s) to appear on a page or to reach max number of attempts
'By default, bot waits 0.1 second after each attempt
Dim i As Long
Dim foundElem As Boolean
foundElem = False
Do Until foundElem
i = i + 1
If bot.FindElementsByCss(css).Count > 0 Then
foundElem = True
ElseIf i = maxAttempts Then
foundElem = True
Else
bot.Wait waitTimeMS
End If
Loop
End Sub
定义超时(允许经过的指定时间段(以摆脱硬编码延迟。超时需要硬编码。
此代码与原始代码之间的区别是:
- 循环本身一遍又一遍地运行(每次迭代不等待 6 秒(并检查新内容,直到找到新内容或达到超时。
- 如果延迟加载花费的时间比预期的要长,例如,当加载数字 21 到 50 时,循环将"等待"并尝试在超时中定义的最大时间内获取新内容。
- 缺点:在加载所有内容的最后一步,循环将花费与超时设置的秒数一样多的时间。
法典:
Sub Getlinks()
Dim driver As New ChromeDriver, prevlen&, curlen&
Dim posts As Object, post As Object
Dim timeout As Integer, startTime As Double
timeout = 10 ' set the timeout to 10 seconds
With driver
.get "http://fortune.com/fortune500/list/"
prevlen = .FindElementsByClass("company-title").Count
startTime = Timer ' set the initial starting time
Do
.ExecuteScript ("window.scrollTo(0, document.body.scrollHeight);")
Set posts = .FindElementsByClass("company-title")
curlen = posts.Count
If curlen > prevlen Then
startTime = Timer ' reset start time if new elements found
prevlen = curlen ' set new prevlen
End If
Loop While Round(Timer - startTime, 2) <= timeout ' check if timeout is reached
For Each post In posts
R = R + 1: Cells(R, 1) = post.Text
Next post
End With
End Sub
我不知道这是否有帮助,因为它仍然是一个"硬编码"解决方案,但您可以尝试延迟函数而不是等待函数,看看这是否有助于解决程序退出问题。
Function Delay(Seconds As Single)
Dim StopTime As Single: StopTime = Timer + Seconds
Do While Timer < StopTime
DoEvents
Loop
End Function
我想你快到了。
虽然我认为你不能避免等待,但解决方法是在你向下滚动时保持多次检查新帖子,等待时间更短。
下面的示例是检查新帖子 5 次,每次等待 2 秒,因此在声明页面结束之前总共需要 10 秒。调整这两个参数以适合。
Sub Getlinks()
Dim driver As New ChromeDriver, prevlen&, curlen&
Dim posts As Object, post As Object
' Counter for number of times when there are NO NEW POSTS
Dim NoIncreaseCount As Integer
Const MaxNoIncreaseCount As Integer = 5
Const WaitTime As Integer = 2000 ' 2 seconds wait time each scroll down
With driver
.get "http://fortune.com/fortune500/list/"
prevlen = .FindElementsByClass("company-title").Count
NoIncreaseCount = 0
Do Until NoIncreaseCount = MaxNoIncreaseCount
.ExecuteScript ("window.scrollTo(0, document.body.scrollHeight);")
.Wait WaitTime
Set posts = .FindElementsByClass("company-title")
curlen = posts.Count
If prevlen < curlen Then
' There are new Posts
prevlen = curlen
NoIncreaseCount = 0
Else
' No new Posts
NoIncreaseCount = NoIncreaseCount + 1
End If
Loop
For Each post In posts
R = R + 1: Cells(R, 1) = post.Text
Next post
End With
End Sub