我正在尝试开发一个网络刮擦工具。我有一个Python脚本和JavaScript代码。Python脚本调用JavaScript代码。我的JavaScript代码从网页中重述了相关内容。并将此内容返回到Python脚本。当我们在浏览器上手动运行它时,JavaScript代码正常运行。这是我的JS代码:
var doc = ""
var path1 = document.getElementsByClassName("entry-header")[0]
doc = doc + path1.innerText
doc = doc + "n"
var path2 = document.getElementsByClassName("entry-content")[0]
var cont = path2.getElementsByTagName("p")
for (var i=0; i<cont.length; i++)
{
doc = doc+cont[i].innerText
doc = doc+ "n"
}
res()
function res()
{
return doc
}
这是我的python代码:
from selenium import webdriver
js = open("generalized.js", "r").read()
driver = webdriver.Firefox()
browser = webdriver.Firefox()
browser.get("http://www.geeksforgeeks.org/branch-and-bound-set-1- introduction-with-01-knapsack/")
result = driver.execute_script(js)
print result
,但是当通过python调用时,我给了我以下错误。
Traceback (most recent call last):
File "sample.py", line 7, in <module>
result = driver.execute_script(js)
File "/home/sagar/anaconda2/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 543, in execute_script
'args': converted_args})['value']
File "/home/sagar/anaconda2/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 308, in execute
self.error_handler.check_response(response)
File "/home/sagar/anaconda2/lib/python2.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: TypeError: p[0] is undefined
请帮助我解决这个问题。还是其他方法可以进行网络刮擦?
您是由于某种原因,您是启动两个浏览器,但是在浏览器中执行脚本并打开一个空页面。这对我有用:
from selenium import webdriver
import time
js = open("generalized.js", "r").read()
browser = webdriver.Firefox()
browser.get("http://www.geeksforgeeks.org/branch-and-bound-set-1-introduction-with-01-knapsack/")
time.sleep(1) # try to replace with an Explicit Wait
result = browser.execute_script(js)
print(result)
带有最高级别return doc
的修改脚本:
var doc = "";
var path1 = document.getElementsByClassName("entry-header")[0];
doc = doc + path1.innerText;
doc = doc + "n";
var path2 = document.getElementsByClassName("entry-content")[0];
var cont = path2.getElementsByTagName("p");
for (var i=0; i<cont.length; i++)
{
doc = doc+cont[i].innerText;
doc = doc+ "n"
}
return doc;