通过Python脚本抓取动态网站:如何获取值

我正试图从网站上获取信息。到目前为止，我已经能够访问该网页，使用用户名和密码登录，然后根据需要将该登录页的页面源打印到一个单独的.html/.txt文件中。

问题出现在这里：在"登录页"上，有一个表，我想从中抓取数据。如果我手动右键单击该表上的任何整数，并选择"inspect"，我会找到没有问题的整数。然而，当将页面源代码作为一个整体来看时，我看不到整数，只看到变量/参数名称。这让我相信这是一个充满活力的网站。

如何收集数据？

我一直在拼命地刮这个网站，到目前为止，以下是可用技术对我的作用：

Firefox、IE和Opera不呈现该表。我的猜测是，这是网站端的一个问题。如果我手动登录，似乎只有Chrome才能工作
Selenium的Chromium程序包在我身上(在我的Windows7笔记本电脑上)反复出现故障，我甚至在这里发布了一个关于此事的问题。现在我认为这只是一个失败的事业，但我愿意慷慨地接受任何人的善意帮助
Spynner的描述看起来很有希望，但这种设置让我沮丧了很长一段时间，而且缺乏明确的介绍只会让像我这样的新手更加麻烦
我更喜欢用Python编写代码，因为它是我最熟悉的语言。我有一个悬而未决的公司请求，要求公司在我的计算机上安装Visual Studio(尝试在C#中安装)，但我没有屏住呼吸

到目前为止，如果我的代码有任何用处，下面是我如何使用mechanize:

# Headless Browsing Using PhantomJS and Selenium
#
# PhantomJS is installed in current directory
#
from selenium import webdriver
import time
browser = webdriver.PhantomJS()
browser.set_window_size(1120, 550) # need a fake browser size to fetch elements
def login_entry(username, password):
login_email = browser.find_element_by_id('UserName')
login_email.send_keys(username)
login_password = browser.find_element_by_id('Password')
login_password.send_keys(password)
submit_elem = browser.find_element_by_xpath("//button[contains(text(), 'Log in')]")
submit_elem.click()
browser.get("https://www.example.com")
login_entry('usr_name', 'pwd')
time.sleep(10)
test_output = open('phantomjs_test_source_output.html', 'w')
test_output.write(repr(browser.page_source))
test_output.close()
browser.quit()

附言-如果有人认为我应该给javascript加上这个问题的标签，请告诉我。我个人不知道javascript，但我感觉它可能是问题/解决方案的一部分。

试试这样的方法。有时使用动态页面时，您需要等待数据加载。

from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
WebDriverWait(my_driver, my_time).until(EC.presence_of_all_elements_located(my_expected_element))

http://selenium-python.readthedocs.io/waits.htmlhttps://seleniumhq.github.io/selenium/docs/api/py/webdriver_support/selenium.webdriver.support.expected_conditions.html

相关内容

最新更新

热门标签：