为什么美丽汤和lxml不起作用？

我正在使用mechanize库登录网站。我检查过了，效果很好。但问题是我不能将response.read()与BeautifulSoup和"lxml"一起使用。

#BeautifulSoup
response = browser.open(url)
source = response.read()
soup = BeautifulSoup(source)  #source.txt doesn't work either
for link in soup.findAll('a', {'class':'someClass'}):
    some_list.add(link)

这不起作用，实际上找不到任何标签。当我使用requests.get(url)时，它工作得很好。

#lxml->html
response = browser.open(url)
source = response.read()
tree = html.fromstring(source)  #souce.txt doesn't work either
print tree.text
like_pages = buyers = tree.xpath('//a[@class="UFINoWrap"]')  #/text() doesn't work either
print like_pages

不打印任何内容。我知道response的返回类型有问题，因为它与requests.open()配合得很好。我能做什么？你能提供response.read()在html解析中使用的示例代码吗？

顺便问一下，response和requests对象之间有什么区别？

谢谢

我找到了解决方案。这是因为mechanize.browser是一个模拟浏览器，它只得到原始的html。我想抓取的页面在JavaScript的帮助下将类添加到标签中，所以这些类不在原始html上。最好的选择是使用网络驱动程序。我在Python中使用了Selenium。这是代码：

from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.set_preference('network.http.phishy-userpass-length', 255)
driver = webdriver.Firefox(firefox_profile=profile)
driver.get(url)
list = driver.find_elements_by_xpath('//a[@class="someClass"]')

注意：您需要安装Firefox。或者，您可以根据要使用的浏览器选择其他配置文件

请求是web客户端向服务器发送的内容，包括客户端想要的URL、要使用的http动词（get/post等）的详细信息，如果您正在提交表单，则请求通常包含您在表单中输入的数据。响应是web服务器对客户端请求的回复。响应具有指示请求是否成功的状态代码（如果没有问题，则通常为代码200，或者类似404或500的错误代码）。响应通常包含数据，如页面中的html或jpeg中的二进制数据。响应还具有提供有关响应中的数据的更多信息的标头（例如，"内容类型"标头，它说明数据的格式）。

引用@davidbuxton对此链接的回答。

祝你好运

相关内容

最新更新

热门标签：