Python:Selenium & PhantomJS



我正在尝试抓取以下网站: https://www.linkedin.com/jobs/search/?keywords=coach%20&location=United%20States&locationId=us%3A0

我想得到的文本是:

Showing 114,877 results

该 HTML 代码:

<div class="jobs-search-results__count-sort pt3">
<div class="jobs-search-results__count-string results-count-string Sans-15px-black-55% pb0 pl5 pr4">
Showing 114,877 results
</div>

我的蟒蛇代码是:

index_url = 'https://www.linkedin.com/jobs/search/?keywords=coach%20&location=United%20States&locationId=us%3A0'
java = '!function(i,n){void 0!==i.addEventListener&&void 0!==i.hidden&&(n.liVisibilityChangeListener=function(){i.hidden&&(n.liHasWindowHidden=!0)},i.addEventListener("visibilitychange",n.liVisibilityChangeListener))}(document,window);'
browser = webdriver.PhantomJS()
browser.get(index_url)
browser.execute_script(java)
soup = BeautifulSoup(browser.page_source, "html.parser")
link = "jobs-search-results__count-string results-count-string Sans-15px-black-55% pb0 pl5 pr4" 
div = soup.find('div', {"class":link})
text = div.text

到目前为止,我的代码似乎不起作用。我认为这是为了执行java脚本。

我收到以下错误:


AttributeError                            Traceback (most recent call last)
<ipython-input-33-7cdc1c4e0894> in <module>()
6 link = "jobs-search-results__count-string results-count-string Sans-15px-black-55% pb0 pl5 pr4"
7 div = soup.find('div', {"class":link})
----> 8 text = div.text
AttributeError: 'NoneType' object has no attribute 'text'

汤量:

<html><head>n<script type="text/javascript">nwindow.onload = function() {n  // Parse the tracking code from cookies.n  var trk = "bf";n  var trkInfo = "bf";n  var cookies = document.cookie.split("; ");n  for (var i = 0; i < cookies.length; ++i) {n    if ((cookies[i].indexOf("trkCode=") == 0) && (cookies[i].length > 8)) {n      trk = cookies[i].substring(8);n    }n    else if ((cookies[i].indexOf("trkInfo=") == 0) && (cookies[i].length > 8)) {n      trkInfo = cookies[i].substring(8);n    }n  }nn  if (window.location.protocol == "http:") {n    // If "sl" cookie is set, redirect to https.n    for (var i = 0; i < cookies.length; ++i) {n      if ((cookies[i].indexOf("sl=") == 0) && (cookies[i].length > 3)) {n        window.location.href = "https:" + window.location.href.substring(window.location.protocol.length);n        return;n      }n    }n  }nn  // Get the new domain. For international domains such asn  // fr.linkedin.com, we convert it to www.linkedin.comn  var domain = "www.linkedin.com";n  if (domain != location.host) {n    var subdomainIndex = location.host.indexOf(".linkedin");n    if (subdomainIndex != -1) {n      domain = "www" + location.host.substring(subdomainIndex);n    }n  }nn  window.location.href = "https://" + domain + "/authwall?trk=" + trk + "&trkInfo=" + trkInfo +n      "&originalReferer=" + document.referrer.substr(0, 200) +n      "&sessionRedirect=" + encodeURIComponent(window.location.href);n}n</script>n</head><body></body></html>

我在webdriver.Chrome中有解决方案,因为我从未使用过PhantomJS。如果要获取结果文本,有两种情况。一种是您已从驱动程序实例登录LinkedIn,另一种是您未登录。

假设您没有登录。所以下面的代码将完成你的工作

from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
url = 'https://www.linkedin.com/jobs/search/?keywords=coach%20&location=United%20States&locationId=us%3A0'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
text = soup.find('div',{'class':'results-context'}).text
print(text)

假设您已登录

from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
url = 'https://www.linkedin.com/jobs/search/?keywords=coach%20&location=United%20States&locationId=us%3A0'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
class = 'jobs-search-results__count-string results-count-string Sans-15px-black-55% pb0 pl5 pr4'
text = soup.find('div',{'class':class}).text.split('n')[1].lstrip()
print(text)

最新更新