我们如何在 colab.research.google.com 中使用硒网络驱动程序?



我想在 colab.research.google.com 中使用Chrome的Selenium Webdriver进行快速处理。我能够使用!pip install selenium安装Selenium,但是chrome的Web驱动程序需要WebdriverChrome.exe的路径。我应该如何使用它?

P.S.-colab.research.google.com 是一个在线平台,为与深度学习相关的快速计算问题提供GPU。请避免使用网络驱动程序等解决方案。铬(路径(。

最近 Google collab 进行了升级,由于 Ubuntu 20.04+ 不再在 snap 软件包之外分发 chromium 浏览器,您可以从 Debian buster 存储库安装兼容版本:

%%shell
# Ubuntu no longer distributes chromium-browser outside of snap
#
# Proposed solution: https://askubuntu.com/questions/1204571/how-to-install-chromium-without-snap
# Add debian buster
cat > /etc/apt/sources.list.d/debian.list <<'EOF'
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster.gpg] http://deb.debian.org/debian buster main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster-updates.gpg] http://deb.debian.org/debian buster-updates main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-security-buster.gpg] http://deb.debian.org/debian-security buster/updates main
EOF
# Add keys
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys DCC9EFBF77E11517
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 648ACFD622F3D138
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 112695A0E562B32A
apt-key export 77E11517 | gpg --dearmour -o /usr/share/keyrings/debian-buster.gpg
apt-key export 22F3D138 | gpg --dearmour -o /usr/share/keyrings/debian-buster-updates.gpg
apt-key export E562B32A | gpg --dearmour -o /usr/share/keyrings/debian-security-buster.gpg
# Prefer debian repo for chromium* packages only
# Note the double-blank lines between entries
cat > /etc/apt/preferences.d/chromium.pref << 'EOF'
Package: *
Pin: release a=eoan
Pin-Priority: 500

Package: *
Pin: origin "deb.debian.org"
Pin-Priority: 300

Package: chromium*
Pin: origin "deb.debian.org"
Pin-Priority: 700
EOF
# Install chromium and chromium-driver
apt-get update
apt-get install chromium chromium-driver
# Install selenium
pip install selenium

然后你可以像这样运行硒:

from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.headless = True
wd = webdriver.Chrome('chromedriver',options=chrome_options)
wd.get("https://www.webite-url.com")

这个在 colab 中工作

!pip install selenium
!apt-get update 
!apt install chromium-chromedriver
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)

我制作了自己的库来简化它。

!pip install kora -q
from kora.selenium import wd
wd.get("https://www.website.com")

PS:我忘记了我是如何搜索和实验的,直到它起作用。但我在 2018 年 12 月首次在此要点中编写并分享了它。

没有足够的 repu 来评论。 :(

但是@Thomas答案在 06.10.2021 中仍然有效,但自蝙蝠右起只需一个简单的更改,您就会得到DeprecationWarning: use options instead of chrome_options

工作代码如下:

!pip install selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('chromedriver',options=options)
wd.get("https://stackoverflow.com/questions/51046454/how-can-we-use-selenium-webdriver-in-colab-research-google-com")
wd.title

要在 GOOGLE 中使用硒 COLAB 在 colab notebook 中执行后续步骤

!pip install kora -q

如何在 COLAB 中使用它:

from kora.selenium import wd
wd.get("enter any website here")

您也可以将其与美丽汤一起使用

import bs4 as soup
wd.get("enter any website here")
html = soup.BeautifulSoup(wd.page_source)

Google collab 现在使用的是 Ubuntu 20.04,如果没有 snap,你就无法安装 chromium 浏览器。但是您可以使用 .deb 的 ubuntu 18.04 文件在 security.ubuntu.com/ubuntu/pool/universe/c/chromium-browser/安装它。

为此,我制作了一个python脚本。它找到最新版本的chromium浏览器和chromedriver的18.04,并将其安装到具有Ubuntu 20.04的Google colab上。

网站的链接已定期更新。你不需要 debian 存储库和 apt 密钥。

import os
import re
import subprocess
import requests
# The deb files we need to install
deb_files_startstwith = [
"chromium-codecs-ffmpeg-extra_",
"chromium-codecs-ffmpeg_",
"chromium-browser_",
"chromium-chromedriver_"
]
def get_latest_version() -> str:
# A request to security.ubuntu.com for getting latest version of chromium-browser
# e.g. "112.0.5615.49-0ubuntu0.18.04.1_amd64.deb"
url = "http://security.ubuntu.com/ubuntu/pool/universe/c/chromium-browser/"
r = requests.get(url)
if r.status_code != 200:
raise Exception("status_code code not 200!")
text = r.text
# Find latest version
pattern = '<ashref="chromium-browser_([^"]+.ubuntu0.18.04.1_amd64.deb)'
latest_version_search = re.search(pattern, text)
if latest_version_search:
latest_version = latest_version_search.group(1)
else:
raise Exception("Can not find latest version!")
return latest_version
def download(latest_version: str, quiet: bool):
deb_files = []
for deb_file in deb_files_startstwith:
deb_files.append(deb_file + latest_version)
for deb_file in deb_files:
url = f"http://security.ubuntu.com/ubuntu/pool/universe/c/chromium-browser/{deb_file}"
# Download deb file
if quiet:
command = f"wget -q -O /content/{deb_file} {url}"
else:
command = f"wget -O /content/{deb_file} {url}"
print(f"Downloading: {deb_file}")
# os.system(command)
!$command
# Install deb file
if quiet:
command = f"apt-get install /content/{deb_file} >> apt.log"
else:
command = f"apt-get install /content/{deb_file}"
print(f"Installing: {deb_file}n")
# os.system(command)
!$command
# Delete deb file from disk
os.remove(f"/content/{deb_file}")
def check_chromium_installation():
try:
subprocess.call(["chromium-browser"])
print("Chromium installation successfull.")
except FileNotFoundError:
print("Chromium Installation Failed!")
def install_selenium_package(quiet: bool):
if quiet:
!pip install selenium -qq >> pip.log
else:
!pip install selenium
def main(quiet: bool):
# Get the latest version of chromium-browser for ubuntu 18.04
latest_version = get_latest_version()
# Download and install chromium-browser for ubuntu 20.04
download(latest_version, quiet)
# Check if installation succesfull
check_chromium_installation()
# Finally install selenium package
install_selenium_package(quiet)
if __name__ == '__main__':
quiet = True # verboseness of wget and apt
main(quiet)

并尝试硒

from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
wd = webdriver.Chrome('chromedriver', options=chrome_options)
wd.get("https://www.google.com")
print(f"Page title: {wd.title}")

colab 和 selenium 如何从 whoscored.com 中提取数据?

#    https://www.whoscored.com
# install chromium, its driver, and selenium
!apt update
!apt install chromium-chromedriver
!pip install selenium
# set options to be headless, ..
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
# open it, go to a website, and get results
wd = webdriver.Chrome(options=options)
wd.get("https://www.whoscored.com")
print(wd.page_source)  # results

安装库

!pip install selenium
!apt-get update
!apt install chromium-chromedriver

并设置一个铬驱动程序

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
# Set the path to the chromedriver executable
chromedriver_path = '/usr/bin/chromedriver'
# Set the Chrome driver options
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
# Start the Chrome driver
driver = webdriver.Chrome(service=Service(executable_path=chromedriver_path), options=options)
# Navigate to a website
driver.get('https://www.example.com')
# Quit the driver
driver.quit()

如果您遇到任何错误,例如"WebDriver异常:消息:服务chromedriver意外退出。状态代码为:1">

在笔记本页面中 Ctrl + Shift + P , 选择"使用回退运行时版本" 再试一次。

你可以通过使用WebDriverManager来摆脱使用.exe文件,而不是这个

System.setProperty("webdriver.gecko.driver", "driverpath/.exe");
WebDriver driver = new FirefoxDriver();

你会写这个

WebDriverManager.firefoxdriver().setup();
WebDriver driver = new FirefoxDriver();

您所需要的只是将依赖项添加到POM文件中(假设您使用maven或某些构建工具( 请参阅我关于如何使用它的完整答案 此链接 使用网络驱动程序管理器

相关内容

最新更新