打印自定义站点的所有搜索结果



如果我给出命令,我的程序如何打印自定义站点(在我的示例中:https://discordpy.readthedocs.io/en/latest/search.html?q=test)的所有搜索结果

我想要这样的东西:

site = f'https://discordpy.readthedocs.io/en/latest/search.html?q={search}'
for line in site.content:
if str(line).startswith("<a"):
print(str(line))

这样的事情可能吗?

你可以用selenium webscraping包来做这些。

第一次运行:pip install selenium

则使用以下脚本:

from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get("https://discordpy.readthedocs.io/en/latest/search.html?q=test")
time.sleep(2)
text = driver.find_element_by_css_selector("body").text  

print(text)

注意,如果按下ctrl +shift+i,您将进入可以找到scraper需要打印的_element_by_css_selector的环境。你可以在各种网站上使用它。如需进一步参考,您需要查看Selenium文档;)

这个站点通过Javascript动态加载搜索结果。通过请求加载页面并使用BeautifulSoup对其进行解析是行不通的。解决方案是用selenium加载页面。这个示例将在colab:

上开箱即用。
!apt update
!apt install chromium-chromedriver
!pip install selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
options = Options()
web = 'https://discordpy.readthedocs.io/en/latest/search.html?q=test'
path = '/usr/bin/chromedriver' #set the path of your chromedriver file
options.add_argument('--no-sandbox')
options.add_argument('--window-size=1920x1080')
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(web)
html = driver.page_source
soup = BeautifulSoup(html)
results = [i.get_text() for i in soup.find_all('li')]

结果:

['API Reference...– The extension name to load. It must be dot separated likenregular Python imports if accessing a sub-module. e.g.nfoo.test if you want to import foo/test.py.nnRaisesnnExtensionNotFound – The extension could not be imported.nExtensionAlrea...', "Commands...two are equivalent:nfrom discord.ext import commandsnnbot = commands.Bot(command_prefix='$')nn@bot.command()nasync def test(ctx):n    passnn# or:nn@commands.command()nasync def test(ctx):n    passnnbot.add_command(test)nnnSince the Bot.com...", 'Migrating to v0.10.0....channelsnServer.membersnnSome examples of previously valid behaviour that is now invalidnif client.servers[0].name == "test":n    # do somethingnnnSince they are no longer lists, they no longer support indexing or any operation other than...', "Migrating to v1.0...ow use a File pseudo-namedtuple to upload a single file.n# beforenawait client.send_file(channel, 'cool.png', filename='testing.png', content='Hello')nn# afternawait channel.send('Hello', file=discord.File('cool.png', 'testing.png'))nnnThis..."]

最新更新