有没有一种方法可以用bs4在具有特定属性的标签中获取文本



假设我在HTML文件中有这个:

<a rel="nofollow" class="result__a" href="some_link">Foo-baz</a>

如何使用bs4仅提取Foo-baz

现在,我可以使用获取href属性

headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:84.0) Gecko/20100101 Firefox/84.0",
}
page = requests.get('the_url', headers=headers).text
soup = BeautifulSoup(page, 'html.parser').find_all("a", class_="result__a")
for link in soup:
print(link['href'])

然而,我无法提取具有这些特定属性的特定标签中的单词
我尝试过文档和StackOverflow中的不同解决方案,但它们似乎都不起作用。或者我可能无法执行它,因为我是bs4的新手。

链接是:https://html.duckduckgo.com/html/?q=test

感谢您的帮助。

使用link.text:而不是link['href']

import requests
from bs4 import BeautifulSoup
url = "https://html.duckduckgo.com/html/?q=test"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:84.0) Gecko/20100101 Firefox/84.0",
}
page = requests.get(url, headers=headers).text
soup = BeautifulSoup(page, "html.parser").find_all("a", class_="result__a")
for link in soup:
print(link.text)

打印:

Test | Definition of Test by Merriam-Webster
Speedtest by Ookla - The Global Broadband Speed Test
Internet Speed Test | Fast.com
Tests.com Practice Tests
Speed test
Testing for COVID-19 | CDC
Xfinity Speed Test - Check Your Internet Speed
Internet Speed Test - AT&T Official Site
Speed Test by Speedcheck - Test your internet speed
Mic Test
Test - definition of test by The Free Dictionary
CPS Test - Check Clicks per Second
Free Personality Test | 16Personalities
CEA Test: MedlinePlus Medical Test
Test Innovators | Prep for Success
Join a Test Meeting - Zoom
TEST Synonyms: 83 Synonyms & Antonyms for TEST | Thesaurus.com
Speakeasy Internet Speed Test - Check Your Broadband Speed ...
Speed Test - Telstra
IQTest.com--The Original Free Online IQ Test
Speedtest by Ookla - Teste de Velocidade de Conexão da ...
Speedtest - Google Search
ADHD Test - Psych Central
Cisco Webex | Test online meeting
Login | Salesforce
Practice Tests, Tutoring & Prep Courses | Kaplan Test Prep
test | Origin and meaning of test by Online Etymology ...
Test English - Prepare for your English exam
Internet Speed Test - Check Your Internet Speed | Cox
A1C test - Mayo Clinic

查看SelectorGadget Chrome扩展以获取CSS选择器。

如果您想抓取DuckDuckGo的javascript版本,可以使用request-htmlselenium注意:这不是最快的解决方案

您可以通过解析<script>标记中的数据来完成此操作,但这将需要更多的工作。

代码(非javascript(:

import requests, lxml
from bs4 import BeautifulSoup
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://html.duckduckgo.com/html/?q=test&kl=us-en', headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.result__body'):
title = result.select_one('.result__a').text
url = result.select_one('.result__snippet')['href'].replace('//', '')
snippet = result.select_one('.result__snippet').text
print(f"{title}n{url}n{snippet}n")
-----
'''
Test | Definition of Test by Merriam-Webster
duckduckgo.com/l/?uddg=https%3A%2F%2Fwww.merriam%2Dwebster.com%2Fdictionary%2Ftest&rut=4749db61adf540ccd15b6f5aa6a68d4e6604b2c86e5bffe0a5380e06513eff93
Test definition is - a means of testing: such as. How to use test in a sentence.
...
'''

或者,您可以使用SerpApi的DuckDuckGo有机结果API。这是一个付费的API免费计划。

不同的是,在这里,您解析的是一个非javascript DuckDuckGo页面(查看此URL,您将看到"最新新闻"、"内联图像"结果(,而SerpApi解析的是该页面的javascript版本。真正需要做的唯一一件事就是迭代JSON字符串并获得所需的数据。

要集成的代码:

import json, os
from serpapi import GoogleSearch
params = {
"api_key": os.getenv("API_KEY"),
"engine": "duckduckgo",
"q": "fus ro dah",
"kl": "us-en"
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
print(json.dumps(result, indent=2, ensure_ascii=False))
------------
'''
{
"position": 1,
"title": "Test | Definition of Test by Merriam-Webster",
"link": "https://www.merriam-webster.com/dictionary/test",
"snippet": "Test definition is - a means of testing: such as. How to use test in a sentence.",
"favicon": "https://external-content.duckduckgo.com/ip3/www.merriam-webster.com.ico"
}
...
'''

免责声明,我为SerpApi工作。

最新更新