我试图最终从页面中parse
出url
,如果它满足xx_web_job_alt_keywords
中的关键字之一在job.get_text()
文本中的某个条件。
xx_good_jobs = []
xx_web_job_alt_keywords = ['Website']
# <a class="result-title hdrlnk" href="//mywebsite.com/web/123.html" data-id="5966181668">Print business magazine's website management</a>
each_job_link_details = soup.find_all('a', class_='result-title hdrlnk')
for job in each_job_link_details:
if xx_web_job_alt_keywords in job.get_text():
#append '//mywebsite.com/web/123.html' to list:xx_good_jobs
xx_good_jobs.append(xx_web_job_alt_keywords.get('href',None))
在你看来,这是什么样子的?
import bs4, re
#keywords = ['Website', 'Website', 'business']
html = '''<a class="result-title hdrlnk" href="//mywebsite.com/web/123.html" data-id="5966181668">Print business magazine's website management</a>
<a class="result-title hdrlnk" href="//mywebsite.com/web/123.html" data-id="5966181668">Print business magazine's website management</a>
<a class="result-title hdrlnk" href="//mywebsite.com/web/123.html" data-id="5966181668">Print business magazine's website management</a>'''
soup = bs4.BeautifulSoup(html, 'lxml')
keywords = ['Website', 'Website', 'business']
regex = '|'.join(keywords)
for a in soup.find_all('a', class_="result-title hdrlnk", text=re.compile(regex,re.IGNORECASE)):
print(a.get('href'))
外:
//mywebsite.com/web/123.html
//mywebsite.com/web/123.html
//mywebsite.com/web/123.html
编辑:
keywords = ['Website', 'Website', 'business']
regex = '|'.join(keywords)
外:
'Website|Website|business'
只需使用正则表达式和|
来匹配 a 标签中的多个关键字。
编辑2:
keyword_lists = [['Website', 'Website', 'business'], ['Website1', 'Website1', 'business1'], ['Website2', 'Website2', 'business2']]
sum(keyword_lists, [])
外:
['Website',
'Website',
'business',
'Website1',
'Website1',
'business1',
'Website2',
'Website2',
'business2']
或者,您可以使用搜索函数采用更明确的方法:
xx_web_job_alt_keywords = ['Website']
def desired_links(tag):
"""Filters 'header' links having desired keywords in the text."""
class_attribute = tag.get('class', [])
is_header_link = tag.name == 'a' and 'result-title' in class_attribute and 'hdrlnk' in class_attribute
link_text = tag.get_text()
has_keywords = any(keyword.lower() in link_text.lower() for keyword in xx_web_job_alt_keywords)
return is_header_link and has_keywords
xx_good_jobs = [link['href'] for link in soup.find_all(desired_links)]
请注意,我们正在使用内置函数来检查文本中是否有任何关键字any()
。另外,请注意,我们正在降低关键字和文本以处理案例差异。
演示:
In [1]: from bs4 import BeautifulSoup
In [2]: data = """
...: <div>
...: <a class="result-title hdrlnk" href="//mywebsite.com/web/123.html" data-id="596618166
...: 8">Print business magazine's website management</a>
...: <a class="result-title hdrlnk" href="//mywebsite.com/web/456.html" data-id="1234">Som
...: e other header link</a>
...: </div>"""
In [3]: soup = BeautifulSoup(data, "html.parser")
In [4]: xx_web_job_alt_keywords = ['Website']
In [5]: def desired_links(tag):
...: """Filters 'header' links having desired keywords in the text."""
...:
...: class_attribute = tag.get('class', [])
...: is_header_link = tag.name == 'a' and 'result-title' in class_attribute and 'hdrlnk' in cl
...: ass_attribute
...:
...: link_text = tag.get_text()
...: has_keywords = any(keyword.lower() in link_text.lower() for keyword in xx_web_job_alt_key
...: words)
...:
...: return is_header_link and has_keywords
...:
In [6]: xx_good_jobs = [link['href'] for link in soup.find_all(desired_links)]
In [7]: xx_good_jobs
Out[7]: [u'//mywebsite.com/web/123.html']