如何通过标题提取带有美丽汤的URL



我有一个我感兴趣的链接列表:

lis = ['https://example1.com', 'https://example2.com', ..., 'https://exampleN.com']

在这些链接中有几个URL,我想提取一些特定的内部URL。此类网址具有以下形式:

<a href="https://interesting-linkN.com" target="_blank" title="Url to news"> News JPG </a>

如何检查lis的所有元素并返回lis访问的链接,并且仅返回在熊猫数据框中具有标题Url to news的URL?,如下所示(**(:

visited_link, extracted_link
https://www.example1.com, NaN
https://www.example2.com, NaN
https://www.example3.com, https://interesting-linkN.com

请注意,对于没有任何<a href="https://interesting-linkN.com" target="_blank" title="Url to news"> News JPG </a>lis元素,我想返回NaN.

我试过这个,并且:

def extract_jpg_url(a_link):
    page = requests.get(a_link)
    tree = html.fromstring(page.content)
    # here is the problem... not all interesting links have this xpath, how can I select by title?
    #(apparently all the jpg urls have this form: title="Url to news")
    interesting_link = tree.xpath(".//*[@id='object']//tbody//tr//td//span//a/@href")
    if len(interesting_link) == 0:
        return'NaN'
    else:
        return 'image link ', interesting_link
then:
    df['news_link'] = df['urls_from_lis'].apply(extract_jpg_url)

但是,后一种方法需要很长时间,并且并非所有元素lis都与给定的 xpath 匹配(检查注释(任何关于我可以得到什么的想法 (**(?

这不会完全返回您想要的内容(NaN(,但它会让您大致了解如何简单有效地完成这项工作。

from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool
import requests
def extract_urls(link):
    r = requests.get(link)
    html = r.text
    soup = BeautifulSoup(html, "html.parser")
    results = soup.findAll('a', {'title': 'Url to news'})
    results = [x['href'] for x in results]
    return (link, results)
links = [
    "https://example1.com",
    "https://example2.com",
    "https://exampleN.com", ]
p = ThreadPool(10)
r = p.map(extract_urls, links)
for url, results in r:
    print(url, results)

最新更新