如何通过BeautifulSoup解析文本文件中的多个url ?



我一直在尝试通过BeautifulSoup抓取多个url的值。

我可以成功地为一个URL做到这一点,但我希望做的是从一个单独的文本文件中拉入大约300个URL的列表。

如何在

中引用URL的文本文件而不是单个URLsource = requests.get('https://notmyrealurl.com').text.

到目前为止我写的是:

import requests
from bs4 import BeautifulSoup
source = requests.get('https://notmyrealurl.com').text
soup = BeautifulSoup(source, features="html.parser")
title = soup.find("meta", attrs={'itemprop': 'acquia_lift:content_keywords'})
print(title["content"] if title is not None else "No meta title given")

您可以使用open并遍历URL的每一行。

import requests
from bs4 import BeautifulSoup
with open("your_txt.txt") as file:
for line in file:
url = line.rstrip()
source = requests.get(url).text
soup = BeautifulSoup(source, features="html.parser")
title = soup.find("meta", attrs={'itemprop': 'acquia_lift:content_keywords'})
print(title["content"] if title is not None else "No meta title given")

我假设你的txt文件是

url1.com
url2.com
url3.com

您的文本文件需要是一个列表:['url1', 'url2', 'url3', '...']。下面是读取文本文件的配置代码:

import requests
from bs4 import BeautifulSoup
# Read your text file
yourtextfile = open('yourtextfile.txt').read()
# import it into list
listurl = list(eval(yourtextfile))
for url in listurl:
source = requests.get(url).text
soup = BeautifulSoup(source, features="html.parser")
title = soup.find("meta", attrs={'itemprop': 'acquia_lift:content_keywords'})
print(title["content"] if title is not None else "No meta title given")

最新更新