如何在脚本运行时删除URL监控



我写了一个脚本,在其中我在一些网页上进行监视,每当发现特定的html标记时,它都应该打印一个通知。重点是全天候运行脚本,当脚本运行时,我想删除URL。我目前有一个数据库,我将在其中读取正在查找/删除的URLS。

import threading
import requests
from bs4 import BeautifulSoup
# Replacement for database for now
URLS = [
'https://github.com/search?q=hello+world',
'https://github.com/search?q=python+3',
'https://github.com/search?q=world',
'https://github.com/search?q=i+love+python',
]

def doRequest(url):
while True:
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
if soup.find("span", {"data-search-type": "Repositories"}).text.strip():  # if there are sizes
sendNotifications({
'title': soup.find("input", {"name": "q"})['value'],
'repo_count': soup.find("span", {"data-search-type": "Repositories"}).text.strip()
})
else:
print(url, response.status_code)

def sendNotifications(data):
...

if __name__ == '__main__':
# TODO read URLS from database instead of lists
for url in URLS:
threading.Thread(target=doRequest, args=(url,)).start()

我目前面临的问题是doRequest处于一直运行的while循环中,我想知道当脚本在可运行的脚本中运行时,如何删除特定的URL?例如https://github.com/search?q=world

方法1:一种简单的方法

您想要的是在while True循环中插入一些终止逻辑,以便它不断检查终止信号。

为此,您可以使用threading.Event()

例如,您可以添加一个stopping_event参数:

def doRequest(url, stopping_event):
while True and not stopping_event.is_set():
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
if soup.find("span", {"data-search-type": "Repositories"}).text.strip():  # if there are sizes
sendNotifications({
'title': soup.find("input", {"name": "q"})['value'],
'repo_count': soup.find("span", {"data-search-type": "Repositories"}).text.strip()
})
else:
print(url, response.status_code)

在启动线程时创建这些事件

if __name__ == '__main__':
# TODO read URLS from database instead of lists
stopping_events = {url: threading.Event() for url in URLS}
for url in URLS:
threading.Thread(target=doRequest, args=(url, stopping_events[url])).start()

每当你想停止/删除一个特定的url时,你可以直接调用

stopping_events[url].set()

那个特定的while循环将停止并退出。

您甚至可以创建一个单独的线程,等待用户输入来停止特定的url:

def manager(stopping_events):
while True:
url = input('url to stop: ')
if url in stopping_events:
stopping_events[url].set()
if __name__ == '__main__':
# TODO read URLS from database instead of lists
stopping_events = {url: threading.Event() for url in URLS}
for url in URLS:
threading.Thread(target=doRequest, args=(url, stopping_events[url])).start()
threading.Thread(target=manager, args=(stopping_events,)).start()

方法2:更清洁的方法

您可以使用一个线程不断读取URL列表并将其提供给处理线程,而不是固定的URL列表。这就是生产者-消费者模式。现在您并没有真正删除任何URL。您只需继续处理数据库中稍后的URL列表。这应该会自动处理新添加/删除的URL。

import queue
import threading
import requests
from bs4 import BeautifulSoup

# Replacement for database for now
def get_urls_from_db(q: queue.Queue):
while True:
url_list = ...  # some db read logic
map(q.put, url_list)  # putting newly read URLs into queue
def doRequest(q: queue.Queue):
while True:
url = q.get()  # waiting and getting url from queue
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
if soup.find("span", {"data-search-type": "Repositories"}).text.strip():  # if there are sizes
sendNotifications({
'title': soup.find("input", {"name": "q"})['value'],
'repo_count': soup.find("span", {"data-search-type": "Repositories"}).text.strip()
})
else:
print(url, response.status_code)

def sendNotifications(data):
...

if __name__ == '__main__':
# TODO read URLS from database instead of lists
url_queue = queue.Queue()
for _ in range(10):  # starts 10 threads
threading.Thread(target=doRequest, args=(url_queue,)).start()
threading.Thread(target=get_urls_from_db, args=(url_queue,)).start()

CCD_ 7不断读取数据库中的URL,并将数据库中的当前URL列表添加到待处理的CCD_。

doRequest中,循环的每次迭代现在都从url_queue中获取一个url并对其进行处理

需要注意的一件事是添加URL太快,处理速度跟不上。然后,队列长度将随着时间的推移而增长,并消耗大量内存。

这可以说是更好的,因为现在你可以很好地控制要处理的URL,并且有固定数量的线程。

最新更新