模拟和测试一个从url返回文本的函数



我有一个函数,它接受一个url并从该url返回文本。

def extract_raw_text_from_url(url, set_parser='lxml'):
try:
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})  # Set user agent as Mozilla. Otherwise: Error 403
source = urlopen(req).read()  # Return source code
parser = set_parser
soup = bs.BeautifulSoup(source, parser)  # create beautiful soup object
text = soup.get_text()  # get text of websites
except (ValueError): # ToDo: Why urllib.error.URLError is unknown? I want to include it in exception! Works in Colab!
text = []
return text

如何正确测试此功能?由于我认为每次测试都提出请求是不好的做法,我认为嘲笑结果是个好主意。

知道怎么做吗?我正在使用pytest,但我还是个初学者。

我认为这取决于你想测试什么,如果你想测试请求,你应该每次都执行一个请求(事实上,网页可能会在一天到另一天发生变化,它会考虑到这一点(。

如果你想测试给定html输入的解析过程,我认为你可以下载并将html页面放在测试中的资产(或其他(文件夹中,然后你可以尝试使用

url = "assets/mywebpage1.html"
with open(url, 'r') as f:
source = f.read()
#...

编辑:我认为可以采取两种方法:

  1. 将这两个操作划分为两个不同的函数,然后只测试parse_content_from_html(source(,其中source是在测试例程中获得的,如上所述
def extract_raw_text_from_url(url, set_parser='lxml'):
try:
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
source = urlopen(req).read()  # Return source code
text = parse_content_from_html(source)
except (ValueError): 
text = []
return text
def parse_content_from_html(source):
parser = set_parser
soup = bs.BeautifulSoup(source, parser)  # create beautiful soup object
text = soup.get_text()  # get text of websites
return text
  1. 使用标志来区分本地html加载和远程html加载。您可以使用extract_raw_text_from_url("assets/mywebpage1.html", ..., local=True)
def extract_raw_text_from_url(url, set_parser='lxml', local=False):
try:
if local:
with open(url, 'r') as f:
source = f.read()
else:
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})  # Set user agent as Mozilla. Otherwise: Error 403
source = urlopen(req).read()  # Return source code
parser = set_parser
soup = bs.BeautifulSoup(source, parser)  # create beautiful soup object
text = soup.get_text()  # get text of websites
except (ValueError): 
text = []
return text

最新更新