REGEX检查链接是否为文件



如何检查给定链接(url)是文件还是其他网页?

我的意思是:

  • 页面: https://stackoverflow.com/questions/
  • page: https://www.w3schools.com/html/default.asp
  • 文件: https://www.python.org/ftp/python/3.7.2/python-3.7.2.2.exe
  • 文件: http://jmlr.org/papers/volume19/16-534/16-534.pdf#page=15

目前,我正在使用相当骇人听闻的多步检查进行操作,并且还需要相对于绝对链接进行转换,如果丢失了,则添加HTTP前缀并删除"#"锚固链接/参数以工作。我也是不确定我是否正在白名单,

import re
def check_file(url):
    try:
        sub_domain = re.split('/+', url)[2] # part after '2nd slash(es)''
    except:
        return False # nothing = main page, no file
    if not re.search('.', sub_domain):
        return False # no dot, no file
    if re.search('.htm[l]{0,1}$|.php$|.asp$', sub_domain):
        return False # whitelist some page extensions
    return True
tests = [
    'https://www.stackoverflow.com',
    'https://www.stackoverflow.com/randomlink',
    'https:////www.stackoverflow.com//page.php',
    'https://www.stackoverflow.com/page.html',
    'https://www.stackoverflow.com/page.htm',
    'https://www.stackoverflow.com/file.exe',
    'https://www.stackoverflow.com/image.png'
]
for test in tests:
    print(test + 'n' + str(check_file(test)))
# False: https://www.stackoverflow.com
# False: https://www.stackoverflow.com/randomlink
# False: https:////www.stackoverflow.com//page.php
# False: https://www.stackoverflow.com/page.html
# False: https://www.stackoverflow.com/page.htm
# True: https://www.stackoverflow.com/file.exe
# True: https://www.stackoverflow.com/image.png

是否有一个干净的,单个正则匹配解决方案 或具有已建立函数的库可以这样做?我想有人一定在我面前遇到了这个问题,但不幸的是我在这里或其他地方找不到解决方案。

aran-fey的答案在行为良好的页面上很好地工作,占网络的99.99%。但是没有规则说以特定扩展为结尾的URL必须解决特定类型的内容。配置较差的服务器可以将请求的HTML返回到名为"示例.png"的页面,或者它可以返回一个名为" example.php"的页面的MPEG,或任何其他内容类型和文件扩展名的组合。p>获取URL内容类型信息的最准确方法是实际访问该URL并检查其标题中的内容类型。大多数http交换库都可以从站点中检索标头信息,因此即使对于非常大的页面,此操作也相对较快。例如,如果您使用的是requests,则可以:

import requests
def get_content_type(url):
    response = requests.head(url)
    return response.headers['Content-Type']
test_cases = [
    "http://www.example.com",
    "https://i.stack.imgur.com/T3HH6.png?s=328&g=1",
    "http://php.net/manual/en/security.hiding.php",
]    
for url in test_cases:
    print("Url:", url)
    print("Content type:", get_content_type(url))

结果:

Url: http://www.example.com
Content type: text/html; charset=UTF-8
Url: https://i.stack.imgur.com/T3HH6.png?s=328&g=1
Content type: image/png
Url: http://php.net/manual/en/security.hiding.php
Content type: text/html; charset=utf-8

urlparse是您的朋友。

from urllib.parse import urlparse
def check_file(url):
    path = urlparse(url).path  # extract the path component of the URL
    name = path.rsplit('/', 1)[-1]  # discard everything before the last slash
    if '.' not in name:  # if there's no . it's definitely not a file
        return False
    ext = path.rsplit('.', 1)[-1]  # extract the file extension
    return ext not in {'htm', 'html', 'php', 'asp'}

使用pathlib模块可以进一步简化这一点:

from urllib.parse import urlparse
from pathlib import PurePath
def check_file(url):
    path = PurePath(urlparse(url).path)
    ext = path.suffix[1:]
    if not ext:
        return False
    return ext not in {'htm', 'html', 'php', 'asp'}

最新更新