JS:将URL转换为最简单的形式



我正在构建一个可以将URL存储在数据库中的NodeJS应用程序。我想使用 URL 作为主键,以避免存储重复项。为了做到这一点,我需要网址尽可能采用最简单的形式,删除多余的斜杠、参数和前缀。

如何将下面列出的所有 URL 转换为与列出的第一个 URL 相同的字符串?有没有办法安全地做到这一点,以解释我可能没有在下面列出的其他变化?

website.com/coolpage/938921

https://website.com/coolpage/938921/

https://www.website.com/coolpage/938921/

http://website.com/coolpage/938921/

https://website.com/coolpage/938921/

https://website.com/coolpage/938921/?awesome=1

https://website.com/coolpage/938921?awesome=1

https:///website.com//coolpage//938921//

使用标准的 Node.js url 模块。

溶液:

require('url');
function getBaseUrl(url){
    const u = new URL(url);
    const result =`${u.host}${u.pathname}`
        .split('//').join('/')
        .replace('www.', '');
    // cut off the trailing '/' character from the result
    if (result.length && result[result.length - 1] === '/')
        return result.substring(0, result.length - 1)
    return result;
}

测试:

const urls = [
    "https://website.com/coolpage/938921/",
    "https://www.website.com/coolpage/938921/",
    "http://website.com/coolpage/938921/",
    "https://website.com/coolpage/938921/",
    "https://website.com/coolpage/938921/?awesome=1",
    "https://website.com/coolpage/938921?awesome=1",
    "https:///website.com//coolpage//938921//"
    ];
for (let i = 0; i < urls.length; i++) {
    const u = getBaseUrl(urls[i]);
    console.log(`${i}: ${u}`);
}

控制台输出:

0: website.com/coolpage/9389211: website.com/coolpage/9389212: website.com/coolpage/9389213: website.com/coolpage/9389214: website.com/coolpage/9389215: website.com/coolpage/9389216: website.com/coolpage/938921

runkit.com 上的活生生的例子

在这里,您可以实现所需的功能:

function convertURL(url) {
	var urlParts = url.split('/')
	var finalURL = ''
	urlParts.forEach((p, i) => {
		if(finalURL.length == 0){
			if(p.includes('.com')){
				finalURL += p
			}
		}
		else if (p.length > 0 && i < urlParts.length - 1){
			finalURL += '/' + p
		}
	})
	return finalURL
}
var url = convertURL('https://website.com/coolpage/938921/?awesome=1')
console.log(url)

您可以将String.prototype.replaceRegExp /+一起使用,以匹配一个或多个正斜杠字符/替换为单个/String.prototype.match()替换为RegExp /[a-z0-9]+.[a-z0-9]+(?=/+)/[a-z0-9]+(?=/+)/[a-z0-9]+/ig以匹配URL的主机名和路径名。

let urls = ["https://website.com/coolpage/938921/", "https://www.website.com/coolpage/938921/", "http://website.com/coolpage/938921/", "https://website.com/coolpage/938921/", "https://website.com/coolpage/938921/?awesome=1", "https://website.com/coolpage/938921?awesome=1", "https:///website.com//coolpage//938921//"];
let _URL = "website.com/coolpage/938921";
let replaceForwardSlashes = //+/g;
let matchHostAndPathNames = /[a-z0-9]+.[a-z0-9]+(?=/+)/[a-z0-9]+(?=/+)/[a-z0-9]+/ig;
let matchedURLS = urls.map(url => url.replace(replaceForwardSlashes,'/').match(matchHostAndPathNames));
console.log(matchedURLS, new Set(...matchedURLS).size === 1, matchedURLS.every(u => u == _URL));

最新更新