我正在构建一个可以将URL存储在数据库中的NodeJS应用程序。我想使用 URL 作为主键,以避免存储重复项。为了做到这一点,我需要网址尽可能采用最简单的形式,删除多余的斜杠、参数和前缀。
如何将下面列出的所有 URL 转换为与列出的第一个 URL 相同的字符串?有没有办法安全地做到这一点,以解释我可能没有在下面列出的其他变化?
website.com/coolpage/938921
https://website.com/coolpage/938921/
https://www.website.com/coolpage/938921/
http://website.com/coolpage/938921/
https://website.com/coolpage/938921/
https://website.com/coolpage/938921/?awesome=1
https://website.com/coolpage/938921?awesome=1
https:///website.com//coolpage//938921//
使用标准的 Node.js url
模块。
溶液:
require('url');
function getBaseUrl(url){
const u = new URL(url);
const result =`${u.host}${u.pathname}`
.split('//').join('/')
.replace('www.', '');
// cut off the trailing '/' character from the result
if (result.length && result[result.length - 1] === '/')
return result.substring(0, result.length - 1)
return result;
}
测试:
const urls = [
"https://website.com/coolpage/938921/",
"https://www.website.com/coolpage/938921/",
"http://website.com/coolpage/938921/",
"https://website.com/coolpage/938921/",
"https://website.com/coolpage/938921/?awesome=1",
"https://website.com/coolpage/938921?awesome=1",
"https:///website.com//coolpage//938921//"
];
for (let i = 0; i < urls.length; i++) {
const u = getBaseUrl(urls[i]);
console.log(`${i}: ${u}`);
}
控制台输出:
0: website.com/coolpage/9389211: website.com/coolpage/9389212: website.com/coolpage/9389213: website.com/coolpage/9389214: website.com/coolpage/9389215: website.com/coolpage/9389216: website.com/coolpage/938921
runkit.com 上的活生生的例子
在这里,您可以实现所需的功能:
function convertURL(url) {
var urlParts = url.split('/')
var finalURL = ''
urlParts.forEach((p, i) => {
if(finalURL.length == 0){
if(p.includes('.com')){
finalURL += p
}
}
else if (p.length > 0 && i < urlParts.length - 1){
finalURL += '/' + p
}
})
return finalURL
}
var url = convertURL('https://website.com/coolpage/938921/?awesome=1')
console.log(url)
您可以将String.prototype.replace
与RegExp
/+
一起使用,以匹配一个或多个正斜杠字符/
替换为单个/
,String.prototype.match()
替换为RegExp
/[a-z0-9]+.[a-z0-9]+(?=/+)/[a-z0-9]+(?=/+)/[a-z0-9]+/ig
以匹配URL的主机名和路径名。
let urls = ["https://website.com/coolpage/938921/", "https://www.website.com/coolpage/938921/", "http://website.com/coolpage/938921/", "https://website.com/coolpage/938921/", "https://website.com/coolpage/938921/?awesome=1", "https://website.com/coolpage/938921?awesome=1", "https:///website.com//coolpage//938921//"];
let _URL = "website.com/coolpage/938921";
let replaceForwardSlashes = //+/g;
let matchHostAndPathNames = /[a-z0-9]+.[a-z0-9]+(?=/+)/[a-z0-9]+(?=/+)/[a-z0-9]+/ig;
let matchedURLS = urls.map(url => url.replace(replaceForwardSlashes,'/').match(matchHostAndPathNames));
console.log(matchedURLS, new Set(...matchedURLS).size === 1, matchedURLS.every(u => u == _URL));