假设我有一个包含换行符和制表符的长字符串,如:
var x = "This is a long string.nt This is another one on next line.";
那么我们如何使用正则表达式将这个字符串分割成token呢?
我不想使用.split(' ')
,因为我想学习Javascript的正则表达式。
更复杂的字符串可以是这样的:
var y = "This @is a #long $string. Alright, lets split this.";
现在我只想从这个字符串中提取有效的单词,不包含特殊字符和标点符号,即我想要这些:
var xwords = ["This", "is", "a", "long", "string", "This", "is", "another", "one", "on", "next", "line"];
var ywords = ["This", "is", "a", "long", "string", "Alright", "lets", "split", "this"];
下面是一个jsfiddle示例:http://jsfiddle.net/ayezutov/BjXw5/1/
基本上,代码非常简单:
var y = "This @is a #long $string. Alright, lets split this.";
var regex = /[^s]+/g; // This is "multiple not space characters, which should be searched not once in string"
var match = y.match(regex);
for (var i = 0; i<match.length; i++)
{
document.write(match[i]);
document.write('<br>');
}
:基本上,您可以展开分隔符列表:http://jsfiddle.net/ayezutov/BjXw5/2/
var regex = /[^s.,!?]+/g;
更新2:一直都只有字母:http://jsfiddle.net/ayezutov/BjXw5/3/
var regex = /w+/g;
使用s+
对字符串进行标记
exec可以遍历匹配项以删除非单词(W)字符。
var A= [], str= "This @is a #long $string. Alright, let's split this.",
rx=/W*([a-zA-Z][a-zA-Z']*)(W+|$)/g, words;
while((words= rx.exec(str))!= null){
A.push(words[1]);
}
A.join(', ')
/* returned value: (String)
This, is, a, long, string, Alright, let's, split, this
*/
var words = y.split(/[^A-Za-z0-9]+/);
这是一个使用regex组使用不同类型的标记对文本进行标记的解决方案。
您可以在这里测试代码https://jsfiddle.net/u3mvca6q/5/
/*
Basic Regex explanation:
/ Regex start
(w+) First group, words w means ASCII letter with w + means 1 or more letters
| or
(,|!) Second group, punctuation
| or
(s) Third group, white spaces
/ Regex end
g "global", enables looping over the string to capture one element at a time
Regex result:
result[0] : default group : any match
result[1] : group1 : words
result[2] : group2 : punctuation , !
result[3] : group3 : whitespace
*/
var basicRegex = /(w+)|(,|!)|(s)/g;
/*
Advanced Regex explanation:
[a-zA-Zu0080-u00FF] instead of w Supports some Unicode letters instead of ASCII letters only. Find Unicode ranges here https://apps.timwhitlock.info/js/regex
(...|.|,|!|?) Identify ellipsis (...) and points as separate entities
You can improve it by adding ranges for special punctuation and so on
*/
var advancedRegex = /([a-zA-Zu0080-u00FF]+)|(...|.|,|!|?)|(s)/g;
var basicString = "Hello, this is a random message!";
var advancedString = "Et en français ? Avec des caractères spéciaux ... With one point at the end.";
console.log("------------------");
var result = null;
do {
result = basicRegex.exec(basicString)
console.log(result);
} while(result != null)
console.log("------------------");
var result = null;
do {
result = advancedRegex.exec(advancedString)
console.log(result);
} while(result != null)
/*
Output:
Array [ "Hello", "Hello", undefined, undefined ]
Array [ ",", undefined, ",", undefined ]
Array [ " ", undefined, undefined, " " ]
Array [ "this", "this", undefined, undefined ]
Array [ " ", undefined, undefined, " " ]
Array [ "is", "is", undefined, undefined ]
Array [ " ", undefined, undefined, " " ]
Array [ "a", "a", undefined, undefined ]
Array [ " ", undefined, undefined, " " ]
Array [ "random", "random", undefined, undefined ]
Array [ " ", undefined, undefined, " " ]
Array [ "message", "message", undefined, undefined ]
Array [ "!", undefined, "!", undefined ]
null
*/
为了提取纯文字字符,我们使用w
符号。这是否匹配Unicode字符取决于实现,您可以使用此参考来查看您的语言/库的情况。
请参阅Alexander Yezutov关于如何将此应用于表达式的回答(更新2)