如何使javascript regex匹配管道文件时的所有行



如果我在数据上运行正则表达式作为字符串,我没有任何问题,我的三行得到匹配。

https://regex101.com/r/pHsTvV/1

const regex = /(?<email>((?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[x01-x08x0bx0cx0e-x1fx21x23-x5bx5d-x7f]|\[x01-x09x0bx0cx0e-x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[x01-x08x0bx0cx0e-x1fx21-x5ax53-x7f]|\[x01-x09x0bx0cx0e-x7f])+)])))s*|s*(?<name>([a-zA-Z]{2,}s[a-zA-Z]{1,}'?-?[a-zA-Z]{2,}s?([a-zA-Z]{1,})?))s*|s*(?<address>.*)s*|s*(?<country>(w|.|s*){1,})s*|s*(?<phone>(d|-| |+|(|)|.|/){7,})/gm;
const str = `john.doe@gmail.test| John Doe| 160 Boston Rd| Chelmsford MA 11824| United States| 00088782000
jane.doe@aol.test| Jane Doe| 8415 45th St| Lyons IL 60534| United States| 0005800000
alicia.random123@gmail.test| Alicia Random| BLK 8, City Point| No.58 Wing Shun Street| Tsuen Wan| Not in U.S.| +00092262000`;
const lines = str.split('n')
lines.forEach(line => {
const test = regex.exec(str)
if (test && test.groups) {
console.dir(test.groups)
} else {
console.log('could not match')
}
});

然而,当我从txt文件加载数据时,javascript总是给我两行不匹配的其中一行:

const regex = /(?<email>((?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[x01-x08x0bx0cx0e-x1fx21x23-x5bx5d-x7f]|\[x01-x09x0bx0cx0e-x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[x01-x08x0bx0cx0e-x1fx21-x5ax53-x7f]|\[x01-x09x0bx0cx0e-x7f])+)])))s*|s*(?<name>([a-zA-Z]{2,}s[a-zA-Z]{1,}'?-?[a-zA-Z]{2,}s?([a-zA-Z]{1,})?))s*|s*(?<address>.*)s*|s*(?<country>(w|.|s*){1,})s*|s*(?<phone>(d|-| |+|(|)|.|/){7,})/gm;
import * as fs from 'fs';
import * as path from 'path';
import * as es from 'event-stream';
const filePath = path.join(process.cwd(), 'data/test.txt')
var s = fs.createReadStream(filePath)
.pipe(es.split())
.pipe(es.mapSync(function (line: string) {
let values = regex.exec(line.trim())
if (values && values.groups) {
console.dir(values.groups)
} else {
console.log(`COULD NOT MATCH`)
console.log(line)
}
}).on('error', function (err) {
console.log('Error while reading file.', err);
})
.on('end', function () {
console.log('Read entire file.')
})
)

test.txt文件如下:

john.doe@gmail.test| John Doe| 160 Boston Rd| Chelmsford MA 11824| United States| 00088782000
jane.doe@aol.test| Jane Doe| 8415 45th St| Lyons IL 60534| United States| 0005800000
alicia.random123@gmail.test| Alicia Random| BLK 8, City Point| No.58 Wing Shun Street| Tsuen Wan| Not in U.S.| +00092262000

即使在有100行的文件中,也总是有两行中有一行不匹配。当我读取文件时,jane.doe@aol.test没有得到匹配

我尝试了以下操作,看看它是否特定于行:

const regex = /(?<email>((?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[x01-x08x0bx0cx0e-x1fx21x23-x5bx5d-x7f]|\[x01-x09x0bx0cx0e-x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[x01-x08x0bx0cx0e-x1fx21-x5ax53-x7f]|\[x01-x09x0bx0cx0e-x7f])+)])))s*|s*(?<name>([a-zA-Z]{2,}s[a-zA-Z]{1,}'?-?[a-zA-Z]{2,}s?([a-zA-Z]{1,})?))s*|s*(?<address>.*)s*|s*(?<country>(w|.|s*){1,})s*|s*(?<phone>(d|-| |+|(|)|.|/){7,})/gm;
const uniqueStr = `jane.doe@aol.test| Jane Doe| 8415 45th St| Lyons IL 60534| United States| 0005800000`
const test = regex.exec(uniqueStr)
if (test && test.groups) {
console.dir(test.groups)
} else {
console.log('could not match')
console.log(uniqueStr)
}

这不匹配,但如果我在regex101上尝试正则表达式,则没有匹配问题。

https://regex101.com/r/52kpRD/1

看看这个问题的公认答案:RegExp是有状态的

本质上,您的regex是一个对象,它保留在找到最后匹配的行中的索引,下一次它从那里继续,而不是再次从该行的开头查找匹配。

所以一个解决方案是在每次调用es.MapSync时手动重置regex.lastIndex

:

let s = fs.createReadStream(filePath)
.pipe(es.split())
.pipe(es.mapSync(function (line) {
regex.lastIndex = 0; //Reset the RegExp index
let values = regex.exec(line.trim())
if (values && values.groups) {
console.dir(values.groups)
} else {
console.log(`COULD NOT MATCH`)
console.log(line)
}
}).on('error', function (err) {
console.log('Error while reading file.', err);
})
.on('end', function () {
console.log('Read entire file.')
})
)

注意,这只会发生,因为regex是全局定义的。如果你要在mapSync()回调中分配正则表达式,它应该具有相同的效果。但是,重置lastIndex更简单,可能性能更高。

最新更新