在Javascript的多个段落中查找常见单词


你好,我有40段话。每段包含大约250个单词。我想找到段落的常用词,并将结果保存在逗号分隔的文件中。我将举个例子。

var para1 = "this is para one. I am cat. I am 10 years old. I like fish";
var para2 = "this is para two. I am dog. my age is 12. I can swim";
var para3 = "this is para three. I am cat. I am 9 years. I like rat";
var para4 = "this is para four. I am rat. my age is secret. I hate cat";
var para5 = "this is para five. I am dog. I am 10 years old. I like fish";

我需要的结果

这个,5

是,5

第5段

I,13

上午8点

猫,3

像这样。我还想排除一些词,比如"我是",这些词是不必要的。然而,我认为如果我找到了一种如上所述保存结果的方法,我可以排除部分。

您可以简单地遍历所有段落,然后用空白字符将它们拆分。如果您想排除标点符号,那么在拆分之前执行基本的regex替换可能是有意义的。

然后,你可以将所有这些收集到字典中,这样你就可以统计所有段落中每个唯一单词的计数总数。因此,通常最好先将整个字符串转换为小写,这样就不会有单独计数的看似重复的单词,例如Hellohello

若要添加排除列表,可以在将单个单词添加到词典之前检查它们是否与列表匹配。

var para1 = "this is para one. I am cat. I am 10 years old. I like fish";
var para2 = "this is para two. I am dog. my age is 12. I can swim";
var para3 = "this is para three. I am cat. I am 9 years. I like rat";
var para4 = "this is para four. I am rat. my age is secret. I hate cat";
var para5 = "this is para five. I am dog. I am 10 years old. I like fish";
const dict = {};
const wordsToExclude = ['this', 'i', 'am'];
const paras = [para1, para2, para3, para4, para5];
paras.forEach(p => {
const words = p.toLowerCase().replace(/[.,]/g, '').split(' ');
words.forEach(w => {
// GUARD: Exclude certain words
if (wordsToExclude.includes(w)) {
return;
}

// OPTIONAL GUARD: Exclude numbers
if (w.match(/^d+$/)) {
return;
}

dict[w] = (dict[w] ? dict[w] : 0) + 1;
});
});
console.log(dict);

专业提示:如果您支持非常现代的浏览器或使用TypeScript,您可以将行dict[w] = (dict[w] ? dict[w] : 0) + 1;更改为使用空合并运算符:dict[w] = (dict[w] ?? 0) + 1;

最新更新