从海量字符串列表中计数相同的字符串计数



我有超过600k行的字符串。我想把同样的字符串分组并学习它们的计数。

例如

i go to school
i like music
i like games
i like music
i like music
i like games
i like music

所以结果将是

i go to school , 1
i like games  , 2
i like music , 4

我怎样才能用最快的方法做到这一点?

GroupBy方法就是您想要的。您需要将字符串放在列表或实现IEnumerable<string>的东西中。spender建议的File.ReadLines将返回一个逐行读取文件的IEnumerable<string>

var stringGroups = File.ReadLines("filename.txt").GroupBy(s => s);
foreach (var stringGroup in stringGroups)
    Console.WriteLine("{0} , {1}", stringGroup.Key, stringGroup.Count());

如果您希望它们按最小到最大的顺序(如您的示例),只需添加一个OrderBy

...
foreach (var stringGroup in stringGroups.OrderBy(g => g.Count()))
    ...

您可以使用Linq来实现它

IEnumerable<string> stringSource = File.ReadLines("C:\file.txt");
var result = stringSource
    .GroupBy(str => str)
    .Select(group => new {Value = group.Key, Count = group.Count()})
    .OrderBy(item => item.Count)
    .ToList();
foreach(var item in result)
{
    // item.Value - string value
    // item.Count - count
}

你可以试试这个:


var groupedLines = System.IO.File.ReadAllLines(@"C:tempsamplelines.txt").GroupBy(x=>x);
groupedLines.ToList().ForEach(y => Console.WriteLine("Content: {0} - Occurences: {1}", y.Key, y.Count()));

Another, "oldschool" approach is iterating all lines and add them to a Dictioary(if not already present). The key is the line and the value is the count.

var d = new Dictionary<string, Int32>();
foreach (var line in File.ReadAllLines(@"C:TempFileName.txt"))
     if (d.ContainsKey(line)) d[line]++; else d.Add(line, 1);

优点是,这也适用于早期的框架版本。

相关内容

  • 没有找到相关文章

最新更新