我有超过600k行的字符串。我想把同样的字符串分组并学习它们的计数。
例如
i go to school
i like music
i like games
i like music
i like music
i like games
i like music
所以结果将是
i go to school , 1
i like games , 2
i like music , 4
我怎样才能用最快的方法做到这一点?
GroupBy
方法就是您想要的。您需要将字符串放在列表或实现IEnumerable<string>
的东西中。spender建议的File.ReadLines
将返回一个逐行读取文件的IEnumerable<string>
。
var stringGroups = File.ReadLines("filename.txt").GroupBy(s => s);
foreach (var stringGroup in stringGroups)
Console.WriteLine("{0} , {1}", stringGroup.Key, stringGroup.Count());
如果您希望它们按最小到最大的顺序(如您的示例),只需添加一个OrderBy
...
foreach (var stringGroup in stringGroups.OrderBy(g => g.Count()))
...
您可以使用Linq来实现它
IEnumerable<string> stringSource = File.ReadLines("C:\file.txt");
var result = stringSource
.GroupBy(str => str)
.Select(group => new {Value = group.Key, Count = group.Count()})
.OrderBy(item => item.Count)
.ToList();
foreach(var item in result)
{
// item.Value - string value
// item.Count - count
}
你可以试试这个:
var groupedLines = System.IO.File.ReadAllLines(@"C:tempsamplelines.txt").GroupBy(x=>x);
groupedLines.ToList().ForEach(y => Console.WriteLine("Content: {0} - Occurences: {1}", y.Key, y.Count()));
Another, "oldschool" approach is iterating all lines and add them to a Dictioary(if not already present). The key is the line and the value is the count.
var d = new Dictionary<string, Int32>();
foreach (var line in File.ReadAllLines(@"C:TempFileName.txt"))
if (d.ContainsKey(line)) d[line]++; else d.Add(line, 1);
优点是,这也适用于早期的框架版本。