如何使用 C# 计算大 (5+ GB) 文件中字符的出现次数



为了提供一些上下文,我正在尝试优化以下代码它逐行读取文件,缓冲这些行并每 100 行保存到数据库中 -

using (StreamReader sr = new StreamReader(fileName, Encoding.Default)) 
{
    IList<string> list = new List<string>();
    int lineCount = 0;
    foreach (var line in sr.ReadLines((char)someEOL)) //ReadLines is an extension method that yield returns lines based on someEOL while reading character by character 
    {
        list.Add(line); //Keeping it simple for this example. In the actual code it goes through a bunch of operations
        if(++lineCount % 100 == 0) { //Will not work if the total number of lines is not a multiple of 100
            SaveToDB(list);
            list = new List<string>();
        }     
    }
    if(list.Count() > 0)
        SaveToDB(list); //I would like to get rid of this. This is for the case when total number of lines is not a multiple of 100.   
}

正如您会注意到的,SaveToDB(list)在上面的代码中发生了两次。在total number of lines % 100 != 0的情况下第二次需要它(例如,如果有 101 行,则if(lineCount % 100 == 0)将错过最后一行(。这不是一个很大的麻烦,但我想知道我是否可以摆脱它。

为此,如果我能在进入 foreach 循环之前读取总行数,我就可以以不同的方式编写if(lineCount % 100 == 0)。但是查找行总数需要逐个字符遍历文件以计数someEOL这是一个明确的否定,因为文件大小的范围为 5-20 GB。有没有办法在不降低性能的情况下进行计数(这对我来说似乎值得怀疑,但也许有解决方案(?或者另一种重写它以摆脱额外SaveDB(list)调用的方法?

您的代码看起来不错,除了每次读取 100 行时创建新的空列表。无论如何,您可能想尝试这种方法:

var enumerator = sr.ReadLines((char)someEOL).GetEnumerator();
isValid = true;
for (int i = 1; isValid; i++)
{
    bool isValid = enumerator.MoveNext();
    if (isValid)
    {
        list.Add(enumerator.Current);
    }
    if (i % 100 ==  0 || (!isValid && list.Count() > 0))
    {
        SaveToDB(list);
        // It is better to clear the list than creating new one for each iteration, given that your file is big.
        list.Clear();
    }
}

我想你正在寻找StreamReader.Peek((

sr.Peek().Equals(-1)

法典:

        string filepath = "myfile.txt";
        int lineCount = 0;
        List<string> list = new List<string>();
        using (StreamReader sr = File.OpenText(filepath))
        {
            string line;
            while ((line = sr.ReadLine()) != null)
            {
                lineCount++;
                if (lineCount % 100 == 0 || sr.Peek().Equals(-1))
                { 
                    SaveToDB(list);
                    list = new List<string>();
                }
            }
        }

最新更新