研究使用 C# 对目录及其子目录中的文本进行有效搜索



我正在尝试在属于目录的某些文件中搜索字符串的特定匹配项。(搜索也在子目录中执行。目前,我想出了这样的解决方案。

  1. 获取目录及其子目录中的所有文件名。
  2. 逐个打开文件。
  3. 搜索特定字符串
  4. 如果包含,请将文件名存储在数组中。
  5. 继续此操作直到最后一个文件。

    string[] fileNames = Directory.GetFiles(@"d:test", "*.txt", SearchOption.AllDirectories);
    foreach (string sTem in fileNames)
    {
        foreach (string line in File.ReadAllLines(sTem))
        {
            if (line.Contains(SearchString))
            {
                MessageBox.Show("Found search string!");
                break;
            }
        }
    }
    

我认为还有其他方法/方法比这更有效和更快吗?使用批处理文件?还行。另一种解决方案是使用 findstr(但是如何在没有批处理文件的情况下直接与 C# 程序一起使用?什么是最有效的(或比我做的更有效?非常感谢代码示例!

找到了另一种解决方案。

Process myproc = new Process();
myproc.StartInfo.FileName = "findstr";
myproc.StartInfo.Arguments = "/m /s /d:"c:\REQs" "madhuresh" *.req";
myproc.StartInfo.RedirectStandardOutput = true;
myproc.StartInfo.UseShellExecute = false;

myproc.Start();
string output = myproc.StandardOutput.ReadToEnd();
myproc.WaitForExit();

这个过程的执行好吗?也欢迎对此发表评论!

根据@AbitChev的方法,一个圆滑的(我不知道它是否有效!无论如何,它就这样继续下去。这个搜索所有目录以及子目录!

IEnumerable<string> s = from file in Directory.EnumerateFiles("c:\directorypath", "*.req", SearchOption.AllDirectories)
                   from str in File.ReadLines(file)
                   //where str.Contains("Text@tosearched2")
                   where str.IndexOf(sSearchItem, StringComparison.OrdinalIgnoreCase) >= 0
                   select file;
        foreach (string sa in s)
            MessageBox.Show(sa);

(对于不区分大小写的搜索。也许这可以帮助某人。请评论!谢谢。

使用 Directory.EnumerateFiles()File.ReadLines() - 两者都提供延迟加载数据:

from file in Directory.EnumerateFiles(path)
from arr in File.ReadLines(file)
from str in arr
where str.Contains(pattern)
select new 
{
    FileName = file, // file containing matched string
    Line = str // matched string
};

foreach (var file in Directory.EnumerateFiles(path).AsParallel())
{
    try
    {
        foreach (var arr in File.ReadLines(file).AsParallel())
        {
            // one more try here?
            foreach (var str in arr)
            {
                if (str.Contains(pattern))
                {
                    yield return new 
                    {
                        FileName = file, // file containing matched string
                        Line = str // matched string
                    };
                }
            }
        }
    }
    catch (SecurityException)
    {
        // swallow or log
    }
}

像这样的事情怎么样

var found = false;
string file;
foreach (file in Directory.EnumerateFiles(
            "d:\tes\",
            "*.txt",
            SearchOption.AllDirectories))
{
    foreach(var line in File.ReadLines(file))
    {
        if (line.Contains(searchString))
        {
            found = ture;
            break;
        }
    }
    if (found)
    {
            break;
    }
}
if (found)
{
    var message = string.Format("Search string found in "{0}".", file)
    MessageBox.Show(file);
}

这样做的好处是只加载到内存中所需的内容,而不是所有文件的名称,然后是每个文件的内容。


我注意到您正在使用String.Contains

执行序号(区分大小写和不区分区域性(比较

这将允许我们进行简单的字符比较。

我会从一个小的辅助函数开始

private static bool CompareCharBuffers(
    char[] buffer,
    int headPosition,
    char[] stringChars)
{
    // null checking and length comparison ommitted
    var same = true;
    var bufferPos = headPosition;
    for (var i = 0; i < stringChars.Length; i++)
    {
        if (!stringChars[i].Equals(buffer[bufferPos]))
        {
            same = false;
            break;
        }
        bufferPos = ++bufferPos % (buffer.Length - 1);
    }
    return same;
}

然后我会改变之前的算法来使用这样的函数。

var stringChars = searchString.ToCharArray();
var found = false;
string file;

foreach (file in Directory.EnumerateFiles(
            "d:\tes\",
            "*.txt",
            SearchOption.AllDirectories))
{
    using (var reader = File.OpenText(file))
    {
        var buffer = new char[stringChars.Length];
        if (reader.ReadBlock(buffer, 0, buffer.Length - 1) 
                < stringChars.Length - 1)
        {
            continue;
        }
        var head = 0;
        var nextPos = buffer.Length - 1;
        var nextChar = reader.Read();
        while (nextChar != -1)
        {
            buffer[nextPos] = (char)nextChar;
            if (CompareCharBuffers(buffer, head, stringChars))
            {
               found = ture;
               break;
            }
            head = ++head % (buffer.Length - 1);
            if (head == 0)
            {
                nextPos = buffer.Length - 1;
            }
            else
            {
                nextPos = head - 1;
            } 
            nextChar = reader.Read();
        }
        if (found)
        {
            break;
        }
    }
}
if (found)
{
    var message = string.Format("Search string found in "{0}".", file)
    MessageBox.Show(file);
}

这仅保存与搜索字符串在内存中包含的一样多的char,并在每个文件中使用滚动缓冲区。理论上,该文件可以不包含新行并占用整个磁盘,或者搜索字符串可以包含新行。


作为进一步的工作,我将算法的每个文件部分转换为函数并研究多线程方法。

所以这将是内部函数,

static bool FileContains(string file, char[] stringChars)
{
    using (var reader = File.OpenText(file))
    {
        var buffer = new char[stringChars.Length];
        if (reader.ReadBlock(buffer, 0, buffer.Length - 1) 
                < stringChars.Length - 1)
        {
            return false;
        }
        var head = 0;
        var nextPos = buffer.Length - 1;
        var nextChar = reader.Read();
        while (nextChar != -1)
        {
            buffer[nextPos] = (char)nextChar;
            if (CompareCharBuffers(buffer, head, stringChars))
            {
               return true;
            }
            head = ++head % (buffer.Length - 1);
            if (head == 0)
            {
                nextPos = buffer.Length - 1;
            }
            else
            {
                nextPos = head - 1;
            } 
            nextChar = reader.Read();
        }
        return false;
    }
}

然后你可以像这样并行处理文件

var stringChars = searchString.ToCharArray();
if (Directory.EnumerateFiles(
            "d:\tes\",
            "*.txt",
            SearchOption.AllDirectories)
    .AsParallel()
    .Any(file => FileContains(file, stringChars)))
{
    MessageBox.Show("Found search string!");
}

很好用。 我在 .5 毫秒内搜索了 230 个文件中的大约 500 个术语。 这是非常占用内存的;它将每个文件加载到内存中

public class FindInDirectory
{
    public class Match
    {
        public string Pattern { get; set; }
        public string Directory { get; set; }
        public MatchCollection Matches { get; set; }
    }
    public static List<FindInDirectory.Match> Search(string directory, string searchPattern, List<string> patterns)
    {
        //find all file locations
        IEnumerable<string> files = System.IO.Directory.EnumerateFiles(directory, searchPattern, System.IO.SearchOption.AllDirectories);
        //load all text into memory for MULTI-PATERN
        //this greatly increases speed, but it requires a ton of memory!
        Dictionary<string, string> contents = files.ToDictionary(f => f, f => System.IO.File.ReadAllText(f));
        List<FindInDirectory.Match> directoryMatches = new List<Match>();
        foreach (string pattern in patterns)
        {
            directoryMatches.AddRange
            (
                contents.Select(c => new Match
                {
                    Pattern = pattern,
                    Directory = c.Key,
                    Matches = Regex.Matches(c.Value, pattern, RegexOptions.IgnoreCase | RegexOptions.Multiline)
                })
                .Where(c => c.Matches.Count > 0)//switch to > 1 when program directory is same or child of search
            );
        };
        return directoryMatches;
    }
}

用:

    static void Main(string[] args)
    {
        List<string> patterns = new List<string>
        {
            "class",
            "foreach",
            "main",
        };
        string searchPattern = "*.cs";
        string directory = "C:\SearchDirectory";
        DateTime start = DateTime.UtcNow;
        FindInDirectory.Search(directory, searchPattern, patterns);
        Console.WriteLine((DateTime.UtcNow - start).TotalMilliseconds);
        Console.ReadLine();
    }

可以使用Tasks.Dataflow(此.dll当前不是 .NET 4.5 的一部分,但可以从此处下载(创建"管道",以使用所有文件和搜索显式字符串。查看此参考实现。

相关内容

最新更新