在LINQ中组合GroupBy和Count项以匹配条件



我正在努力找出一个LINQ语句来汇总数据。我正在通过开发一个工具来帮助我清理重复的文件来学习c#。我已经有一个字典变量,它由存储在fileResult中的文件项信息填充,该文件项信息被定义为Dictionary<string, List<string>>。列表项包括Path、FileHash和FolderDupFileCount(以及其他项(。

我已经成功地使用了这个LINQ表达式来总结所有不同的FileHash,分配一个组id,并使用相同的哈希对所有文件进行计数。

var fileMD5Groups = fileResult.GroupBy(x => x.Value.FileHash).Select((x, xid) => 
new { x.Key, count = x.Distinct().Count(), id = xid + 1 }).ToDictionary(y => y.Key, z => z);

现在,我有了下面的查询,它可以计算路径中的文件数。我正试图弄清楚如何修改此语句,以计算此路径中在其他地方有重复的文件(对于每个路径,提供此路径中重复的文件的计数(

// Group by Path and Count the files in this path that have duplicates
// fileResult contains a field called FileHash
var folderDuplicateCount =
from file in fileResult
group file by file.Value.Path into g
where g.Count() > 1
select new { Path = g.Key, FolderDupFileCount = g.Count() };

// Convert to dictionary
Dictionary<string, int> dupResults = folderDuplicateCount
.ToDictionary(x => x.Path, x => x.FolderDupFileCount);

我想这对一个技术娴熟的人来说很简单,我正在努力成为一个技术熟练的人,所以任何帮助都将不胜感激。

编辑1:以下是我正在使用的完整方法。

public static bool UpdateFileHashResults(Dictionary<string, FolderItem> folderResult, Dictionary<string, FileItem> fileResult)
{
var fileMD5Groups = fileResult.GroupBy(x => x.Value.FileHash).Select((x, xid) => new { x.Key, count = x.Distinct().Count(), id = xid + 1 }).ToDictionary(y => y.Key, z => z);
// Group by Path and Count the files in this path which have the
// same FileHash that are in other Path's
// fileResult contains a field called FileHash
var folderDuplicateCount =
from file in fileResult
group file by file.Value.Path into g
where g.Count() > 1
select new { Path = g.Key, FolderDupFileCount = g.Count() };
Dictionary<string, int> dupResults = folderDuplicateCount.ToDictionary(x => x.Path, x => x.FolderDupFileCount);
timeItLinq.Stop();
timeItAssignValue.Restart();
foreach (var file in fileResult.ToList())
{
var ik = file.Key;
var ivMD5Hash = file.Value.FileHash;
var fResult = fileResult[ik];
var ivFileFolder = file.Value.Path;
fResult.FileHashGroupID = fileMD5Groups[ivMD5Hash].id;
fResult.FileHashCount = fileMD5Groups[ivMD5Hash].count;

if (RS.FoldersFound)
{
var folResult = folderResult[ivFileFolder];
fResult.FolderID = folResult.FolderID;
var dupCount = 0;
if (dupResults.ContainsKey(ivFileFolder))
{
dupCount = dupResults[ivFileFolder];
}
fResult.FolderDupFileCount = dupCount;
folResult.FolderDupFileCount = dupCount;
}
}
return true;
}

现在,var fileResult = fileListing.FindFiles(fileList)是第一个分配,使用接口:

public interface IFileListing
{
Dictionary<string, FileItem> FindFiles(IEnumerable<string> files);
}

对于文件夹结果var folderResult =FolderListing.FindFolders(folderPaths);,并使用下面的界面。

public interface IFolderListing
{
Dictionary<string, FolderItem> FindFolders(IEnumerable<string> folders);
}

所需结果:我正在尝试获取按路径分组的结果,并统计此文件夹中FileHash与其他路径中文件相同的文件数。因此,如果该路径有10个文件,其中2个文件的哈希值与另一个路径中的文件相同,那么.FolderDupFileCount的该路径的结果应该是2。

我希望这能使结果更加清晰。

在学习了更多关于linq的知识和更多的试错之后,我找到了一个有效的解决方案。感谢NetMage的问题和评论,帮助我思考这个问题。我还按照建议更改了lambda名称,但不确定它是否完全一致。

我正在发布有效的解决方案;然而,我觉得代码看起来并没有那么优雅,而且在这种方法中可能有更好的方法来完成一些任务任何关于简化和改进此方法的建议都将有助于我养成良好的编程标准和习惯

对于解决方案,我删除了folderDuplicateCount查询并修改了dupResults查询。由于我的fileResult字典有Path作为字段,所以我使用了这个字典而不是folderResult变量。

现在,修改后的dupResults提供了正确的结果。我还添加了两个额外的计算字段DupFilesHash = string.Concat(frg.Select(fvg => fvg.FileHash))id = frgId + 1。这些字段是更新字典中字段的助手,并且是在特定条件下分配的。DupFileHash是此路径中的文件的串联散列,这些文件在其他路径中具有重复项。然后对该散列字符串进行重新散列,以提供表示这些重复项的唯一指纹,该指纹可用于定位/匹配其他地方的重复项。

我无法解决的最大问题是,在第一个.GroupBy(frg => frg.Path)之后,我似乎无法访问其他值字段。我看到了一个展示frg.Select(fvg => fvg.FileHash)的例子,然后灯亮了,我学到了一些新东西。

public static bool UpdateFileHashResults(Dictionary<string, FolderItem> folderResult, Dictionary<string, FileItem> fileResult)
{
// List of file hashes with a count of files with identical hashes
var fileMD5Groups = fileResult.FileItemDictionaryToList()
.GroupBy(kvg => kvg.FileHash)
.Select((kvg, kvgId) => new { kvg.Key, 
count = kvg.Distinct().Count(), 
id = kvgId + 1 })
.ToDictionary(krg => krg.Key, kvg => kvg);
// List of all folders and a count of the number of files in this folder 
// that have the same file hash in another folder(s)
var dupResults = fileResult.FileItemDictionaryToList()
.Where(frg => fileMD5Groups[frg.FileHash].count > 1)
.GroupBy(frg => frg.Path)
.Select((frg, frgId) => new { Path = frg.Key, 
NumberOfFilesWithDuplicates = frg.Count(), 
DupFilesHash = string.Concat(frg.Select(fvg => fvg.FileHash)), 
id = frgId + 1})
.ToDictionary(frg => frg.Path, fvg => fvg);

// Loop over all files and back load values into folder and file results dictionaries
timeItAssignValue.Restart();
foreach (KeyValuePair<string, FileItem> file in fileResult.ToList())
{
string ik = file.Key;
string ivMD5Hash = file.Value.FileHash;
FileItem fResult = fileResult[ik];
string ivFileFolder = file.Value.Path;
int fileHashCount = fileMD5Groups[ivMD5Hash].count;
fResult.FileHashGroupID = fileMD5Groups[ivMD5Hash].id;
fResult.FileHashCount = fileHashCount;
if (RS.FoldersFound)
{
FolderItem folResult = folderResult[ivFileFolder];
fResult.FolderID = folResult.FolderID;
int dupCount = 0;
int dupID = 0;
string dupFilesHash = "";

if (dupResults.ContainsKey(ivFileFolder) && fileHashCount> 1)
{
dupCount = dupResults[ivFileFolder].NumberOfFilesWithDuplicates;
dupID = dupResults[ivFileFolder].id;
dupFilesHash= dupResults[ivFileFolder].DupFilesHash;
dupFilesHash = HashTool.MD5StringHash(dupFilesHash);

}
fResult.FolderDupFileCount = dupCount;
folResult.FolderDupFileCount = dupCount;
fResult.FolderDupFileCountID = dupID;
folResult.FolderDupFileCountID = dupID;
fResult.FolderDupFilesHash = dupFilesHash;
}
}
return true;
}       

最新更新