PowerShell：比较 2 个大型 CSV 文件以查找其中一个文件中不存在的用户

我有两个csv文件，每个文件大约有10000个用户。我需要计算有多少用户出现在csv1中，而不是csv2中。现在我有下面的代码。然而，我知道这可能是非常低效的，因为它可能会在多达10000个用户中循环10000次。代码需要很长时间才能运行，我相信一定有更有效的方法。如有任何帮助或建议，不胜感激。我是Powershell 的新手

foreach ($csv1User in $csv1) {
$found = $false
foreach ($csv2User in $csv2) {
if ($csv1User.identifier -eq $csv2User.identifier)
{
$found = $true
break
}
}
if ($found -ne $true){
$count++
}
}

如果用2个HashSet替换嵌套循环，则有两种方法可以计算两者之间的异常：

使用`SymmetricExceptWith()`

HashSet<T>.SymmetricExceptWith()函数允许我们计算两个集合中都存在但不同时存在的项的子集：

# Create hashset from one list
$userIDs = [System.Collections.Generic.HashSet[string]]::new([string[]]$csv1.identifier)
# Pass the other list to `SymmetricExceptWith`
$userIDs.SymmetricExceptWith([string[]]$csv2.identifier)
# Now we have an efficient filter!
$relevantRecords = @($csv1;$csv2) |Where-Object { $userIDs.Contains($_.identifier) } |Sort-Object -Unique identifier

使用集合跟踪重复项

类似地，我们可以使用哈希集来跟踪哪些术语至少被观察过一次，哪些术语被观察过不止一次：

# Create sets for tracking
$seenOnce = [System.Collections.Generic.HashSet[string]]::new()
$seenTwice = [System.Collections.Generic.HashSet[string]]::new()
# Loop through whole superset of records
foreach($record in @($csv1;$csv2)){
# Always attempt to add to the $seenOnce set
if(!$seenOnce.Add($record.identifier)){
# We've already seen this identifier once, add it to $seenTwice
[void]$seenTwice.Add($record.identifier)
}
}
# Just like the previous example, we now have an efficient filter!
$relevantRecords = @($csv1;$csv2) |Where-Object { $seenOnce.Contains($_.identifier) -and -not $seenTwice.Contains($_.identifier) } |Sort-Object -Unique identifier

使用哈希表作为分组构造

您还可以使用字典类型(例如[hashtable](根据两个csv文件的标识符对记录进行分组，然后根据每个字典条目中的记录值数量进行筛选：

# Groups records on their identifier value
$groupsById = @{}
foreach($record in @($csv1;$csv2)){
if(-not $groupsById.ContainsKey($record.identifier)){
$groupsById[$record.identifier] = @()
}
$groupsById[$record.identifier] += $record
}
# Filter based on number of records with a distinct identifier
$relevantRecords = $groupsById.GetEnumerator() |Where-Object { $_.Value.Count -eq 1 } |Select-Object -Expand Value

如果您只是在寻找计数，那么这应该会更快。

$csv2 = Import-Csv $csvfile2
Import-Csv $csvfile1 |
Where-Object identifier -in $csv2.identifier |
Measure-Object | Select-Object -ExpandProperty Count

下面是的一个小例子

$csvfile1 = New-TemporaryFile
$csvfile2 = New-TemporaryFile
@'
identifier
bob
sally
john
sue
'@ | Set-Content $csvfile1 -Encoding UTF8
@'
identifier
bill
sally
john
stan
'@ | Set-Content $csvfile2 -Encoding UTF8
$csv2 = Import-Csv $csvfile2
Import-Csv $csvfile1 |
Where-Object identifier -in $csv2.identifier |
Measure-Object | Select-Object -ExpandProperty Count

输出只是

使用`SymmetricExceptWith()`

使用集合跟踪重复项

使用哈希表作为分组构造

相关内容

最新更新

热门标签：

PowerShell：比较 2 个大型 CSV 文件以查找其中一个文件中不存在的用户

使用SymmetricExceptWith()

使用集合跟踪重复项

使用哈希表作为分组构造

相关内容

最新更新

热门标签：

使用`SymmetricExceptWith()`