我有一个文本文件,它包含超过100,000行,标题列字段由制表符分隔,并且它包含重复条目。现在,我想通过使用批处理脚本或PowerShell脚本删除那些重复的条目,当VATRegistrationNumber重复时(当已经存在于GroupID 1000的条目时,再次出现在GroupID 2000中)。
输入文件:
Supplier GroupID Name1 City VATRegistrationNumber Region
9032512 1000 PENRITH PERFORMING AND VISUAL ARTS Penrith 97003605089 NSW
9032923 1000 TORONTO MULTI PURPOSE CENTRE INC Toronto 85659352794 NSW
9035991 1000 SYDNEY'S BEST MAINTENANCE AND CLEAN Panania 42616361365 NSW
9044080 1000 FIRST PEOPLES DISABILITY NETWORK Sydney 58169154330 NSW
9044292 1000 PUNCHY DIGITAL MEDIA PTY LTD South Melbourne 22166978669 VIC
9044691 1000 POLARON LANGUAGE SERVICES PTY LTD St Kilda 99120446362 VIC
9047594 1000 WORKSPACE COMMERCIAL FURNITURE PTY Rosebery 16619275213 NSW
9075591 1000 ENGADINE COMMUNITY SERVICES INC Engadine 39108386726 NSW
9076178 1000 CENTRAL COAST AGEING AND DISABILITY Newcastle West 60874001192 NSW
9076319 1000 YASS SOLDIERS' CLUB LTD Yass 44000988152 NSW
9076370 1000 ARTFUL FOOD CO. PTY LIMITED EMU PLAINS 52600861612 NSW
9078232 1000 WSG AUSTRALIA PTY LTD NORWOOD SOUTH 76604823430 SA
9078312 1000 CAZNER PTY LTD Manly 39648003570 NSW
9078463 1000 MILSONS POINT DIGITAL PTY LTD Milsons Point 88635600681 NSW
9079757 1000 RISK & SECURITY MANAGEMENT PTY LTD Brisbane 71613136048 QLD
9079793 1000 PETRUSIC, MILOJKA Seven Hills 86077460484 NSW
8487483 2000 FOTI, STEPHANIE BLACKTOWN 78620350307 NSW
7458490 2000 STAFF CHECK PTY LIMITED Marrickville 36227046572 NSW
75637 2000 ANDERSON, ALLAN GRAHAM Gladesville 49340951752 NSW
362789 2000 STAR TRACK EXPRESS PTY LIMITED CRONULLA 82104931562 NSW
952678 2000 CALCUTTA GROUP PTY LTD KATOOMBA 52699520223 NSW
2403417 2000 The trustee for Moore Family Trust NORWOOD SOUTH 76604823430 SA
94905 2000 MEYER PTY LTD Manly 39648003570 NSW
426783 2000 DAWSON DIGITAL PTY LTD Milsons Point 88635600681 NSW
8588214 2000 HOME MANAGEMENT PTY LTD Brisbane 71613136048 QLD
期望输出文件:
Supplier GroupID Name1 City VATRegistrationNumber Region
9032512 1000 PENRITH PERFORMING AND VISUAL ARTS Penrith 97003605089 NSW
9032923 1000 TORONTO MULTI PURPOSE CENTRE INC Toronto 85659352794 NSW
9035991 1000 SYDNEY'S BEST MAINTENANCE AND CLEAN Panania 42616361365 NSW
9044080 1000 FIRST PEOPLES DISABILITY NETWORK Sydney 58169154330 NSW
9044292 1000 PUNCHY DIGITAL MEDIA PTY LTD South Melbourne 22166978669 VIC
9044691 1000 POLARON LANGUAGE SERVICES PTY LTD St Kilda 99120446362 VIC
9047594 1000 WORKSPACE COMMERCIAL FURNITURE PTY Rosebery 16619275213 NSW
9075591 1000 ENGADINE COMMUNITY SERVICES INC Engadine 39108386726 NSW
9076178 1000 CENTRAL COAST AGEING AND DISABILITY Newcastle West 60874001192 NSW
9076319 1000 YASS SOLDIERS' CLUB LTD Yass 44000988152 NSW
9076370 1000 ARTFUL FOOD CO. PTY LIMITED EMU PLAINS 52600861612 NSW
9078232 1000 WSG AUSTRALIA PTY LTD NORWOOD SOUTH 76604823430 SA
9078312 1000 CAZNER PTY LTD Manly 39648003570 NSW
9078463 1000 MILSONS POINT DIGITAL PTY LTD Milsons Point 88635600681 NSW
9079757 1000 RISK & SECURITY MANAGEMENT PTY LTD Brisbane 71613136048 QLD
9079793 1000 PETRUSIC, MILOJKA Seven Hills 86077460484 NSW
8487483 2000 FOTI, STEPHANIE BLACKTOWN 78620350307 NSW
7458490 2000 STAFF CHECK PTY LIMITED Marrickville 36227046572 NSW
75637 2000 ANDERSON, ALLAN GRAHAM Gladesville 49340951752 NSW
362789 2000 STAR TRACK EXPRESS PTY LIMITED CRONULLA 82104931562 NSW
952678 2000 CALCUTTA GROUP PTY LTD KATOOMBA 52699520223 NSW
这是我到目前为止在PowerShell中所做的,但它没有提供所需的结果,因为删除的行比我预期的要多。
$Unique = [System.Collections.Generic.HashSet[string]]::new()
Get-Content C:TestInput.txt |ForEach-Object {
if ($Unique.Add(($_.Split('VATRegistrationNumber'))[-2])) { $_ }
} | Set-Content C:TestOutput.txt
任何帮助都将非常感激。谢谢你
更新:
谢谢大家的帮助。我个人认为iRon提供的解决方案最有帮助。我只需要做一些调整以适应我的要求。以下是我现在使用的最终PowerShell代码,它在1分钟内运行一个包含超过120,000个条目的文档:
$OutputTXT = ".Output.txt"
$Unique = [System.Collections.Generic.HashSet[string]]::new()
Import-Csv .Input.txt -Delimiter "`t" |ForEach-Object {
if ($Unique.Add($_.VATRegistrationNumber)) { $_ }
} |
ConvertTo-Csv -Delimiter "`t" -NoTypeInformation |
ForEach-Object { $_ -replace '"' } |
Set-Content $OutputTXT -Encoding Unicode
使用Export-Csv时,所有输出字段都包含双引号。所以,我做了一些研究,现在使用convert - csv代替,因为它允许我通过使用ForEach-Object {$_ -replace '"'}来消除双引号。此外,输出文件中的第一行包含了我不想要的类型信息,并使用-NoTypeInformation删除了这些信息。
boxdog提供的解决方案也非常好,而且超级快:
$OutputTXT = ".Output.txt"
Import-Csv 'input.txt' -Delimiter `t |
Group-Object -Property VATRegistrationNumber |
ForEach-Object {
if($_.Count -gt 1) {
$_.Group | Select-Object -First 1
}
else {
$_.Group
}
} |
ConvertTo-Csv -Delimiter "`t" -NoTypeInformation |
ForEach-Object { $_ -replace '"' } |
Set-Content $OutputTXT -Encoding Unicode
知道#11221
Select-Object -Unique是不必要的,缓慢而详尽的,我确实会为此使用哈希集,但不是Get-Content
/Set-Content
cmdlet,我建议您简单地使用Import-Csv
/Export-Csv
cmdlet,因为它们会自动处理您的属性(列):
$Unique = [System.Collections.Generic.HashSet[string]]::new()
Import-Csv .Input.txt -Delimiter "`t" |ForEach-Object {
if ($Unique.Add($_.VATRegistrationNumber)) { $_ }
} |Export-Csv .Output.txt -Delimiter "`t"
这适用于给定的示例,尽管我不确定100,000个条目的性能会是什么样子。试一试,看看是否有帮助。
Import-Csv 'input.txt' -Delimiter `t |
Group-Object -Property VATRegistrationNumber |
ForEach-Object {
if($_.Count -gt 1) {
$_.Group | Select-Object -First 1
}
else {
$_.Group
}
}
@ECHO OFF
SETLOCAL enabledelayedexpansion
rem The following settings for the source directory, destination directory,
rem and filenames are names
rem that I use for testing and deliberately include names which include spaces to make sure
rem that the process works using such names. These will need to be changed to suit your situation.
SET "sourcedir=u:your files"
SET "destdir=u:your results"
SET "filename1=%sourcedir%q72613575.txt"
SET "outfile=%destdir%outfile.txt"
:: remove variables starting #
FOR /F "delims==" %%b In ('set # 2^>Nul') DO SET "%%b="
(
FOR /f "usebackqdelims=" %%b IN ("%filename1%") DO (
SET "line=%%b"
CALL :isunique
IF DEFINED unique ECHO %%b
)
)>"%outfile%"
TYPE "%outfile%"
GOTO :EOF
:isunique
FOR %%e IN ("%line: =" "%") DO IF "%%~e" neq "" SET "vat=!region!"&SET "region=%%~e"
IF DEFINED #%vat% (SET "unique=") ELSE (SET "unique=Y"&SET "#%vat%=Y")
GOTO :eof
纯批处理溶液
将每一行读入%%b
,并分配给line
,以便在:isunique
中处理
:isunique
引用每个字符串和空格。忽略空结果,将vat
设置为该行的倒数第二个字符串,将region
设置为最后一个字符串。然后将unique
切换到是否已经设置了变量#vatnumber
,并记录遇到vatnumber。
我多次复制您的测试数据以制作一个100,000行以上的数据文件。我用了大约4分30秒。
使用批处理脚本处理大型数据文件的问题是可能花费的时间。一个高效的批处理文件解决方案应该以尽可能快的方式执行尽可能少的指令。下面的解决方案使用很少的指令,但它依赖于大量的环境变量。如果不同的键(VATRegistrationNumber字段)的数量很大,那么随着文件变大,性能会下降。只要一个时间测试就可以回答这个问题,如果这个方法足够快…
@echo off
setlocal
for /F "delims==" %%v in ('set $ 2^>NUL') do set "%%v="
(for /F "tokens=1-6 delims= " %%a in (input.txt) do (
if not defined $%%e (
set "$%%e=1"
echo %%a %%b %%c %%d %%e %%f
)
)) > output.txt
重要:请注意,"delims= "选项,echo
命令中每列之间必须是TAB (ASCII 9)字符。在这个站点中,这个字符通常由空格替换。
PS -如果你发布每个解决方案的时间测试,那将是非常有趣的…