在4GB文件上使用awk时的运行时间/性能

我写了一个脚本，在一个模式之后给我下一行(我需要的行是在上面一行和下面一行的46倍'='之间)和该行的行号。之后，我执行了一个sed，对它进行了格式化，因此只剩下46*'='之间的行。我把它写入一个文件，以便我可以进一步使用它。

我从中得到的文件非常小，最多有30个匹配项。

我从这个

开始

awk '/^={46}$/{ n=NR+1 } n>=NR {print NR","$0}' $file1 | sed -n '2~4p' > tmpfile$1

,
一个4gb的文件需要115秒，1gb的文件需要12秒，100mb的文件需要2秒。

我注意到所有文件的最后匹配总是相同的，但在文件本身中是唯一的，所以我在退出中工作。最后一次匹配发生在大约50k-500k行之后，之后还有6700万行用于4gb文件(最后一次匹配在71k)， 2600万行用于1gb文件(最后一次匹配在168k)， 200万行用于100mb文件(最后一次匹配在414k)。

awk '/^={46}$/{ n=NR+1 } n>=NR {print NR","$0} /*unique string here*/{exit}' $file1 | sed -n '2~4p' > tmpfile$1

得到的次数是:
一个4gb的文件需要70秒，一个1gb的文件需要2秒，一个100mb的文件需要1秒
这是一个进步

我还尝试了不同的顺序

awk '1;/*unique string here*/{exit}' $file1 | awk '/^={46}$/{ n=NR+1 } n>=NR {print NR","$0}'  | sed -n '2~4p > tmpfile$1

,
对于4gb的文件需要70秒，对于1gb的文件需要5秒，对于100mb的文件需要1秒

现在，虽然在awk中有一个出口是一种改进，但考虑到最后一次匹配发生的时间，我期望4gb文件的性能会更好。至少当我看到我用1gb的文件节省了多少时间时。
由于第3次awk比第2次awk慢，对于1gb的文件，但对于4gb的文件有相同的时间，我想我遇到了一些内存问题，因为4gb的文件太大了，我只是使用一个有2个cpu和4gb RAM的Ubuntu虚拟机。

这是我第一次全面使用awk、sed和脚本，所以我不知道现在该怎么做才能使4gb的文件获得更好的时间。对于1gb的文件，我可以接受2秒。

输入输出示例

Random text here
blab
==============================================
Here is the string I need
==============================================
------------------------
random stuff
------------------------
other stuff
==============================================
Here is the 2nd string I need
==============================================
i dont need this string here
Random stuff
==============================================
last string I need, that is the same across all files
==============================================
a lot of lines are following the last match

输出:

5,Here is the string I need
15,Here is the 2nd string I need
22,last string I need, that is the same across all files`

edit1:将更新并尝试新的东西(旋转一个类似的虚拟机与更多的内存)在星期一

edit2:在启动了一个新的虚拟机，并对更大的文件(大约15 GB)进行了更多的测试，并将缓存作为一个因素，我没有注意到运行时与这里发布的所有不同代码有任何大的变化。

但是flag on, flag off {f=!f;next}确实比我的代码优雅得多，所以感谢James Brown和Ed Morton。如果我可以的话，我会选你们两个的答案:)

在使用awk时不需要sed。您不需要转义=，因为它不是元字符。字符串连接速度慢。Regexp比较比字符串比较慢。测试n>=NR是没有意义的，因为n只比您不想要的==*线的NR大。您当前正在打印每一行==之后的行，但您只想要它们对之间的行。如果你的"unique string"是你想要打印的行之一，那么只需在打印的地方测试它，而不是在文件中的每一行。试一试:

$ awk -v OFS=',' '
    $0=="=============================================="{f=!f; next}
    f {print NR, $0; if (/unique string/) exit}
' file
5,Here is the string I need
15,Here is the 2nd string I need
22,last string I need, that is the same across all files

，看看regexp比较有什么不同，你也可以试试这个:

awk -v OFS=',' '
    /^={46}$/{f=!f; next}
    f {print NR, $0; if (/unique string/) exit}
' file

甚至不强制awk计数46个=可能会更快:

awk -v OFS=',' '
    /^=+$/{f=!f; next}
    f {print NR, $0; if (/unique string/) exit}
' file

这个怎么样:

$ awk '/^={46}$/ {f=!f; next} f {print NR, $0}' file
5 Here is the string I need
15 Here is the 2nd string I need
22 last string I need, that is the same across all files

字符串的 =s翻转标志f，在它之后打印，直到下一个翻转标志的字符串

相关内容

最新更新

热门标签：