我有一个简单的.txt日志文件,应用程序在其工作时添加行。这些线由时间戳和可变长度文本组成:
17-06-25 06:37:43 xxxxxxxxxxxxxxx
17-06-25 06:37:46 yyyyyyy
17-06-25 06:37:50 zzzzzzzzzzzzzzzzzzzzzzzzzzzz
...
我需要用大于特定日期时间的时间戳提取所有线条。这通常是关于最后一个20-40个日志条目(行(。
问题是,该文件很大并且正在增长。
如果所有长度都相等,我会调用二进制搜索。但是它们不是,所以我最终使用了以下内容:
Private Sub ExtractNewestLogs(dEarliest As Date)
Dim sLine As String = ""
Dim oSRLog As New StreamReader(gsFilLog)
sLine = oSRLog.ReadLine()
Do While Not (sLine Is Nothing)
Debug.Print(sLine)
sLine = oSRLog.ReadLine()
Loop
End Sub
这不是真的很快。
有一种方法可以使用它"向后"读取此类文件,即第一行?如果没有,我还有什么选择?
下面的功能将使用二进制读取器返回文件中的最后一个x
字符数作为字符串。然后,您可以提取比阅读整个日志文件要快得多的最后记录。您可以根据最后20-40个日志条目对多少个字节进行粗略近似来微调要读取的字节数。在我的PC上 - 读取17MB文本文件的最后10,000个字符,花了10毫秒。
当然,此代码假定您的日志文件是普通的ASCII文本。
Private Function ReadLastbytes(filePath As String, x As Long) As String()
Dim fileData(x - 1) As Byte
Dim tempString As New StringBuilder
Dim oFileStream As New FileStream(filePath, FileMode.Open, FileAccess.Read)
Dim oBinaryReader As New BinaryReader(oFileStream)
Dim lBytes As Long
If oFileStream.Length > x Then
lBytes = oFileStream.Length - x
Else
lBytes = oFileStream.Length
End If
oBinaryReader.BaseStream.Seek(lBytes, SeekOrigin.Begin)
fileData = oBinaryReader.ReadBytes(lBytes)
oBinaryReader.Close()
oFileStream.Close()
For i As Integer = 0 To fileData.Length - 1
If fileData(i)=0 Then i+=1
tempString.Append(Chr(fileData(i)))
Next
Return tempString.ToString.Split(vbCrLf)
End Function
我无论如何都尝试了二进制搜索,尽管文件没有静态行长。
首先考虑一些考虑,然后是代码:
有时需要,基于行开头的上升排序键提取日志文件的最后n行。关键确实可以是任何东西,但是日志文件通常代表日期时间,通常以格式yymmddhhnss(可能带有一些插入(。
日志文件通常是基于文本的文件,其中有时数百万。通常,日志文件具有固定长度的线路宽度,在这种情况下,特定的键很容易通过二进制搜索访问。但是,可能经常,日志文件具有可变的线宽度。要访问这些,可以使用平均线宽度的估计值以从末端计算文件位置,然后从那里依次处理到EOF。
但是,如下所示,人们也可以采用二进制方法。一旦文件大小增长,优势就会出现。日志文件的最大大小由文件系统确定:NTFS允许16 EIB(16 x 2^60 B(,理论上;在Windows 8或Server 2012下,实际上是256 TIB(256 x 2^40 b(。
(256 tib实际上是含义:典型的日志文件是设计为人类可读的,每行很少超过80个字符。让我们假设您的日志文件沿着愉快而完全不间断的12年,且完全不间断总计为86,400秒的总计4,383天,然后允许您的申请将9个条目写入上述日志文件中,最终达到13年的256 TIB限制。(
二进制方法的最大优势是,n比较足以由2^n个字节组成的日志文件,随着文件大小变大,迅速获得优势:而1个kib文件尺寸需要10个比较(1个KIB(1个(每102.4 b(只需要20个MIB(每50 kiB 1个(,1个GIB(每33次MIB 1(,仅需30个比较,而仅需40个对比较的文件,则只需40个比较。
到该功能。这些假设是做出的:日志文件是在UTF8中编码的,日志线通过CR/LF序列分开,时间戳以每行的开头位于每行的开头,可能是以[yy] yymmddhhnss的格式(可能是(两者之间有一定的插入。(所有这些假设都可以轻松地通过超载函数调用来修改和照顾。(
在外循环中,通过比较提供的最早日期匹配来完成二进制缩小。一旦发现了流中的新位置,就可以在内部循环中进行独立的正向搜索,以找到下一个CR/LF序列。此序列之后的字节标志着唱片键的开始。如果此键更大或等于我们正在寻找的密钥,则将被忽略。只有当发现的键小于我们正在寻找其位置的密钥时,才能将其视为我们想要的键的可能的凝结。我们最终以最大键的最后记录小于搜索键。
最后,除终极候选人以外的所有日志记录都以字符串数组返回到呼叫者。
该函数需要system.io。
导入Imports System.IO
'This function expects a log file which is organized in lines of varying
'lengths, delimited by CR/LF. At the start of each line is a sort criterion
'of any kind (in log files typically YYMMDD HHMMSS), by which the lines are
'sorted in ascending order (newest log line at the end of the file). The
'earliest match allowed to be returned must be provided. From this the sort
'key's length is inferred. It needs not to exist neccessarily. If it does,
'it can occur multiple times, as all other sort keys. The returned string
'array contains all these lines, which are larger than the last one found to
'be smaller than the provided sort key.
Public Shared Function ExtractLogLines(sLogFile As String,
sEarliest As String) As String()
Dim oFS As New FileStream(sLogFile, FileMode.Open, FileAccess.Read,
FileShare.Read) 'The log file as file stream.
Dim lMin, lPos, lMax As Long 'Examined stream window.
Dim i As Long 'Iterator to find CR/LF.
Dim abEOL(0 To 1) As Byte 'Bytes to find CR/LF.
Dim abCRLF() As Byte = {13, 10} 'Search for CR/LF.
Dim bFound As Boolean 'CR/LF found.
Dim iKeyLen As Integer = sEarliest.Length 'Length of sort key.
Dim sActKey As String 'Key of examined log record.
Dim abKey() As Byte 'Reading the current key.
Dim lCandidate As Long 'File position of promising candidate.
Dim sRecords As String 'All wanted records.
'The byte array accepting the records' keys is as long as the provided
'key.
ReDim abKey(0 To iKeyLen - 1) '0-based!
'We search the last log line, whose sort key is smaller than the sort
'provided in sEarliest.
lMin = 0 'Start at stream start
lMax = oFS.Length - 1 - 2 '0-based, and without terminal CRLF.
Do
lPos = (lMax - lMin) 2 + lMin 'Position to examine now.
'Although the key to be compared with sEarliest is located after
'lPos, it is important, that lPos itself is not modified when
'searching for the key.
i = lPos 'Iterator for the CR/LF search.
bFound = False
Do While i < lMax
oFS.Seek(i, SeekOrigin.Begin)
oFS.Read(abEOL, 0, 2)
If abEOL.SequenceEqual(abCRLF) Then 'CR/LF found.
bFound = True
Exit Do
End If
i += 1
Loop
If Not bFound Then
'Between lPos and lMax no more CR/LF could be found. This means,
'that the search is over.
Exit Do
End If
i += 2 'Skip CR/LF.
oFS.Seek(i, SeekOrigin.Begin) 'Read the key after the CR/LF
oFS.Read(abKey, 0, iKeyLen) 'into a string.
sActKey = System.Text.Encoding.UTF8.GetString(abKey)
'Compare the actual key with the earliest key. We want to find the
'largest key just before the earliest key.
If sActKey >= sEarliest Then
'Not interested in this one, look for an earlier key.
lMax = lPos
Else
'Possibly interesting, remember this.
lCandidate = i
lMin = lPos
End If
Loop While lMin < lMax - 1
'lCandidate is the position of the first record to be taken into account.
'Note, that we need the final CR/LF here, so that the search for the
'next CR/LF sequence following below will match a valid first entry even
'in case there are no entries to be returned (sEarliest being larger than
'the last log line).
ReDim abKey(CInt(oFS.Length - lCandidate - 1)) '0-based.
oFS.Seek(lCandidate, SeekOrigin.Begin)
oFS.Read(abKey, 0, CInt(oFS.Length - lCandidate))
'We're done with the stream.
oFS.Close()
'Convert into a string, but omit the first line, then return as a
'string array split at CR/LF, without the empty last entry.
sRecords = (System.Text.Encoding.UTF8.GetString(abKey))
sRecords = sRecords.Substring(sRecords.IndexOf(Chr(10)) + 1)
Return sRecords.Split(ControlChars.CrLf.ToCharArray(),
StringSplitOptions.RemoveEmptyEntries)
End Function