如何在不同的子字符串触发器上收集特定子字符串的列表

我有一个用 Cobol 编写的大型工程信息文本数据集 (~2 GB)。我正在尝试提取其中的某些子字符串，并使用提取的数据制作CSV列表。

感兴趣的子字符串出现在每条记录中的已知位置。但是，数据本身中没有唯一标识符(主键)。它只是一个数据列表，其中每个"记录"都以"01"开头的一行开头。随后的每一行都属于同一记录，直到下一个"01"。给定行的存在可能会有所不同，但如果存在，数据将按特定间隔出现。

数据如下所示：

Line1: 01253820RELEVANTSUBSTRING39ALSORELEVANT0990
Line2: 02999IRRELEVANT
Line3: 0420180101RELEVANTMONTHLYDATA000MORERELEVANTDATA8980
Line4: 0420190101FURTHERRELEVANTMONTHLYDATA
Line5: 12000003848982IRRELEVANT
Line6: 0100NEWRECORD8932000
Line7: 0420100101MORE

我已经能够使用以下代码(部分包含在下面)成功提取每个"01"之后发生的相关子字符串：

static void PopulateList(){
using (StreamReader sr = new StreamReader(sourcePath))
{
string ctrl  //control key - indicates a new record if "01"
List<TurbineModel> turbines = new List<TurbineModel>();
List<string> lines = File.ReadAllLines(sourcePath).ToList();
foreach (string line in lines)
{
if (line.Substring(0, 2) == "01")
{
ctrl = line.Substring(0, 2);
TurbineModel newWell = new TurbineModel();
newTurbine.Ctrl = ctrl;
turbines.Add(newTurbine);
}
}
}

此代码工作正常。但是，还有以"04"开头的行，其中包含我无法提取并与当前"01"列表分组的其他信息。我可以从以"04"开头的每一行中提取子字符串，但我无法将每条记录的数据链接到它前面的"01"记录。

我需要代码执行以下操作：

1) 在数据中得出"01"并设置新记录 2)从"01"行中提取相关信息(根据上面的代码) 3)跳过后续行，直到到达"04" 4)如果达到"04"，则从该行中提取子字符串，并将提取的子字符串与"01"子字符串分组 5) 继续扫描行，直到到达新的"01"，此时它会设置新记录并重新开始 6)将所有内容输出为CSV

我无法将信息组合在一起，以便我知道哪个"04"与哪个"01"相关。

非常感谢您能提供的任何帮助。如果我能澄清，请告诉我。

在我看来，您所要做的就是创建一个可以存储01行中的数据的类，并且可以保存以下行的相关部分。

下面是一个示例，我们遍历文件中的每一行，如果该行以"01开头，我们将创建一个新Item，并在Data时添加该行(您可以对行内容进行一些处理以填充其他属性)。如果该行不以"01"开头，并且我们已经创建了一个Item，那么如果该行以"04"开头，我们将该行添加到项的AssociatedLines属性中(您也可以以某种方式处理该行并将相关部分添加到Item中)。

最后，我们有一个Item对象的列表，每个对象都是从以"01"开头的一行创建的，其中包含此后的所有行，直到下一行以"01"开头

。一、Item类：

public class Item
{
public string Data { get; set; }
public List<string> AssociatedData { get; set; } = new List<string>();
// This returns a comma-separated line representing this item
public string GetCsvString()
{
return $"{Data},{string.Join(",", AssociatedData)}";
}
}

然后根据文件数据创建这些列表的代码：

public static List<Item> GetItems(string filePath)
{
var items = new List<Item>();
Item current = null;
foreach (var line in File.ReadAllLines(filePath))
{
if (line.StartsWith("01"))
{
// If there's already a current item, add it to our list
if (current != null) items.Add(current);
// Here we would parse the '01' line and set properties of the current item
current = new Item {Data = line};
}
else if (line.StartsWith("04"))
{
// Here we would parse the '04' line and set properties of the current item
current?.AssociatedData.Add(line);
}
}
// Add the final item to our list
if (current != null) items.Add(current);
return items;
}

然后调用上述方法的代码将如下所示：

var items = GetItems(@"f:publictemptemp.txt");

将项提取到CSV文件可能最好重写Item类上的ToString()方法或提供以正确格式吐出相关数据的GetCsvString()方法。之后，您可以将项目写入 csv 文件，如下所示：

File.WriteAllLines(@"f:publictemptemp.csv", items.Select(item => item.GetCsvString()));

试一试，这是一个"块阅读器":)我过去使用过类似的东西。它可能需要一些工作，但它将您的样本解析为 2 个"块"。

namespace Solution
{
class Solution
{
static void Main(string[] args)
{
var reader = new ChunkReader();
Chunk chunk = null;
foreach (Chunk c in reader.Read(@"D:test.txt"))
{
Console.WriteLine(c.Header);
}
Console.ReadKey();
}
}
internal class ChunkReader
{
public IEnumerable<Chunk> Read(string filePath)
{
Chunk currentChunk = null;
using (StreamReader reader = new StreamReader(File.OpenRead(filePath)))
{
string currentLine;
while ((currentLine = reader.ReadLine()) != null)
{
if (currentLine.StartsWith("01"))
{
if (currentChunk != null)
{
yield return currentChunk;
}
currentChunk = new Chunk();
currentChunk.Contents.Add(currentLine);
}
else
{
currentChunk?.Contents.Add(currentLine);
}
}
}
yield return currentChunk;
}
}
internal class Chunk
{
public Chunk()
{
Contents = new SortedSet<string>();
}
public SortedSet<string> Contents { get; }
public string Header
{
get
{
return Contents.FirstOrDefault(s => s.StartsWith("01"));
}
}
}
}

首先，正如其他人所建议的那样，如果您的文件非常大，您应该考虑File.ReadAllLines()的替代方案，因为它可能会变得昂贵。但由于问题不是关于这个，所以我正在超越它。

首先，两个虚拟函数，一旦您知道一行是以01开头还是以04开头，就可以模拟提取必要的数据。

static string Extract01Data(string line)
{
return line;
}
static string Extract04Data(string line)
{
return line;
}

编辑

编辑了答案以容纳以第一行01之后的04开头的多行：

还有一个简单的类来保存结果数据：

public class Record
{
public string OneInfo { get; set; }
public List<string> FourInfo { get; set; } = new List<string>();
}

然后，这是我的代码，注释中有解释：

static void Main()
{
var file = @"C:UsersgurudeniyasDesktopCobolData.txt";
var lines = File.ReadAllLines(file).ToList();
var records = new List<Record>();
for (var count = 0; count < lines.Count; count++)
{
var line = lines[count];
var firstTwo = line.Substring(0, 2);
// Iterate till we find a line that starts with 01
if (firstTwo == "01")
{
// Create a Record and add 01 line related data
var rec = new Record
{
OneInfo = Extract01Data(line)
};
// Here we iterate to find preceding lines that start with 03
// If we find them, extract 04 data and add as a record
// Break out of the loop if we find the next 01 line or EOF
do
{
count++;
if (count == lines.Count)
break;
line = lines[count];
firstTwo = line.Substring(0, 2);
if (firstTwo == "04")
{
rec.FourInfo.Add(Extract04Data(line));
}
} while (firstTwo != "01");
// If we found next 01, backtrack count by 1 so in the outer loop we can process that record again
if (firstTwo == "01")
{
count--;
}
records.Add(rec);
}
}
Console.ReadLine();
}

如果"04"总是跟在01之后，你可以添加一个else，如下所示，然后访问列表中的最后一项(这将起作用，因为将项目添加到列表中会将其添加到末尾)。

foreach (string line in lines)
{
if (line.Substring(0, 2) == "01")
{
ctrl = line.Substring(0, 2);
TurbineModel newWell = new TurbineModel();
newTurbine.Ctrl = ctrl;
turbines.Add(newTurbine);
}
else if (line.Substring(0, 2) == "04")
{
var lastTurbine = turbines[turbines.Count - 1];
//do what you need to do with the "04" record monthly data here
}
}

您是否考虑过使用有限状态机算法？似乎很理想。

编辑

相关内容

最新更新

热门标签：