如何将此文本转换为所需的数组格式并以 csv 格式导出?



我使用pdftotext工具从pdf中提取了此文本

请在下面找到文本结构

stage    title1    title2  title3  title4
I        value1    value2  value3  
II                         value5  value6
stage    Other1      Other2     Other3     Other4
I        otherval1   otherval2  otherval3  otherval4

现在我想以这种方式使用适当的列和标题将此文本导出为 CSV 格式,或者以这种方式构建数组

[
"category" => "title1",
"score"    => "value1",
],
[
"category" => "title2",
"score"    => "value2",
],
[
"category" => "title3",
"score"    => "value3"
],
// unable to to do this
[
"category" => "title3",
"score"    => "value5"
],
[
"category" => "title4",
"score"    => "value6",
],
.
.
// so on

现在的问题是

  • I 阶段和 II 阶段中的列值是可选的,但以下任一 每列至少包含一个值
  • 第二阶段行是可选的,可能存在也可能不存在
  • 如果阶段 II 行存在,则至少存在一个列值 排

我面临的问题是我如何映射

  • 值 5 到标题 3
  • 值 6 到 TITL4

这是我的解析器代码(PHP(

$rows = explode("n", $pdfExtractedText);
$rows = array_values(array_filter($rows));
$categories = array_values(array_filter(explode(" ", $rows[7])));
$stage1Scores = array_values(array_filter(explode(" ", $rows[8])));
$stage2Scores = array_values(array_filter(explode(" ", $rows[9])));
var_dump($categories);
var_dump($stage1Scores);
var_dump($stage2Scores);

输出:

// categories
array:13 [
0 => "stage"
1 => "title1"
2 => "title2"
3 => "title3"
4 => "title4"
]
//values - Index preserved so that I can map with categories
array:14 [
0 => "I"
1 => "value1"
2 => "value2"
3 => "value3"
4 => "value4"
]
// index not preserved :(
array:2 [
0 => "II"
1 => "value5",
2 => "value6"
]

然后试试这个,

$csv = "";
$csv .= implode("," , $categories) . PHP_EOL; 
$csv .= implode("," , $stage1scores) . PHP_EOL;
$csv .= implode("," , $stage2scores) . PHP_EOL;

然后将其写入文件。

相关内容

  • 没有找到相关文章

最新更新