PDF_Extraction没有 OCR 的结构化数据

我一直在尝试使用 C# 从 pdf 文件中提取数据，包括表中的一次。我的目标是在没有任何第三方库及其许可或 OCR 的情况下提取这些数据，同时提取数据而不会丢失其结构。我需要这个来创建用于 pdf 自动化的 DLL。

我相信，实现这一目标的最佳方法是使用一个名为iTextSharp的库。它很容易作为Nuget包提供。

下面是一个示例：

using System;
using System.IO;
using System.Linq;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace Pdf2Text
{
class Program
{
static void Main(string[] args)
{
if (!args.Any()) return;
var file = args[0];
var output = Path.ChangeExtension(file, ".txt");
if (!File.Exists(file)) return;
var bytes = File.ReadAllBytes(file);
File.WriteAllText(output, ConvertToText(bytes), Encoding.UTF8);
}
private static string ConvertToText(byte[] bytes)
{
var sb = new StringBuilder();
try
{
var reader = new PdfReader(bytes);
var numberOfPages = reader.NumberOfPages;
for (var currentPageIndex = 1; currentPageIndex <= numberOfPages; currentPageIndex++)
{
sb.Append(PdfTextExtractor.GetTextFromPage(reader, currentPageIndex));
}
}
catch (Exception exception)
{
Console.WriteLine(exception.Message);
}
return sb.ToString();
}
}
}

附言 - 由于您不需要 OCR 解决方案，因此这种方式将起作用。但是，如果 PDF 在图像中有数据，它将不起作用。为此，只有 OCR 将是一个解决方案。

试一试，让我知道您的评论。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Word = Microsoft.Office.Interop.Word;
using System.IO;
namespace PDF_EXTRACT
{
public class pdfTohtm
{
public static string ConvertPdf(string path, string outpath)
{
Word.Application app = new Word.Application(); ;
Word.Document doc1;
try
{
doc1 = app.Documents.Open(path, false, ReadOnly: false);
app.DisplayAlerts = Word.WdAlertLevel.wdAlertsAll;
app.FileValidation = Microsoft.Office.Core.MsoFileValidationMode.msoFileValidationSkip;
app.Visible = false;
app.AutomationSecurity = Microsoft.Office.Core.MsoAutomationSecurity.msoAutomationSecurityForceDisable;
doc1.SaveAs2(outpath, Word.WdSaveFormat.wdFormatFilteredHTML, ReadOnlyRecommended: false);
doc1.Close();
string result = File.ReadAllText(outpath + ".htm", Encoding.UTF7);
return "success:" + result;
}
catch (Exception e)
{

return "failed::::" + e;

}
finally
{
app.Quit();
System.Runtime.InteropServices.Marshal.FinalReleaseComObject(app);
}
}
}

}

说明：此解决方案的工作原理是将pdf作为可编辑的Word文档打开，然后将文件另存为
.htm文件。现在，.htm文件被打开并作为文本文件读取，因此此代码的输出是一组html代码，您可以将其粘贴到Excel中以转换为pdf到Excel，而不会丢失数据的结构。

关键说明：

如果 pdf 是扫描副本，则此解决方案不起作用，对于此类
pdf，根据我对这个主题的
了解，OCR 似乎是唯一的选择。

2.对于参数"路径"，文件的完整路径必须是通过，对于参数"outpath"，传递路径没有扩展名例如：C：\用户\用户名\文件夹\文件名(无扩展名文件，即需要".htm"(。

相关内容

最新更新

热门标签：