我是R的初学者,使用tm
包有点麻烦。我需要从第55页到第300页提取特定的数据,并认为R可能是一个很好的方法。(如果有人有更好的主意,请告诉我!)我做了一些搜索,在安装了tm
包和xpdf
包之后,我试着阅读这篇文章,并尝试了zx8754的解决方案,但没有运气。我怀疑它与readPDF命令有关——我得到以下内容:
readPDF(PdftotextOptions = "-layout")错误:未使用的参数(pdftotextopoptions = "-layout")
我认为这与尝试使用tm
包和xpdf
包在一起有关,所以我读了托尼布雷亚尔的解决方案(我不能发布超过2个链接),把pdfinfo和pdftotext作为环境变量(我在Win 8上)并重新启动。我确信我错过了一些东西-现在我在r中的工作目录中有pdftotext.exe,谁能帮助我正确配置这个,以便tm包正确调用xpdf文件并像它应该的那样读取pdf函数?
再次强调,我对这个很陌生,所以如果我说的有点离题,我很抱歉。如果你能帮助我,我将不胜感激。
提前感谢,
贾斯汀为了让您入门,这里有一个完整的readPDF
命令示例,用于读取PDF文件。当我试图从您提供的链接直接检索PDF文件时,readPDF
抛出了一个错误,所以我先将PDF文件下载到我的工作目录。
library(tm)
# File name
filename = "ea0607.pdf"
# Read the PDF file
doc <- readPDF(control = list(text = "-layout"))(elem = list(uri = filename),
language = "en",
id = "id1")
上面的代码将PDF文件转换为文本并将结果存储在doc
中。doc
实际上是一个列表,如下面的代码所示:
str(doc)
List of 2
$ content: chr [1:23551] " STATE UNIVERSITY SYSTEM OF FLORIDA" "" "EXPENDITURE ANALYSIS" " 2006-2007" ...
$ meta :List of 7
..$ author : chr "greg.jacques"
..$ datetimestamp: POSIXlt[1:1], format: "2007-12-10 11:33:48"
..$ description : NULL
..$ heading : chr " PGM=EASUSI-V01 STATE UNIVERSITY SYSTEM "| __truncated__
..$ id : chr "ea0607.pdf"
..$ language : chr "en"
..$ origin : chr "Acrobat PDFMaker 8.1 for Word"
..- attr(*, "class")= chr "TextDocumentMeta"
- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
PDF文件的文本存储在doc$content
中,而doc$meta
包含PDF文件的各种元数据。doc$content
的每一行都是PDF文件中的一行。以下是PDF文件的第300至310行:
doc$content[300:310]
[1] ""
[2] "and General (E&G) budget entity. The Expenditure Analysis continues to reflect special units separately and the"
[3] ""
[4] "traditional program components and related activities have been further defined to support the funding formula. The"
[5] ""
[6] "Expenditure Analysis format was revised in 1995-96 to include all activities in the funding formula as well as college"
[7] ""
[8] "detail by activity for the UF Health Science Center, the USF Health Science Center and the FSU Medical School. A"
[9] ""
[10] "definition of each follows:"
[11] ""