如何从 R 中使用 tm 包导入的 pdf 中提取具有特定标题的文本?



我正在使用tm包在R中导入多个pdf。我需要从 pdf 的内容中获取一些包含标题"公司信息"的字符向量。问题是双重的。首先,我无法提取带有此标题的向量。其次,这个向量以一种非常混乱的方式出现。我真的无法将这个人的名字与在公司中担任的职位联系起来。这是我尝试构建的数据集类型。我在下面展示一个例子。欢迎任何帮助。

vector_of_interest <- c("   CORPORATE INFORMATIONrn   BOARD OF DIRECTORS                 REGISTERED OFFICErn   Chuah Ah Bee                       Suite 12-02,12th Floorrn   Executive Chairman                 Menara Zurichrn   Chuah Hoon Phong                   170 Jalan Argyll, 10050 Penangrn   Group Managing Director            Telephone Number : 04-2296 318rn   Chan Kim Keow                      Facsimile Number : 04-2282 118rn   Executive Directorrn   Loo Choo Geern   Executive Director                 COMPANY SECRETARIESrn   Chew Chee Khongrn   Executive Director                 Gunn Chit Geokrn   Ng Seng Bee                        (MAICSA 0673097)rn   Independent Non-Executive Director Chew Siew Chengrn   Haji Ahmad Fazil Bin Haji Hashim   (MAICSA 7019191)rn   Independent Non-Executive Directorrn   Goh Choon Aikrn   Independent Non-Executive Director SHARE REGISTRARrn                                      Tricor Investor Services Sdn Bhdrn   AUDIT COMMITTEE                    Level 17, The Gardens North Towerrn                                      Mid Valley Cityrn   Ng Seng Bee                        Lingkaran Syed Putrarn   Chairman                           59200 Kuala Lumpurrn   Haji Ahmad Fazil Bin Haji Hashim   Telephone Number : 03-2264 3883rn   Member                             Facsimile Number : 03-2282 1886rn   Goh Choon Aikrn   Memberrn                                      STOCK EXCHANGE LISTINGrn   REMUNERATION COMMITTEE             Main Market of Bursa Malaysia Securities Berhadrn                                      Stock Code : 7174rn   Haji Ahmad Fazil Bin Haji Hashim   Stock Name : CABrn   Chairmanrn   Chuah Ah Beern   Member                             AUDITORSrn   Ng Seng Beern   Member                             Deloitte KassimChanrn                                      Chartered Accountantsrn                                      4th Floor, Wisma Wangrn   NOMINATION COMMITTEE               251-A Jalan Burmarn                                      10350 Penangrn   Haji Ahmad Fazil Bin Haji Hashimrn   Chairmanrn   Ng Seng Bee                        PRINCIPAL BANKERSrn   Memberrn   Goh Choon Aik                      Malayan Banking Berhadrn   Member                             Hong Leong Bank Berhadrn                                      United Overseas Bank (Malaysia) Berhadrn10 CAB Annual Report 2012rn")
#my attempt
library(tm)
library(tidyverse)
library(stringr)
Rpdf <- readPDF(control = list(text = "-layout")) # layout control in order to keep the original format as much as possible. I have also tried to add engine = "xpdf", before control
docs <- Corpus(DirSource(cname), readerControl=list(reader=Rpdf)) # upload documents
document <- content(docs[[1]])
corporate.info <- unlist(str_extract_all(document, "CORPORATE INFORMATION.+"))

pdf 可以在此链接中找到:http://www.bursamalaysia.com/market/listed-companies/company-announcements/4372609 信息在第 10 页

我找到了一个解决方案:

首先,我将默认ReadPDF engine更改为 xpdf

Rpdf <- readPDF(engine = "xpdf", control = list(text = "-layout")) 
# layout control in order to keep the original format as much as possible 
docs <- Corpus(DirSource(cname), readerControl=list(reader=Rpdf)) 
# upload documents i ncname, the path to the files

其次,我折叠文本以便每个向量有一个文档:

document <- content(docs[[1]])
document <- unlist(paste(document , collapse = ' '))

第三,我提取包含我正在寻找的信息的页面并使用正则表达式提取名称

corporate.info <- unlist(str_extract_all(document, "\f+.+CORPORATE+.+INFORMATION+.+\f"))
### "f" --> indicates the beggining and end of of a page
### "+.+CORPORATE+.+INFORMATION+.+"  --> indicates the page with the heading I was interested
corporate.info <- unlist(str_extract_all(corporate.info, "[A-Z]+[a-z]{1,8}\s[A-Z]+[a-z]{1,8}\s[A-Z]+[a-z]{1,8}")) # extract names 
corporate.info <- unique(corporate.info) # clean
corporate.info <- str_replace_all(corporate.info, ".*Bank.*", "") # clean + similar stuff to clean

最新更新