我正在使用tm包在R中导入多个pdf。我需要从 pdf 的内容中获取一些包含标题"公司信息"的字符向量。问题是双重的。首先,我无法提取带有此标题的向量。其次,这个向量以一种非常混乱的方式出现。我真的无法将这个人的名字与在公司中担任的职位联系起来。这是我尝试构建的数据集类型。我在下面展示一个例子。欢迎任何帮助。
vector_of_interest <- c(" CORPORATE INFORMATIONrn BOARD OF DIRECTORS REGISTERED OFFICErn Chuah Ah Bee Suite 12-02,12th Floorrn Executive Chairman Menara Zurichrn Chuah Hoon Phong 170 Jalan Argyll, 10050 Penangrn Group Managing Director Telephone Number : 04-2296 318rn Chan Kim Keow Facsimile Number : 04-2282 118rn Executive Directorrn Loo Choo Geern Executive Director COMPANY SECRETARIESrn Chew Chee Khongrn Executive Director Gunn Chit Geokrn Ng Seng Bee (MAICSA 0673097)rn Independent Non-Executive Director Chew Siew Chengrn Haji Ahmad Fazil Bin Haji Hashim (MAICSA 7019191)rn Independent Non-Executive Directorrn Goh Choon Aikrn Independent Non-Executive Director SHARE REGISTRARrn Tricor Investor Services Sdn Bhdrn AUDIT COMMITTEE Level 17, The Gardens North Towerrn Mid Valley Cityrn Ng Seng Bee Lingkaran Syed Putrarn Chairman 59200 Kuala Lumpurrn Haji Ahmad Fazil Bin Haji Hashim Telephone Number : 03-2264 3883rn Member Facsimile Number : 03-2282 1886rn Goh Choon Aikrn Memberrn STOCK EXCHANGE LISTINGrn REMUNERATION COMMITTEE Main Market of Bursa Malaysia Securities Berhadrn Stock Code : 7174rn Haji Ahmad Fazil Bin Haji Hashim Stock Name : CABrn Chairmanrn Chuah Ah Beern Member AUDITORSrn Ng Seng Beern Member Deloitte KassimChanrn Chartered Accountantsrn 4th Floor, Wisma Wangrn NOMINATION COMMITTEE 251-A Jalan Burmarn 10350 Penangrn Haji Ahmad Fazil Bin Haji Hashimrn Chairmanrn Ng Seng Bee PRINCIPAL BANKERSrn Memberrn Goh Choon Aik Malayan Banking Berhadrn Member Hong Leong Bank Berhadrn United Overseas Bank (Malaysia) Berhadrn10 CAB Annual Report 2012rn")
#my attempt
library(tm)
library(tidyverse)
library(stringr)
Rpdf <- readPDF(control = list(text = "-layout")) # layout control in order to keep the original format as much as possible. I have also tried to add engine = "xpdf", before control
docs <- Corpus(DirSource(cname), readerControl=list(reader=Rpdf)) # upload documents
document <- content(docs[[1]])
corporate.info <- unlist(str_extract_all(document, "CORPORATE INFORMATION.+"))
pdf 可以在此链接中找到:http://www.bursamalaysia.com/market/listed-companies/company-announcements/4372609 信息在第 10 页
我找到了一个解决方案:
首先,我将默认ReadPDF engine
更改为 xpdf
Rpdf <- readPDF(engine = "xpdf", control = list(text = "-layout"))
# layout control in order to keep the original format as much as possible
docs <- Corpus(DirSource(cname), readerControl=list(reader=Rpdf))
# upload documents i ncname, the path to the files
其次,我折叠文本以便每个向量有一个文档:
document <- content(docs[[1]])
document <- unlist(paste(document , collapse = ' '))
第三,我提取包含我正在寻找的信息的页面并使用正则表达式提取名称
corporate.info <- unlist(str_extract_all(document, "\f+.+CORPORATE+.+INFORMATION+.+\f"))
### "f" --> indicates the beggining and end of of a page
### "+.+CORPORATE+.+INFORMATION+.+" --> indicates the page with the heading I was interested
corporate.info <- unlist(str_extract_all(corporate.info, "[A-Z]+[a-z]{1,8}\s[A-Z]+[a-z]{1,8}\s[A-Z]+[a-z]{1,8}")) # extract names
corporate.info <- unique(corporate.info) # clean
corporate.info <- str_replace_all(corporate.info, ".*Bank.*", "") # clean + similar stuff to clean