PDF阅读器,用于读取PDF文件夹,并为读取的每个文件返回关键字的numpy数组



我有一个脚本,可以找到并读取一个单一的pdf文件,并返回该pdf文件中特定单词的计数。我想扩展这个scrip,这样它就会读取特定文件夹中的所有pdf文件,并创建一个表(numpy数组(,以pdf的名称为行,以特定的单词为列——对应单元格中相应pdf文件的每个单词的计数。

以下是可以让我计算单个文件的特定单词的代码:

import PyPDF2 
import textract
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nummpy as np

#path and singular object. something of a forloop and and list made to parse through all files in the folder?
filename = rb'path'
pdfFileObj = open(filename, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
if text != "":
text = text
else:
text = textract.process(fileurl, method='tesseract', language='eng')
tokens = word_tokenize(text)
punctuations = ['(',')',';',':','[',']',',']
stop_words = stopwords.words('english')
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
#count the specific word. want to count many words and the place them into an array match to the particular pdf as explained in first part. 
keywords.count('metaphysics')

把这当成一种爱好,这是我试图创建的最复杂的事情之一

import os
arr=np.array([])
path='C:/Users/Kevin/Documents/'
for entry in os.listdir(path):
if os.path.isfile(os.path.join(path, entry)):
if entry.lower().endswith('.pdf'):
filename=entry 
#followed by ur code
#....
keywords=np.array(keywords)
arr=np.concatenate((arr,keywords))

最新更新