如何将pdf文件转换为docx。有没有一种使用python的方法?
我看到一些页面允许用户上传PDF
并返回DOC
文件,比如PdfToWord
提前感谢
如果您安装了LibreOffice
lowriter --invisible --convert-to doc '/your/file.pdf'
如果你想使用Python:
import os
import subprocess
for top, dirs, files in os.walk('/my/pdf/folder'):
for filename in files:
if filename.endswith('.pdf'):
abspath = os.path.join(top, filename)
subprocess.call('lowriter --invisible --convert-to doc "{}"'
.format(abspath), shell=True)
这很困难,因为PDF是面向表示的,word文档是面向内容的。我已经测试了这两个项目,可以推荐以下项目。
- PyPDF2
- PDFMiner
然而,你肯定会在转换中失去表现方面。
如果你想转换PDF->MS Word类型的文件,比如docx,我遇到了这个。
Ahsin Shabbir写道:
import glob
import win32com.client
import os
word = win32com.client.Dispatch("Word.Application")
word.visible = 0
pdfs_path = "" # folder where the .pdf files are stored
for i, doc in enumerate(glob.iglob(pdfs_path+"*.pdf")):
print(doc)
filename = doc.split('\')[-1]
in_file = os.path.abspath(doc)
print(in_file)
wb = word.Documents.Open(in_file)
out_file = os.path.abspath(reqs_path +filename[0:-4]+ ".docx".format(i))
print("outfilen",out_file)
wb.SaveAs2(out_file, FileFormat=16) # file format for docx
print("success...")
wb.Close()
word.Quit()
这对我来说很有魅力,转换了500页的PDF格式和图像。
您可以使用GroupDocs.Conversion Cloud SDK for python,而无需安装任何第三方工具或软件。
示例Python代码:
# Import module
import groupdocs_conversion_cloud
# Get your app_sid and app_key at https://dashboard.groupdocs.cloud (free registration is required).
app_sid = "xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx"
app_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# Create instance of the API
convert_api = groupdocs_conversion_cloud.ConvertApi.from_keys(app_sid, app_key)
file_api = groupdocs_conversion_cloud.FileApi.from_keys(app_sid, app_key)
try:
#upload soruce file to storage
filename = 'Sample.pdf'
remote_name = 'Sample.pdf'
output_name= 'sample.docx'
strformat='docx'
request_upload = groupdocs_conversion_cloud.UploadFileRequest(remote_name,filename)
response_upload = file_api.upload_file(request_upload)
#Convert PDF to Word document
settings = groupdocs_conversion_cloud.ConvertSettings()
settings.file_path =remote_name
settings.format = strformat
settings.output_path = output_name
loadOptions = groupdocs_conversion_cloud.PdfLoadOptions()
loadOptions.hide_pdf_annotations = True
loadOptions.remove_embedded_files = False
loadOptions.flatten_all_fields = True
settings.load_options = loadOptions
convertOptions = groupdocs_conversion_cloud.DocxConvertOptions()
convertOptions.from_page = 1
convertOptions.pages_count = 1
settings.convert_options = convertOptions
.
request = groupdocs_conversion_cloud.ConvertDocumentRequest(settings)
response = convert_api.convert_document(request)
print("Document converted successfully: " + str(response))
except groupdocs_conversion_cloud.ApiException as e:
print("Exception when calling get_supported_conversion_types: {0}".format(e.message))
我是aspose的开发者传道者。
根据预览答案,这是使用Python 3.7.1最适合我的解决方案
import win32com.client
import os
# INPUT/OUTPUT PATH
pdf_path = r"""C:path2pdf.pdf"""
output_path = r"""C:output_folder"""
word = win32com.client.Dispatch("Word.Application")
word.visible = 0 # CHANGE TO 1 IF YOU WANT TO SEE WORD APPLICATION RUNNING AND ALL MESSAGES OR WARNINGS SHOWN BY WORD
# GET FILE NAME AND NORMALIZED PATH
filename = pdf_path.split('\')[-1]
in_file = os.path.abspath(pdf_path)
# CONVERT PDF TO DOCX AND SAVE IT ON THE OUTPUT PATH WITH THE SAME INPUT FILE NAME
wb = word.Documents.Open(in_file)
out_file = os.path.abspath(output_path + '\' + filename[0:-4] + ".docx")
wb.SaveAs2(out_file, FileFormat=16)
wb.Close()
word.Quit()
在您的机器上使用Adobe
如果你的机器上有adobe acrobate,你可以使用以下功能将PDF文件保存为docx文件
# Open PDF file, use Acrobat Exchange to save file as .docx file.
import win32com.client, win32com.client.makepy, os, winerror, errno, re
from win32com.client.dynamic import ERRORS_BAD_CONTEXT
def PDF_to_Word(input_file, output_file):
ERRORS_BAD_CONTEXT.append(winerror.E_NOTIMPL)
src = os.path.abspath(input_file)
# Lunch adobe
win32com.client.makepy.GenerateFromTypeLibSpec('Acrobat')
adobe = win32com.client.DispatchEx('AcroExch.App')
avDoc = win32com.client.DispatchEx('AcroExch.AVDoc')
# Open file
avDoc.Open(src, src)
pdDoc = avDoc.GetPDDoc()
jObject = pdDoc.GetJSObject()
# Save as word document
jObject.SaveAs(output_file, "com.adobe.acrobat.docx")
avDoc.Close(-1)
请注意,input_file和output_file需要如下所示:
- D: \OneDrive。。。\文件.pdf
- D: \OneDrive。。。\dafad.docx
对于安装了LibreOffice的Linux用户,请尝试
soffice --invisible --convert-to doc file_name.pdf
如果您遇到类似Error: no export filter found, abording
的错误,请尝试此
soffice --infilter="writer_pdf_import" --convert-to doc file_name.pdf