我正在尝试导出PDF>DOCX使用Adobe的REST API:https://documentcloud.adobe.com/document-services/index.html post-exportPDF
我面临的问题是无法在本地正确保存它(它损坏了)。我发现了另一个线程与类似的目标,但解决方案不适合我。以下是我脚本的相关部分:
url = "https://cpf-ue1.adobe.io/ops/:create?respondWith=%7B%22reltype%22%3A%20%22http%3A%2F%2Fns.adobe.com%2Frel%2Fprimary%22%7D"
payload = {}
payload['contentAnalyzerRequests'] = json.dumps(
{
"cpf:engine": {
"repo:assetId": "urn:aaid:cpf:Service-26c7fda2890b44ad9a82714682e35888"
},
"cpf:inputs": {
"params": {
"cpf:inline": {
"targetFormat": "docx"
}
},
"documentIn": {
"dc:format": "application/pdf",
"cpf:location": "InputFile"
}
},
"cpf:outputs": {
"documentOut": {
"dc:format": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"cpf:location": docx_filename,
}
}
}
)
myfile = {'InputFile': open(filename,'rb')}
response = requests.request("POST", url, headers=headers, data=payload, files=myfile)
location = response.headers['location']
...
polling here to make sure export is complete
...
if response.status_code == 200:
print('Export complete, saving file locally.')
write_to_file(docx_filename, response)
def write_to_file(filename, response):
with open(filename, 'wb') as f:
for chunk in response.iter_content(1024 * 1024):
f.write(chunk)
我认为问题(或至少是解决方案的线索)是以下文本在回应的乞求:
--Boundary_357737_1222103332_1635257304781
Content-Type: application/json
Content-Disposition: form-data; name="contentAnalyzerResponse"
{"cpf:inputs":{"params":{"cpf:inline":{"targetFormat":"docx"}},"documentIn":{"dc:format":"application/pdf","cpf:location":"InputFile"}},"cpf:engine":{"repo:assetId":"urn:aaid:cpf:Service-26c7fda2890b44ad9a82714682e35888"},"cpf:status":{"completed":true,"type":"","status":200},"cpf:outputs":{"documentOut":{"cpf:location":"output/pdf_test.docx","dc:format":"application/vnd.openxmlformats-officedocument.wordprocessingml.document"}}}
--Boundary_357737_1222103332_1635257304781
Content-Type: application/octet-stream
Content-Disposition: form-data; name="output/pdf_test.docx"
... actual byte content starts here...
为什么要发送这个?我是否错误地将内容写入文件(我也尝试过f.write(response.content)
,结果相同)。我应该向Adobe发送不同的请求吗?
这个额外的文本实际上是为了让服务器可以一次发送多个文件,参见https://stackoverflow.com/a/20321259。基本上,你得到的响应是两个文件:一个名为contentAnalyzerResponse
的JSON文件,和一个名为output/pdf_test.docx
的Word文档。
您可能可以使用werkzeug.formparser
中的parse_form_data
来解析文件,如这里所示,我以前已经成功地完成了,但我不确定如何使它与多个文件一起工作。
关于你关于剥离内容的问题:根据我上面所说的,是的,像你这样剥离内容是完全可以的。
注意:我建议在文本编辑器中打开文件,并在文件的最后检查,以确保没有任何额外的--Boundary...
的东西,你也会想要剥离。