如何通过XPDF或mupdf获得指定的文本位置

我想提取pdf文件中的一些指定文本和文本位置。

我知道xpdf和mupdf可以解析pdf文件，所以我想它们可以帮助我完成这个任务。

但是如何使用这两个库来获取文本位置?

如果您不介意为MuPDF使用Python绑定，这里有一个使用PyMuPDF的Python解决方案(我是它的开发人员之一):

import fitz                     # the PyMuPDF module
doc = fitz.open("input.pdf")    # PDF input file
page = doc[n]                   # page number n (0-based)
wordlist = page.getTextWords()  # gives you a list of all words on the
# page, together with their position info (a rectangle containing the word)
# or, if you only are interested in blocks of lines belonging together:
blocklist = page.getTextBlocks()
# If you need yet more details, use a JSON-based output, which also gives
# images and their positions, as well as font information for the text.
tdict = json.loads(page.getText("json"))

如果你感兴趣，我们在GitHub上。

Mupdf附带了几个工具，其中一个是pdfdraw。

如果您使用pdfdraw和-tt选项，它将生成包含所有字符及其确切位置信息的XML。
从那里你应该可以找到你需要的。

相关内容

最新更新

热门标签：