为什么我的PDF中有不可见的字符，我如何用PDFBox过滤掉它们

我使用PDFBox通过扩展PDFTextStripper从文档中提取文本。我注意到其中一些文档包含正在提取的不可见字符。我想把这些看不见的字符过滤掉。

我看到已经有一些关于这个的stackoverflow帖子，例如：

PDFBox-删除不可见的文本(通过剪辑/填充路径问题(
使用pdfbox从pdf中删除不可见的文本

我尝试将此处找到的PDFVisibleTextStripper类进行子类化：

https://github.com/mkl-public/testarea-pdfbox2/blob/master/src/main/java/mkl/testarea/pdfbox2/extract/PDFVisibleTextStripper.java

然而，我发现这过滤掉了实际上可见的文本。我用它来代替PDFTextStripper。

package com.example.foo;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.List;
public class ExtractChars extends PDFVisibleTextStripper {
Processor processor;
public static void extract(PDDocument document, Processor processor) throws IOException {
ExtractChars instance = new ExtractChars();
instance.processor = processor;
instance.setSortByPosition(true);
instance.setStartPage(0);
instance.setEndPage(document.getNumberOfPages());
ByteArrayOutputStream stream = new ByteArrayOutputStream();
Writer streamWriter = new OutputStreamWriter(stream);
instance.writeText(document, streamWriter);
}
ExtractChars() throws IOException {}
protected void writeString(String _string, List<TextPosition> textPositions) throws IOException {
for (TextPosition text: textPositions) {
float height = text.getHeightDir();
String character = text.getUnicode();
int pageIndex = getCurrentPageNo() - 1;
float left = text.getXDirAdj();
float right = left + text.getWidthDirAdj();
float bottom = text.getYDirAdj();
float top = bottom - height;
BoundingBox box = new BoundingBox(pageIndex, left, right, top, bottom);
this.processor.process(character, box);
}
}
public interface Processor {
void process(String character, BoundingBox box);
}
}

我不知道我的子类中是否有什么需要更改的地方，以使其正确工作。如果有帮助的话，我可以提供一个展示这种行为的PDF，尽管它包含敏感内容，所以我需要先删除它。

相反，我创建了一个最小的例子(下面(，展示了我所看到的"隐形文本"行为。项目符号列表在'24末尾包含一个项目。a.'，可以在macOS预览等PDF查看器中突出显示并复制粘贴出来。

这个"a."正在被PDFTextStripper提取，我希望它不要被提取。我真的不明白为什么会发生这种情况。我想这可能与剪辑有关，但如果有人能解释一下发生了什么，我会非常感激。

我的最终目标是过滤掉这些字符，所以如果你对我如何以最简单的方式处理这个特定的案例有建议，我将不胜感激。我不认为我需要PDFVisibleTextStripper中的所有通用方法。

非常感谢！

%PDF-1.3
1 0 obj
<<
/Type /Catalog
/Pages 2 0 R
>>
endobj
2 0 obj
<<
/Type /Pages
/Kids [3 0 R]
/Count 1
/MediaBox [0 0 612 792]
>>
endobj
3 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources 4 0 R
/Contents 6 0 R
/MediaBox [0 0 612 792]
>>
endobj
4 0 obj
<<
/Font <<
/TT2 5 0 R
>>
>>
endobj
5 0 obj
<<
/BaseFont
/OXRDVC+Helvetica
/Subtype /TrueType
/Type /Font
>>
endobj
6 0 obj
<<
>>
stream
q 0 54 612 648 re W n /Cs1 cs 0 0 0 sc
q 1 0 0 0.8181818 0 54 cm Q
q 48 93.30545 516 569.4218 re W n /Cs1 cs 1 1 1 sc 48 93.30545 516 569.4218 re f 0 0 0 sc
q 1 0 0 0.8181818 0 54 cm BT 7.99 0 0 7.99 66.86 589.28 Tm /TT2 1 Tf (24.  ) Tj ET Q
q 1 0 0 0.8181818 0 54 cm BT 7.99 0 0 7.99 96.86 40.39 Tm /TT2 1 Tf (a.  ) Tj ET Q 
endstream
endobj
trailer
<<
/Root 1 0 R
>>
%%EOF

我弄清楚了发生了什么。PDF包含一个不包含"a."的剪切矩形。我试着使用PDFVisibleTextStripper，但它去掉了其他文档中实际上可见的文本。

最后，我编写了一个继承自PageDrawer的类，并实现了showGlyph方法来访问页面上绘制的字符。此方法检查角色的边界框是否在getGraphicsState().getCurrentClippingPath().getBounds2D()之外。

不幸的是，这意味着我不再使用PDFTextStripper了，所以我不得不重新实现它的一些行为，比如按位置对字符排序(我使用的是setSortByPosition(true)(。根据字体大小和位移来计算字符的正确边界框也有点棘手。

ExtractChars.java

package com.example.foo;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.pdmodel.font.*;
import org.apache.pdfbox.rendering.*;
import org.apache.pdfbox.util.*;
import org.apache.pdfbox.util.Vector;
import java.awt.geom.*;
import java.io.*;
// This class effectively renders the PDF document in order to extract its
// text. It intercepts the showGlyph function provided by PageDrawer. We used to
// use PDFTextStripper but that has no way to exclude clipped characters.
public class ExtractChars extends PageDrawerHelper {
// Skip erroneous characters smaller than this height. This might never happen
// but there are places in the code that divide by height, so guard against it.
static final float MIN_CHARACTER_HEIGHT = 0.01f;
Processor processor;
ExtractChars(PageDrawerParameters params, float pageHeight, int pageIndex, Processor processor) throws IOException {
super(params, pageHeight, pageIndex);
this.processor = processor;
}
// We can't move this method up to the superclass because the Renderer is
// different each time. It needs to build an instance of the current class.
public static void extract(PDDocument document, Processor processor) throws IOException {
Renderer renderer = new Renderer(document);
renderer.processor = processor;
for (int i = 0; i < document.getNumberOfPages(); i += 1) {
PDPage page = document.getPage(i);
renderer.pageHeight = page.getMediaBox().getHeight();
renderer.pageIndex = i;
renderer.renderImage(i);
}
}
@Override
public void showGlyph(Matrix matrix, PDFont font, int _code, String unicode, Vector displacement) throws IOException {
if (unicode == null) { return; }
// Get the width and height of the character relative to font size.
// The height does not change but the width does, e.g. 'M' is wider than 'I'.
float width = displacement.getX();
float height = fontHeight(font) / 2;
BoundingBox charBox = clippedBoundingBox(matrix, width, height);
// Skip the character if it is outside the clipping region and not visible.
if (charBox == null) { return; }
float boxHeight = charBox.bottom - charBox.top;
if (boxHeight < MIN_CHARACTER_HEIGHT) { return; }
// We need the text direction so we can sort text in separate buckets based on this.
int direction = textDirection(matrix);
processor.process(unicode, charBox, direction);
}
// https://stackoverflow.com/questions/17171815/get-the-font-height-of-a-character-in-pdfbox#answer-17202929
float fontHeight(PDFont font) {
return font.getFontDescriptor().getFontBoundingBox().getHeight() / 1000;
}
int textDirection(Matrix matrix) {
float a = matrix.getValue(0, 0);
float b = matrix.getValue(0, 1);
float c = matrix.getValue(1, 0);
float d = matrix.getValue(1, 1);
// This logic is copied from:
// https://github.com/atsuoishimoto/pdfbox-ja/blob/master/src/main/java/org/apache/pdfbox/util/TextPosition.java
if ((a > 0) && (Math.abs(b) < d) && (Math.abs(c) < a) && (d > 0)) {
return 0;
} else if ((a < 0) && (Math.abs(b) < Math.abs(d)) && (Math.abs(c) < Math.abs(a)) && (d < 0)) {
return 180;
} else if ((Math.abs(a) < Math.abs(c)) && (b > 0) && (c < 0) && (Math.abs(d) < b)) {
return 90;
} else if ((Math.abs(a) < c) && (b < 0) && (c > 0) && (Math.abs(d) < Math.abs(b))) {
return 270;
}
return 0;
}
// We can't construct an instance of ExtractChars directly because its
// constructor requires PageDrawerParameters which is private to the package.
// Instead, make an instance via a renderer and forward the fields to it.
static class Renderer extends PDFRenderer {
Processor processor;
float pageHeight;
int pageIndex;
Renderer(PDDocument document) {
super(document);
}
protected PageDrawer createPageDrawer(PageDrawerParameters params) throws IOException {
return new ExtractChars(params, pageHeight, pageIndex, processor);
}
}
public interface Processor {
void process(String character, BoundingBox box, int direction);
}
}

PageDrawerHelper.java

package com.example.foo;
import org.apache.pdfbox.rendering.*;
import org.apache.pdfbox.util.*;
import java.awt.geom.*;
import java.io.*;
// This class provides utility methods to subclasses, mostly so they can check
// if the currently content is being clipped and therefore should be skipped.
//
// We shouldn't really use inheritance for sharing code but this has the
// advantage of being able to call some methods of the PageDrawer superclass.
public class PageDrawerHelper extends PageDrawer {
float pageHeight;
int pageIndex;
PageDrawerHelper(PageDrawerParameters params, float pageHeight, int pageIndex) throws IOException {
super(params);
this.pageHeight = pageHeight;
this.pageIndex = pageIndex;
}
// Gets the bounding for a matrix by transforming corner points and taking the
// min/max values in the x- and y-directions. This ensures rotation and skew
// are taken into account. This method can return null if content is clipped.
BoundingBox clippedBoundingBox(Matrix matrix, float width, float height) {
Point2D p0 = matrix.transformPoint(0, 0);
Point2D p1 = matrix.transformPoint(0, height);
Point2D p2 = matrix.transformPoint(width, 0);
Point2D p3 = matrix.transformPoint(width, height);
BoundingBox contentBox = boundingBox(p0, p1, p2, p3);
BoundingBox clippedBox = applyClipping(contentBox);
return clippedBox;
}
BoundingBox boundingBox(Point2D p0, Point2D p1, Point2D p2, Point2D p3) {
Point2D topLeft = topLeft(p0, p1, p2, p3);
Point2D botRight = botRight(p0, p1, p2, p3);
float left = (float)topLeft.getX();
float right = (float)botRight.getX();
float top = pageHeight - (float)botRight.getY();
float bottom = pageHeight - (float)topLeft.getY();
return new BoundingBox(pageIndex, left, right, top, bottom);
}
Point2D topLeft(Point2D... points) {
double minX = points[0].getX();
double minY = points[0].getY();
for (int i = 1; i < points.length; i += 1) {
minX = Math.min(minX, points[i].getX());
minY = Math.min(minY, points[i].getY());
}
return new Point2D.Double(minX, minY);
}
Point2D botRight(Point2D... points) {
double maxX = points[0].getX();
double maxY = points[0].getY();
for (int i = 1; i < points.length; i += 1) {
maxX = Math.max(maxX, points[i].getX());
maxY = Math.max(maxY, points[i].getY());
}
return new Point2D.Double(maxX, maxY);
}
BoundingBox applyClipping(BoundingBox box) {
Rectangle2D clip = getGraphicsState().getCurrentClippingPath().getBounds2D();
float clipLeft = (float)clip.getMinX();
float clipRight = (float)clip.getMaxX();
float clipTop = pageHeight - (float)clip.getMaxY();
float clipBottom = pageHeight - (float)clip.getMinY();
float left = Math.max(box.left, clipLeft);
float right = Math.min(box.right, clipRight);
float top = Math.max(box.top, clipTop);
float bottom = Math.min(box.bottom, clipBottom);
if (left >= right || top >= bottom) {
return null;
} else {
return new BoundingBox(pageIndex, left, right, top, bottom);
}
}
}

CharacterSorter.java

package com.example.foo;
import java.util.*;
public class CharacterSorter {
ArrayList<String> characters;
ArrayList<BoundingBox> boxes;
ArrayList<Integer> directions;
public CharacterSorter(ArrayList<String> characters, ArrayList<BoundingBox> boxes, ArrayList<Integer> directions) {
this.characters = characters;
this.boxes = boxes;
this.directions = directions;
}
public void sortByDirectionThenPosition() {
ArrayList<Tuple> tuples = new ArrayList();
for (int i = 0; i < characters.size(); i += 1) {
tuples.add(new Tuple(characters.get(i), boxes.get(i), directions.get(i)));
}
Collections.sort((List)tuples);
characters.clear(); boxes.clear(); directions.clear();
for (Tuple tuple: tuples) {
characters.add(tuple.character);
boxes.add(tuple.box);
directions.add(tuple.direction);
}
}
// This helper class wraps the three fields associated with a single character
// and provides a comparator function which mimics how PDFTextStripper orders
// its characters when #setSortByPosition(true) is set.
class Tuple implements Comparable {
String character;
BoundingBox box;
Integer direction;
Tuple(String character, BoundingBox box, Integer direction) {
this.character = character;
this.box = box;
this.direction = direction;
}
public int compareTo(Object o) {
Tuple other = (Tuple)o;
int primary = ((Integer)box.pageIndex).compareTo(other.box.pageIndex);
if (primary != 0) { return primary; }
// The remainder of this logic is copied and adapted from:
// https://github.com/apache/pdfbox/blob/a78f4a2ea058181e5ed05d6367ba7556948331b8/pdfbox/src/main/java/org/apache/pdfbox/text/TextPositionComparator.java#L29-L70
// Only compare text that is in the same direction.
int secondary = Float.compare(direction, other.direction);
if (secondary != 0) { return secondary; }
// Get the text direction adjusted coordinates.
float x1 = box.left;
float x2 = other.box.left;
float pos1YBottom = box.bottom;
float pos2YBottom = other.box.bottom;
// Note that the coordinates have been adjusted so (0, 0) is in upper left.
float pos1YTop = pos1YBottom - (box.bottom - box.top);
float pos2YTop = pos2YBottom - (other.box.bottom - other.box.top);
float yDifference = Math.abs(pos1YBottom - pos2YBottom);
// We will do a simple tolerance comparison.
if (yDifference < .1 ||
pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom ||
pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom)
{
return Float.compare(x1, x2);
} else if (pos1YBottom < pos2YBottom) {
return -1;
} else {
return 1;
}
}
}
}

相关内容

最新更新

热门标签：