How to identify and remove hidden text from the PDF using PDFBox java - pdfbox

I am reading text from PDF using pdfbox library and saving it in text file. It reads hidden text as well which is not visible when PDF is viewed through PDF Reader. My requirement is to get some characteristics of these hidden text which can distinguish it from normal text.

One possible criterion for the texts to ignore in your example files is the text color, pure CMYK white in one case, 0.753 in a Gray Gamma 2.2 XYZ ICCBased colorspace in the other case.
So let's extend the text stripper by a color filtering option. This in particular means adding operator processors for color setting instructions as the PDFTextStripper by default ignores them:
public class PDFFilteringTextStripper extends PDFTextStripper {
public interface TextStripperFilter {
public boolean accept(TextPosition text, PDGraphicsState graphicsState);
}
public PDFFilteringTextStripper(TextStripperFilter filter) throws IOException {
addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingColorSpace());
addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColorSpace());
addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingColor());
addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColor());
addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingColorN());
addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingColorN());
addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingDeviceGrayColor());
addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingDeviceGrayColor());
addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingDeviceRGBColor());
addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingDeviceRGBColor());
addOperator(new org.apache.pdfbox.contentstream.operator.color.SetStrokingDeviceCMYKColor());
addOperator(new org.apache.pdfbox.contentstream.operator.color.SetNonStrokingDeviceCMYKColor());
this.filter = filter;
}
#Override
protected void processTextPosition(TextPosition text) {
PDGraphicsState graphicsState = getGraphicsState();
if (filter.accept(text, graphicsState))
super.processTextPosition(text);
}
final TextStripperFilter filter;
}
(PDFFilteringTextStripper class)
Using that text stripper class, we can filter the white text from the first example PDF like this:
float[] colorToFilter = new float[] {0,0,0,0};
PDDocument document = ...;
PDFFilteringTextStripper stripper = new PDFFilteringTextStripper((text, gs) -> {
PDColor color = gs.getNonStrokingColor();
return color == null || !((color.getColorSpace() instanceof PDDeviceCMYK) && Arrays.equals(color.getComponents(), colorToFilter));
});
String text = stripper.getText(document);
(ExtractFilteredText test testExtractNoWhiteText...)
Similarly we can filter the gray text from the second example PDF like this:
float[] colorToFilter = new float[] {0.753f};
PDDocument document = ...;
PDFFilteringTextStripper stripper = new PDFFilteringTextStripper((text, gs) -> {
PDColor color = gs.getNonStrokingColor();
return color == null || !((color.getColorSpace() instanceof PDICCBased) && Arrays.equals(color.getComponents(), colorToFilter));
});
String text = stripper.getText(document);
(ExtractFilteredText test testExtractNoGrayText...)
In a comment you asked
A quick question- this text in 0.753 in a Gray Gamma 2.2 XYZ ICCBased colorspace - invisible text? Or is it just because of the colorspace, text is not visible in PDF?
It is visible! (Thus, strictly speaking you should not remove it from the extracted text.)
It merely is quite small. On the title page zoom in on the year "2016":

Related

Trying to replace graphics resources in a PDF - PDFBox 2.0.8

I'm trying to manipulate image resources in some PDF files; the workflow is: extract image resources -> process each -> replace old ones with the new.
Simple task really, I have working code for extracting and replacing, but when I replace, the new file size is nearly twice the original.
To replace the images, I use PDResources.put(COSName, PDXObject). Any ideas what would cause the size increase in the resulting document? It happens even if I completely omit the middle step in the workflow to process each image resource.
public static void PDFBoxReplaceImages() throws Exception {
PDDocument document = PDDocument.load(new File("C:\\Users\\Markus\\workspace\\pdf-test\\book.pdf"));
PDPageTree list = document.getPages();
for (PDPage page : list) {
PDResources pdResources = page.getResources();
for (COSName c : pdResources.getXObjectNames()) {
PDXObject o = pdResources.getXObject(c);
if (o instanceof PDImageXObject) {
counter++;
String path = "C:\\Users\\Markus\\workspace\\pdf-test\\images\\"+counter+".png";
PDImageXObject newImg =
PDImageXObject.createFromFile(path, document);
pdResources.put(c, newImg);
}
}
}
document.save("C:\\Users\\Markus\\workspace\\pdf-test\\book.pdf");
}

iText LocationTextExtractionStrategy/HorizontalTextExtractionStrategy splits text into single characters

I used a extended version of LocationTextExtractionStrategy to extract connected texts of a pdf and their positions/sizes. I did this by using the locationalResult. This worked well until I tested a pdf containing texts with a different font (ttf). Suddenly these texts are splitted into single characters or small fragments.
For example "Detail" is not any more one object within the locationalResult list but splitted into six items (D, e, t, a, i, l)
I tried using the HorizontalTextExtractionStrategy by making the getLocationalResult method public:
public List<TextChunk> GetLocationalResult()
{
return (List<TextChunk>)locationalResultField.GetValue(this);
}
and using the PdfReaderContentParser to extract the texts:
reader = new PdfReader("some_pdf");
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
var strategy = parser.ProcessContent(i, HorizontalTextExtractionStrategy());
foreach (HorizontalTextExtractionStrategy.HorizontalTextChunk chunk in strategy.GetLocationalResult())
{
// Do something with the chunk
}
but this also returns the same result.
Is there any other way to extract connected texts from a pdf?
I used a extended version of LocationTextExtractionStrategy to extract connected texts of a pdf and their positions/sizes. I did this by using the locationalResult. This worked well until I tested a pdf containing texts with a different font (ttf). Suddenly these texts are splitted into single characters or small fragments.
This problem is due to wrong expectations concerning the contents of the LocationTextExtractionStrategy.locationalResult private list member variable.
This list of TextChunk instances contains the pieces of text as they were forwarded to the strategy from the parsing framework (or probably as they were preprocessed by some filter class), and the framework forwards each single string it encounters in a content stream separately.
Thus, if a seemingly connected word in the content stream actually is drawn using multiple strings, you get multiple TextChunk instances for it.
There actually is some "intelligence" in the method getResultantText joining these chunks properly, adding a space where necessary and so on.
In case of your document, "DETAIL " usually is drawn like this:
[<0027> -0.2<00280037> 0.2<0024002c> 0.2<002f> -0.2<0003>] TJ
As you see there are slight text insertion point moves between 'D' and 'E', 'T' and 'A', 'I' and 'L', and 'L' and ' '. (Such mini moves usually represent kerning.) Thus, you'll get individual TextChunk instances for 'D', 'ET', 'AI', and 'L '.
Admittedly, the LocationTextExtractionStrategy.locationalResult member is not very well documented; but as it is a private member, this IMHO is forgivable.
That this worked well for many documents is due to many PDF creators not applying kerning and simply drawing connected text using single string objects.
The HorizontalTextExtractionStrategy is derived from the LocationTextExtractionStrategy and mainly differs from it in the way it arranges the TextChunk instances to a single string. Thus, you'll see the same fragmentation here.
Is there any other way to extract connected texts from a pdf?
If you want "connected texts" as in "atomic string objects in the content stream", you already have them.
If you want "connected texts" as in "visually connected texts, no matter where the constituent letters are drawn in the content stream", you have to glue those TextChunk instances together like the LocationTextExtractionStrategy and HorizontalTextExtractionStrategy do in getResultantText in combination with the comparison methods in their respective TextChunkLocationDefaultImp and HorizontalTextChunkLocation implementations.
After debugging deep into the iTextSharp library I figured out that my texts are drawn with the TJ operator as mkl also mentioned.
[<0027> -0.2<00280037> 0.2<0024002c> 0.2<002f> -0.2<0003>] TJ
iText processes these texts not as a single PdfString but as an array of PdfObjects which ends up in calling renderListener.RenderText(renderInfo) for each PdfString item in it (see ShowTextArray class and DisplayPdfString method). In the RenderText method however the information about the relation of the pdf strings within the array got lost and every item is added to locationalResult as an independent object.
As my goal is to extract the "argument of a single text drawing instruction" I extended the PdfContentStreamProcessor class about a new method ProcessTexts which returns a list of these atomic strings. My workaround is not very pretty as I had to copy paste some private fields and methods from the original source but it works for me.
class PdfContentStreamProcessorEx : PdfContentStreamProcessor
{
private IDictionary<int, CMapAwareDocumentFont> cachedFonts = new Dictionary<int, CMapAwareDocumentFont>();
private ResourceDictionary resources = new ResourceDictionary();
private CMapAwareDocumentFont font = null;
public PdfContentStreamProcessorEx(IRenderListener renderListener) : base(renderListener)
{
}
public List<string> ProcessTexts(byte[] contentBytes, PdfDictionary resources)
{
this.resources.Push(resources);
var texts = new List<string>();
PRTokeniser tokeniser = new PRTokeniser(new RandomAccessFileOrArray(new RandomAccessSourceFactory().CreateSource(contentBytes)));
PdfContentParser ps = new PdfContentParser(tokeniser);
List<PdfObject> operands = new List<PdfObject>();
while (ps.Parse(operands).Count > 0)
{
PdfLiteral oper = (PdfLiteral)operands[operands.Count - 1];
if ("Tj".Equals(oper.ToString()))
{
texts.Add(getText((PdfString)operands[0]));
}
else if ("TJ".Equals(oper.ToString()))
{
string text = string.Empty;
foreach (PdfObject entryObj in (PdfArray)operands[0])
{
if (entryObj is PdfString)
{
text += getText((PdfString)entryObj);
}
}
texts.Add(text);
}
else if ("Tf".Equals(oper.ToString()))
{
PdfName fontResourceName = (PdfName)operands[0];
float size = ((PdfNumber)operands[1]).FloatValue;
PdfDictionary fontsDictionary = resources.GetAsDict(PdfName.FONT);
CMapAwareDocumentFont _font;
PdfObject fontObject = fontsDictionary.Get(fontResourceName);
if (fontObject is PdfDictionary)
_font = GetFont((PdfDictionary)fontObject);
else
_font = GetFont((PRIndirectReference)fontObject);
font = _font;
}
}
this.resources.Pop();
return texts;
}
string getText(PdfString #in)
{
byte[] bytes = #in.GetBytes();
return font.Decode(bytes, 0, bytes.Length);
}
private CMapAwareDocumentFont GetFont(PRIndirectReference ind)
{
CMapAwareDocumentFont font;
cachedFonts.TryGetValue(ind.Number, out font);
if (font == null)
{
font = new CMapAwareDocumentFont(ind);
cachedFonts[ind.Number] = font;
}
return font;
}
private CMapAwareDocumentFont GetFont(PdfDictionary fontResource)
{
return new CMapAwareDocumentFont(fontResource);
}
private class ResourceDictionary : PdfDictionary
{
private IList<PdfDictionary> resourcesStack = new List<PdfDictionary>();
virtual public void Push(PdfDictionary resources)
{
resourcesStack.Add(resources);
}
virtual public void Pop()
{
resourcesStack.RemoveAt(resourcesStack.Count - 1);
}
public override PdfObject GetDirectObject(PdfName key)
{
for (int i = resourcesStack.Count - 1; i >= 0; i--)
{
PdfDictionary subResource = resourcesStack[i];
if (subResource != null)
{
PdfObject obj = subResource.GetDirectObject(key);
if (obj != null) return obj;
}
}
return base.GetDirectObject(key); // shouldn't be necessary, but just in case we've done something crazy
}
}
}

Apache PDFBox and PDF/A-3

Is it possible to use Apache PDFBox to process PDF/A-3 documents? (Especially for changing field values?)
The PDFBox 1.8 Cookbook says that it is possible to create PDF/A-1 documents with pdfaid.setPart(1);
Can I apply pdfaid.setPart(3) for a PDF/A-3 document?
If not: Is it possible to read in a PDF/A-3 document, change some field values and safe it by what I have not need for >creation/conversion to PDF/A-3< but the document is still PDF/A-3?
How to create a PDF/A {2,3} - {B, U, A) valid: In this example I convert the PDF to Image, then I create a valid PDF / Ax-y with the image. PDFBOX2.0x
public static void main(String[] args) throws IOException, TransformerException
{
String resultFile = "result/PDFA-x.PDF";
FileInputStream in = new FileInputStream("src/PDFOrigin.PDF");
PDDocument doc = new PDDocument();
try
{
PDPage page = new PDPage();
doc.addPage(page);
doc.setVersion(1.7f);
/*
// A PDF/A file needs to have the font embedded if the font is used for text rendering
// in rendering modes other than text rendering mode 3.
//
// This requirement includes the PDF standard fonts, so don't use their static PDFType1Font classes such as
// PDFType1Font.HELVETICA.
//
// As there are many different font licenses it is up to the developer to check if the license terms for the
// font loaded allows embedding in the PDF.
String fontfile = "/org/apache/pdfbox/resources/ttf/ArialMT.ttf";
PDFont font = PDType0Font.load(doc, new File(fontfile));
if (!font.isEmbedded())
{
throw new IllegalStateException("PDF/A compliance requires that all fonts used for"
+ " text rendering in rendering modes other than rendering mode 3 are embedded.");
}
*/
PDPageContentStream contents = new PDPageContentStream(doc, page);
try
{
PDDocument docSource = PDDocument.load(in);
PDFRenderer pdfRenderer = new PDFRenderer(docSource);
int numPage = 0;
BufferedImage imagePage = pdfRenderer.renderImageWithDPI(numPage, 200);
PDImageXObject pdfXOImage = LosslessFactory.createFromImage(doc, imagePage);
contents.drawImage(pdfXOImage, 0,0, page.getMediaBox().getWidth(), page.getMediaBox().getHeight());
contents.close();
}catch (Exception e) {
// TODO: handle exception
}
// add XMP metadata
XMPMetadata xmp = XMPMetadata.createXMPMetadata();
PDDocumentCatalog catalogue = doc.getDocumentCatalog();
Calendar cal = Calendar.getInstance();
try
{
DublinCoreSchema dc = xmp.createAndAddDublinCoreSchema();
// dc.setTitle(file);
dc.addCreator("My APPLICATION Creator");
dc.addDate(cal);
PDFAIdentificationSchema id = xmp.createAndAddPFAIdentificationSchema();
id.setPart(3); //value => 2|3
id.setConformance("A"); // value => A|B|U
XmpSerializer serializer = new XmpSerializer();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
serializer.serialize(xmp, baos, true);
PDMetadata metadata = new PDMetadata(doc);
metadata.importXMPMetadata(baos.toByteArray());
catalogue.setMetadata(metadata);
}
catch(BadFieldValueException e)
{
throw new IllegalArgumentException(e);
}
// sRGB output intent
InputStream colorProfile = CreatePDFA.class.getResourceAsStream(
"../../../pdmodel/sRGB.icc");
PDOutputIntent intent = new PDOutputIntent(doc, colorProfile);
intent.setInfo("sRGB IEC61966-2.1");
intent.setOutputCondition("sRGB IEC61966-2.1");
intent.setOutputConditionIdentifier("sRGB IEC61966-2.1");
intent.setRegistryName("http://www.color.org");
catalogue.addOutputIntent(intent);
catalogue.setLanguage("en-US");
PDViewerPreferences pdViewer =new PDViewerPreferences(page.getCOSObject());
pdViewer.setDisplayDocTitle(true);;
catalogue.setViewerPreferences(pdViewer);
PDMarkInfo mark = new PDMarkInfo(); // new PDMarkInfo(page.getCOSObject());
PDStructureTreeRoot treeRoot = new PDStructureTreeRoot();
catalogue.setMarkInfo(mark);
catalogue.setStructureTreeRoot(treeRoot);
catalogue.getMarkInfo().setMarked(true);
PDDocumentInformation info = doc.getDocumentInformation();
info.setCreationDate(cal);
info.setModificationDate(cal);
info.setAuthor("My APPLICATION Author");
info.setProducer("My APPLICATION Producer");;
info.setCreator("My APPLICATION Creator");
info.setTitle("PDF title");
info.setSubject("PDF to PDF/A{2,3}-{A,U,B}");
doc.save(resultFile);
}catch (Exception e) {
throw new IllegalArgumentException(e);
}
}
PDFBox supports that but please be aware that due to the fact that PDFBox is a low level library you have to ensure the conformance yourself i.e. there is no 'Save as PDF/A-3'. You might want to take a look at http://www.mustangproject.org which uses PDFBox to support ZUGFeRD (electronic invoicing) which also needs PDF/A-3.

Unable to get page number when using RenderListener interface to find a piece of text in PDF

iText requires coordinates to create form fields and Page Number in existing PDFs at different places.
My PDF is dynamic. So I decided to creat the PDF with some identifier text. And use TextRenderInfo to find the coordinates for the text and use those coordinates to creat the textfields and other form fields.
ParsingHelloWorld.java
public void extractText(String src, String dest) throws IOException, DocumentException {
PrintWriter out = new PrintWriter(new FileOutputStream(dest));
PdfReader reader = new PdfReader(src);
PdfStamper stp = new PdfStamper(reader, new FileOutputStream(dest);
RenderListener listener = new MyTextRenderListener(out,reader,stp);
PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener);
for ( int pageNum= 0; pageNum < reader.getNumberOfPages(); pageNum++ ){
PdfDictionary pageDic = reader.getPageN(pageNum);
PdfDictionary resourcesDic = pageDic.getAsDict(PdfName.RESOURCES);
processor.processContent(ContentByteUtils.getContentBytesForPage(reader, pageNum), resourcesDic);
}
out.flush();
out.close();
stp.close();
}
MyTextRenderListener.java
public void renderText(TextRenderInfo renderInfo) {
if (renderInfo.getText().startsWith("Fill_in_TextField")){
// creates the text fields by getting co-ordinates form the renderinfo object.
createTextField(renderInfo);
}else if (renderInfo.getText().startsWith("Fill_in_SignatureField")){
// creates the text fields by getting co-ordinates form the renderinfo object.
createSignatureField(renderInfo);
}
}
The problem is I have a page number in extractText method in the ParsingHelloWorld class.
When the renderText method is called inside the MyTextRenderListener class internally processing the page content, I couldn't get the pageNumber to generate the fields in the PDF at the particular coordinates where the identifier text resides(ex Fill_in_TextField,Fill_in_SignatureField..etc ).
Any suggestions/ ideas to get the page number in my scenario.
Thanks in advance.
That's easy. Add a parameter to MyTextListener:
protected int page;
public void setPage(int page) {
this.page = page;
}
Now when you loop over the pages in ParsingHelloWorld, pass the page number to MyTextListener:
listener.setPage(pageNum);
Now you have access to that number in the renderText() method and you can pass it to your createTextField() method.
Note that I think your loop is wrong. Page numbers don't start at page 0, they start at page 1.

Pdfbox how to extract font type and style from pdf

How to retrieve font type style attributes from pdf using pdfbox
If you want to get the font of a single character in the pdf document, you can call textPosition.getFont().getFontDescriptor().getFontName(), where textPosition is a instance of the class TextPosition.
All characters of a PDF document are related to TextPosition objects.
You can get the TextPosition objects of a PDF document by overriding the processTextPosition(TextPosition t) method of PDFTextStripper or with the getCharactersByArticle() method of PDFTextStripper.
i.e. for latter - extend the PDFStripper class like this:
public class MyPDFTextStripper extends PDFTextStripper {
public MyPDFTextStripper() throws IOException {
super();
}
public Vector<List<TextPosition>> myGetCharactersByArticle() {
return getCharactersByArticle();
}
}
... to get the list of TextPositions for a single page use:
MyPDFTextStripper stripper = new MyPDFTextStripper();
PDDocument doc = PDDocument.load(new File(filename));
stripper.setStartPage(pageNr+1);
stripper.setEndPage(pageNr+1);
stripper.getText(doc);
Vector<List<TextPosition>> list = stripper.myGetCharactersByArticle();
... and finally to get the font for a single character just type:
textPosition.getFont().getFontDescriptor().getFontName()