PostScript code to un-hide hidden text in PDF - pdf

I have a PDF with some hidden text in it.
When I press [CTRL+a] I see the hidden text in my document viewer.
I can copy the text too and I can extract the text via pdftotext, but I can't recolorize the text so I can view the hidden text in the PDF viewer without pressing [CTRL+a].
So I had the idea, that I could use PostScript and change the color for the this text object.
But how can I determine what function sets the color or hides the text?

You cannot use PostScript to achieve what you want. You need to resort to manually editing the PDF file...
There are basically three ways to "hide" text:
It could be white (or any color) text on white (or same color as text) background.
It could be covered by another object, say, a white area, or an image.
It could be using Text Rendering Mode 3 ("3 Tr").
The first two cases I'll not explain here, because they are rather unlikely. For the third case you could proceed like this:
Use qpdf to unpack as many as possible compressed 'streams' inside the PDF, creating what qpdf calls the 'QDF mode' of a PDF:
qpdf --qdf --object-streams=disable input.pdf uncompressed.pdf
Open uncompressed.pdf in a good text editor, such as VIm.
Search for the sequence 3 Tr.
(Text rendering mode 3 is described in the PDF-1.7 specification as "Neither fill nor stroke text (invisible).")
Change it to 1 Tr or 2 Tr and save the file.
(Text rendering mode 1 is "stroke text", mode 2 is "Fill, then stroke text." Mode 1 will only show the outlines...)
Re-compress the file:
qpdf uncompressed.pdf input-modified.pdf
Open the new file input-modified.pdf in your favourite PDF viewer. It should now show the "un-hidden" text.
Update
Having received a sample of a PDF file with "hidden" text from the OP (via private channels), I can confirm now that the hiding indeed is achieved by using white text color (RGB-white).
To make such text visible:
Unpack the PDF, using qpdf --qdf --object-streams=disable in.pdf unpacked.pdf
Search for all occurrences of 1 1 1 rg and 1 1 1 RG. These set the RGB colors to white (the first one non-stroking, the second one for stroking operations).
Comments à la %%Contents for page N: in the QDF-version of the uncompressed PDF file will indicate for which page the color setting is valid. (Note, there may be multiple occurrences of the rg and RG operators, each one setting a different (or the same) color for the next drawing operation.)
Now replace the white colors by black ones, by overwriting the found occurrences with 0 0 0 rg and 0 0 0 RG. Do this not all at once, but one after the other and observe what changes on the respective page after saving the changes. (You may want to avoid painting white text to black if it is on a black background already!)

Firstly, hidden text in PDF is done with a text rendering mode, not a colour. Text rendering mode 3 is 'neither stroke nor fill'. So changing the colour won't help you if this is how the text is drawn. Of course we can't tell if this is how the text has been drawn (but I suspect it is) because you haven't made the PDF file publicly available. In almost all cases if you want to discuss a particular file the best thing to do is make it public.
Secondly, you can't use PostScript to change a PDF file (well, you could write a PostScript program to interpret the PDF file, but that would be hard...)

Related

iText7 - create PDF with exact dimensions when printed - how?

I'm creating a simple PDF using iText7 (C#) but I need it to be printed at exactly the right size. Here's my code:
PdfWriter writer = new PdfWriter("output.pdf");
PdfDocument pdf = new PdfDocument(writer);
pdf.SetDefaultPageSize(iText.Kernel.Geom.PageSize.LETTER);
var page = pdf.AddNewPage();
page.SetCropBox(new iText.Kernel.Geom.Rectangle(36, 36, 7.5f * 72, 10 * 72));
PdfCanvas canvas = new PdfCanvas(page);
canvas.SetStrokeColor(ColorConstants.BLACK).SetLineWidth(3);
canvas.MoveTo(36, 36);
canvas.LineTo(36, 36 + 72); // Draw a line 1 inch long
canvas.LineTo(36 + 72, 36 + 72); // Draw a second line, perpendicular to the first, also 1 inch long
canvas.ClosePathStroke();
pdf.Close();
If I right-click the resulting PDF and select "Print", my triangle is off the bottom of the page.
When I open the resulting PDF in the PDF program I'm using (PDF Architect), it gives me a few options:
If I just click "Print", it gives me lines that are 1 1/16" long and start about 1/8" from the edge of the page, so by default PDF Architect seems to be taking the contents of my crop box and expanding it to the maximum page availability.
If I click on "Fit" before clicking "Print", that results in the desired output - lines 1" long, starting 1/2" from each side of the page. That works but is error-prone - too easy to forget to click "Fit" every time.
Is there a way to generate a PDF that contains information that says "I'm targeting this document at letter size, but I'm staying 1/2" away from all the edges, so when you print, if the printer has margins <= 1/2 inch you should be fine, and just print it exactly how I've described without any shrinking or enlarging"?
You will not be able to completely control this from the PDF document. The PDF processor (e.g. viewing application) or the printer (driver) will always be able to scale the content up or down.
Apparently, PDF Architect has the "Fit" option enabled by default, so it scales the page to the selected paper size.
You are setting a crop box of 7.5x10 in. I assume you're printing to Letter sized (8.5x11 in) paper. So the 7.5x10 page will indeed be scaled up, and your content will become slightly larger.
Is there a way to generate a PDF that contains information that says "I'm targeting this document at letter size, but I'm staying 1/2" away from all the edges, so when you print, if the printer has margins <= 1/2 inch you should be fine, and just print it exactly how I've described without any shrinking or enlarging"?
I would not set the crop box. When the pages in the PDF document are Letter size and the output paper is also Letter size, it should not matter whether the "Fit" option is enabled or not, as not scaling needs to happen. It's definitely not a fool proof solution, but at least it's less error prone.

Libre Office Labels don't show up as "AcroFields" in iTextSharp?

so I've been trying to generate a report. I've tried quite a few things already but there always seems to be problems. I'm currently trying iTextSharp 4.1.6.
My current strategy is to use LibreOffice to create a document with editable pdf fields, or I guess they are called "AcroFields". I'm not sure since I can't find a definition. But anyways, I assume that all of these are "AcroFields":
But if I put all of those into a form and export as pdf only some of them show up as AcroFields:
var reader = new PdfReader(File.ReadAllBytes("abc.pdf"));
foreach(var field in reader.AcroFields.Fields)
{
Console.WriteLine(((DictionaryEntry)field).Key);
}
> Text Box 1
Check Box 1
Numeric Field 1
Formatted Field 1
Date Field 1
List Box 1
Combo Box 1
Push Button 1
Option Button 1
Notice how Label Field 1 is not present. If it were present then doing a text replace might be easy. Except it's not present so it's looking like even iText can't do a simple text replace in a pdf. Is this true? How would you replace text in a pdf document using iTextSharp?
Notice how Label Field 1 is not present.
As there is no AcroForm form field type "label", form labels usually are drawn as regular page content in PDF files.
If it were present then doing a text replace might be easy. Except it's not present so it's looking like even iText can't do a simple text replace in a pdf. Is this true?
Indeed, in general there is no simple text replacement in a PDF.
How would you replace text in a pdf document using iTextSharp?
I would determine the bounding box coordinates of the text to replace using the iText text extraction feature with some extension that returns text plus coordinates. Then I'd remove that text by redaction using iText's PdfCleanUp... classes. Finally I'd add the replacement text as new text in the bounding box determined at start.
Unfortunately for you, both good text extraction and redaction are not present in your version 4.1.6; for this approach you should update at least to 5.5.x.
Alternatively, though, as you've been trying to generate a report, I assume the template design is in your hands. In that case you can put your labels into read-only text fields which you can change (they are read-only only to GUI users).

Lost some text when extracting pdf

I've tried to get all the text on the page by using iText, but I have no idea why every coordinate text loses the last two character.
PdfDocument pdfDoc = new PdfDocument(new PdfReader(#"E:\Coding\COOR.pdf"));
LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
parser.ProcessPageContent(pdfDoc.GetFirstPage());
Console.Write(strategy.GetResultantText());
pdfDoc.Close();
Console.WriteLine("Great!");
Console.ReadKey();
You can also download my code from
https://1drv.ms/u/s!Al1hUSZtR4OjwU3XVBRQGneVaZlS
In short
The reason for that "lost text" is that the missing "text" isn't there to start with!
In detail
The contents of you PDF file are constructed in a misleading manner.
On the one hand there are very many path definitions which then are stroked (drawn). These drawings create what you can see in a viewer, both text and table lines.
On the other hand there are a few text drawing instructions to draw text using text rendering mode 3 which is... invisible! These drawings create the text you can copy&paste in a viewer or extract using iText.
Unfortunately the text in the text drawing instructions and the text drawn using paths does not match completely. The text you retrieve via copy&paste or text extraction, therefore, differs from your expectations.
Also the glyph sizes and positions are not exactly the same
To illustrate this I made the text drawing instructions use the normal (fill) text rendering mode. The top left corner which originally looks like this:
with that change looks like this:
As you see the formerly invisible text is only approximately at the same position as the visible drawings, and it is somewhat broken: The symbol for degrees is weirdly represented as "¡ã", and the longitude fractional seconds and the following symbol for seconds are missing.
To correctly extract the originally visible data, you'll need to use OCR instead of text extraction.

How is this pdf encoded? The font looks funny

I have seen this effect many times while reading pdf documents. So, some pdf have this funny smudged font which looks like it is a scanned image. However, I am able to select the font, and while selecting it the highlighted font appears differently as seen in the images.
Default appearance
Appearance on selection of font
Overall, seems like some ocr is happening behind the scene.
The document reader I am using is Atril 1.12.2 document viewer.
My question is: What is encoded in the pdf, image or text? What is happening to text when I am selecting it?
Another nice change can be observed in the document shared by the OP:
What we see here indeed is the result of OCR. But it's not some ocr happening behind the scene in the viewer, OCR has already happened before and the results have been integrated into the PDF.
The PDF page actually contains a scanned image upon which invisible text is drawn.
As long as nothing is selected, Atril shows exactly that, you only see the scanned image. As soon as you start selecting text, though, it appears to cover the marked area in blue and display the marked (formerly invisible) text in white upon it.
In situations, therefore, in which the invisible text is not added exactly above the corresponding letters in the image, this might result in funny gaps like the one in the OP's screenshot after "multidimensional". In case of errors in the OCR output, one sees the erroneous data like in my screenshots.
Other PDF viewer often merely mark the text by applying some effect to the text area, e.g. inverting colors or overlaying a semi-transparent color.
It might be considered an advantage of the Atril approach that already in the selection process one sees the exact text one is selecting and probably eventually going to copy.
Inside the content stream
As mentioned above, the PDF page actually contains a scanned image upon which invisible text is drawn.
In the page content stream the corresponding instructions look like this:
1 0 0 1 0 0.2401 cm
(shift the coordinate system a minute bit up)
1 1 1 rg
1 i
/RelativeColorimetric ri
/R794 gs
0 0 576 719.5 re
f
(filling the image area to be with white color)
q
576 0 0 719.5 0 0 cm
/Im0 Do
Q
(drawing the bitmap image)
1 0 0 1 0 -0.2401 cm
(shift the coordinate system a minute bit down, undoing the initial upshift)
BT
(beginning a text object)
0 0 0 rg
(setting the fill color to black)
/TT1 1 Tf
0.05 Tc
0 Tw
3 Tr
(selecting the font TT1 at size 1, a bit of extra space between characters, no extra space between words, and text rendering mode 3, i.e. invisible)
7.3 0 0 7.3 83.8 678.4401 Tm
(SOFTWARE-PRACTICE ) Tj
(setting the text coordinate system to be shifted by 83.8 horizontally and 678.4401 vertically and to be scaled by 7.3 and drawing some text)
0.08 Tc
7.4 0 0 7.1 175.2 678.4401 Tm
(AND ) Tj
(changing character spacing a bit, setting the text coordinate system to be shifted by 175.2 horizontally and 678.4401 vertically and to be scaled by 7.4 horizontally and 7.1 vertically and drawing some text)
...
TL;DR
What is encoded in the pdf, image or text?
Both, the image plus invisible text upon it.
What is happening to text when I am selecting it?
Atril covers the text in blue and draws the selected (formerly invisible) text upon it in white.

Using whole space of pdf file

I am using prawn to create pdf file but it always leaves some spaces/margins around the page. Can't we use whole space of the pdf file not leaving any margins around?
Thanks !!!
Are you referring to the page bounds ?
The general space that is consumed on the page can be shown by the example code:
require 'prawn/core'
require 'prawn/layout'
Prawn::Document.generate('padded_box.pdf') do
stroke_bounds
text "Margin box"
padded_box(25) do
stroke_bounds
text "Bounding box padded by 25 on all sides from the margins"
padded_box(50) do
stroke_bounds
text "Bounding box padded by 50 on all sides from the parent bounds"
end
end
end
This will draw the bounds of the page showing the margin. There is a gap, which is the margins typically defined for the printing area