How is this pdf encoded? The font looks funny - pdf

I have seen this effect many times while reading pdf documents. So, some pdf have this funny smudged font which looks like it is a scanned image. However, I am able to select the font, and while selecting it the highlighted font appears differently as seen in the images.
Default appearance
Appearance on selection of font
Overall, seems like some ocr is happening behind the scene.
The document reader I am using is Atril 1.12.2 document viewer.
My question is: What is encoded in the pdf, image or text? What is happening to text when I am selecting it?

Another nice change can be observed in the document shared by the OP:
What we see here indeed is the result of OCR. But it's not some ocr happening behind the scene in the viewer, OCR has already happened before and the results have been integrated into the PDF.
The PDF page actually contains a scanned image upon which invisible text is drawn.
As long as nothing is selected, Atril shows exactly that, you only see the scanned image. As soon as you start selecting text, though, it appears to cover the marked area in blue and display the marked (formerly invisible) text in white upon it.
In situations, therefore, in which the invisible text is not added exactly above the corresponding letters in the image, this might result in funny gaps like the one in the OP's screenshot after "multidimensional". In case of errors in the OCR output, one sees the erroneous data like in my screenshots.
Other PDF viewer often merely mark the text by applying some effect to the text area, e.g. inverting colors or overlaying a semi-transparent color.
It might be considered an advantage of the Atril approach that already in the selection process one sees the exact text one is selecting and probably eventually going to copy.
Inside the content stream
As mentioned above, the PDF page actually contains a scanned image upon which invisible text is drawn.
In the page content stream the corresponding instructions look like this:
1 0 0 1 0 0.2401 cm
(shift the coordinate system a minute bit up)
1 1 1 rg
1 i
/RelativeColorimetric ri
/R794 gs
0 0 576 719.5 re
f
(filling the image area to be with white color)
q
576 0 0 719.5 0 0 cm
/Im0 Do
Q
(drawing the bitmap image)
1 0 0 1 0 -0.2401 cm
(shift the coordinate system a minute bit down, undoing the initial upshift)
BT
(beginning a text object)
0 0 0 rg
(setting the fill color to black)
/TT1 1 Tf
0.05 Tc
0 Tw
3 Tr
(selecting the font TT1 at size 1, a bit of extra space between characters, no extra space between words, and text rendering mode 3, i.e. invisible)
7.3 0 0 7.3 83.8 678.4401 Tm
(SOFTWARE-PRACTICE ) Tj
(setting the text coordinate system to be shifted by 83.8 horizontally and 678.4401 vertically and to be scaled by 7.3 and drawing some text)
0.08 Tc
7.4 0 0 7.1 175.2 678.4401 Tm
(AND ) Tj
(changing character spacing a bit, setting the text coordinate system to be shifted by 175.2 horizontally and 678.4401 vertically and to be scaled by 7.4 horizontally and 7.1 vertically and drawing some text)
...
TL;DR
What is encoded in the pdf, image or text?
Both, the image plus invisible text upon it.
What is happening to text when I am selecting it?
Atril covers the text in blue and draws the selected (formerly invisible) text upon it in white.

Related

iText7 - create PDF with exact dimensions when printed - how?

I'm creating a simple PDF using iText7 (C#) but I need it to be printed at exactly the right size. Here's my code:
PdfWriter writer = new PdfWriter("output.pdf");
PdfDocument pdf = new PdfDocument(writer);
pdf.SetDefaultPageSize(iText.Kernel.Geom.PageSize.LETTER);
var page = pdf.AddNewPage();
page.SetCropBox(new iText.Kernel.Geom.Rectangle(36, 36, 7.5f * 72, 10 * 72));
PdfCanvas canvas = new PdfCanvas(page);
canvas.SetStrokeColor(ColorConstants.BLACK).SetLineWidth(3);
canvas.MoveTo(36, 36);
canvas.LineTo(36, 36 + 72); // Draw a line 1 inch long
canvas.LineTo(36 + 72, 36 + 72); // Draw a second line, perpendicular to the first, also 1 inch long
canvas.ClosePathStroke();
pdf.Close();
If I right-click the resulting PDF and select "Print", my triangle is off the bottom of the page.
When I open the resulting PDF in the PDF program I'm using (PDF Architect), it gives me a few options:
If I just click "Print", it gives me lines that are 1 1/16" long and start about 1/8" from the edge of the page, so by default PDF Architect seems to be taking the contents of my crop box and expanding it to the maximum page availability.
If I click on "Fit" before clicking "Print", that results in the desired output - lines 1" long, starting 1/2" from each side of the page. That works but is error-prone - too easy to forget to click "Fit" every time.
Is there a way to generate a PDF that contains information that says "I'm targeting this document at letter size, but I'm staying 1/2" away from all the edges, so when you print, if the printer has margins <= 1/2 inch you should be fine, and just print it exactly how I've described without any shrinking or enlarging"?
You will not be able to completely control this from the PDF document. The PDF processor (e.g. viewing application) or the printer (driver) will always be able to scale the content up or down.
Apparently, PDF Architect has the "Fit" option enabled by default, so it scales the page to the selected paper size.
You are setting a crop box of 7.5x10 in. I assume you're printing to Letter sized (8.5x11 in) paper. So the 7.5x10 page will indeed be scaled up, and your content will become slightly larger.
Is there a way to generate a PDF that contains information that says "I'm targeting this document at letter size, but I'm staying 1/2" away from all the edges, so when you print, if the printer has margins <= 1/2 inch you should be fine, and just print it exactly how I've described without any shrinking or enlarging"?
I would not set the crop box. When the pages in the PDF document are Letter size and the output paper is also Letter size, it should not matter whether the "Fit" option is enabled or not, as not scaling needs to happen. It's definitely not a fool proof solution, but at least it's less error prone.

How the value of the tj operator is generated in a pdf document (justified text)

I can't understand and find how the value of the tj operator is generated??
Here I paste result before and after changes in the display of the text (on the second block I changed the position Left-Justice and then again comeback to Centered)
I think pdf use some of prng, but what kind of, I can't find
HElp please
[(\003\024\027\005\003\030\036\b)-114.267(\003\006\007\024\036\b)-113.297(\026\002\024\003\032\020\b)-113.337(\b)-111.574(#\024\002\f\005\002\021\003\007\004\f\005\b)-117.089(\003\006\002\003\b)-114.08
[(\003\024\027\005\003\030\036\b)-114.366(\003\006\007\024\036\b)-113.297(\026\002\024\003\032\020\b)-113.327(\b)-111.693(#\024\002\f\005\002\021\003\007\004\f\005\b)-116.98(\003\006\002\003\b)-114.188
First of all, the PDF format does not explicitly support text justification. PDF does not even know text column definitions to justify text in!
All the PDF format supports is
setting or changing the text matrix (and text line matrix), scaling, character and word spacing explicitly and
drawing text pieces which implicitly changes the text matrix.
Thus, if a PDF processor changes the justification of a line of text, it actually first has to have determined
which text pieces belong together and form that line of text;
text pieces can be given as arguments of the Tj or TJ instructions (or more seldom the " or ' instructions); in simple cases the whole line is drawn using a single instruction but you cannot count on that in general; and
what the left and right borders of the text column are to justify between;
e.g. these borders might be standard values assumed by the processor for certain page formats or derived from the current clip path.
Having determined these data, the procedure differs for different kinds of justification:
left justification - position the text matrix at the left text column border at the height of the line and simple let the text drawing instructions follow;
right justification - calculate the width of the drawn line using the current font, position the text matrix at the right text column border minus that width at the height of the line, and let the text drawing instructions follow;
center justification - calculate the width of the drawn line using the current font, position the text matrix at the middle of the text column minus half that width at the height of the line, and let the text drawing instructions follow;
full justification - calculate the width of the drawn line using the current font, set the character spacing and word spacing (using the Tc and Tw instructions, probably with a tweak of the Tz horizontal scaling) to use up the difference between that width and the text column width, position the text matrix at the left text column border at the height of the line, and let the text drawing instructions follow;
or calculate the width of the drawn line using the current font, change the text drawing instructions to use up the difference between that width and the text column width (e.g. using the numeric TJ array argument values), position the text matrix at the left text column border at the height of the line, and let the changed text drawing instructions follow;
or even apply a combination of these methods.
(The changes applied when doing a full justification - character spacing, word spacing, changes of text drawing instructions - obviously additionally are undone when later again changing to another type of justification...)
Positioning the text matrix can happen using the Tm, Td, TD, and T* instructions.
By the way, the positioning and scaling of the text also is influenced by the current transformation matrix. Thus, cm instructions can also be used for justification. But this is less likely than the use of the instructions mentioned above...
Unfortunately you merely supplied an excerpt from the array argument of a TJ instruction before and after such a justification job. One sees that the numeric elements of that array change very slightly. Whether this actually is the justification itself (as per the second option of the full justification above) or merely some computational inaccuracy cannot be told without the context.

PostScript code to un-hide hidden text in PDF

I have a PDF with some hidden text in it.
When I press [CTRL+a] I see the hidden text in my document viewer.
I can copy the text too and I can extract the text via pdftotext, but I can't recolorize the text so I can view the hidden text in the PDF viewer without pressing [CTRL+a].
So I had the idea, that I could use PostScript and change the color for the this text object.
But how can I determine what function sets the color or hides the text?
You cannot use PostScript to achieve what you want. You need to resort to manually editing the PDF file...
There are basically three ways to "hide" text:
It could be white (or any color) text on white (or same color as text) background.
It could be covered by another object, say, a white area, or an image.
It could be using Text Rendering Mode 3 ("3 Tr").
The first two cases I'll not explain here, because they are rather unlikely. For the third case you could proceed like this:
Use qpdf to unpack as many as possible compressed 'streams' inside the PDF, creating what qpdf calls the 'QDF mode' of a PDF:
qpdf --qdf --object-streams=disable input.pdf uncompressed.pdf
Open uncompressed.pdf in a good text editor, such as VIm.
Search for the sequence 3 Tr.
(Text rendering mode 3 is described in the PDF-1.7 specification as "Neither fill nor stroke text (invisible).")
Change it to 1 Tr or 2 Tr and save the file.
(Text rendering mode 1 is "stroke text", mode 2 is "Fill, then stroke text." Mode 1 will only show the outlines...)
Re-compress the file:
qpdf uncompressed.pdf input-modified.pdf
Open the new file input-modified.pdf in your favourite PDF viewer. It should now show the "un-hidden" text.
Update
Having received a sample of a PDF file with "hidden" text from the OP (via private channels), I can confirm now that the hiding indeed is achieved by using white text color (RGB-white).
To make such text visible:
Unpack the PDF, using qpdf --qdf --object-streams=disable in.pdf unpacked.pdf
Search for all occurrences of 1 1 1 rg and 1 1 1 RG. These set the RGB colors to white (the first one non-stroking, the second one for stroking operations).
Comments à la %%Contents for page N: in the QDF-version of the uncompressed PDF file will indicate for which page the color setting is valid. (Note, there may be multiple occurrences of the rg and RG operators, each one setting a different (or the same) color for the next drawing operation.)
Now replace the white colors by black ones, by overwriting the found occurrences with 0 0 0 rg and 0 0 0 RG. Do this not all at once, but one after the other and observe what changes on the respective page after saving the changes. (You may want to avoid painting white text to black if it is on a black background already!)
Firstly, hidden text in PDF is done with a text rendering mode, not a colour. Text rendering mode 3 is 'neither stroke nor fill'. So changing the colour won't help you if this is how the text is drawn. Of course we can't tell if this is how the text has been drawn (but I suspect it is) because you haven't made the PDF file publicly available. In almost all cases if you want to discuss a particular file the best thing to do is make it public.
Secondly, you can't use PostScript to change a PDF file (well, you could write a PostScript program to interpret the PDF file, but that would be hard...)

How to cut out numbers from an image dynamically?

i've got to this stage:
where i can find the numbers in the above image but i need to cut them out so i can retain the order etc. but the as the number increases the spacing changes and the position of the number?
so i think it should be a find a white PX the continue until it find a solid black col and then use the points to do a simple cut any help would be great.
A simple solution would be this:
Find the first upmost horizontal line which contains white pixels
From that line find the first horizontal line which contains only black pixels
Those two lines are your upper and lower borders.
Between this borders proceed like this:
Find the first most left vertical line which contains white pixels
From that line find the last vertical line which contains only black pixels and which comes directly after a line with white pixels.
Those two lines are your left and right borders.
The steps to separate single numbers can be performed analogously.
If you need to identify which numbers are in your picture, I recommend using specialized computer vision libraries.
Some VB.net pseudo code to get you going:
Sub FindTopBorder(image As MyImage) As Integer
For y = 0 to image.Height - 1
For x = 0 to image.Width - 1
Dim pixel = image.GetPixel(x, y)
If ('Check if pixel is white here with RGB or Color') Then
Return y
End If
Next
Next
' Just in case there are no white pixels or use an exception instead
Return -1
End Sub
I would start looking into Connected component segmentation. You find a pixel which is within a character (number). Then run the connected component algorithm which finds all connected pixels under specific set of rules (e.g. slight deviation in color, stop at hard borders etc).
http://en.wikipedia.org/wiki/Connected-component_labeling
If you can use libraries, I'm sure OpenCV or similar libraries support this out of the box.
//edit
I see you need VB.net. Probably it is easiest to port some algorithm to VB or create one yourself.
See e.g. http://www.codeproject.com/Articles/336915/Connected-Component-Labeling-Algorithm
What to expect
Input
An image containing two shapes:
Output
Now each is separated into single images.

Printing PDF differs on different printers

I created A4 PDF document using iTextSharp. It contains simple table of which every cell in fact is label on the page. I drew it by using PdfContentByte, so the size of the programmatically drawing area is 595 x 842 points. Thus I drew rectangles (as table cells) using points as units.
I fixed the size and place (in points) of printing content by checking printed pages on my printer. I used Acrobat Reader and options: without scaling ('None') and default paper size 8.3' x 11.7'
Now if I print the same PDF on different printer the content (table) is shifted to the left or/and top direction. So the distances between page's edges and the outer frame of the table are different on different printers. Sometimes the content is cut but I know that different printers have different printing area - so it understood.
But I can not understand why it is shifted. Are they others parameters that I don't know?