How to add text object to existing pdf - pdf

I have a source pdf which I am modifying by adding text objects. I am using "Incremental Updates" which is mentioned in the PDF specification. But while adding text objects using this method I am making some mistakes due to which the pdf doesn't render properly in Adobe Reader 11. When the pdf is opened and I double-click on it, the added text objects get deleted. I figured out that this is due to text annotation.
Now I want to know how a new text object can be added using incremental update? How do the Contents and RC of a free text annotation have to be to maintained?
Also is it possible to disable or delete the annotation so that my problem can be avoided easily? Because I want a simple pdf, I don't want annotation options.
The source pdf I am using is here.
The modified pdf after adding text object is here.
I am not sure that source pdf is itself proper according to pdf specification.

First off let me show you how easy things are if you can use a decent PDF library. I use iTextSharp as an example but the same can also be done with others like PDFBox or PDFNet (already mentioned by #Ika in his answer):
PdfReader reader = new PdfReader(sourcePdf);
using (PdfStamper stamper = new PdfStamper(reader, targetPdfStream)) {
Font FONT = new Font(Font.FontFamily.HELVETICA, 12, Font.BOLD, new GrayColor(0.75f));
PdfContentByte canvas = stamper.GetOverContent(1);
ColumnText.ShowTextAligned(
canvas,
Element.ALIGN_LEFT,
new Phrase("Hello people!", FONT),
36, 540, 0
);
}
(Derived from the Webified iTextSharp Example StampText.cs explained in chapter 6 of iText in Action — 2nd Edition.)
(Which PDF library you choose, depends on your general requirements and available license models.)
If, in spite of the ease of use of such PDF libraries, you insist on doing it manually, here some remarks:
First you have to find the Page dictionary of the page you want to add content to. Depending on the type of PDF this already might require decompression of object streams etc. but in your sample modified1.pdf that is not necessary:
7 0 obj
<</Rotate 90
/Type /Page
/TrimBox [ 9.54 6.12 585.68 835.88 ]
/Resources 8 0 R
/CropBox [ 0 0 595.22 842 ]
/ArtBox [ 9.54 18.36 585.68 842 ]
/Contents [ 9 0 R 10 0 R 11 0 R 12 0 R 13 0 R 14 0 R 15 0 R 16 0 R ]
/Parent 6 0 R
/MediaBox [ 0 0 595.22 842 ]
/Annots 17 0 R
/BleedBox [ 9.54 6.12 585.68 835.88 ]
>>
endobj
You see the array of references to content streams. This is where you have to add new page content to. You can manipulate an existing stream or create a new stream and add it to that array.
(Most PDFs have their content stream compressed. For the general case, therefore, you'd have to decompress a stream before you can work on it. Thus, in my eyes, the easier way would be to start a new stream.)
You chose to manipulate the last referenced stream 16 0 which in your PDF is uncompressed:
16 0 obj
<</Length 37 0 R>>
stream
S 1 0 0 1 13.183 0 cm 0 0 m
[...]
0 10 -10 -0 506.238 342.629 Tm
.13333 .11765 .12157 scn
-.0002 Tc
.0006 Tw
(the Bank and branch on which cheque is drawn\).)Tj
/F1 2 Tf
-15.1279 10.9462 Td
(abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~!##$%^&*aaaaaaaaaaaaa)Tj
/F2 1 Tf
015.1279 01.9462 Td
(ANAabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789)Tj
ET
endstream
endobj
Your additions, I gather, are the two 3-liners at the bottom which first select a font, then position the insertion point and finally print a selection of letters.
Now you say you added text abc..z and ABC...Z just for testing. But letters b j k q v etc not appearing in the pdf. The problem becomes even more visible for your second addition of letters; here only the capital 'A' and 'N' are displayed.
This is due to the fact that the fonts in question are embedded into the PDF --- fonts are embedded into PDFs to allow PDF viewers on systems which don't have the font in question, to display the PDF --- but they are not completely embedded, only the subset of characters required from that font.
Let's look for the font F2 for which only 'N' and 'A' appear:
According to the page object, the page resources can be found in object 8 0:
8 0 obj
<</Font <</F1 45 0 R /TT2 46 0 R /F2 47 0 R>>
/ExtGState <</GS2 48 0 R>>
/ProcSet [ /PDF /Text ]
/ColorSpace <</Cs6 49 0 R>>
>>
endobj
So F2 is defined in 47 0:
47 0 obj
<</Subtype /Type1
/Type /Font
/Widths [ 722 250 250 250 250 250 250 250 250 250 250 250 250 722 ]
/Encoding 52 0 R
/FirstChar 65
/FontDescriptor 53 0 R
/ToUnicode 54 0 R
/BaseFont /ILBPOB+TimesNewRomanPSMT-Bold
/LastChar 78
>>
endobj
In the referenced ToUnicode map 54 0 you see
54 0 obj
<</Length 55 0 R>>stream
/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
/Registry (AAAAAA+F2+0) /Ordering (T1UV) /Supplement 0 >> def
/CMapName /AAAAAA+F2+0 def
/CMapType 2 def
1 begincodespacerange <41> <4e> endcodespacerange
2 beginbfchar
<41> <0041>
<4e> <004E>
endbfchar
endcmap CMapName currentdict /CMap defineresource pop end end
endstream
endobj
In this mapping you see that only character codes 0x41 'A' and 0x4e 'N' are mapped
In your document the font is used only to print "NA" in the amount table cells and for nothing else. Thus, only those two letters 'N' and 'A' are embedded, which results in your addition with that font only outputting these letters.
Thus, to successfully add text to the page you either have to check the font ressources associated with the page for the glyphs they provide (and restrict your additions to those glyphs) or you have to add your own font resource.
As the presence of characters in the encoding often is not as easy to see as it is here (ToUnicode is optional), I would propose, you add your own font ressources. The PDF specification ISO 32000-1 explains how to do that.
Furthermore you state the x and y axis position for the text is not properly displaying in pdf. While you don't say what exactly you mean, you should be aware that in the content stream you can apply affine transformations to the coordinate system of the page, i.e. stretch, skew, rotate, and move the axis.
If you want to use the original coordinate system and not depend on the coordinates to be proper at your additions, you should add an initial content stream to the page containing a q operator (to save the current graphics state on the graphics state stack) and start your additions in a new final content stream with a Q operator (to restore the graphics state by removing the most recently saved state from the stack and making it the current state).
EDIT As a sample I applied the Java equivalent of the C# code at the top to your modified1.pdf with append mode activated. The following objects were changed or added as a result:
The page object 7 0 has been updated:
7 0 obj
<</CropBox[0 0 595.22 842]
/Parent 6 0 R
/Contents[69 0 R 9 0 R 10 0 R 11 0 R 12 0 R 13 0 R 14 0 R 15 0 R 16 0 R 70 0 R]
/Type/Page
/Resources<<
/ExtGState<</GS2 48 0 R>>
/ProcSet [/PDF /Text /ImageB /ImageC /ImageI]
/ColorSpace<</Cs6 49 0 R>>
/Font<</F1 45 0 R/F2 47 0 R/TT2 46 0 R/Xi0 68 0 R>>
>>
/MediaBox[0 0 595.22 842]
/TrimBox[9.54 6.12 585.68 835.88]
/BleedBox[9.54 6.12 585.68 835.88]
/Annots 17 0 R
/ArtBox[9.54 18.36 585.68 842]
/Rotate 90
>>
endobj
If you compare with your former version, you see that
two new content streams have been added, 69 0 at the start and 70 0 at the end;
the resources are not an indirect object anymore but instead are directly included here;
the resources contain a new Font ressource Xi0 at 68 0.
Now let's look at the added objects.
This is the font ressource for Helvetica-Bold named Xi0 at 68 0:
68 0 obj
<</BaseFont/Helvetica-Bold
/Type/Font
/Encoding/WinAnsiEncoding
/Subtype/Type1
>>
endobj
Non-embedded, standard 14 font resources are not complicated at all...
Now there are the additional content streams. iText does compress them, but I'll show them in an uncompressed state here:
69 0 obj
<</Length 1>>stream
q
endstream
endobj
70 0 obj
<</Length 106>>stream
Q
q
0 1 -1 0 595.22 0 cm
q
BT
1 0 0 1 36 540 Tm
/Xi0 12 Tf
0.75 g
(Hello people!)Tj
0 g
ET
Q
Q
endstream
endobj
So the new content stream at the start stores the current graphic state, and the new one at the end retrieves that stored state, changes the coordinate system, positions for text insertion, selects font, font size, and the fill colour, and finally prints a string.

Related

Convert png/jpeg to pdf 1.0 using ASCII85Encode and DCTEncode

I have been asked to modify a php script (written ages ago and running on a server) that outputs complex pdf 1.0 files, so that png and jpeg images could be added to the layout.
Since I don't have neither the time nor the budget to re-write that script with a pdf API, I thought I'd try to use Imagick to convert those images into pdf, and then insert the code in the pdf 1.0 files.
The code is pretty straightforward :
$images = array('myfile.png');
$pdf = new Imagick($images);
$pdf->setImageFormat('pdf');
$pdf->writeImages('myfile.pdf', true);
But the problem is that Imagick produces pdf 1.4 which makes it impossible to insert in the pdf 1.0.
Ideally, the pdf 1.0 code for a jpeg should be as follows :
10 0 obj
<<
/Type /XObject
/Subtype /Image
/Name /I1
/Width 41
/Height 32
/BitsPerComponent 8
/ColorSpace /DeviceRGB
/Length 11 0 R
/Filter [ /ASCII85Decode /DCTDecode ]
/DecodeParms [ null null ]
>>
stream
s4IA>!"M;*Ddm8XA,lT0!!3,S!/(=^#mq"I#n7=Q%MTB`&/6#u'GM3"-RKoP
+!`0d0J+n0/M].F4Z>5W3&*K\5s[k-9he>U6qpBM9he>V9he=_%M''a&JQQ-
(+(jp/2T1X9gM'>9he>V9he>V9he>V9he>V9he>V9he>V9he>V9he>V9he>V
9he>V9he>Vs1eUH#QPtI.0BS_!!3`5!tbS6_uZS4!!*6(!<E3%!<<*"z
!rr?'"9eu7#RLhG!<<-(!<E3%!<E0#z!!*'$!sAc3#7(VC$P3:="9AT+
"9J`2"pbG9%:K8;!YGM;+VHL55uD&(,&r/h"r*2nYsKZ\'iJKs1r43a6O+p#
#ZK.?#rjLCiM,kJK-j!M<+JG7UNAC1(t)FDAb*0\_p`bgo0t*lUkQ1#`73l?
V7":mjn2YdG(u<[[`6n\p,>KCB6T,tVmj^ukP$Aa86BPMLmY-NaOo_O#oP0P
8QfbQM4(?Rak>qS$5tBT8m5tUMOLQVb1c.W&HDk6!<NB,!sA`2#R1M<B)qu6
&Ha12&d1Kt#<!J)"YtXk'VVcu,Jd86d3S3jiGsO56W4_0F#50I(#e-)K*N_\
#f9!XP>n:nA49KVFCjJ&Z\66FFlW'_Pba#?Q,M25oVJt7e`HI)Ap/opVRFLq
k4U/]7os>ILR4pJa4KMK(aq#7=D2r8R&IO9f]`):)(#R;=_W/<RAma=g$/;>
)Cdd?>&&A#R]<sAg?SP7h#IQX"97'T$j-M1!YGMH!'^JNqTi0JV2B,5/+1D&
qY&K_UhL'P,TcprL_clM-<<$9%D*\4h-7C1%"o=_,sVFVoNUaa*h+di_3NU"
k;_r28P"/>FaZaaS#\$l[TWuQD%PouR$5i#aWmOL8La\eoR[)TFTj'NH!aTe
AZj,=2nPqt*do^$hu+mArd5r=d8#&f\R]0_p_c&NG>`4q?rq,4:FE\m52Uqh
4!#,#s3g/P5gS_5hq%As9&pS#O"I;X:G2YI4]AMl76M#^f!I<-!!E9]!AuC'
VuP6<Jj9rj!#9kjDc+-LB"+[[X5(pBaihl+BF!r_ge%#Xj\Di3%[A,;qE6u%
fIb'NU&P*d4&Y+u:G^pUG!a>cs4$9R?W?m12q$l3%[Pc';nn=4Di7ES1jFID
8p5n-4U*N5)0ki36cG<m2`8L_9hJBLrrE)Lms:m,VlA\bH_4e_g`59W[dO=i
:Q\$`;LD_cHqG>Q4`.4-E7P9q"XQ`9s46gqPfB/,VRG)T)d=9!=9&Ha&lUWO
rrE)P~>
endstream
endobj
So I thought I could encode straight the png/jpeg file using ASCII85Encode and DCTEncode to produce the code to encapsulate in pdf 1.0, but my php skills aren't good enough... So before I start some research, I'd like to know if it's the way to go. And I will appreciate any advice on that matter.
As mentioned there is no need to use encoded streams the streams can be imported in many ways such as here using MS Notepad text. KenS may recognise the MuPDF signature used for some manipulation/conversion.
Obj 4 is the scaled size of 5x5.
Obj 6 is the 75 image pixels without encoding as compressed binary.
Obj 7 is the related mask (not always needed).
The critical point is the decimal positions especially length are good enough to be valid if slightly off as some readers may not accept invalid.
%PDF-1.0
%µ¶
1 0 obj
<</Type/Catalog/Pages 2 0 R>>
endobj
2 0 obj
<</Type/Pages/Count 1/Kids[3 0 R]>>
endobj
3 0 obj
<</Type/Page/MediaBox[0 0 3.75 3.75]/Rotate 0/Resources<</XObject<</Img3 6 0 R>>>>/Contents 4 0 R/Parent 2 0 R>>
endobj
4 0 obj
<</Length 34>>
stream
q
3.75 0 0 3.75 0 0 cm
/Img3 Do
Q
endstream
endobj
6 0 obj
<</Length 153/Type/XObject/Subtype/Image/Width 5/Height 5/BitsPerComponent 8/SMask 7 0 R/ColorSpace/DeviceRGB/Filter/ASCIIHexDecode>>
stream
ff2020ffffff20a020ffffff2020ffffffffffffffffffffffffffffffffff20
ffffffff404040ffffff20ffffffffffffffffffffffffffffffffff2020ffff
ffffffff20ffffff202020>
endstream
endobj
7 0 obj
<</Length 11/Type/XObject/Subtype/Image/Width 5/Height 5/BitsPerComponent 1/ColorSpace/DeviceGray/Filter/ASCIIHexDecode>>
stream
ffffffffff>
endstream
endobj
xref
0 8
0000000004 65536 f
0000000016 00000 n
0000000062 00000 n
0000000114 00000 n
0000000000 00002 f
0000000243 00000 n
0000000326 00000 n
0000000647 00000 n
trailer
<</Size 8/Info<</Producer(Me)>>/Root 1 0 R>>
startxref
814
%%EOF
Thank you both for your help. I managed to insert png/jpeg images in the pdf as follows :
// converts png to jpg and removes transparency
$image = new Imagick('img.png');
$image->setImageBackgroundColor('white');
$image = $image->flattenImages();
$image->writeImage('img.jpg');
// converts jpg to bmp
$image = array('img.jpg');
$image = new Imagick($image);
$image->setImageFormat('bmp');
$image->writeImages('img.bmp', true);
// converts bmp to pdf
$image = array('img.bmp');
$image = new Imagick($image);
$image->setImageFormat('pdf');
$image->writeImages('img.pdf', true);
Then I get pdf code of my image in the format "/Filter [ /ASCII85Decode ]" that I can select and insert into the existing pdf 1.0.
Best,

PDF generation — How to merge multiple stream objects?

I'm currently into generating PDF documents without the use of an external library and it has been going well so far. I've written the document exposed below with a text editor (vim) and it renders the expected results using at least two PDF distinct viewers (evince & gsview, running Linux).
This document produces three squares at top of the page, coming in different sizes, widths and colors.
My question is : is there a way to merge two stream objects into a new single one or, in other words, is there a way to compose sophisticated objects starting from simple ones, so we can easily refer to these composite objects, multiple times if needed ?
In the given example, object 5 0 obj is drawing a square, and following ones are just applying colors and coordinates transformations (through a matrix).
The PDF reference manual states that multiple stream contents passed as an array to page object's /Contents parameter are concatenated and processed as a single continuous stream, which totally does the trick… as long as the document remains small and simple!
In this same example, the /Contents array is indirectly passed through object 4 0 obj, which refers three times to 5 0 R, to draw the squares.
The ideal here would be to define three differents objects, each refering to 5 0 R by themselves, then invoke only these objects, a single time each, from the Contents array.
I tried adding subarrays inside it, which could in turn be embedded into dedicated objects and referenced indirectly, but it unfortunately doesn't work. :-(
A lot of thanks to any people that could/try-to help !
PS: I'm doing it because I'm interested in the format itself and would like to produce some autogenerated documents from small scripts. Also, I'll probably embed them into a weakly powered appliance and I cannot afford relying on dozens of megabytes in dependencies.
But before this, I still tried to do that too, using PHP with TCPDF. If there's already some facilities dedicated to this that I would have missed, this is relevant to my interests too. :-)
Small.pdf (hand made PDF file)
%PDF-1.7
1 0 obj
<<
/Type /Catalog
/Pages 2 0 R
>>
endobj
2 0 obj
<<
/Type /Pages
/Count 1
/Kids [ 3 0 R ]
>>
endobj
3 0 obj
<<
/Type /Page
/MediaBox [ 0.000000 0.000000 1000.000000 1414.213562 ]
/Contents 4 0 R
>>
endobj
4 0 obj
% A simple array, just to avoid embedding it directly in /Page object (3 0 R here)
[
6 0 R 5 0 R % Red square
7 0 R 5 0 R % Green square
8 0 R 5 0 R % Blue square (tilted)
]
endobj
5 0 obj
% Draws a square, centered by default on lower left corner
<<
/Length 43
>>
stream
+20 +20 m
+20 -20 l
-20 -20 l
-20 +20 l s Q
endstream
endobj
6 0 obj
<<
/Length 63
>>
stream
/DeviceRGB CS
q
1.0 0.0 0.0 SC
2.0 w
1 0 0 -1 60 1354.213562 cm
endstream
endobj
7 0 obj
<<
/Length 49
>>
stream
q
0.0 1.0 0.0 SC
1.0 w
2 0 0 -2 190 1334.213562 cm
endstream
endobj
8 0 obj
<<
/Length 83
>>
stream
q
0.0 0.0 1.0 SC
5.0 w
0.707106781 0.707106781 -0.707106781 0.707106781 110 1250 cm
endstream
endobj
xref
0 9
0000000000 65535 f
0000000010 00000 n
0000000079 00000 n
0000000168 00000 n
0000000296 00000 n
0000000513 00000 n
0000000674 00000 n
0000000796 00000 n
0000000905 00000 n
trailer
<<
/Size 9
/Root 1 0 R
/ID [ <0000000000> <0000000001> ]
>>
startxref
01047
%%EOF
What you are looking for are form XObjects.
The pdf specification ISO 32000-1 characterizes them like this:
A form XObject is a PDF content stream that is a self-contained description of any sequence of graphics objects. A form XObject may be painted multiple times - either on several pages or at several locations on the same page - and produces the same results each time, subject only to the graphics state at the time it is invoked.
For details please read section 8.10 of the specification.

pdf 1.2: How to display a graphical image?

I am trying to learn structure of a pdf document from guide. I could add the text and shapes with lines, but I am having problem displaying the image.
The code I am writing to display an image is (on page 54):
%PDF-1.2
% based on e08.pdf
1 0 obj
<<
/Type /Page
/Parent 5 0 R
/Resources 3 0 R
/Contents 2 0 R
>>
endobj
2 0 obj
<< /Length 51 >>
stream
BT
/F1 24 Tf
1 0 0 1 260 254 Tm
/CS1 cs
63 127 127 sc
(Hello World)Tj
ET
100 0 127 sc
/CS2 CS
0 0 1 SC
315 226 m
299 182 l
339 208 l
291 208 l
331 182 l
b
100 0 0 100 65 326 cm
BI /W 36 /H 32 /BPC 8
/CS /DeviceGray
ID
ççççççççççççÕˇˇˇˇˇˇˇˇˇÕççççççççççç
çççççççççççç͡ˇˇˇˇˇˇˇˇÍçççççççççççç
ççççççççççç¢ˇˇˇˇˇˇˇˇˇˇˇ¢ççççççççççç
çççççççççç瑡ˇˇˇˇˇˇˇˇˇˇ‘ççççççççççç
ççççççççççˇˇˇˇˇˇˇˇˇˇˇîçççççççççç
ççççççççççøˇˇˇˇˇˇˇˇˇˇˇˇˇøçççççççççç
ççççççççççÒˇˇˇˇˇˇˇˇˇˇˇˇˇÒçççççççççç
çççççççç籡ˇˇˇˇˇˇˇˇˇˇˇˇˇˇ±ççççççççç
çççççççç瀡ˇˇˇˇˇˇˇˇˇˇˇˇˇˇ€ççççççççç
ççççççççõˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇõçççççççç
ççççççççDˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇDçççççççç
çççççççç¯ˇˇˇˇˇˇˇˇ¯ˇˇˇˇˇˇˇˇ¯çççççççç
çççççççÕˇˇˇˇˇˇˇˇˇÕˇˇˇˇˇˇˇˇˇÕçççççç
çççççç炡ˇˇˇˇˇˇˇÍç͡ˇˇˇˇˇˇˇ‚ççççççç
çççççç¢ˇˇˇˇˇˇˇˇˇÕçÕˇˇˇˇˇˇˇˇˇ¢ççççç
ççççç瑡ˇˇˇˇˇˇˇ¯ççç¯ˇˇˇˇˇˇˇˇ‘çççççç
çççççî¯ˇˇˇˇˇˇˇˇDçççÕˇˇˇˇˇˇˇˇ¯îççççç
çççççøˇˇˇˇˇˇˇˇˇõçççõˇˇˇˇˇˇˇˇˇøççççç
çççççÒˇˇˇˇˇˇˇˇ‘çççç瀡ˇˇˇˇˇˇˇÒççççç
ççç癡ˇˇˇˇˇˇˇˇ™çççç癡ˇˇˇˇˇˇˇˇ™ççç
ççç瑡ˇˇˇˇˇˇˇÒçççççççÒˇˇˇˇˇˇˇˇ‘çççç
çççõˇˇˇˇˇˇˇˇˇÕçççççççøˇˇˇˇˇˇˇˇˇõççç
çççDˇˇˇˇˇˇˇˇ¯îçççççççî¯ˇˇˇˇˇˇˇˇDççç
çççÒˇˇˇˇˇˇˇˇÕçççççççç瑡ˇˇˇˇˇˇˇÒççç
ççÕˇˇˇˇˇˇˇˇˇõçççççççççõˇˇˇˇˇˇˇˇˇÕçç
ç炡ˇˇˇˇˇˇˇÒDDDDDDçççç炡ˇˇˇˇˇˇˇ‚çç
çõˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇ™ççççÕˇˇˇˇˇˇˇˇˇõç
瑡ˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇ‘çççççÒˇˇˇˇˇˇˇˇ‘ç
î¯ˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇîççççDˇˇˇˇˇˇˇˇ¯î
ÕˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇøççççˇˇˇˇˇˇˇÕ
͡ˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇÒçççç瑡ˇˇˇˇˇˇˇÒ
ˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇˇ™ççç癡ˇˇˇˇˇˇˇˇ
EI
endstream
endobj
3 0 obj
<<
/ProcSet[/PDF/Text]
/Font <</F1 4 0 R>>
/ColorSpace
<<
/CS1
[
/Lab
<<
/Range [-128 127 -128 127]
/WhitePoint [ 0.951 1 1.089]
>>
]
/CS2
[
/CalRGB
<<
/Gamma [2.222 2.222 2.222]
/Matrix
[
0.412 0.213 0.019
0.358 0.715 0.119
0.181 0.072 0.951
]
/WhitePoint [0.951 1 1.089]
>>
]
>>
>>
endobj
4 0 obj
<<
/Type /Font
/Subtype /Type1
/Name /F1
/BaseFont/Helvetica
>>
endobj
5 0 obj
<<
/Type /Pages
/Kids [ 1 0 R ]
/Count 1
/MediaBox [ 0 0 612 446 ]
>>
endobj
6 0 obj
<<
/Type /Catalog
/Pages 5 0 R
>>
endobj
trailer
<< /Root 6 0 R >>
What I expect from it is:
But when I open the file in Acrobat Reader DC 2015, I see the text and the star, but not the image logo.
Note:
I have formatted the code myself, so please let me know if it is not proper.
I assume that there are problems with the characters that are used to show the Adobe logo. I guess the characters should be binary data, and when the PDF is generated, they are converted to those symbols.
Here the author is using pdf 1.2, that is pretty old, but as far as I know it should not make a problem, since pdf is backward compatible.
My question:
Why I cannot see the desired result as shown in the image using this code?
How to get the codes needed to use in PDF to display an image. Let us say this textual representation of the binary code (or even the binary itself) that I have used in my code?
Update:
As mentioned in the comment, cross reference table does not exist in my code, but when I generated that with pdftk tool, the result was the same.
The major problem with your inline image is that you try to create a binary data block using text.
The data between ID and EI is interpreted as a stream of a single (!) white space character followed by height x width x bits-per-component/8 x number-of-components data bytes, i.e. in your case (according to /W 36 /H 32 /BPC 8 /CS /DeviceGray) 32 * 36 * 8/8 * 1 bytes.
This is not the case in your sample. In your question you have that data block indented which adds numerous bytes to the stream. Furthermore you have lines containing different numbers of bytes (even though they may look equally long in an editor).
Your binary download differs substantially from your question text, e.g. instead of the ˇ characters filling the A you have . characters there. If suffers from unequal line lengths, too
I assume you use a text editor to write that PDF which is a bad choice because you do not correctly see the real number of bytes used. Especially problematic are control characters and byte values not associated with a character in your encoding.
Let's therefore try something more simple and only use characters in the ASCII range and a smaller, simpler form:
Depending on your end-of-line sequence (their bytes are part of the data bytes!!) use either of the following two samples
in case of single byte end-of-line sequences (only CR or only LF, typical for Mac or Unix):
BI /W 5 /H 4 /BPC 8
/CS /DeviceGray
ID
zzzz
z..z
z..z
zzzz
EI
in case of two byte end-of-line sequences (CR LF, typical for DOS / MS Windows):
BI /W 6 /H 4 /BPC 8
/CS /DeviceGray
ID
zzzz
z..z
z..z
zzzz
EI
Do take care not to add any leading or trailing spaces! They would also be interpreted as data bytes!
The result looks like this in the first case
and this in the second case
The dark bar(s) on the right / on both right and left is/are due to the line ending character(s).
If you don't want such bars, you have to get rid of the line endings, e.g.
BI /W 4 /H 4 /BPC 8
/CS /DeviceGray
ID zzzzz..zz..zzzzz EI
resulting in
That all been said, please do yourself a favor and
don't create PDFs as text, e.g. in a text editor! While they can be understood to a certain degree in a text viewer, creating them in a text editor very soon becomes hell;
don't use inline images but instead image resources! Inline images have proved to be troublesome and in PDF 2.0 will be deprecated or at least restricted to very small sizes; and finally
don't use the PDF 1.2 reference but instead the current PDF standard ISO 32000-1! Adobe personal called the old PDF references not normative in nature, so you can not count on what they say.

PDF Tj command with angle brackets?

I'm trying to figure out where in an uncompressed PDF v1.4 document the Times font is used.
The /Font object describing the Times font within the PDF is object 65 as follows:
65 0 obj
<</Type /Font
/Subtype /TrueType
/BaseFont /PXAAAD+TimesNewRoman,Italic
/FirstChar 1
/LastChar 35
/Widths [250 333 333 333 500 500 500 500 500 500 500 500 500 500 333 722 722 833 666 610 500 556 500 443 443 500 277 443 500 389 389 277 500 443 500]
/FontDescriptor 205 0 R
/ToUnicode 206 0 R>>
endobj
It refers to a /FontDescriptor object 205 to further define the Times font object, and to a /ToUnicode map in object 206 which describes byte-to-unicode character mapping. EDIT: After Ritsaert's initial answer to the question below, I'm adding the font's /ToUnicode object here, to provide the mentioned CMap.
206 0 obj
<</Length 208 0 R>>
stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<00> <FF>
endcodespacerange
35 beginbfchar
<01> <0020>
<02> <0028>
<03> <0029>
<04> <002d>
<05> <0030>
<06> <0031>
<07> <0032>
...
<23> <0101>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
endstream
endobj
I've now tracked down the use of the Times font object to a /Page object (one of many) like the following one which refers to font object 65 through the /F4 reference in its page /Resources:
12 0 obj
<</Type /Page
/Parent 2 0 R
/MediaBox [0 0 432 648]
/Contents 92 0 R
/Resources <</Font <</F1 62 0 R
/F3 64 0 R
/F4 65 0 R>>
/ProcSet [/PDF /Text]>>
/Group <</S /Transparency
/CS /DeviceRGB>>>>
endobj
The /Contents stream (object 92 in the PDF file) is then full of text objects (enclosed in BT and ET), none of which contains text, but instead they use angle brackets full of numbers. For example, here is the only reference to the Times font /F4 whose use I'm trying to find:
92 0 obj
<</Length 93 0 R>>
stream
...
BT
0.5020 g
72.0000 615.1512 Td
/F4 12.0000 Tf
<0605> Tj
ET
...
endstream
endobj
But what do the angle brackets and the number <0605> refer to? A specific glyph in the font table? Looking at the PDF reference and section 5.3.2 I can't find mention of the angle brackets.
EDIT: Given the above code and the accepted answer that <0605> is a hex encoding of text, the <0605> are the entries <06> and <05> in the CMap object 206 and thus map to unicodes <0031> and <0030> respectively. That means, the string <0605> refers to U+0031 (a "1") and to U+0030 (a "0"), such that the Times font is used for the string "10" on page object 12.
What is going on here:
in the content stream the Tj command is given the string <0605> to draw. a string in between <> is a hex string and hence the characters #6 and #5 are drawn. In 3.2.3 of the linked PDF reference is the notation explained.
Just before the text draw command the font F4 is selected using the Tf command.
Given the resource fork of the page containing the font is referenced as object 65 revision 0. This font object is a subsetted Truetype font where glyphs 1..35 are defined. No Encoding is specified (thus WinAnsiEncoding is used). So the embedded subsetted font rearranged the characters in the font in a non standard manner (occurs quite often).
Now if you want to know how these glyph IDs are linked to Unicode characters: the font has a ToUnicode link where a stream contains a CMAP defining the mapping. This should be sufficient to convert the string to an Unicode string.

Detect position of watermark in a pdf

I am on ubuntu.
I have a pdf file with pages divided into a grid. Each block of the grid contains name/age/dob/photo of a candidate. some records have a watermark "disqualified"
I need to scrape his pdf, with disqualified candidates in a separate list.
Using pyPdf I was able to get individual records, but it also includes watermarked candidates.
How to detect the watermark? If I can get the coordinates of the watermark, how do I match it with the candidate?
I am open to solutions other than python pyPdf
(Actually this is not an answer but merely an analysis to bit for a comment.)
I don't know pyPdf (or any python PDF classes) myself, but here is how the watermark is created for a sample entry; based upon this, anyone knowing pyPDF well enough, may more easily advice.
The Roundup
Depending on how pyPDF (or other python PDF classes) allows access to the page content, there are two major basic approaches:
If the class returns information on content (text and image) in their order in the page content stream: The watermark image xobject is referred to right before the data of the entry. Thus, any entry preceded by the drawing of a xobject image is marked.
If otherwise the information are not given in the order indicated by the page content stream, coordinate comparison must be used which per se is quite straight forward. In that case it might be of interest that the images are inserted with a [0.1 0 0 0.1 0 0] transformation matrix in action while the text is drawn with an identity transformation matrix.
The Details
This is entry # 200; the other watermarked entry is constructed similarly:
Watermarking is done by means of an image xobject. There is but one image xobject defined for the page used by both watermarked entries:
4 0 obj
<</Type/Page/MediaBox [0 0 595 841]
/Rotate 0/Parent 3 0 R
/Resources<</ProcSet[/PDF /ImageC /ImageI /Text]
/ColorSpace 18 0 R
/ExtGState 19 0 R
/XObject 20 0 R
/Font 21 0 R
>>
/Contents 5 0 R
>>
endobj
20 0 obj
<</R17
17 0 R>>
endobj
17 0 obj
<</Subtype/Image
/ColorSpace 16 0 R
/Width 128
/Height 88
/BitsPerComponent 8
/Filter/FlateDecode/Length 463>>stream
[...]
endstream
endobj
In the content stream this xobject /R17 is inserted right before the data of the entry itself is drawn:
q 0.1 0 0 0.1 0 0 cm
[...]
q 1045 0 0 495 462.5 6510.5 cm
/R17 Do
Q
q
10 0 0 10 0 0 cm BT
0.000487366 Tc
/R10 8 Tf
1 0 0 1 86 650.75 Tm
(Sex : Male)Tj
0.000304794 Tc
-64 0 Td
(Age : 43)Tj
-0.000140686 Tc
-1 11.05 Td
(House No :)Tj
-0.00002085 Tc
1 31.95 Td
(Name :)Tj
0.00008575 Tc
/R12 7.15 Tf
25.5 17.8 Td
( 200 )Tj
ET
Q
1547.5 6475 485 535.5 re
S
q
10 0 0 10 0 0 cm BT
-0.000403137 Tc
/R14 8 Tf
1 0 0 1 145.1 708.5 Tm
(XVX0001081)Tj
0.000421651 Tc
/R14 7.05 Tf
-90.35 -14.95 Td
(Ramesh Kumar)Tj
0.000373332 Tc
/R10 7.05 Tf
-33 -12.75 Td
(Father's )Tj
0.000193787 Tc
7.3 TL
(Name)'
0.00037774 Tc
/R14 7.05 Tf
40.25 1.8 Td
(Ram Singh)Tj
0 Tc
2.5 -11.85 Td
(37)Tj
0.00137196 Tc
/R12 7.15 Tf
-5.25 13.35 Td
(:)Tj