Type 3 fonts conversion - pdf

I am parsing Type3 glyphs fonts from Pdf to postscript. The input file have inline image with data streams flate decode filter applied.the filter has predictor 15.
Any body can help how I take the image streams form pdf to postscript.
This is how the input stream is given in pdf
32 0 obj
<<
/Length 342
>>
stream
37 0 4 -52 33 -1 d1
0.01 0 0 0.01 0 0 concat
gsave 2900 0 0 -5100 400 -100 concat
BI
/IM true
/W 29
/H 51
/BPC 1
/D[1
0]
/F/Fl
/DP<</Predictor 15
/Columns 29>>
ID xœ=Ì¡
Â`ÅñÿeÂLθ n`0>Ù`ñ
f[¦DŒF_ÁhC1ì%Ä)¶o.¢Ÿ"†ßá†s®àì]^ÏŠÅS³tFËÂÚ3sç'Æi èÐÇ:j‹¹¨åìOTÿ ª•ÉÙÕÅŸ¨‡¹Ó$°ÆΚWèÁ!¯Cê
÷0&f µtðV ©Ë÷iôíتÄ~Ø•Œöí&´« +ro#Ê‚ûÏÅùlßG'
EI gRestore
endstream
endobj
And here is what i am trying to write in output in Postscript
/g21 {
37 0 4 -52 33 -1 setcachedevice
q
[0.01 0 0 0.01 0 0] concat
q
[2900 0 0 -5100 400 -100] concat
[ xœ…ѱNÃ0à3©p'l` ¢abä*‰'#‚W`KP¡00öQ`d# ¨CWž€u`‰štj4Ü]# /ù¤œíÿ| ÂìÊüå7úŠ‰V'‚ª¦zò¡9à*´º
m1Õ`ñ—íü‹­‡½Gù#ãÝAVxc¥Ž®"6oFܬJHÃB3(æod¾…xFP†o$!v±Ã»·0—gØY÷J$û„`´#zÊ
Oí¼œÑ¸é`Ê}ü…ñ.Z¯›cF4\¡*O¤ÑPÒYòî¦/éG‘qÑç¼2>öq<Üœ<
B˜5‚²¢ºÎ/èqUTUàoÓ9͔Π܉ä²z ‡S×ÛÙC(PA²š7è­T¾ŽCGÈRaLéåksnˆÃ0z<zø:ž=
]
0
<<
/ImageType 1
/Width 29
/Height 51
/ImageMatrix [29 0 0 -51 0 51]
/BitsPerComponent 1
/Decode [1 0]
/DataSource { 2 copy get exch 1 add exch }
<</Predictor 15
/Columns 29
>>
/FlateDecode filter
>>
imagemask
pop pop
gRestore
gRestore
} def

PostScript has mostly the same filters as PDF. You don't need to decompress the data, just use the FlateDecode filter in PostScript and leave the compressed data untouched.
Note you'll need Language Level 3 for Predictor 15 (or any other PNG predictor) but that shouldn't be a problem, level 3 has been the standard for 18 years.
Otherwise you'll need to implement a version of the FlateDecode filter which supports the PNG Predictor. I believe zlib is quite capable of this.
[EDIT]
Your 'PostScript output' is incomplete, you are using PDF operators (q and Q) which you have not provided a definition for. Apart from anything else this makes it impossible to run the code through an interpreter. Kindly supply a complete simple example file, as requested. Not pasted code, I'm not inclined to go and create a file myself, and besides, binary doesn't cut and paste at all well.
Off the top of my head from desk checking I can't immediately see a problem, but since I can't run the code, I could easily be missing something.
[EDIT 2]
And that file, unsurprisingly, works fine.
You haven't supplied the PostScript file that you are creating. Its rather hard for me to tell what's wrong with the PostScript you created by looking at the PDF file you started with.
You could, of course, use Ghostscript (and I see you've used it to create the PDF file) to create a PostScript file, and then look at what that contains. If you set -dCompressFonts=false then the output font won't even be compressed.
For example:
37 0 4 -52 33 -1 d1
0.01 0 0 0.01 0 0 cm
q 2900 0 0 -5100 400 -99.9998 cm
BI
/IM true
/W 29
/H 51
/BPC 1
/D[1
0]
/F[/A85
/CCF]
/DP[null
<</K -1
/Columns 29>>]
ID
-D=,M5m+t^0_>op8\HM"Du]KKrr2rthqG/5qU_ik]$f$TlUslD91qoN93j0%dckk:ld^*DV25!+
!WX>~>
EI Q
Of course you'll need to look at the prolog to see how all the procedures used there are defined, but you can do that yourself, you certainly don't need me to do it. Notice that the imagemask uses the CCITTFax and ASCII85 decode filters, its trivial to add additional filters. Since the data is guaranteed to be 'monochrome' (its a mask) the CCITT filter generally gives superior compression to Flate.
Note that if you are really using Ghostscript 9.05 then you should upgrade, that is 6 years old.
It might possibly help if you were to explain why you want to take an ugly, bitmapped, type 3 font from PDF and make an ugly, bitmapped type 3 PostScript font from it.
[EDIT 3]
well looking at your PostScript file, the definition of the glyphs does not match what you've put in your question. The actual content looks like this:
/g10135{
88 0 4 -70 82 8 setcachedevice
q
[
0.01 0 0 0.01 0 0 ] M
q
[7800 0 0 -7800 400 800 ]M
<<
/ImageType 1
/Width 78
/Height 78
/ImageMatrix [ 78 0 0 -78 0 78]
/BitsPerComponent 1
/Decode [1
0]
/DataSource ....binary data.....
<< /Predictor 15
/Columns 78
/BitsPerComponent 1>>
/FlateDecode filter def
>> imagemask
Q
Q
}bind def
You have not supplied either a file, procedure or string source as a value for the DataSource key in the dictionary. Essentially, the PostScript interpreter reads and tokenises the /DataSource key, and then proceeds to process the binary as PostScript. Unsurprisingly this causes an error 'syntaxerror in (binary token, type=156)' when processed with Ghostscript.
If you had got past that then you would have discovered that the filter operator takes a data source as well and you haven't supplied one for that either.
So you need to create a data source for your binary data. Up to you how you do that but currentfile is one way. Or readstring given that you know the string length.
So something like:
<<
/ImageType 1
/Width 29
/Height 51
/ImageMatrix [29 0 0 -51 0 51]
/BitsPerComponent 1
/Decode [1 0]
/DataSource
<length> string dup
currentfile exch readstring
.....binary data.....
<<
/Predictor 15
/Columns 29
>> /FlateDecode filter
>> imagemask
Obviously you'll have to fill in yourself by knowing the string length. The dictionary argument to FlateDecode looks to me like it shouldn't be needed.
[Edit 4]
I notice that this is appears to be intended for commercial use. Nothing wrong with that, but I'm not going to do all your homework for you, if its your job its up to you to learn the language well enough to do the job.
I'm skipping lightly over the actual implementation details below in an attempt to outline where you are going wrong. In practice things are a little more complex, I haven't discussed how the procedure stored in the CharStrings dictionary is created, or the difference with early name binding (which is an important concept in PostScript).
Your existing code is:
/g10135{
88 0 4 -70 82 8 setcachedevice
q
[
0.01 0 0 0.01 0 0 ] M
q
[7800 0 0 -7800 400 800 ]M
<<
/ImageType 1
/Width 78
/Height 78
/ImageMatrix [ 78 0 0 -78 0 78]
/BitsPerComponent 1
/Decode [1
0]
/DataSource {417 string dup
currentfile exch readstring}
...binary data....
<< /Predictor 15
/Columns 78
>>/FlateDecode filter def
>> imagemask
Q
Q
}bind def
So, the PostScript interpreter reads those bytes one at a time, and converts them into tokens. This either results in an executable token, which is executed, or an operation on one of the stacks.
So /g10135 is terminated by the { character, because that's a reserved character. The / introduces a name object, so we end up with the name object g10135 which we push on to the operand stack. The { character introduces an executable array so we put a mark on the operand stack.
Next we read 88, terminated by a white space character. That's a numeric so we store that on the operand stack, likewise the other numbers. The operand stack now contains:
/g10135
mark
88
0
4
-70
82
8
We then read setcachedevice, which is terminated by a white space. That isn't a standard token so the interpreter starts looking through the dictionaries on the dictionary stack, looking for a definition. Since it is a standard operator, we find it in systemdict and execute it. That consumes 6 operands from the operand stack, it has no other effects (actually it does, but this is a bit special because we are executing inside a font, but we'll ignore that for now).
Next we encounter a q, again this is looked up in every dictionary on the dictionary stack to find a definition. This is defined in your own prolog as a gsave, so it takes no operands and returns no operands, it simply saves the graphics state, incrementing the save depth by 1.
I'm not going to go through the rest it would be tedious, however, eventually we reach your /DataSource, this is a name, so we push it on the operand stack. The next thing we encounter is a { that's a procedure definition so we push a mark on the operand stack. We then encounter a 417 so we push that, string, dup, currentfile, exch and readstring, so our stack looks like:
/DataSource
mark
417
string
dup
currentfile
exch
readstring
Then we get the character } That is the closing mark for an executable array, so we create the array and push it onto the operand stack:
/DataSource
{....}
Then we return to the procedure and continue executing it. The next thing we find is some binary data so we try to execute that as PostScript binary tokens. Because it isn't valid the interpreter throws an error.
Just creating an executable array is not sufficient to actually execute it. If you look at the outline code I posted at the end of edit 3 above you will note that I did not put the readstring and so on in an executable array, I simply allowed the interpreter to execute that code immediately.
By doing so the readstring acts on currentfile (the actual PostScript program in this case) and reads bytes of data from the current point in that file. The current point will be immediately after consuming the white space which terminates the readstring, ie the actual binary data. The readstring operator reads enough bytes from the file to fill the string, leaving the string on the operand stack. The file pointer has moved on to the byte after the binary data, and the interpreter resumes token scanning at that point. So it then creates the FilterParams dictionary puts the /FlateDecode name on the stack and then executes the filter operator which consumes the name, the dictionary and the string operands, returning a file object. That file object then becomes the value associated with the DataSource key in the image dictionary which is passed to the imagemask operator.
While I haven't tested that code, its basically correct. There are of course other ways to achieve the same aim.
That's basically about as far as I'm prepared to go with this, you need to go and look at what I've written and compare it with your own program.
Note that the simplest way to investigate this is to take the contents of the CharProc (excluding the setcachedevice) and just run that as a PostScript program.

Related

PostScript PDF (1.7), manually writing code

I'm trying to manually write a simple PDF file that contains a title, some text, and an image. I found one example of a manually written "Hello world" and managed to change some things, but I cant get it working for another text object. I have looked for help on the internet but with no luck, I guess not many people write their own PDF files.
This is what I have so far:
%PDF-1.7
1 0 obj % entry point
<<
/Type /Catalog
/Pages 2 0 R
>>
endobj
2 0 obj
<<
/Type /Pages
/MediaBox [ 0 0 200 200 ]
/Count 1
/Kids [ 3 0 R ]
>>
endobj
3 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources <<
/Font <<
/F1 4 0 R
>>
>>
/Contents 4 0 R
>>
endobj
4 0 obj % page content
<<
/Length 20
>>
stream
BT
80 180 TD
/F1 14 Tf
(PDF) Tj
ET
endstream
endobj
5 0 obj % page content
<<
/Length 20
>>
stream
BT
50 70 TD
/F1 14 Tf
(this is a pdf) Tj
ET
endstream
endobj
trailer
<<
/Size 6
/Root 1 0 R
>>
startxref
492
%%EOF
I have tried adding another text object with "this is a pdf" text but it wont show up, I don't know what could be wrong, I tried changing a few things but with no luck. The image part I don't have it either, so some help with that would be nice.
This is a wiki about the "hello world" pdf I found:
http://www.gnupdf.org/Introduction_to_PDF
Adobe offers some explanation on how the pdf works but I cant find anything that would fix my problem:
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
This is not a valid PDF. If Acrobat opens it at all it's because it's given up on the xref table and done a full scan of the file, but your PDF is invalid. 4 0 obj is not a font, as you specified, and 5 0 obj is not accessed from anywhere.
PDF specification requires an xref table which points to the exact position in the file for each object. You can't realistically write this by hand unless you intend to manually update the entire xref table every time you add or remove even 1 byte from the file.
You can write a PDF from scratch like this from code easily enough but it will not work to just open a PDF in notepad and start changing things because the index (xref) immediately becomes corrupt.
I'd also advise against putting comments throughout the file unless the comments start on new lines. Otherwise some PDF parsers will get confused as this is generally not expected. Usually PDF files do not contain comments (with the exception of the second line, which is recommended by Adobe to be a comment of some non-ASCII characters so FTP recognizes the file as binary) seeing as they are virtually impossible to write manually anyway.
http://www.adobe.com/devnet/pdf/pdf_reference.html
A few years ago, I wrote a book which covers exactly this sort of thing:
http://www.amazon.com/PDF-Explained-John-Whitington/dp/1449310028/
No free online version, I'm afraid. You can get all the same information from Adobe's own documentation, which is free, but it's a rather long document!

Attachment damages signature

I have PDF document.
1) Adobe reader reads document well.
2) I sign document (using pdfbox) and everything is well
3) I try to attach file to original pdf (Code is written in the pdfbox web page - in the cookBook).
4) Adobe reader reads attached document well. everything is well.
5) Now I have document with attachment.
6) I try to sign that document (I mean document with attachment). And I have 2 problem:
First:
when I open document, Adobe reader tells me that signature byte range is invalid.
Second:
when I try to close document (I mean to close adobe reader), Adobe reader tells me that:
Do you want to save changes to "original[with-attachment][signed]" before closing? I didn't find thy this happens.
here is testing files uploaded to the google doc
The cause of your issue is that the process of signing original[with-attachment].pdf creates an incremental update with a cross reference stream while the source file has a cross reference table. When adding incremental updates, the new cross references must be of the same type as the old ones.
It is quite possible that this error is due to the process attaching attach.txt misbehaving a bit, too: it stores the file as a PDF with a cross reference table even though the original was a file with a cross reference stream, but at the same time leaves some elements from the former cross reference dictionary in the trailer of the new file. These left-over elements (which do not belong in a trailer dictionary) probably make your signing process think the source already uses a cross reference stream.
As this change of cross reference style between incremental updates is forbidden, the Adobe Reader tries to fix the document in memory. Such attempts to fix often give rise to unexpected Do you want to save changes to "original[with-attachment][signed]" before closing? warnings.
In the course of fixing the PDF, the whole PDF is rearranged. This obviously causes that signature byte range is invalid.
original.pdf
%PDF-1.3
%âãÏÓ
11 0 obj
<</Linearized 1/L 48987/O 13/E 37674/N 3/T 48682/H [ 480 178]>>
endobj
25 0 obj
<</DecodeParms<</Columns 4/Predictor 12>>/Filter/FlateDecode/ID[<321A6D6DCD0785E8E35BD4B13115140A><59793561FB914D408936FC170763541A>]/Index[11 22]/Info 10 0 R/Length 77/Prev 48683/Root 12 0 R/Size 33/Type/XRef/W[1 2 1]>>stream
hÞbbd``b`jŒ â`–,õ#‚µÄb‰í±#Ä"Q{$¬rÄ‚MLŒ³€,F¬ÄÆK¿ Mi
endstream
endobj
startxref
0
%%EOF
32 0 obj
[.........]
endobj
8 0 obj
<</DecodeParms<</Columns 3/Predictor 12>>/Filter/FlateDecode/ID[<321A6D6DCD0785E8E35BD4B13115140A><59793561FB914D408936FC170763541A>]/Info 10 0 R/Length 50/Root 12 0 R/Size 11/Type/XRef/W[1 2 0]>>stream
hÞbb```bœ¬ÅÄÀ°“‰A\š‰H³Îbbà)²'ñ5&F§Û#yF€ xi
endstream
endobj
startxref
116
%%EOF
original[with-attachment].pdf
%PDF-1.3
%öäüß
1 0 obj
[.........]
endobj
xref
0 33
0000000000 65535 f
0000000015 00000 n
[...]
0000049667 00000 n
0000049737 00000 n
trailer
<<
/DecodeParms <<
/Columns 4
/Predictor 12
>>
/Filter /FlateDecode
/ID [<321A6D6DCD0785E8E35BD4B13115140A> <59793561FB914D408936FC170763541A>]
/Info 5 0 R
/Length 77
/Root 1 0 R
/Size 33
/Type /XRef
/W [1 2 1]
/Index [11 22]
>>
startxref
49755
%%EOF
original[with-attachment][signed].pdf
%PDF-1.3
%öäüß
1 0 obj
[....as above....]
startxref
49755
%%EOF
1 0 obj
[.........]
endobj
37 0 obj
<<
/ID [<DC60F4419C05967B81D7F64090027D7F> <DC60F4419C05967B81D7F64090027D7F>]
/Info 5 0 R
/Root 1 0 R
/Prev 49755
/Type /XRef
/Size 38
/Filter /FlateDecode
/Index [1 1 6 1 33 4]
/W [1 3 0]
/Length 31
>>
stream
xœcd8ú‘1&ˆ‘áØ.F†ã¾ŒŒ±ù#| VÚ
endstream
endobj
startxref
89569
%%EOF
A side remark
ID management: Your process adding attachments keeps the whole ID. Your signing process drops the whole original ID of the PDF and replaces it with a new one:
original.pdf
/ID[<321A6D6DCD0785E8E35BD4B13115140A><59793561FB914D408936FC170763541A>]
original[with-attachment].pdf
/ID [<321A6D6DCD0785E8E35BD4B13115140A> <59793561FB914D408936FC170763541A>]
original[signed].pdf
/ID [<A9F7159B1E5D8285A68475689B750214> <A9F7159B1E5D8285A68475689B750214>]
original[with-attachment][signed].pdf
/ID [<DC60F4419C05967B81D7F64090027D7F> <DC60F4419C05967B81D7F64090027D7F>]
Both approaches are wrong, processes manipulating a PDF and, therefore, creating a new version of it, shall keep the first ID entry and replace only the second one with a unique new one.

PDF Flag annotations

I try to (programmatically) write the page numbers to all pages in a PDF file.
The object I use to write looks like this:
493 0 obj
<</Length 96>>
stream
Q
/2 12 Tf
/DeviceRGB cs
0 0 0 scn
q
1 0 -0 1 298 32 cm
BT
1 0 0 1 -3.6 1.884 Tm
(2) Tj
ET
Q
endstream
endobj
It worked fine, until I tried to do it on a page which uses the flag "/rotate" :
23 0 obj
<</Parent 2 0 R /Rotate 180 /Contents [492 0 R 24 0 R 493 0 R ] ... >>
...
When tried to do so, the number I wrote came upside down (and in the top of the page instead of bottom).
I read about this in the PDF manual, and found I can use the annotation flags, indicating I want the written number to be fixed, and not effected by page rotation.
For that, I tried to add to the 493 obj dictionary the corresponding flag (NoRotate):
493 0 obj
<</Length 96 /F 16>>
stream
...
The only thing that actually happens is that the number I try to write doesn't show at all.
I tried to load different numbers into the "/F", but they all lead to an invisible number.
I tried to look for examples in the manual and over the net, but didn't find.
What am I doing wrong?
Maybe I place the "/F" in the wrong location??
According to Adobe's PDF Reference v1.7 (link to PDF), 8.4.2 Annotation Flags, the flag /F only applies to annotations -- objects with a /Type of /Annot, and appearing in a PDF as sticky notes, text edits, and clickable rectangles.
It seems you have to provide the rotation yourself, using the Tm operator.

CGPDFScannerScan doesn't fire callback functions

I parse pdf files using Quartz.
Everything works fine except for one file. Callback functions are not call at all.
My operator table has been created, I added operators into it with CGPDFOperatorTableSetCallback. Everything seem ok, just callbacks are not called.
Have you any idea what can caused this behaviour ?
The page content is a large form XObject. Form XObjects are self contained graphic objects that use a content stream like the page.
You need to do the following: include the 'Do' operator in the list of scanned operators. When it is encountered, its operand is the symbolic name of a XObject. Get the 'Resources' dictionary from the page dictionary. From the 'Resources' dictionary get the 'XObject' dictionary. From the 'XObject' dictionary get your xobject using the symbolic name used with the 'Do' operator. From the xobject get the value of the 'Subtype' key. If it is 'Image' ignore the xobject because it is an image. If it is 'Form' then you have a form XObject. Get the stream from the xobject and scan it the same way you scanned the page content stream. You can reuse the same scanner class, you just need to keep a context in order to know what object you are scanning. Form XObjects can use other form XObjects, they being located in the parent form XObject 'Resources' dictionary.
Your page dictionary looks like this:
<<
/ArtBox[0.0 0.0 768.0 7066.0]
/BleedBox[0.0 0.0 768.0 7066.0]
/Contents 29 0 R
/CropBox[0.0 0.0 768.0 7066.0]
/Group 62 0 R
/MediaBox[0.0 0.0 768.0 7066.0]
/Parent 23 0 R
/Resources
<<
/ExtGState<</GS0 30 0 R>>
/XObject<</Fm0 61 0 R>>
>>
/Rotate 0
/TrimBox[0.0 0.0 768.0 7066.0]
/Type/Page
>>
The 'Fm0' is the name of the form XObject used in the page content stream, the operand for the 'Do' operator. Its resources dictionary looks like this:
/Resources
<<
/ColorSpace<</CS0 32 0 R>>
/ExtGState<</GS0 34 0 R/GS1 30 0 R>>
/Font<</T1_0 38 0 R/T1_1 40 0 R>>
/ProcSet[/PDF/Text]
/XObject<</Fm0 45 0 R/Fm1 48 0 R/Fm2 51 0 R/Fm3 54 0 R/Fm4 57 0 R/Fm5 60 0 R>>
>>
As you can see it uses several other form XObjects.

Explanation of part of a PDF file

This segment of a PDF file seems to cause Poppler to crash. Xpdf doesn't seem to choke on it. If I remove the /I1 Do and /I2 Do lines, the PDF file works fine. Can someone give me a quick explanation of those might be doing? Let me know if you need to see other parts of the PDF file.
1289 0 obj
<<
/Length 72
>>
stream
q
360.00 0 0 583.20 0 0 cm
/I1 Do
Q
q
360.00 0 0 583.20 0 0 cm
/I2 Do
Q
endstream
endobj
I1 and I2 are either images or form XObjects. Probably for some reason Poppler cannot decode their content and crashes. Even if I see the file I do not know the internals of Poppler so it is difficult to guess what it is causing the problem, unless it is an obvious error in the PDF structure.