Making the PDF format readable and diff-able

Making the PDF format readable and diff-able - pdf

I am wondering if anyone have thought of a way to display the PDF document format in a more human readable form?
Now, to compare PDF files, or see exactly what have changed between to versions is very difficult. Many changes aren't visible to the naked eye since they are not a part of the graphical representation(as "created when", and similar).
So if a PDF is a result of an integration test, it is difficult to find the problem without a hex-editor. Also, it is difficult to disregard "created when" in the comparison.
I am not talking any interpretation and displaying, just converting the basic object types to some meta-language. For simplicity's sake, let's say XML. And name nodes like they are named in the PDF specification.
There are PDF-parsers available for most programming languages. Still, at least I, can't find anyone that have gone the distance to convert it to something readable.
Or have I missed it?
Edit:
To clarify(example from specification):
BI % Begin inline image object
/W 17 % Width in samples
/H 17 % Height in samples
/CS /RGB % Color space
/BPC 8 % Bits per component
/F [ /A85 /LZW ] % Filters
Would become:
<BI>
<W>17</W>
<H>17</H>
<CS><RGB/></CS>
<BPC>8</BPC>
<F>
<item>A85</item>
<item>LZW</item>
</F>
</BI>
..and so on.
Binary data could either be extracted to a file or just show a hash or size.

Related

Can't get Ghostscript "viewraw.ps" or "viewrgb.ps" programs to work (scrambled output)

I've had good results in the past using the "viewjpeg.ps" PostScript program included with Ghostscript to place JPEG images into generated PDFs. Now I'm trying to do the same for bitmaps, and I just haven't been able to make it work. My hunch is that the program I need is either "viewraw.ps" or "viewrgb.ps," and I can see that the parameters expected will be a bit different from those passed to "viewjpeg.ps."
So far this is what I have:
"C:\Program Files\gs\gs9.10\bin\gswin64c.exe" -q -sDEVICE=pdfwrite -DNOSAFER -r200x200 -sOutputFile=o.pdf z:\home\dell\reporting\viewrgb.ps -c "(out002.bmp) 6800 viewrgb"
This gets pretty close to what I want, but my bitmap (though clearly identifiable) is scrambled in the output PDF: compressed vertically, upside-down, and somewhat wrong in color.
I have attempted to address these issues by tweaking the "width" parameter (6800 above). My bitmap is 1,700 pixels wide, and uses 4 bytes per pixel, so 1,700 * 4 = 6,800 seemed like a logical choice. I've also tried 1,700 (width in pixels) and 54,400 (bits per image row). 5,100 (3 * 1,700) seemed to work best, but it's still wrong.
Note that "viewjpeg.ps" does not expect a "width" parameter, so I haven't had to deal with this before. (It was an examination of "viewrgb.ps" that made me realize this parameter was required.)
Can anyone spot my mistake, or maybe point me to an example that uses "viewraw.ps" or "viewrgb.ps"?

You haven't said (or I missed it) what format your 'bitmaps' are, and you haven't supplied an example to look at so I can't tell (or experiment).
You say your output is 4 bytes per pixel so that's either CMYK or something like RGBa. Either way viewrgb isn't going to work, because it only expects 3 channels. It's intended to view the output of the Ghostscript bitrgb device.
Viewraw just reads raw data, straight image samples, no header IIRC and it's CMYK, so unless your 4 bytes are CMYK then it's not going to be correct either.
Since both of these are RAW format, they don't expect a header, if your image format includes a header, then that's going to be treated as image data which will certainly cause the image to be drawn incorrectly.
Both of these PostScript programs will display a usage message on the back channel if you invoke them incorrectly.
You don't need -dNOSAFER with such an old version of Ghostscript (9.10).
-r has little effect on pdfwrite and will have no effect at all when you feed it an image as input; you should probably omit that.

Avoiding fragmenting of text extracted from PDF after processing with Ghostscript

After processing with Ghostscript, I sometimes see whitespace breaking up the words as seen with pdftotext or in a PDF viewer when searching or selecting. Possibly unrelated but the anomalies seem to correspond with kerning variations in the rendered font.
Is there a way to avoid this?
For example, from GS 9.23 (also occurred with earlier versions):
gs -sDEVICE=pdfwrite \
-dNOPAUSE -dQUIET -dPARANOIDSAFER -dBATCH \
-sOutputFile=./output.pdf input.pdf
Excerpt from pdftotext input.pdf:
Review this manual before
operating deep cleaner
while pdftotext output.pdf:
Re vie w t his m a nua l be fore
ope ra t ing de e p c le a ne r

Ghostscript and the pdfwrite device (as explained in VectorDevices.htm) does not simply 'fiddle' with the input when producing a PDF file. The input (from whatever source; PDF, PostScript, XPS, PCL, PCL-XL) is fully interpreted into marking operations, those marking operations are sent to the device which turns them back into PDF constructs.
So the low level (PDF) format describing the page need not bear any relation to the low level format of the input. In particular you cannot expect the PDF operations in the input to be reflected in the output.
The visual appearance will be the same (or should be, because that's the main goal), but the actual operations won't be.
The reason for the difference in the text output is because, basically, there is no 'metadata' in a PDF file that describes words, paragraphs, columns etc. When you extract text from a PDF file what you actually get is a series of character codes and positions.
Its up to the text extraction code to try and make some sense of that. I'd guess that pdftotext is using the rather naive approach of assuming that text strings are words.
This is problematic because there are numerous different ways to handle kerning, justification and other spacings in PDF. You could do something like :
(Te) Tj
10 0 Td
(st) Tj
Or :
[(Te) 2 (st)] TJ
The pdfwrite device doesn't know what the original was, so what it emits could be either of those, depending on some heuristics. The chances of it matching the original are low.
I suspect that pdftotext would regard the first operation as "Te st" and the second operation as "Test"
One possible solution would be to use Ghostscript's txtwrite device to extract the text, it might do a better job.
As with your other question, it would be best to supply examples when asking these kinds of questions, because without that its pretty much guesswork.
TL;DR
Is there a way to avoid this?
No.

Losslessly Compress PDF Generated from PostScript

I am generating multiple EPS files, which contain several PostScript drawing commands that are not necessarily encoded efficiently. The first update in the answer to this question describes similar inefficiencies.
Each of my EPS files are around 18 MB, and the resulting PDF files are around 3 MB. I am generating the PDF files using epstopdf, which enables some sort of compression by default.
Are there any suggestions for how to further reduce the resulting PDF file sizes without changing the quality (e.g. rasterizing the vector graphics)?
I tried reducing the precision of the coordinates from 8 digits past the decimal to 3. This reduced the EPS file sizes to about 14 MB, but, counter-intuitively, the PDF file sizes slightly increased.
Update 1: The EPS files contain several occurrences of the sample code below for different coordinates and colors.
newpath
1 setlinejoin
1 setlinecap
<<
/BBox [322 384.0417 615.0087 651.9958]
/Domain [322 384.0417 615.0087 651.9958]
/ShadingType 6
/ColorSpace [/DeviceRGB]
/DataSource
[
0
350.00000000 651.99583594
336.00000000 645.75890880
336.00000000 645.75890880
322.00000000 639.52198166
339.17140372 627.26533984
339.17140372 627.26533984
356.34280743 615.00869803
370.19224806 621.16169097
370.19224806 621.16169097
384.04168868 627.31468392
367.02084434 639.65525993
367.02084434 639.65525993
0.23047 0.29688 0.75
0.23047 0.29688 0.75
0.41081 0.54141 0.93366
0.41112 0.54178 0.93388
]
>>
gsave
322 615.0087 62.04169 36.98714 rectclip
shfill
grestore
Update 2: I have been able to reduce the PDF file sizes by about 15% by using pdftocairo, followed by gs -dCompatibilityLevel=1.4 -dPDFSETTINGS=/default -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dDetectDuplicateImages=true -sOutputFile=out.pdf in_.pdf.

PostScript is a programming language and PDF is not, so often you can actually create a smaller PostScript program than the resulting PDF file.
The 'inefficiencies' you mention in your EPS program, and the precision of the input numbers, are completely irrelevant to the size of the PDF file. The operators in PDF are not the same names as the operators in PostScript, so a 'moveto' in PostScript does not simply get translitereated into a 'moveto' in the resulting PDF file. The precision of numbers in the output PDF file is not tied to the precision of the numbers in the input.
In addition, PostScript interpreters often use a fixed precision arithmetic (Ghostscript for example uses 24:8), so (eg) 1.5 on the input may not be produced as 1.5 on the output, it may instead become 1.49999999.
So the result of this, basically, is that nobody can tell why your PDF files are as large as they are without seeing them. I would suggest that a 6:1 reduction in size is pretty reasonable personally. If you post a representative example somewhere its possible someone could look at it and might be able to offer some suggestions, but without seeing the content its not really possible to tell.
NB rendering the content would most likely increase the size of the PDF file, unless you render at a really low resolution.
EDIT
The supplied example is simply a shading dictionary, the PDF file will contain almost exactly the same data for that particular construct. Its already about as compact as you could expect, I very much doubt it this is the sort of thing occupying 18MB of source, that would be an enormous amount of shadings. There is no realistic way to make that smaller, and rendering it to a bitmap (even at very low resolution) would actually make it larger.
Its entirely possible the EPS contains things like a bitmap preview, which will, of course, be removed when creating a PDF. It may also (depending on the creating application) contain the original document, stored as comments, which will also be removed when creating a PDF file. Without seeing the original EPS its not really possible to suggest much.
I'm afraid posting little bits of the file isn't going to help really.

How to search my PDF with grep?

I have followed ideas from this thread but it does not work.
https://unix.stackexchange.com/questions/6704/how-can-i-grep-in-pdf-files
pdftotext PercivalWalden.pdf - | grep 'Slepian'
pdftotext PercivalWalden.pdf - | grep 'Naive'
pdftotext PercivalWalden.pdf - | grep 'Filter'
I know for sure that 'Filter' appears at least 100 times in this book.
Any ideas?

If you really can grep a given string (that you can 'see' and read on a rendered or printed PDF page) from a PDF, even with the help of pdftotext, then you must be very lucky indeed.
First off: most of the advice from the link you provided to unix.stackexchange.com is very uninformed (to put it most politely). Most of the answers there are clearly written by people who are not familiar with the huge range of PDF variations out there.
In your case, you are trying to convert the file with the help of pdftotext first, streaming the output to stdout.
There are many types of PDF where pdftotext cannot extract the text at all. The reasons for this may be (listings below not complete):
The "text" that you see is not based on using a font. It may be one big raster image generated by a scan or other production process, then embedded into a PDF file shell. This may make the page only appear to be text strings.
The "text" that you see is not based on using a font. It may be a series of small vector drawings (or small raster images), that only look like text strings to our eyes and brain.
There are many software applications, which do convert fonts to so-called 'outlines'. The reason for this seemingly strange behaviour may be:
Circumvent licensing problems (when a certain font disallows its embedding).
Impose a handicap upon attempts to extract the text.
Accidentally wrong setting in the PDF generating application.
The font is embedded as a subset in the PDF file (by the PDF generating software -- users usually do not have much control over the details of this operation) and uses a 'custom' encoding, but the file does not provide a toUnicode table to map the glyphs to characters.
'Glyphs' are the well-defined shapes in each font drawn on screen. Glyphs map to characters for the computer -- our eyes merely see these shapes and our brains translate these to characters without needing a toUnicode table. Programs like pdftotext require a toUnicode table to reverse the translation of glyphs back to characters.
You can use a command line utility named pdffonts to gain a first insight into the fonts used by your PDF file. Example output:
pdffonts paper-projectiris---final.pdf
name type encoding emb sub uni object ID
-------------------------- ------------ -------------- --- --- --- ---------
TCQJEF+CMCSC10 Type 1 Builtin yes yes no 96 0
VPAFLY+CMBX12 Type 1 Builtin yes yes no 97 0
CWAIXW+CMTI12 Type 1 Builtin yes yes no 98 0
OBMDLT+CMR12 Type 1 Builtin yes yes no 99 0
In this case, text extraction (and your method of grepping for strings) should work:
Even though the column named uni (telling if a toUnicode map is embedded in the PDF file)
says no for each single font, the encoding column does not contain custom, but builtin (meaning that a glyph->character mapping is provided with the font file, which is of type Type 1.
To sum it up: Without access to your PDF file it is impossible to tell why you cannot "grep" for the strings you are looking for!

Unicode in PDF

My program generates relatively simple PDF documents on request, but I'm having trouble with unicode characters, like kanji or odd math symbols. To write a normal string in PDF, you place it in brackets:
(something)
There is also the option to escape a character with octal codes:
(\527)
but this only goes up to 512 characters. How do you encode or escape higher characters? I've seen references to byte streams and hex-encoded strings, but none of the references I've read seem to be willing to tell me how to actually do it.
Edit: Alternatively, point me to a good Java PDF library that will do the job for me. The one I'm currently using is a version of gnujpdf (which I've fixed several bugs in, since the original author appears to have gone AWOL), that allows you to program against an AWT Graphics interface, and ideally any replacement should do the same.
The alternatives seem to be either HTML -> PDF, or a programmatic model based on paragraphs and boxes that feels very much like HTML. iText is an example of the latter. This would mean rewriting my existing code, and I'm not convinced they'd give me the same flexibility in laying out.
Edit 2: I didn't realise before, but the iText library has a Graphics2D API and seems to handle unicode perfectly, so that's what I'll be using. Though it isn't an answer to the question as asked, it solves the problem for me.
Edit 3: iText is working nicely for me. I guess the lesson is, when faced with something that seems pointlessly difficult, look for somebody who knows more about it than you.

In the PDF reference in chapter 3, this is what they say about Unicode:
Text strings are encoded in
either PDFDocEncoding or Unicode character encoding. PDFDocEncoding is a
superset of the ISO Latin 1 encoding and is documented in Appendix D. Unicode
is described in the Unicode Standard by the Unicode Consortium (see the Bibliography).
For text strings encoded in Unicode, the first two bytes must be 254 followed by
255. These two bytes represent the Unicode byte order marker, U+FEFF, indicating
that the string is encoded in the UTF-16BE (big-endian) encoding scheme specified
in the Unicode standard. (This mechanism precludes beginning a string using
PDFDocEncoding with the two characters thorn ydieresis, which is unlikely to
be a meaningful beginning of a word or phrase).

The simple answer is that there's no simple answer. If you take a look at the PDF specification, you'll see an entire chapter — and a long one at that — devoted to the mechanisms of text display. I implemented all of the PDF support for my company, and handling text was by far the most complex part of exercise. The solution you discovered — use a 3rd party library to do the work for you — is really the best choice, unless you have very specific, special-purpose requirements for your PDF files.

Algoman's answer is wrong in many things. You can make a PDF document with Unicode in it and it's not rocket science, though it needs some work.
Yes he is right, to use more than 255 characters in one font you have to create a composite font (CIDFont) pdf object.
Then you just mention the actual TrueType font you want to use as a DescendatFont entry of CIDFont.
The trick is that after that you have to use glyph indices of a font instead of character codes. To get this indices map you have to parse cmap section of a font - get contents of the font with GetFontData function and take hands on TTF specification.
And that's it! I've just did it and now I have a Unicode PDF!
Sample Code for parsing cmap section is here: https://web.archive.org/web/20150329005245/http://support.microsoft.com/en-us/kb/241020
And yes, don't forget /ToUnicode entry as #user2373071 pointed out or user will not be able to search your PDF or copy text from it.

As dredkin pointed out, you have to use the glyph indices instead of the Unicode character value in the page content stream. This is sufficient to display Unicode text in PDF, but the Unicode text would not be searchable. To make the text searchable or have copy/paste work on it, you will also need to include a /ToUnicode stream. This stream should translate each glyph in the document to the actual Unicode character.

See Appendix D (page 995) of the PDF specification. There is a limited number of fonts and character sets pre-defined in a PDF consumer application. To display other characters you need to embed a font that contains them. It is also preferable to embed only a subset of the font, including only required characters, in order to reduce file size. I am also working on displaying Unicode characters in PDF and it is a major hassle.
Check out PDFBox or iText.
http://www.adobe.com/devnet/pdf/pdf_reference.html

I have worked several days on this subject now and what I have learned is that unicode is (as good as) impossible in pdf. Using 2-byte characters the way plinth described only works with CID-Fonts.
seemingly, CID-Fonts are a pdf-internal construct and they are not really fonts in that sense - they seem to be more like graphics-subroutines, that can be invoked by addressing them (with 16-bit addresses).
So to use unicode in pdf directly
you would have to convert normal fonts to CID-Fonts, which is probably extremely hard - you'd have to generate the graphics routines from the original font(?), extract character metrics etc.
you cannot use CID-Fonts like normal fonts - you cannot load or scale them the way you load and scale normal fonts
also, 2-byte characters don't even cover the full Unicode space
IMHO, these points make it absolutely unfeasible to use unicode directly.
What I am doing instead now is using the characters indirectly in the following way:
For every font, I generate a codepage (and a lookup-table for fast lookups) - in c++ this would be something like
std::map<std::string, std::vector<wchar_t> > Codepage;
std::map<std::string, std::map<wchar_t, int> > LookupTable;
then, whenever I want to put some unicode-string on a page, I iterate its characters, look them up in the lookup-table and - if they are new, I add them to the code-page like this:
for(std::wstring::const_iterator i = str.begin(); i != str.end(); i++)
{
if(LookupTable[fontname].find(*i) == LookupTable[fontname].end())
{
LookupTable[fontname][*i] = Codepage[fontname].size();
Codepage[fontname].push_back(*i);
}
}
then, I generate a new string, where the characters from the original string are replaced by their positions in the codepage like this:
static std::string hex = "0123456789ABCDEF";
std::string result = "<";
for(std::wstring::const_iterator i = str.begin(); i != str.end(); i++)
{
int id = LookupTable[fontname][*i] + 1;
result += hex[(id & 0x00F0) >> 4];
result += hex[(id & 0x000F)];
}
result += ">";
for example, "H€llo World!" might become <01020303040506040703080905>
and now you can just put that string into the pdf and have it printed, using the Tj operator as usual...
but you now have a problem: the pdf doesn't know that you mean "H" by a 01. To solve this problem, you also have to include the codepage in the pdf file. This is done by adding an /Encoding to the Font object and setting its Differences
For the "H€llo World!" example, this Font-Object would work:
5 0 obj
<<
/F1
<<
/Type /Font
/Subtype /Type1
/BaseFont /Times-Roman
/Encoding
<<
/Type /Encoding
/Differences [ 1 /H /Euro /l /o /space /W /r /d /exclam ]
>>
>>
>>
endobj
I generate it with this code:
ObjectOffsets.push_back(stream->tellp()); // xrefs entry
(*stream) << ObjectCounter++ << " 0 obj \n<<\n";
int fontid = 1;
for(std::list<std::string>::iterator i = Fonts.begin(); i != Fonts.end(); i++)
{
(*stream) << " /F" << fontid++ << " << /Type /Font /Subtype /Type1 /BaseFont /" << *i;
(*stream) << " /Encoding << /Type /Encoding /Differences [ 1 \n";
for(std::vector<wchar_t>::iterator j = Codepage[*i].begin(); j != Codepage[*i].end(); j++)
(*stream) << " /" << GlyphName(*j) << "\n";
(*stream) << " ] >>";
(*stream) << " >> \n";
}
(*stream) << ">>\n";
(*stream) << "endobj \n\n";
Notice that I use a global font-register - I use the same font names /F1, /F2,... throughout the whole pdf document. The same font-register object is referenced in the /Resources Entry of all pages. If you do this differently (e.g. you use one font-register per page) - you might have to adapt the code to your situation...
So how do you find the names of the glyphs (/Euro for "€", /exclam for "!" etc.)? In the above code, this is done by simply calling "GlyphName(*j)". I have generated this method with a BASH-Script from the list found at
http://www.jdawiseman.com/papers/trivia/character-entities.html
and it looks like this
const std::string GlyphName(wchar_t UnicodeCodepoint)
{
switch(UnicodeCodepoint)
{
case 0x00A0: return "nonbreakingspace";
case 0x00A1: return "exclamdown";
case 0x00A2: return "cent";
...
}
}
A major problem I have left open is that this only works as long as you use at most 254 different characters from the same font. To use more than 254 different characters, you would have to create multiple codepages for the same font.
Inside the pdf, different codepages are represented by different fonts, so to switch between codepages, you would have to switch fonts, which could theoretically blow your pdf up quite a bit, but I for one, can live with that...

dredkin's answer has worked fine for me in the forward direction (unicode text to PDF representation).
I was writing an increasingly convoluted comment there about the reverse direction (PDF representation to text, when copying from the PDF document), explained by user2373071. The method referred to throughout this thread is the definition of a /ToUnicode map (which, incidentally, is optional). I found it simplest to map from glyphs to characters using the beginbfrange srcCode1 srcCode2 [ dstString1 m ] endbfrange construct.
This seems to work OK in Adobe Reader, but two glyphs (0x100 and 0x1ef) cause the mapping for cyrillic characters to fail in browsers and SumatraPDF (the copy/paste provides the glyph IDs instead of the characters. By excluding those two glyphs I made it work there. (I really can't see what's special about these glyphs, and it's independent of font (i.e. it's the same glyphs, but different characters, in Times/Georgia/Palatino, and these values are afaik identically mapped in UTF-16. Any ideas welcome!)
However, and more importantly,
I have reached the conclusion that the whole /ToUnicode mechanism is fundamentally flawed in concept, because many fonts re-use glyphs for multiple characters. Consider simple ones like 0x20 and 0xa0 (ordinary and non-breaking space); 0x2d and 0xad (hyphen and soft hyphen); these two are in the 8-bit character range. Slightly beyond that are 0x3b and 0x37e (semi-colon and greek question mark). And it would be quite reasonable to re-use cyrillic small a and latin small a, and similar homoglyphs. So the point is, in the non-ASCII world that prompts us to worry about Unicode at all, we will encountering a one-to-many mapping from glyphs to characters, and will therefore be bound to pick up the wrong character at some point - which rather removes the point of being able to extract the text in the first place.
The other method in the (1.7) PDF reference is to use /ActualText instead of /ToUnicode. This is better in principle, because completely avoids the homoglyph problem I've mentioned above, and the overhead is probably bearable, but it only seems to be implemented in Adobe Reader (i.e. I haven't got anything consistent or meaningful from SumatraPdf or four browsers).

I'm not a PDF expert, and (as Ferruccio said) the PDF specs at Adobe should tell you everything, but a thought popped up in my mind:
Are you sure you are using a font that supports all the characters you need?
In our application, we create PDF from HTML pages (with a third party library), and we had this problem with cyrillic characters...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas