How can I convert a PDF from Google Docs to images? [or: GoogleDocs' PDF export is horrible!] - pdf

I exported a document from Google Docs as PDF (just simple pages and one of the pre-defined themes) and, like I do usually, I used ImageMagick's convert to get pages converted to images, but it failed (even with the latest version) and showed no errors.
GhostScript also failed.
Other tools such as pdfinfo, mutool or qpdf don't report any error, yet it still fails even if rebuild or clean commands are applied.
Only pdfimages complains and gives me Syntax Error: Missing or invalid Coords in shading dictionary

Ok, I tried to reproduce some bugs, using Google Slides.
However, my bugs are different from yours. Read on for some details...
Google Docs does indeed create a horrible PDF syntax today. I say 'today', because I gave up with Google Docs years ago. The reason: it was always very unstable for me in the past. GoogleDocs' developers seem to change the code they activate for users all the time, and debugging the created PDFs for me was always a moving target.
When I exported to PDF the slideshow I created, and then did run the tools you mentioned on it,...
... I got 4 different results within 20 minutes!
In one case, Mac OS X's Preview.app was unable to render anything else but 3 white pages, while Adobe's Acrobat Pro rendered it (without error message) somehow garbled and different from the GoogleDocs web preview.
In another case, Acrobat Pro showed 3 white pages, while Preview.app rendered it in a garbled way!
Unfortunately, I didn't save the different versions for closer inspection. The lastest PDF I analysed gave however the following details.
Ghostscript:
pdfkungfoo#mbp:> gs -o PDFExportBug-%03d.jpg -sDEVICE=jpeg PDFExportBug.pdf
GPL Ghostscript 9.10 (2013-08-30)
Copyright (C) 2013 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 3.
Page 1
**** Error reading a content stream. The page may be incomplete.
**** File did not complete the page properly and may be damaged.
Page 2
**** Error reading a content stream. The page may be incomplete.
**** File did not complete the page properly and may be damaged.
Page 3
**** Error reading a content stream. The page may be incomplete.
**** File did not complete the page properly and may be damaged.
**** This file had errors that were repaired or ignored.
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.
ImageMagick:
convert creates white-only images from the PDF pages.
(That's no wonder because it does not process the PDFs directly, but employs Ghostscript as it's delegate to convert the PDF to a raster format first, which is then familiar ground for ImageMagick to continue with processing... You can see details of this process by adding -verbose to your ImageMagick command line.)
qpdf
Using qpdf --check yields this result:
pdfkungfoo#mbp:> qpdf --check PDFExportBug.pdf
qpdf --check PDFExportBug.pdf
checking GoogleSlidesPDFExportBug.pdf
PDF Version: 1.4
File is not encrypted
File is not linearized
PDFExportBug.pdf (file position 9269):
unknown token while reading object (0.0000-11728996)
pdfimages:
Unlike what you discovered, my error message was this:
pdfkungfoo#mbp:> pdfimages -list PDFExportBug.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
Syntax Warning (9276): Badly formatted number
Syntax Warning (9292): Badly formatted number
Syntax Warning (9592): Badly formatted number
Syntax Warning (9608): Badly formatted number
Syntax Warning (4907): Badly formatted number
Syntax Warning (4907): Badly formatted number
Syntax Warning (9908): Badly formatted number
Syntax Warning (9924): Badly formatted number
Syntax Warning (8212): Badly formatted number
Syntax Warning (8212): Badly formatted number
When I check with a text editor the file-offsets of 9276, 9292, ... 8212 for numbers, I indeed do find the following lines in the PDF code:
Line 412: 0.0000-11728996
Line 413: 0.0000-11728996
Line 466: 0.0000-11728996
Line 467: 0.0000-11728996
Line 522: 0.0000-11728996
Line 523: 0.0000-11728996
PDF code in text editor:
Looking at the context of these lines, one sees the following:
32
0
obj
<<
/ShadingType
2
/ColorSpace
/DeviceRGB
/Function
<<
/FunctionType
2
/Domain
[
0
1
]
/Range
[
0
1
0
1
0
1
]
/C0
[
0.5882353
0.05882353
0.05882353
]
/C1
[
0.78431374
0.1254902
0.03529412
]
/N
1
>>
/Coords
[
0.000000000000053689468
0.0000
-11728996
0.0000
-11728996
26.832815
]
/Extend
[
true
true
]
>>
endobj
That's true! GoogleDocs gave me a PDF that created a newline after each single token!
PDF code, if Google had formatted it less horribly:
These lines are part of a code snippet that should probably be formatted like this, if the Google PDF export wasn't as horrible as it in fact is:
32 0 obj
<<
/ShadingType 2
/ColorSpace /DeviceRGB
/Function << /FunctionType 2
/Domain [ 0 1 ]
/Range [ 0 1 0 1 0 1 ]
/C0 [ 0.5882353 0.05882353 0.05882353 ]
/C1 [ 0.78431374 0.1254902 0.03529412 ]
/N 1
>>
/Coords [ 0.000000000000053689468 0.0000 -11728996 0.0000 -11728996 26.832815 ]
/Extend [ true true ]
>>
endobj
PDF code compared to the PDF specification:
So GoogleDoc's PDF uses /ShadingType 2 (for axial shading). This Shading Type requires a 'shading dictionary' with an entry for the /Coords key that should have as value an array of 4 numbers [x0 y0 x1 y1]. These numbers would specify the starting and ending coordinates of the axis (expressed in the shading’s target coordinate space).
However, instead of a /Coords array of 4 numbers it uses one of 6 numbers: [0.000000000000053689468 0.0000 -11728996 0.0000 -11728996 26.832815].
But Coords arrays with 6 numbers are to be used by /ShadingType 3 (radial shading).
The 6 numbers [x0 y0 r0 x1 y1 r1] then represent, according to ISO 32000:
"[...] the centres and radii of the starting and ending circles, expressed in the shading’s target coordinate space. The radii r0 and r1 shall both be greater than or equal to 0. If one radius is 0, the corresponding circle shall be treated as a point; if both are 0, nothing shall be painted."
15 minutes later, I exported the PDF again, but now I got these lines:
/Coords
[
0.000000000000053689468
0.0000-11728996
0.0000-11728996
26.832815
]
As you'll notice, now indeed the /Coords array has 4 entries -- but 0.0000-11728996 isn't a valid number!
In any case, the particular numbers in my objects 32, 33 and 34 do look funny somehow:
Either they are meant to be 6 numbers:
[0.000000000000053689468 0.0000 -11728996 0.0000 -11728996 26.832815]
Then they can only be meant for a /ShadingType 3 (radial shading)
But they are noted in the context of /ShadingType 2 (axial shading)
Or they are meant to be 4 numbers:
[0.000000000000053689468 0.0000-11728996 0.0000-11728996 26.832815]
Then 0.0000-11728996 are not valid numbers.
Fix
So the fix could be in...
...either change the /ShadingType 2 to /ShadingType 3 and keep the array of 6 numbers
...or keep the /ShadingType 2 and throw away 2 of the 6 numbers to keep only 4 (but which?)
I decided (arbitrarily, by chance) to try with ShadingType 2 first and delete these two numbers: -11728996 0.0000.
I was lucky: the PDF now lets convert process the PDF pages into JPEGs (which means the Ghostscript command called by convert was also working correctly).
Good luck with your continued using of GoogleDocs when creating PDFs...
...but don't count me in!
Update
Here is a link to a GoogleDoc currently exhibiting one of the bug variants explained above:
To see the bug, save it as a PDF. Then open it in a text editor.
Should the doc from this link stop to export buggy PDFs and stop to exhibit one of the details I've described above, then Google has applied a fix... (until they break it again?!?)

Related

PAC2021 reports "Error while parsing the PDF document (Operator 'cm' is not allowed in this current state)" - but why?

I am using the PAC2021 application to analyze some PDFs for compliance towards the PDF/UA standards.
For some files (all generated by the software of my company) the following error is reported:
Error while parsing the PDF document (Operator 'cm' is not allowed in this current state)
There are no further explanations, just this phrase. An example of PDF page content that raises the above error is:
/P << /MCID 0 >> BDC
BT
0 g
1 0 0 1 81 639.75 cm
/F0 10 Tf
1 0 0 1 0 4.45 Tm
(Hello, world!) Tj
ET
EMC
Apparently PAC2021does not like the "cm" in the fourth line, but why?
I went through the documents of specifics for PDF and could not find an explanation on why this should be considered a syntax error. All PDF readers I have tried do not complain about such content, also I tried running the same document through Adobe Preflight for PDF/UA, it reported the document as fully compliant.
So I'm wondering: is this content violating a special restriction of the PDF/UA format? If so, where can I find its definition? Or is it an error in the PAC2021 report?

PDF Dimensions of Page Out of Range Errors from Ghostscript

I'm trying to produce new PDFs that alter dimensions only the first page (using CropBox). I used a modified version of How do I crop pages 3&4 in a multipage pdf using ghostscript
Here is what's strange: everything runs properly, but when I open the PDFs in typical applications (Preview, Acrobat, etc.), they either crash or I get a "Warning: Dimensions of Page May be Out of Range" error. In Acrobat, only one page will display, even tho page count is 2, 45, 60, or whatever.
Even stranger: I emailed the PDFs to someone to see if it was a machine-specific issue. In Gmail, everything looks fine in Google Apps's PDF viewer. So the process 'worked,' but it looks like there's something about the dimensions or page size that is throwing other apps off.
I've tried multiple GS options (dPDFFitPage, dPrinted=false, dUseCropBox, changing paper size to something other than legal), but nothing seems to work.
I'm attaching a version of a PDF that underwent this process and generates these errors as well. https://www.dropbox.com/s/ka13b7bvxmql4d2/imfwb.pdf?dl=0
Modified output is below. xmin, ymin, xmax, ymax, height, width are variables defined elsewhere in the bigger script of which GS is a part. Data are grabbed using pdfinfo
gs \
-o output/#{filename} \
-sDEVICE=pdfwrite \
-c \"<</EndPage {
0 eq {
pop /Page# where {
/Page# get
1 eq {
(page 1) == flush
[/CropBox [#{xmin} #{ymin} #{xmax} #{ymax}] /PAGE pdfmark
true
}
{
(not page 1) == flush
[/CropBox [0 #{height.to_f} #{width.to_f} #{height.to_f}] /PAGE pdfmark
true
} ifelse
}{
true
} ifelse
}
{
false
}
ifelse
}
>> setpagedevice\" \
-f #{filename}"
`#{cmd}`
For pages after the first you set
[/CropBox [0 #{height.to_f} #{width.to_f} #{height.to_f}] /PAGE pdfmark
I.e. a crop box with zero height!
E.g. in case of your sample document page 2 has the crop box [0 792.0 612.0 792.0].
This surely is not what you want...
If you really want to "produce new PDFs that alter dimensions only the first page (using CropBox)", why do you change the crop box of later pages at all? Simply don't do anything in that case!
Why "Dimensions of Page May be Out of Range"?
Well, ISO 32000-1 in its normative Annex C declares:
The minimum page size should be 3 by 3 units in default user space
Thus, according to that older PDF specification a page height of 0 indeed is out of range for PDF!
Meanwhile, though, ISO 32000-2 has dropped that requirement, so strictly speaking a page height of zero should be nothing to complain about...

Why does ghostscript replace fontnames to "CairoFont"?

I use ghostscript to optimize pdf files (mostly with respect to size), for which it does a great job. The command that I use is:
gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress \
-dCompatibilityLevel=1.4 -sOutputFile=out.pdf in.pdf
However, it seems that this replaces fonts (or subsets them) and does not preserve their names. It replaces it by CairoFont. How could I get ghostscript to preserve the fontnames?
Example:
A simple pdf file (created with Inkscape), with a single text element in it (Nimbus Roman) as an input (in.pdf):
for which pdffonts reports:
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
PMLNBT+NimbusRomanNo9L Type 1 yes yes yes 5 0
However, after running ghostscript over the file pdffonts reports:
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
OEPSCM+CairoFont-0-0 Type 1C yes yes no 8 0
So, is there a way to have ghostscript (or libcairo?) preserve the name of the font?
The input file is uploaded here.
Ghostscript doesn't change the font name, but there are, in fact, several different font 'names' in a PDF file.
In the case of your file the PDF FontDescriptor object has a name
<<
/Type /FontDescriptor
/FontName /PMLNBT+NimbusRomanNo9L
/Flags 4
/FontBBox [ -168 -281 1031 924 ]
/ItalicAngle 0
/Ascent 924
/Descent -281
/CapHeight 924
/StemV 80
/StemH 80
/FontFile 7 0 R
>>
which refers to a FontFile stream
/FontFile 7 0 R
That stream contains the following:
%!PS-AdobeFont-1.0: NimbusRomNo9L-Regu 1.06
%%Title: NimbusRomNo9L-Regu
%Version: 1.06
%%CreationDate: Thu Aug 2 13:14:49 2007
%%Creator: frob
%Copyright: Copyright (URW)++,Copyright 1999 by (URW)++ Design &
%Copyright: Development; Cyrillic glyphs added by Valek Filippov (C)
%Copyright: 2001-2005
% Generated by FontForge 20070723 (http://fontforge.sf.net/)
%%EndComments
FontDirectory/NimbusRomNo9L-Regu known{/NimbusRomNo9L-Regu findfont dup/UniqueID known pop false {dup
/UniqueID get 5020931 eq exch/FontType get 1 eq and}{pop false}ifelse
{save true}{false}ifelse}{false}ifelse
11 dict begin
/FontType 1 def
/FontMatrix [0.001 0 0 0.001 0 0 ]readonly def
/FontName /CairoFont-0-0 def
Do you see the FontName in the actual font ? Its called CairoFont-0-0
This brings me back to a point which I reiterate frequently here and elsewhere; when you process a PDF file with Ghostscript and emit a new PDF file using the pdfwrite device you are not 'optimising', 'converting', 'subsetting' or in a general sense manipulating the content of the original PDF file.
What Ghostscript does is interpret the PDF file, ths produces a set opf marking operations (such as 'stroke', 'fill', 'image' etc) which it sends to the selected Ghostscript device. Most Ghostscript devices will then use the graphics library to render the operations to a bitmap and when the page is complete will write the bitmap to a file. The 'high level' or 'vector' devices instead repackage the operations into another Page Description Language. In the case of pdfwrite, that's a PDF file.
What this means in practice is that the emitted PDF file has nothing (apart from appearance) in common with the original PDF file. In particular the description of the objects may be different.
So in your case, the pdfwrite device doesn't know what the font was called in the original PDF object. It does know that the font that was defined was called Cairo-0-0 so that's what it calls the font when it emits it.
Frankly this is another piss-poor example from Cairo, to go along with defining each page as containing transparency whether it does or not, the FontName in the Font object is supposed to be the same as the name in the Font stream.
Its pretty clear that the FontName has been altered, given the rest of the boilerplate there.

-dSubsetFonts=false option stops showing TrueType fonts /glyphshow

I have a PostScript that uses TrueType fonts. However, I want to include rarly used characters like registration marks (®) and right/left single/double quotes (’, “ etc).
So I used glyphshow and called the names of the glyphs
%!
<< /PageSize [419.528 595.276] >> setpagedevice
/DeviceRGB setcolorspace
% Page 1
%
% Set the Original to be the top left
%
0 595.276 translate
1 -1 scale
gsave
%
% Save this state before moving x y specifically for images
%
1 -1 scale
/BauerBodoniBT-Roman findfont 30 scalefont setfont % set the pt size %-3.792 - 16
1 0 0 setrgbcolor
10 -40 moveto /quoteright glyphshow
10 -80 moveto /registered glyphshow
/Museo-700 findfont 30 scalefont setfont % set the pt size %-3.792 - 16
1 0 1 setrgbcolor
10 -120 moveto /quoteright glyphshow
10 -180 moveto /registered glyphshow
showpage
When I execute this PostScript using the following command (due to my requirement for the pdf to be editable in Illustrator i.e. can be opened with all fonts intact) the PDF shows nothing but seems to contain the glyphs if you copy and paste from the pdf into a text file.
gs -o gly_subsetfalse.pdf -sDEVICE=pdfwrite -dCompatibilityLevel=1.3 -dSubsetFonts=false -dPDFSETTINGS=/prepress textglyph.ps
However, this above command now causes issues with pulling it into Illustrator. The rare glyphs become unrecongisble (', Æ). Normal characters and regular glyphs seem fine i.e. /a glyphshow and just show text appear in pdf and illustrator.
So, it seems that having the SubsetFonts option as True shows rare glyphs but this stops me from pulling the PDF into Illustrator.
Attached are the TrueType Fonts for reference and two PDFs (one with subsetfont option being truw and the other not - default).
I have also tried the following command with the same ill results (no visible glyphs appearing on the PDF and Illustrator incorrectly shows the glyphs).
gs -o gly_subsetfalse_embedallfonts.pdf -sDEVICE=pdfwrite -dCompatibilityLevel=1.3 -dPDFSETTINGS=/prepress -dSubsetFonts=false -dEmbedAllFonts=true textglyph.ps
But with this command I also get a PreFlight error from the PDF if that helps:
"Glyph width info in PDF does not match width info in embedded font"
Attached are all the files spoke about above - click here.
Encoding the font also does not produce good results.
I have encoded a TrueType(and a Type42) font in my PostScript and listed a few new characters to glyphshow.
Results are:
Command 1:
gs -o encode_ttf_subset_false.pdf -sDEVICE=pdfwrite -dSubsetFonts=false encode.ps
Results 1:
Open the PDF in Acrobat does NOT display any glyphshow characters.
Command 2:
gs -o encode_ttf_subset_true.pdf -sDEVICE=pdfwrite encode.ps
Results 2:
Open the PDF in Acrobat and it DOES show the glyphshow characters but not in Illustrator.
Command 3:
gs -o encode_ttf_subset_false_embedtrue.pdf -sDEVICE=pdfwrite -dSubsetFonts=false -dEmbedAllFonts=true encode.ps
Results 3:
Same as Result 1 (glyphshow characters do not appear).
Below is my new PostScript with Encoded TTF and Type42 (I've also included them in my file further below).
Is this a bug at least with Ghostscript?
/museobold findfont dup %%%%% This is the Type42 Font
length dict
copy begin
/Encoding Encoding 256 array copy def
Encoding 1 /oacute put
Encoding 2 /aacute put
Encoding 3 /eacute put
Encoding 4 /questiondown put
Encoding 5 /quotedblleft put
Encoding 6 /quoteright put
Encoding 7 /quotedblbase put
/museobold-Esp currentdict definefont pop
end
/museobold-Esp 18 selectfont
72 600 moveto
(\005D\001lnde est\002 el camino a San Jos\003? More characters \006 and \007) show
%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%
/BauerBodoniBT-Roman findfont dup
length dict
copy begin
/Encoding Encoding 256 array copy def
Encoding 1 /oacute put
Encoding 2 /aacute put
Encoding 3 /eacute put
Encoding 4 /questiondown put
Encoding 5 /quotedblleft put
Encoding 6 /quoteright put
Encoding 7 /quotedblbase put
/BauerBodoniBT-Roman-Esp currentdict definefont pop
end
/BauerBodoniBT-Roman-Esp 18 selectfont
72 630 moveto
(\005D\001lnde est\002 el camino a San Jos\003? More characters \006 and \007) show
showpage
Click here to downloading the following: BBBTRom.ttf (TrueType font); 3 pdfs (results 1, 2 and 3); museobold (TrueType font converted to Type42 using ttftotype42) and encode.ps.
This is back to your problem with using Illustrator as a general PDF application. It can't do that. Now as you note you've found ways round that in the past, this time I believe you are out of luck.
The PostScript glypshow operator doesn't have a PDF equivalent. Also, because of the way glyphshow works, we cannot simply use any existing font instance to store the glyph (because the glyph may not be, and probably isn't, present in the Encoding). As a result pdfwrite does the only thing it can. It makes a new font which consists only of the glyphs used by glyphshow from the specific original font's CharStrings.
Because we don;t have an Encoding to work from we have to use a custom (suymbolic) Encoding (because fonts in a PDF file have to have an Encoding) which from your previous experience I suspect means that Illustrator is unable to read the font we embed.
Using glyphshow with pdfwrite is something I would not encourage.
Now having said that, there should not be a problem with the PDF file when SubsetFonts is true, though I do have an open bug report which sounds similar. You haven't actually said which version of Ghostscript you are using, so I can't be sure if its the same problem. (nor do I have the same fonts etc). Note that this is not (I believe) related to your problem with Illustrator, that's caused by your use of glyphshow and some Illustrator limitation.
As a general rule I would not use -dPDFSETTINGS, certainly not while trying to debug a problem, nor would I limit the output to PDF 1.3.

Setting the photometric interpretation tag for a multi-page tiff

While trying to convert a multipage document from a tiff to a pdf, I encountered the following problem:
↪ tiff2pdf 0271.f1.tiff -o 0271.f1.pdf
tiff2pdf: No support for 0271.f1.tiff with no photometric interpretation tag.
tiff2pdf: An error occurred creating output PDF file.
Does anybody know what causes this and how to fix it?
This is caused because one or more of the pages in the multi-page tiff does not have the photometric interpretation tag set. This is a required tag, so that means your tiffs are technically invalid (though I bet they work fine anyway).
To fix this, you must identify the page (or pages) that does not have the photometric interpretation set and fix it.
To identify the page, you can simply run something like:
↪ tiffinfo your-file.tiff
This will spit out the info for every page of your tiff. For each good page, you'll see something like:
TIFF Directory at offset 0x105c0 (67008)
Subfile Type: (0 = 0x0)
Image Width: 1760 Image Length: 2639
Resolution: 300, 300 pixels/inch
Bits/Sample: 1
Compression Scheme: CCITT Group 4
**Photometric Interpretation: min-is-white**
FillOrder: msb-to-lsb
Orientation: row 0 top, col 0 lhs
Samples/Pixel: 1
Rows/Strip: 2639
Planar Configuration: single image plane
Software: ScanFix(TM) Enhanced ImageGear Version: 11.00.024
DateTime: Mon Oct 31 15:11:07 2005
Artist: 1996-2001 AccuSoft Co., All rights reserved
If you have a bad page, it'll lack the photometric interpretation section, and you can fix it with:
↪ tiffset -d $page-number -s 262 0 your-file.tiff
Note that the value of zero is the default for the photometric interpretation key, which is 262. You can see the other values for this key at the link above.
If your tiff has a lot of pages (like mine does), you may not be able to easily identify the bad page by eye. In that case, you can take a brute force approach, setting the photometric interpretation for all pages to the default value.
# First, split the tiff into many one-page files
↪ tiffsplit your-file.tiff
# Then, set the photometric interpretation to the default for all pages
↪ find . -name '*.tiff' -exec tiffset -s 262 0 '{}' \;
# Then rejoin the pages
↪ tiffcp *.tiff -o out-file.tiff
Lot of dummy work, but gets the job done.