PDFBox 2.0.4 - Blank page (200+ inches height) with Adobe Reader 2015

PDFBox 2.0.4 - Blank page (200+ inches height) with Adobe Reader 2015 - pdf

I have a very big screenshot (48.69 x 1220.93 cm) which I’m converting to PDF using PDFBox 2.0.4.
It works well when I open the generated PDF using the Mac Preview app, but not with Adobe Reader version 2015. It shows only a long blank page and says the dimensions are 19.17 x 200 inches. I’m aware that early versions of the PDF spec had a limit of 200 inches height. So I've tried setting the pdf version to 1.7 but it didn’t work:
org.apache.pdfbox.pdmodel.PDDocument#setVersion
org.apache.pdfbox.cos.COSDocument#setVersion
Both Adobe and Preview say the version of the pdf is 1.7. I can normally open smaller pdfs using adobe.

As #Tilman already said in his comment,
The media box is 1380 x 34609. 1 unit = 1/72 inch
Unfortunately this is beyond the size a specification conforming pdf reader has to support:
The minimum page size should be 3 by 3 units in default user space; the maximum should be 14,400 by 14,400 units. In versions of PDF earlier than 1.6, the size of the default user space unit was fixed at 1 ⁄ 72 inch, yielding a minimum of approximately 0.04 by 0.04 inch and a maximum of 200 by 200 inches. Beginning with PDF 1.6, the size of the unit may be set on a page-by-page basis; the default remains at 1/ 72 inch.
(Table C.1 – Architectural limits - ISO 32000-1)
To support a document page as large as desired here, one should use a larger default user space unit, e.g. 3/72".

Related

Google docs pdf conversion of an A4 size document is too wide

I have an A4 Page Setup for a Google document. When I convert it to PDF via the Download option the width of the page is 8.28 inches rather than the specified 8.27 inches. I've used the Add-on "Page Sizer" but the best that I could get was 8.19 inches wide after lots of trial an error.
Has anyone else experience this?
Thanks.

ghostscript shrinking pdf doesn't work anymore

first question here.
So i was using the ghostscript command to shrink my pdf which yieled good results (around 30-40% decrease in size). However, one day last week it stopped shrinking them and instead returned me a pdf of the size or even a bit heavier (around 1% or less). Therefore I don't know what's going on since the command used to work fine and i was able to shrink some pdf easily...
I will note that when using gs on my pdfs it always return an error about some glyphs missing in the GlyphLessFont but i don't think it's related to my issue (though if you could redirect me to fixing the glyphlessfont that would be much appreciated).
Here's the command I use :
`gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=out.pdf`
Here's also a pdf sample that was shrinked correctly (original file size 4.7mo / shrinked version 2.9mo) https://nofile.io/f/39Skta4n25R/bulletin1_ocr.pdf
EDIT: light version that worked for the file above : https://nofile.io/f/QOKfG34d5Cg/bulletin1_light.pdf
Here's the input and output file of another pdf that didn't work
(input) https://nofile.io/f/sXsU0Mcv35A/bulletin15_ocr.pdf
(output through the gs command above) https://nofile.io/f/STdJYqqt6Fq/out.pdf
you'll notice that both input and output file are 27.6mo whereas the first file was reduced.
I would also add that i've performed OCR on these pdf using pdfocr and the tesseract engine and that's why i didn't try to convert to png to reduce the size, i need the extra OCR layer so that we can publish those file for our website and we want them to be lighter if possible.
Final info : ghostscript -v is 9.10 (2013-08-30) and tesseract is 3.03 with leptonica-1.70 and pdfocr is 0.1.4
Hope you guys can help !
EDIT2: while waiting for the answer I continued my scanning and ocring of the documents and it appears that after passing my pdf through pdfocr it was shrinked like it used to with the ghostscript. Therefore i wonder if the script pdfocr does the shrinking with ghostscript since i know it invokes it for other tasks during the process of OCRisation.

The PDF has a media size of 35.44 by 50.11 inches, is that really the size of the original ?
Given that you appear to commonly use OCR I assume that, in general, your PDF files simply consist of very large images. In that case the major impact on the file size is going to come from downsampling the images. If you look at the documentation you can see that the /screen settings downsample images to 72 dpi, with a threshold of 1.5 (so images over 72 * 1.5 = 107 dpi will be downsampled to 72, anything less is regarded as not worth it)
Your PDF file has a media size of 35.44 x 50.11 inches. Its rather a large file (26 pages) so I'll limit myself to considering page 1. On this page there is one image, and a bunch of invisible text, placed there by Tesseract. The image on page 1 is a 8-bit RGB image with dimensions 2481x3508, and it covers the entire page.
So the resolution of that image is 2481 / 35.44 by 3500 / 50.11 = 70.00 x 69.84
Since that is less than 72 dpi, pdfwrite isn't going to downsample it.
Had your media been 8.5 x 11 inches then the image would have had an effective resolution of 2481 / 8.5 by 2500 / 11 = 291.8 x 318.18 and so would have been downsampled by a factor of about 4.
However..... for me your 'working' PDF file also has a large media size, and the images are also already below the downsampling resolution. When I run that file using your command line, the output file is essentially the same size as the input file.
So I'm at a loss to see how you could ever have experienced the reduced file size. Perhaps you could post the reduced file as well.
EDIT
So, the reason that your files are smaller after passing through Ghostscript is because the vast majority of the content is the scanned pages. These are stored in the PDF file as DCT encoded images (JPEG).
The resolution of the images is low enough (see above) that they are not downsampled. However, the way that old versions of Ghostscript work is that image data is always decompressed on reading, and then recompressed when writing.
Because JPEG is a lossy image format, this means that the decompressed and recompressed image is of lower quality than the original, and the way that loss of quality is applied means that the data compresses better.
So a quirk of the way that Ghostscript works results in you losing quality, but getting smaller files. Note that for current versions of Ghostscript, the JPEG data is passed through unchanged, unless your configuration requires it to be donwsampled, or colour converted.
So why doesn't it compress the other file ? Well for current code, of course, which is what I'm using, it won't, because the image doesn't need downsampling or anything.
Now, when I run it through an old version of Ghostscript which I have here (9.10, chosen because that's what your working reduced file is using) then I do indeed see the file size reduced. It goes down from 26MB to 15MB.
When I look at your 'not working' reduced file, I see that it has been produced by Ghostscript 9.23, not Ghostscript 9.10.
So the reason you see a difference in behaviour is because you have upgraded to a newer version of Ghostscript which does a better job of preserving the image data unchanged.
If you really want to reduce the quality of the images you can set -dPassThroughJPEGImages=false but IMO you'd do better to either get the media size of the original PDF coreect (surely the pages are not really 35x50 inches ?) or set the ColorImageResolution to a lower value.

what unit is pdf resolution in and does DPI affect it

When inspecting a PDF file, a 'resolution' can be seen:
In this example the 'Resolution' is 595x841. There is no 'DPI' shown in the above file dialog, however when exporting as PDF it is possible to set the DPI. Regardless of what DPI setting is chosen, the 'Resolution' is always the same.
For A6 'resolution' is 297x419, about half in each dimension.
What unit is resolution measure in, and is it linked directly to the size only, or does the DPI affect it (and I just haven't been setting it correctly?)

The documentation is not clear, but PDF dimensions (referred to as 'Resolution' by mac file info dialog) are in 'user units' which then correlate to 'points'. Prior to PDF version 1.6, its a 1:1 relationship.
Since PDF version 1.6: There is a setting /UserUnit which is usually 1, and user_points x /UserUnit = points.
A 'user unit' is ALWAYS 1/72 of an inch. So 0.35277777777777775mm per point. (Though you can scale a PDF to fit different sized pages in most print dialogs)
PDFs do NOT themselves have a DPI. PDFs contain things, some of which MAY have the equivalent - text and vector images are scalable and do not, whereas with raster images each one has a 'ppi' (point per inch).
When creating a PDF, you may use a raster image with a very high resolution, but PDFs do 'filtering' on raster images to reduce their resolution so the 'ppi' is similar to the 'DPI' of the page (usually chosen on export of PDF). This means that the smaller you make an image on the page, the lower the resolution it will need to have to have the desired 'ppi'. Thus filtering compresses raster images to stop the PDF being unnecessarily large (filesize) by having lots of unneeded detail.
So in the file dialog, 'Resolution' is always in 'user points', multiply it by /UserPoint to get 'points', and that divided by 72 and is the intended physical size in inches. You can of course print an A4 PDF at A2 size; it just means the embedded raster images will each have half their 'ppi' available to the printer.

Adobe Acrobat XI changes file version automatically after pages are deleted from the document

I've got a version 1.4 PDF created by using the R-function "pdf". The file contains six pages and has 135 KB. Now I want each of these pages in a separate file in order to include it as picture in Latex. Since I have not only the Adobe Reader deleting pages isn't a problem, but after a page is deleted from the document Adobe Acrobat automatically changes the version to 1.6, which then causes problems in Latex.
I've now tried to save it as version 1.4 PDF, which itself isn't a problem, but the file size then increases from 28 KB to 759 KB and my final PDF mustn't be larger than 3 MB. I've already played a bit with the compression settings, but the size doesn't really change. Why does Adobe change the version automatically and how can I extract the pages without blowing up the size that much?

Acrobat is always setting the PDF version to its own level, even if the file itself would be compliant to an earlier standard. It has been doing so since Acrobat 2…
You can control quite a few things when you do Save as… --> Optimized PDF. There you can also set the standard at which the document is saved, and many more things.
About the file size, it really depends on what your document contains. It is also possible that your PDF creation tool creates an incomplete document, and saving it in Acrobat will create a more complete one (think of embedded fonts, etc.).

Poor image rendering with Google Docs PDF viewer

I used Word 2007 to create a PDF file with an 1526px * 900px image filling a whole page. This is not the first time it's happened, but Google Docs PDF viewer absolutely mangles the colour rendering making it unusable.
I've taken screenshots at the same zoom level in Google Docs viewer and Foxit Reader.
Here's an image for comparison:
It's awful! I've tried messing about with some things, but can't find anything that can correct this issue.

In Chrome you can select "Print" and then "Save as PDF". The image quality in the saved PDF file will go up significantly, compared to the one from "Download as PDF". Google seems to be optimizing images to preserve bandwidth.

Let it be recorded here, 16 months after the present original posting by Turkeyphant and a similar posting [1] on the Docs+Drive product forum, that the problem appears to have been fixed within about the past week. Since that time, when a pdf (or Word) file is opened that resides on the Docs+Drive cloud, the file is rendered with what appears to be proper 24-bit color. The treatment whereby the color was reduced to 5 bits, which could encode 32 colors or 32 shades of gray or 16 of each, depending on the image, has been abandoned.
To the best of my knowledge the Docs+Drive staff have not announced this change, either on their Blog or on their product forum. I noticed the change a few days ago and noted it on the conversation [1].
[1] (2013-05-21) Problem in pdf-viewer with color images
https://productforums.google.com/d/msg/docs/_bdfiYgjF2s/5PDMdp9MhFQJ

It might have something to do with compression of the image in the PDF.
I mean, PDF supports JPEG2000-encoded images (JPXDecode Filter) and PDF Reference states that:
From a single JPEG2000 data stream, multiple versions of an image may
be decoded. These different versions form progressions along four
degrees of freedom: sampling resolution, color depth, band, and
location. For example, with a resolution progression, a thumbnail
version of the image may be decoded from the data, followed by a
sequence of other versions of the image, each with approximately four
times as many samples (twice the width times twice the height) as the
previous one. The last version is the full-resolution image.
Google Docs viewer might be displaying only first version of the image (with lower resolution or lower color depth) thus producing "awful" output.

Perhaps the attached pair of images will help towards clarifying what is happening with color in images that are rendered through the Google Docs pdf viewer. I inserted the Wikipedia image RGB_Color_Solid_Cube (1024*1024 pixels) into an otherwise empty Google Docs text document, converted it to pdf, and viewed the resulting pdf files two ways: once through the Google Docs+Drive pdf viewer and once through the regular pdf viewer of the Chrome or Firefox browser. Then I made screenshots. Here is the RGB Color Cube via the Docs PDF Viewer and here is the RGB Color Cube via a regular browser PDF Viewer.
The color resolution in the Docs PDF Viewer version is really awful; it looks like 64 colors at most. Maybe someone else is able to recognize this kind of rendering and identify the problem better.

This is related to compression and it's something that you can't change in the default view of Google Docs Viewer. The simple solution is to upload the PDF and just serve it from the site in an iFrame. Here is an example:
Problem Embedding Google Docs PDF Solution
Mike

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas