How do I export an A5 doc to an A4 pdf without rescaling? - pdf

I have an A5 sized doc file that needs to be printed, yet the press needs them on A4 sized pages, centered, unscaled. When trying to export it from Office Word, you can adjust paper size, but only the left and top margins are kept and the content is spread in width to fill the paper (text size remains unchanged). I've tried PDF Architect / PDF Creator, but when it's about printing on A4 sized pages, the result is messed up fonts, messed up line wrapping and worse quality images.
Are there any tools that can preserve size, scale (in this case, centered and no scale), font, line wrapping and image quality as well or is it too much to ask from free tools? Proprietary tools are no option at the moment.

MS Word has poor options for exporting the PDF
I see some ways to resolve the issue:
Change size of paper in Word then manually recalculate and change size of margins (make like original page is in the center of bigger)
But best solution I see is to find appropriate options for printing device which should print the document (like "don't resize original doc pages, centered; output page size A4")
Try to emulate printing with http://www.dopdf.com/ (or similar) software. I'm pretty sure that it's possible to "print to pdf" with your requirements and then you got PDF which you can use for printing on real device

Related

How is hidden text stored in OCR-enhanced PDF files

// EDIT 26.03.2018 - Who wants to continue my work can have a look on my source-files https://github.com/n0l0cale/ocr-sampledata
I'm actually looking for some details about PDF Files. It's most important for me that the files will be usable for a very long time and if possible the OCR should be automatically applied for new files (which seems to be not really possible with Adobe Acrobat...).
For that I've been looking for different solutions how to OCR my PDF Files. I found three candidates which seems to be doing what they should do... (more or less). But all three variants have their pro&cons... But there seem to be different approaches how to store data in PDF Files.... for all three Variants... Let me explain:
a File OCRed with Adobe Acrobat:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ACROBAT.pdf
results in a file that Acrobat is able to open in one step (no preloading of any background layer) and after a preflight-script I'm able to see the text which is stored hidden:
a File OCRed with Abby Finereader:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ABBY.pdf
does not seem suitable for the default adobe preflight-script as it does not display any additional layers:
But far as I was able to reproduce these Files seems to have a Background-Text-Layer, which contains the OCRed Text, which is the underlying layer for the Image that is shown to the user at the end. Unfortunately this seems to be loaded separately and this is confusing while opening the file with Adobe Acrobat...
a File OCRed with Tesseract 4 (Alpha):
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_TESSERACT_oem2.pdf
is also doing some weird magic with the hidden text part:
But in all three cases I'm able to search for words in the files and see the text using "Remove hidden information" and selecting "hidden text":
I'm seriously confused.... Does anyone know how these programs are storing their hidden text information really?
S.
P.S.: For those wondering what this ominous preflight script is: https://theblog.adobe.com/hidden-gems-in-acrobat-dc-how-to-optimize-hidden-ocr-text/
Does anyone know how these programs are storing their hidden text information really?
You correctly have found out that the approach of Abby Finereader is different from that of Adobe Acrobat and of Tesseract:
Abby creates a page content stream in which first the text is drawn normally on the page and eventually covered by the scanned image.
Acrobat and Tesseract create content streams in which first the image is drawn and then the text is drawn invisibly (using text rendering mode 3 which draws nothing).
The difference between the latter two results is the choice of font used:
Acrobat uses regular standard 14 fonts for which a PDF viewer has a font program to render them as normal glyphs.
Tesseract uses a font GlyphLessFont it embeds a font program for into the result file. When rendered the glyphs in this font do not show as our normal Latin glyphs but merely as empty space.
Considering the visual effect you observed for the Abby result, the approach used by Acrobat or Tesseract might be preferable.
Whether one prefers fonts with visually recognizable glyphs (as used by Acrobat) or without (as used by Tesseract), is mostly a mere matter of taste. They are used only in the invisible rendering mode anyways.

How to resize a PDF page with itext without scaling the content (in Java)

I have been trying for days to find a solution for my problem: I want to resize an existing pdf from A4 to a given individual smaller page size. And I need the real page size to be changed, not the crop box or something like that.
The original pdf will always consist of only one page and all content (e.g. texts (some with hyperlinks), images and tables) will fit into the wanted page size. In fact I want to trim the pfd page to a rectangle that exactly fits to the existing content (the content starts at the left upper corner).
As I found no way to change the page size of an existing pdf page, I tried to create a new pdf with the wanted page size and copy all the content of the original pdf to the new pdf. But that doesn't work either (I can create the new pdf page with the wanted size, but I cannot copy the content).
Any solution (iText 5 or 7) is welcome.

Adjust PDF scale to print

In the context of my studies I often receive PDF files written in LaTeX, with big margins.
When I have to print those files, I like to print them with 2 pages per sheet to spare paper. But I then have a lot of white-space and the text is quite small.
So I'm looking for a way to scale the page contents first and only then print them 2 pages per sheet, to avoid losing space and to have the text as big and readable as possible.
Has anyone an idea of how I could do that either programmatically, or scripted, or on a "step-by-step commands" basis ?
(Note that I have no access to the LaTeX code, otherwise I would just change the margins...)
I used FinePrint to do this on windows. But there are some alternatives, which I haven't try:
https://superuser.com/questions/190869/fineprint-alternative-on-linux
https://superuser.com/questions/107687/good-virtual-printers-with-cropping-for-windows-and-linux
Here are previous answers (all mine) which provide building blocks that will help you construct your own programmatic or scripted or "some step-by-step commands" solution:
PDF Manipulation: "2-Up" page layout (SuperUser)
Linux-based tool to chop PDFs into multiple pages (SuperUser)
Convert PDF 2 sides per page to 1 side per page (SuperUser)
How can I split a PDF's pages down the middle? (SuperUser)
Cropping a PDF using Ghostscript 9.01 (StackOverflow)
Split one PDF page into two (StackOverflow)
PDF - Remove White Margins (StackOverflow)

How do I shave a few KB off a PDF?

I have a scanned greyscale PDF of a set of official school transcripts that has been compressed to 1MB. Actually, its 1023655 bytes. I am trying to upload the document to an online application that has a maximum file size of 1MB.
My attempts to further compressing the PDF via the same website have not worked.
I have tried using Neevia, but any further compression makes the lightest of the three pages completely white (the first two pages are black printed on a blue background, and third is light grey printed on a white background)
I've tried using mac preview to save as black and white (unreadable), and to resize it (blurry).
I have GIMP at my disposal, but otherwise I don't have any experience with photo or document manipulation. How do I shave those kilobytes off this PDF?
You could try looking at the bit depth of the grey scale. For example, if it's currently 16-bit grey scale (2^16, or 65536 shades of grey), you could try using an 8-bit grey scale (256 shades) or 4-bit (16 shades). You've already tried one form of this, going to 1-bit (2 shades, i. e. black and white), but without first taking a look at adjusting the contrast to make the text really stand out, you'll often end up with illegible files.
If you download and install CutePDF, you can open the PDF file and go to print it, select the CutePDF printer, and you will be prompted to save a new PDF file. Chances are this new PDF file will be much smaller,

Poor image rendering with Google Docs PDF viewer

I used Word 2007 to create a PDF file with an 1526px * 900px image filling a whole page. This is not the first time it's happened, but Google Docs PDF viewer absolutely mangles the colour rendering making it unusable.
I've taken screenshots at the same zoom level in Google Docs viewer and Foxit Reader.
Here's an image for comparison:
It's awful! I've tried messing about with some things, but can't find anything that can correct this issue.
In Chrome you can select "Print" and then "Save as PDF". The image quality in the saved PDF file will go up significantly, compared to the one from "Download as PDF". Google seems to be optimizing images to preserve bandwidth.
Let it be recorded here, 16 months after the present original posting by Turkeyphant and a similar posting [1] on the Docs+Drive product forum, that the problem appears to have been fixed within about the past week. Since that time, when a pdf (or Word) file is opened that resides on the Docs+Drive cloud, the file is rendered with what appears to be proper 24-bit color. The treatment whereby the color was reduced to 5 bits, which could encode 32 colors or 32 shades of gray or 16 of each, depending on the image, has been abandoned.
To the best of my knowledge the Docs+Drive staff have not announced this change, either on their Blog or on their product forum. I noticed the change a few days ago and noted it on the conversation [1].
[1] (2013-05-21) Problem in pdf-viewer with color images
https://productforums.google.com/d/msg/docs/_bdfiYgjF2s/5PDMdp9MhFQJ
It might have something to do with compression of the image in the PDF.
I mean, PDF supports JPEG2000-encoded images (JPXDecode Filter) and PDF Reference states that:
From a single JPEG2000 data stream, multiple versions of an image may
be decoded. These different versions form progressions along four
degrees of freedom: sampling resolution, color depth, band, and
location. For example, with a resolution progression, a thumbnail
version of the image may be decoded from the data, followed by a
sequence of other versions of the image, each with approximately four
times as many samples (twice the width times twice the height) as the
previous one. The last version is the full-resolution image.
Google Docs viewer might be displaying only first version of the image (with lower resolution or lower color depth) thus producing "awful" output.
Perhaps the attached pair of images will help towards clarifying what is happening with color in images that are rendered through the Google Docs pdf viewer. I inserted the Wikipedia image RGB_Color_Solid_Cube (1024*1024 pixels) into an otherwise empty Google Docs text document, converted it to pdf, and viewed the resulting pdf files two ways: once through the Google Docs+Drive pdf viewer and once through the regular pdf viewer of the Chrome or Firefox browser. Then I made screenshots. Here is the RGB Color Cube via the Docs PDF Viewer and here is the RGB Color Cube via a regular browser PDF Viewer.
The color resolution in the Docs PDF Viewer version is really awful; it looks like 64 colors at most. Maybe someone else is able to recognize this kind of rendering and identify the problem better.
This is related to compression and it's something that you can't change in the default view of Google Docs Viewer. The simple solution is to upload the PDF and just serve it from the site in an iFrame. Here is an example:
Problem Embedding Google Docs PDF Solution
Mike