Ghostscript adds whitespace no matter what bounding box I use - pdf

I'm trying to convert a page of a PDF to an image. I'm successful with most PDF's I've tried with but this one in particular always ends up with a lot of whitespace on one side or strange scaling.
I've tried every combination of every fixed media, fixed resolution, fit page, use crop/bleed/trim/art box, etc. parameter to fix the issue but nothing does it. The best I get is the right content size but offset and chopped off.
Here's what it should look like, according to every PDF reader I've tried:
Here's a link to the PDF (8 MB) for testing.
https://drive.google.com/file/d/1ErS3KxADb1YAdzM7FG7T5dO8QnW4l1AQ/view?usp=sharing
Edit 1:
Here's what it looks like using just -dUseCropBox without a cropbox override:
I'm using Ghostscript.NET with very simple code. I create a rasterizer, call Ope(PDF file, ghostscript dll in bytes), then GetPage(DPI, page number). To use other flags I add a custom switch to the rasterizer before calling open
using(var rasterizer = new GhostscriptRasterizer()) {
//rasterizer.CustomSwitches.Add("-dFIXEDMEDIA");
//rasterizer.CustomSwitches.Add("-dFIXEDRESOLUTION");
//rasterizer.CustomSwitches.Add("-dPSFitPage");
//rasterizer.CustomSwitches.Add("-dFitPage");
//rasterizer.CustomSwitches.Add("-dPDFFitPage");
//rasterizer.CustomSwitches.Add("-dUseCropBox");
//rasterizer.CustomSwitches.Add("-dPrinted");
//rasterizer.CustomSwitches.Add("-dUseBleedBox");
//rasterizer.CustomSwitches.Add("-dUseTrimBox");
//rasterizer.CustomSwitches.Add("-dUseArtBox");
//rasterizer.CustomSwitches.Add("-sPAPERSIZE=letter");
//rasterizer.CustomSwitches.Add("-dORIENT1=true");
//etc
rasterizer.Open(pdfFilePath, ghostscriptDLL);
img = rasterizer.GetPage(dpi, pageNumber);
img.Save(pageFilePath, imageFormat);
}
I'll try again with the latest version of just ghostscript (no .NET) and see if that makes a difference.
Edit 2:
Using just gswin64c version 9.55.0 and -dUseCropBox works as KenS said. Since I don't need Ghostscript.NET to do that, that's a good resolution.

Using just gswin64c version 9.55.0 and -dUseCropBox works as KenS said. Since I don't need Ghostscript.NET to do that, that's a good resolution.

Related

Typo3 LTS9 PDF dimensions are not read and displayed in 0x0

I am having an issue with PDF's in the latest Typo3 release. If I add PDF to the Image content element, I get this:
The file info looks like this:
Checking the Image Processing Test of Typo3, no errors are returned. PDF/AI also seems to be fine.
I tested several PDF's and AI files as well, they won't show dimensions either.
I have the suspicion that the command 'identify' does not work within Typo3, it still returns perfect results from shell.
Any idea where to look?
multiple reasons possible:
you just need to reimport metadata (scheduler task)
your PDF is coded in an unsual format (there is more then one option in PDF to include the title image)
missing/wrong rights:
maybe another program is executed from commandline than from PHP.
maybe the file can't be accessed correctly from ghostscript started from web

Ghostscript - create a pdf with multiple identical pages and keep size down

Im trying to use Ghostscript to create a PDF with multiple identical pages. I will later use this together with another multipaged PDF to stamp on unique information onto every page.
Is it possible to use Ghostscript to create such a PDF and keep the size of the final file down? Maby there is a flag that i have not noticed that can do this in a better way than the script below?
I have tried to use a regular merge command like the one below but the size of the resulting PDF grows alot and the original file size of 2,061MB merged to a 100page pdf results in a final size of 46,117MB.
"C:\Program Files\gs\gs9.20\bin\gswin64.exe"^
-dBATCH^
-dNOPAUSE^
-q^
-sDEVICE=pdfwrite^
-sOutputFile=outputpdf.pdf^
"inputpdf.pdf"^
"inputpdf.pdf"^
"inputpdf.pdf"(and so on 100 times)
You can construct such a file manually easily enough, which is much smaller, by reusing the page content stream for each page.
However Ghostscript's pdfwrite device won;t do that, not least because it can't. It cannot know in advance that the page its about to receive is the same as the previous page. As a result it will create a new page content stream for each page, and create new content for it.
Note that resources (forms, patterns, colour spaces, image XObjects etc) which are used on each page will be reused on other pages.
However, it seems to me that you're already getting nearly a 5:1 ratio (2k * 100 pages = 200Kb, the final file is 46Kb) though in fairness a good bit of that 2Kb is 'stuff' around the page.
Without seeing your input file I can't really comment any further, but frankly I doubt its possible to make it any smaller without hand-crafting the file. What's the problem with a 46Kb file anyway ?

Rotating PDF's less than 90 degrees

I'm working with a bunch of PDF files, some of which have been scanned at a bit of an angle. Adobe Acrobat allows me to rotate PDF files by 90 or 180 degrees. But is there a way to rotate a PDF just a few degrees - just enough to make it straighter?
I could perhaps take a screenshot, open it in Photoshop and rotate it, then somehow convert the Photoshop file to a PDF. However, that seems like a really clumsy process.
PDF supports for complete pages only /Rotate values of 90 degrees, because that is (of course) simple. What you need to do is rotate the contents, not the page. So you need to use something which can remake the PDF file for you.
You could use either Ghostscript or MuPDF to do this. Either will require some coding:
MuPDF will require coding in C,
Ghostscript will require you to do some PostScript programming.
Using Ghostscript you would need to define a BeginPage procedure which rotates the content by a small amount and moves the origin of the content slightly as well (because the rotation rotates around the origin, which is at the bottom left, not the centre).
Here is a short utility script for rotating pages (written in Perl). It converts each page of the input PDF to a PDF XObject Form, rotates the form, then outputs the rotated page.
#! /usr/bin/perl
use warnings; use strict;
use PDF::API2;
use Getopt::Long;
my $degrees = 3;
my $scale = 1.0;
my $x = 0;
my $y = 0;
GetOptions ("rotate=i" => \$degrees, "scale=f" => \$scale, "x=f" => \$x, "y=f" => \$y)
or die "usage: $0 IN_PDF OUT_PDF --rotate=DEG --scale=ALPHA --x=POINTS --y=POINTS";
my $infile = shift (#ARGV);
my $outfile = shift (#ARGV);
my $pdf_in = PDF::API2->open($infile);
my $pdf_out = PDF::API2->new;
foreach my $pagenum (1 .. $pdf_in->pages) {
my $page_in = $pdf_in->openpage($pagenum);
#
# create a new page
#
my $page_out = $pdf_out->page(0);
my #mbox = $page_in->get_mediabox;
$page_out->mediabox(#mbox);
my $xo = $pdf_out->importPageIntoForm($pdf_in, $pagenum);
#
# lay up the input page in the output page
# note that you can adjust the position and scale, if required
#
my $gfx = $page_out->gfx;
$gfx->rotate($degrees);
$gfx->formimage($xo, $x, $y, $scale);
}
$pdf_out->saveas($outfile);
You'll need to ensure the PDF::API2 and Geopt::Long modules are installed from CPAN.
The script by default rotates 3 degrees anticlockwise, this is configurable vi the --rotate options.
There are also -x, -y and --scale options to allow fine adjustments of the positioning and scale of the output pages.
This question has also been asked on unix.stackexchange.com .
Another option is using LaTeX:
\documentclass{standalone}
\usepackage{graphicx}
\begin{document}
\includegraphics[angle=-1.5]{odd-scan}
\end{document}
In this case, I have the file odd-scan.pdf (a slightly rotated one page scan) in the same folder as the LaTeX file rotated.tex with the content above and then I run pdflatex rotated.tex. The output is a file rotated.pdf with the PDF rotated by 1.5 degrees clockwise.
(I assume a *nix-style environment. On Windows, you can follow these instructions in Cygwin, although I think you might have to build MuPDF from source there as it doesn't appear to be in the Cygwin repos. If you don't want to do that and you're okay with rasterizing the PDF, ImageMagick is in the Cygwin repos and can do the whole job if needed—see below.)
MuPDF's mutool utility can do this. Say you have a PDF file rotate_me.pdf and you want a version of it rotated by 20° clockwise written to a file rotated.pdf:
#!/bin/bash
mutool draw -R 20 -o rotated.pdf rotate_me.pdf
(mutool draw docs)
You can also rasterize the PDF using mutool convert, work with the image files, and then create a new PDF from them (this assumes rotate_me.pdf has between a hundred and a thousand pages—edit the %3d to your liking):
#!/bin/bash
# - for whatever reason convert's `rotate` is counter-clockwise
# - %nd is replaced with the page number
mutool convert -O rotate=-20 -o 'rotated_%3d.png' rotate_me.pdf
(mutool convert docs)
Once you've done whatever else you need to do the image files and you're ready to turn them back into a PDF, you can use ImageMagick:
#!/bin/bash
magick convert $(ls | grep -P 'rotated_[0-9]{3}\.png') rotated_finished.pdf
(If you get an error saying the security policy for PDFs doesn't permit this, you may need to edit /etc/ImageMagick-7/policy.xml and comment out or remove the <policy domain="coder" rights="none" pattern="PDF" /> line. Be aware of this Ghostscript pre-v9.24 vulnerability which that security policy may be intended to mitigate. If you're working with files you made yourself, you should be safe here, but you may want to re-enable this policy afterwards depending on your needs and environment. If you're not working with files you made yourself, especially PDFs, be careful, whether you have a pre-v9.24 Ghostscript installed or not. PDF as a format is very complex and offers many different places to squirrel away maliciousness, and practically speaking you can never be 100% confident that the software you're using to work with it is perfectly hardened.)
ImageMagick can also rasterize PDFs on its own, although it's a bit more complicated. For example:
#!/bin/bash
magick convert -density 150 -rotate 20 rotate_me.pdf rotated.pdf
This might look similar to the mutool draw command, but the difference is that ImageMagick will rasterize the input PDF and then use the resulting images to make the output PDF, so you can use all the regular ImageMagick transformations with this command.
Anyway, -density is for DPI. It will default to 72 DPI if you don't pass that argument, which is likely to not look very good. Also, ImageMagick doesn't seem to be quite as smart as MuPDF about margins and things like that as far as PDFs go, so you may need to do more work with it than this to get reasonable output for your use case. If you do have access to both MuPDF and ImageMagick, I think doing the rasterization with MuPDF and then doing further work on the resulting images with ImageMagick tends to give the nicest results with the least work, but of course that may or may not be practical for you.
(magick convert docs)
Rasterization has obvious disadvantages if your PDF is vector-based—increased file size, fixed resolution, loss of flexibility, etc. Also, even if your PDF is already storing raster graphics, you may lose text data or the like from it in the conversion. If the PDF is really horrible, though, sometimes this is the least painful approach. You can OCR it if needed once you've cleaned it up using Tesseract, often with superior results to whatever may have been done before you arrived.
This can be done with cpdf:
cpdf -rotate-contents 5 in.pdf -o out.pdf
(Rotates around the centre of the page by five degrees)
I had this at one time. I don't know how many pages there are that you have.
What I did is print the pages that wear off use a paper cutter to square them up and rescanned them. Hope this helps.
And yes I've try to find some type of program to fix this and I still have not found one .

PdfBox write Page to Image scales Signatures

when trying to create an Image from a signed PDF Page, the resulting image shows the signatures but the signatures are not displayed correctly.
For example, the original contains two signatures next to each other in the bottom section.
In the resulting image the signatures look like they have been scaled up and are overlapping.
Furthermore, there's a signature in the top right corner. This signature looks scaled up in the resulting image and is cut off to the right. What is happening here? What am I doing wrong? I'm pretty new to working with PDFs on this level.
Hope that makes sense. Please see below for the differences (I've cut out other content).
Here's the code I'm using:
List<PDPage> pages = inputDocument.getDocumentCatalog().getAllPages();
PDPage page = pages.get(0);
BufferedImage image = page.convertToImage(BufferedImage.TYPE_INT_RGB, PDF_RESOLUTION);
String fileName = "converted_image_" + (i + 1);
ImageIOUtil.writeImage(image, "png", fileName, BufferedImage.TYPE_INT_RGB, PDF_RESOLUTION);
here's the original
and now the distorted version
As suggested by Tilman Hausherr, I was using the current 1.8.x stable release which has problems with annotations appearances. This led to the seen behaviour. Testing with the current 2.0 SNAPSHOT solves this problem.
Now we are eagerly awaiting the release of 2.x :)
From what I've seen, they totally reworked how creating images from a PDF(Page) should be done so I'm not sure about the probability of a backport.
Hope that helps for anyone else coming across this.

When using pdfpages in LaTeX, how to avoid page breaks before the first page ?

I am creating a large LaTeX document, and my appendix has reproductions of several booklets that I have as PDFs. I am trying to create a section header and then include the pages at a slightly lower scale. For example:
\section{Booklet about Yada Yada Yada}
\includepdf[pages={-}, frame=true, scale=0.8]{booklet_yadayada.pdf}
However, pdfpagex does two annoying things. First, it devotes one output document page for included document page. I can live with that as I am using 80% scale. The main problem, however, is that the first page is also a new page, so I have a page with just a section title, and then a separate page with the booklet.
Is there some way to get pdfpages to be a little smarter here?
\includepdf uses \includegraphics internally, so something like
\section{Foo}
\fbox{\includegraphics[page=1,scale=0.8]{foo.pdf}}
would include the page without starting a new one, although it only does one page at a time.
For me the following worked just fine:
\includepdf[pages=1,pagecommand=\section{Section Heading}]{testpdf}
\includepdf[pages=2-,pagecommand={}]{testpdf}
I tried this solution too, but \includepdf keeps the advantage of outputting the file over the margin (the output is centered from the edges of the page).
So I openned pdfpages.sty, and I searched for \newpage command. I deleted the first occurance (line 326), just to try, and after saving then compiling again, there were no page break anymore.
Use the minipage environement :
\chapter*{Sujet du stage}
%\fbox{
\begin{minipage}{\textwidth}
\includepdf[scale=0.8]{../sujet-stage/main.pdf}
\end{minipage}
It doesn't add any extra page and it works with includepdf.
Thanks for all the answers - I couldn't for the life of me figure out what logic \includepdf uses to insert blank pages; the trick with including the first page via \includegraphics solved most (but not all) of those problems; so here are some notes:
First, out of curiosity, I have also tried to use only \includepdf, but split in two parts:
\includepdf[pages=1]{MYINCLDOC.pdf}
\includepdf[pages=2-last]{MYINCLDOC.pdf}
... unfortunately, this has the same problem as the question in OP.
Since #WASE's answer, there are now multiple \newpages in the source (pdfpages.sty). I tried reading the source, but I found it quite difficult; so I tried temporarily setting \newpage to \relax only for \includepdf - and that puts all pages in the document on top of each other; so probably not a good idea to get rid of \newpage blindly.
Just \includegraphics[page=1,scale=0.8]{foo.pdf} works - but (as #WASE also note) it is aligned at the top-left corner of the page body, which is to say inside the margins; for a full page we'd want the pdf inclusion overlaid over the whole page, margins included.
This page: graphics - How do I add an image in the upper, left-hand corner using TikZ and graphicx - TeX - LaTeX points to several possibilities for positioning on page over the margins; but for me, the best solution for a full page PDF inclusion is to use package tikz to center it to the page:
\begin{tikzpicture}[remember picture,overlay]
\node at (current page.center) {\includegraphics[page=1]{MYINCLDOC.pdf}};
\end{tikzpicture}
\includepdf[pages=2-last]{MYINCLDOC.pdf}
After this is done, as a bonus, I have also experienced:
Proper targets of PDF bookmarks (going to the right page when clicked)
If you use package pax, the data seems to be included also for the \includegraphics standalone first page, so no difference there
If you have a twoside document - pdfpages, with the above split of the first page in \includegraphics, will now (seemingly) correctly insert the equivalent of \cleardoublepages between pdfs that are included back to back (so I don't have to insert such a command manually).
Hope this helps someone,
Cheers!
I'm a little late, but the following solution worked for me:
\includepdf[pages={-},angle=90, scale=0.7]{lorem-ipsum.pdf}
All pages are imported, scaled and rotated by 90 degrees.
Works with Texmaker 5.0.4