PdfBox write Page to Image scales Signatures - pdf

when trying to create an Image from a signed PDF Page, the resulting image shows the signatures but the signatures are not displayed correctly.
For example, the original contains two signatures next to each other in the bottom section.
In the resulting image the signatures look like they have been scaled up and are overlapping.
Furthermore, there's a signature in the top right corner. This signature looks scaled up in the resulting image and is cut off to the right. What is happening here? What am I doing wrong? I'm pretty new to working with PDFs on this level.
Hope that makes sense. Please see below for the differences (I've cut out other content).
Here's the code I'm using:
List<PDPage> pages = inputDocument.getDocumentCatalog().getAllPages();
PDPage page = pages.get(0);
BufferedImage image = page.convertToImage(BufferedImage.TYPE_INT_RGB, PDF_RESOLUTION);
String fileName = "converted_image_" + (i + 1);
ImageIOUtil.writeImage(image, "png", fileName, BufferedImage.TYPE_INT_RGB, PDF_RESOLUTION);
here's the original
and now the distorted version

As suggested by Tilman Hausherr, I was using the current 1.8.x stable release which has problems with annotations appearances. This led to the seen behaviour. Testing with the current 2.0 SNAPSHOT solves this problem.
Now we are eagerly awaiting the release of 2.x :)
From what I've seen, they totally reworked how creating images from a PDF(Page) should be done so I'm not sure about the probability of a backport.
Hope that helps for anyone else coming across this.

Related

Ghostscript adds whitespace no matter what bounding box I use

I'm trying to convert a page of a PDF to an image. I'm successful with most PDF's I've tried with but this one in particular always ends up with a lot of whitespace on one side or strange scaling.
I've tried every combination of every fixed media, fixed resolution, fit page, use crop/bleed/trim/art box, etc. parameter to fix the issue but nothing does it. The best I get is the right content size but offset and chopped off.
Here's what it should look like, according to every PDF reader I've tried:
Here's a link to the PDF (8 MB) for testing.
https://drive.google.com/file/d/1ErS3KxADb1YAdzM7FG7T5dO8QnW4l1AQ/view?usp=sharing
Edit 1:
Here's what it looks like using just -dUseCropBox without a cropbox override:
I'm using Ghostscript.NET with very simple code. I create a rasterizer, call Ope(PDF file, ghostscript dll in bytes), then GetPage(DPI, page number). To use other flags I add a custom switch to the rasterizer before calling open
using(var rasterizer = new GhostscriptRasterizer()) {
//rasterizer.CustomSwitches.Add("-dFIXEDMEDIA");
//rasterizer.CustomSwitches.Add("-dFIXEDRESOLUTION");
//rasterizer.CustomSwitches.Add("-dPSFitPage");
//rasterizer.CustomSwitches.Add("-dFitPage");
//rasterizer.CustomSwitches.Add("-dPDFFitPage");
//rasterizer.CustomSwitches.Add("-dUseCropBox");
//rasterizer.CustomSwitches.Add("-dPrinted");
//rasterizer.CustomSwitches.Add("-dUseBleedBox");
//rasterizer.CustomSwitches.Add("-dUseTrimBox");
//rasterizer.CustomSwitches.Add("-dUseArtBox");
//rasterizer.CustomSwitches.Add("-sPAPERSIZE=letter");
//rasterizer.CustomSwitches.Add("-dORIENT1=true");
//etc
rasterizer.Open(pdfFilePath, ghostscriptDLL);
img = rasterizer.GetPage(dpi, pageNumber);
img.Save(pageFilePath, imageFormat);
}
I'll try again with the latest version of just ghostscript (no .NET) and see if that makes a difference.
Edit 2:
Using just gswin64c version 9.55.0 and -dUseCropBox works as KenS said. Since I don't need Ghostscript.NET to do that, that's a good resolution.
Using just gswin64c version 9.55.0 and -dUseCropBox works as KenS said. Since I don't need Ghostscript.NET to do that, that's a good resolution.

Java- stretch pdf pages content

How can I make the page larger in the way that the text/images will stretch
according to the new size?
I only found ways to scale down but not scale up... any idea?
Thanks in advance!!!
There are two conceptional options:
For each page enlarge MediaBox and CropBox, prepend current transformation matrix scaling operation to the page content stream or array of streams, and update annotation positions and sizes.
For each page set the property UserUnit to a value > 1.
The second option can e.g. be implemented like this using PDFBox and a stretch factor of 1.7:
PDDocument document = PDDocument.load(SOURCE);
for (PDPage page : document.getPages()) {
page.getCOSObject().setFloat("UserUnit", 1.7f);
}
document.save(TARGET);
(The second option obviously is much easier to implement than the first one. Feature incomplete PDF viewers might ignore this value, though. If you need to support such incomplete viewers, you probably have to go for the first option.)

How to detect if a pdf is text searchable or non text searchable?

i have a set of pdfs, from which i want to process( VB.NET) only those which are non text searchable, can you please tell me how to go about this?
Generally speaking, the way to do this is open up each page and rip the content stream and see if any text operators are executed that place text on the page.
Let me explain what that means - PDF content is a small RPN language that contains operations that mark the page in some way. For example, you might see something like this:
BT 72 400 Td /F0 12 Tf (Throatwarbler Mangrove) Tj ET
Which means:
Begin a text area
Set the position of the text baseline to (72, 400) in PDF units
Set the font to a resource named F0 from the current page's font resource dictionary
Draw the text "Throatwarbler Mangrove"
End a text area
So you can try short cuts
does my page's resource dictionary contain any fonts?
This will fail in some cases because some PDF generation tools put fonts into the resource
dictionary and don't use them (false positive). It will also fail if the page content contains a Form XObject which contains text (false negative).
does my page's content stream have BT/ET opertors?
This will get you closer, but will fail if there is not content in them (false positive) or if they're not present, but there's a Form XObject which contains text (false negative).
So really, the thing to do is to execute the entire page's content stream, including recursing on all XObject to look for text operators.
Now, there's another approach that you can take using my Atalasoft's software (disclaimer, I work for Atalasoft and have written most of the PDF handling code, I also worked on Acrobat versions 1-4). Instead of asking, does this page contain any text, you can ask "does this page contain only a single image?"
bool allPagesImages = true;
using (Document doc = new Document(inputStream))
{
foreach (Page p in doc.Pages)
{
if (!p.SingleImageOnly)
{
allPagesImages = false;
break;
}
}
}
Which will leave allPagesImages with a pretty decent indication that each page is all images, which if you're looking to OCR is the non-searchable documents, is probably what you really want.
The down side is that this will be a very high price for a single predicate, but it also gets you a PDF rasterizer and the ability to extract the images directly out of the file.
Now, I have no doubt that a solid engineer could work their way through the PDF spec and write some code to extend iTextPdfSharp to do this task I think that if I sat down with it, I might be able to write that predicate in a few days, but I already know most of the PDF spec. So it might take you more like two weeks to a month. So your choice.
I think this option could be your consideration, though I haven't tested the code yet but I think it can be done by read the properties for each PDF files that you want to proceed.
You might check this link :
http://www.codeguru.com/columns/vb/manipulating-pdf-files-with-itextsharp-and-vb.net-2012.htm
You have to read the producer properties right after you proceeded it. That's just only example. But my advice please include your code here so we can give a try to help you. Bless you

Exception releasing context after CGContextDrawPDFPage only for some pages

I've inherited some iOS code which opens a source PDF and creates a CGContextRef to which we draw a single page from the source document. The problem is that there are certain pages with one document, our help document unfortunately, which causes this code to crash.
The end goal is to cache 8 pages at a time to improve the user experience.
CFMutableDataRef consumerData = CFDataCreateMutable(kCFAllocatorDefault, 0);
CGDataConsumerRef contextConsumer = CGDataConsumerCreateWithCFData(consumerData);
CGPDFPageRef page = CGPDFDocumentGetPage(sourceDocument, pageNumber);
const CGRect mediaBox = CGPDFPageGetBoxRect(page, kCGPDFCropBox);
CGContextRef pdfOutContext = CGPDFContextCreate(contextConsumer, &mediaBox, NULL);
CGContextDrawPDFPage(pdfOutContext, page); //If I comment out this line, no exception occurs
CGPDFPageRelease(page);
CGPDFContextEndPage(pdfOutContext);
CGPDFContextClose(pdfOutContext); //EXC_BAD_ACCESS
CGContextRelease(pdfOutContext);
(This is a simplified version of the code, the original opens a source document and a page, checks for null on page and ctx, and then writes ctx to a new document.)
There's no issue if, instead of drawing to a PDF context, I draw to a UIGraphics context created thusly:
CGContextRef graphicsContext = UIGraphicsGetCurrentContext();
There is also no issue when I draw other things to the PDF context.
Plus, this works for 99% of the documents and for 75% of the pages within the offending document. The offending document renders correctly in several PDF viewers.
So I don't think there's a memory issue on my part. I'm fairly confident that there is something within the CGPDF code that is buggy (and I say that only after spending a week trying to resolve this issue).
My question is, is there some other way I should/could be doing this?
There is enough evidence that this is a bug introduced in iOS5 for us to work around the issue rather than trying to solve it. So we ended up removing the caching. It's only marginally slower on an iPad 1 than it was with caching for a 200 page document, so the product manager decided that was acceptable (when compared to simply crashing).
We also tried writing the doc to an image and displaying that, but it wasn't any faster and produced a result of inferior quality (especially when zooming).
EDIT
Submitted the bug to Apple. It turns out to have already been reported. The original bug is 10428451 and is being looked at by their engineers.
You're missing CGPDFContextBeginPage.

When using pdfpages in LaTeX, how to avoid page breaks before the first page ?

I am creating a large LaTeX document, and my appendix has reproductions of several booklets that I have as PDFs. I am trying to create a section header and then include the pages at a slightly lower scale. For example:
\section{Booklet about Yada Yada Yada}
\includepdf[pages={-}, frame=true, scale=0.8]{booklet_yadayada.pdf}
However, pdfpagex does two annoying things. First, it devotes one output document page for included document page. I can live with that as I am using 80% scale. The main problem, however, is that the first page is also a new page, so I have a page with just a section title, and then a separate page with the booklet.
Is there some way to get pdfpages to be a little smarter here?
\includepdf uses \includegraphics internally, so something like
\section{Foo}
\fbox{\includegraphics[page=1,scale=0.8]{foo.pdf}}
would include the page without starting a new one, although it only does one page at a time.
For me the following worked just fine:
\includepdf[pages=1,pagecommand=\section{Section Heading}]{testpdf}
\includepdf[pages=2-,pagecommand={}]{testpdf}
I tried this solution too, but \includepdf keeps the advantage of outputting the file over the margin (the output is centered from the edges of the page).
So I openned pdfpages.sty, and I searched for \newpage command. I deleted the first occurance (line 326), just to try, and after saving then compiling again, there were no page break anymore.
Use the minipage environement :
\chapter*{Sujet du stage}
%\fbox{
\begin{minipage}{\textwidth}
\includepdf[scale=0.8]{../sujet-stage/main.pdf}
\end{minipage}
It doesn't add any extra page and it works with includepdf.
Thanks for all the answers - I couldn't for the life of me figure out what logic \includepdf uses to insert blank pages; the trick with including the first page via \includegraphics solved most (but not all) of those problems; so here are some notes:
First, out of curiosity, I have also tried to use only \includepdf, but split in two parts:
\includepdf[pages=1]{MYINCLDOC.pdf}
\includepdf[pages=2-last]{MYINCLDOC.pdf}
... unfortunately, this has the same problem as the question in OP.
Since #WASE's answer, there are now multiple \newpages in the source (pdfpages.sty). I tried reading the source, but I found it quite difficult; so I tried temporarily setting \newpage to \relax only for \includepdf - and that puts all pages in the document on top of each other; so probably not a good idea to get rid of \newpage blindly.
Just \includegraphics[page=1,scale=0.8]{foo.pdf} works - but (as #WASE also note) it is aligned at the top-left corner of the page body, which is to say inside the margins; for a full page we'd want the pdf inclusion overlaid over the whole page, margins included.
This page: graphics - How do I add an image in the upper, left-hand corner using TikZ and graphicx - TeX - LaTeX points to several possibilities for positioning on page over the margins; but for me, the best solution for a full page PDF inclusion is to use package tikz to center it to the page:
\begin{tikzpicture}[remember picture,overlay]
\node at (current page.center) {\includegraphics[page=1]{MYINCLDOC.pdf}};
\end{tikzpicture}
\includepdf[pages=2-last]{MYINCLDOC.pdf}
After this is done, as a bonus, I have also experienced:
Proper targets of PDF bookmarks (going to the right page when clicked)
If you use package pax, the data seems to be included also for the \includegraphics standalone first page, so no difference there
If you have a twoside document - pdfpages, with the above split of the first page in \includegraphics, will now (seemingly) correctly insert the equivalent of \cleardoublepages between pdfs that are included back to back (so I don't have to insert such a command manually).
Hope this helps someone,
Cheers!
I'm a little late, but the following solution worked for me:
\includepdf[pages={-},angle=90, scale=0.7]{lorem-ipsum.pdf}
All pages are imported, scaled and rotated by 90 degrees.
Works with Texmaker 5.0.4