Java- stretch pdf pages content - pdf

How can I make the page larger in the way that the text/images will stretch
according to the new size?
I only found ways to scale down but not scale up... any idea?
Thanks in advance!!!

There are two conceptional options:
For each page enlarge MediaBox and CropBox, prepend current transformation matrix scaling operation to the page content stream or array of streams, and update annotation positions and sizes.
For each page set the property UserUnit to a value > 1.
The second option can e.g. be implemented like this using PDFBox and a stretch factor of 1.7:
PDDocument document = PDDocument.load(SOURCE);
for (PDPage page : document.getPages()) {
page.getCOSObject().setFloat("UserUnit", 1.7f);
}
document.save(TARGET);
(The second option obviously is much easier to implement than the first one. Feature incomplete PDF viewers might ignore this value, though. If you need to support such incomplete viewers, you probably have to go for the first option.)

Related

Ghostscript - create a pdf with multiple identical pages and keep size down

Im trying to use Ghostscript to create a PDF with multiple identical pages. I will later use this together with another multipaged PDF to stamp on unique information onto every page.
Is it possible to use Ghostscript to create such a PDF and keep the size of the final file down? Maby there is a flag that i have not noticed that can do this in a better way than the script below?
I have tried to use a regular merge command like the one below but the size of the resulting PDF grows alot and the original file size of 2,061MB merged to a 100page pdf results in a final size of 46,117MB.
"C:\Program Files\gs\gs9.20\bin\gswin64.exe"^
-dBATCH^
-dNOPAUSE^
-q^
-sDEVICE=pdfwrite^
-sOutputFile=outputpdf.pdf^
"inputpdf.pdf"^
"inputpdf.pdf"^
"inputpdf.pdf"(and so on 100 times)
You can construct such a file manually easily enough, which is much smaller, by reusing the page content stream for each page.
However Ghostscript's pdfwrite device won;t do that, not least because it can't. It cannot know in advance that the page its about to receive is the same as the previous page. As a result it will create a new page content stream for each page, and create new content for it.
Note that resources (forms, patterns, colour spaces, image XObjects etc) which are used on each page will be reused on other pages.
However, it seems to me that you're already getting nearly a 5:1 ratio (2k * 100 pages = 200Kb, the final file is 46Kb) though in fairness a good bit of that 2Kb is 'stuff' around the page.
Without seeing your input file I can't really comment any further, but frankly I doubt its possible to make it any smaller without hand-crafting the file. What's the problem with a 46Kb file anyway ?

PdfBox write Page to Image scales Signatures

when trying to create an Image from a signed PDF Page, the resulting image shows the signatures but the signatures are not displayed correctly.
For example, the original contains two signatures next to each other in the bottom section.
In the resulting image the signatures look like they have been scaled up and are overlapping.
Furthermore, there's a signature in the top right corner. This signature looks scaled up in the resulting image and is cut off to the right. What is happening here? What am I doing wrong? I'm pretty new to working with PDFs on this level.
Hope that makes sense. Please see below for the differences (I've cut out other content).
Here's the code I'm using:
List<PDPage> pages = inputDocument.getDocumentCatalog().getAllPages();
PDPage page = pages.get(0);
BufferedImage image = page.convertToImage(BufferedImage.TYPE_INT_RGB, PDF_RESOLUTION);
String fileName = "converted_image_" + (i + 1);
ImageIOUtil.writeImage(image, "png", fileName, BufferedImage.TYPE_INT_RGB, PDF_RESOLUTION);
here's the original
and now the distorted version
As suggested by Tilman Hausherr, I was using the current 1.8.x stable release which has problems with annotations appearances. This led to the seen behaviour. Testing with the current 2.0 SNAPSHOT solves this problem.
Now we are eagerly awaiting the release of 2.x :)
From what I've seen, they totally reworked how creating images from a PDF(Page) should be done so I'm not sure about the probability of a backport.
Hope that helps for anyone else coming across this.

How to detect if a pdf is text searchable or non text searchable?

i have a set of pdfs, from which i want to process( VB.NET) only those which are non text searchable, can you please tell me how to go about this?
Generally speaking, the way to do this is open up each page and rip the content stream and see if any text operators are executed that place text on the page.
Let me explain what that means - PDF content is a small RPN language that contains operations that mark the page in some way. For example, you might see something like this:
BT 72 400 Td /F0 12 Tf (Throatwarbler Mangrove) Tj ET
Which means:
Begin a text area
Set the position of the text baseline to (72, 400) in PDF units
Set the font to a resource named F0 from the current page's font resource dictionary
Draw the text "Throatwarbler Mangrove"
End a text area
So you can try short cuts
does my page's resource dictionary contain any fonts?
This will fail in some cases because some PDF generation tools put fonts into the resource
dictionary and don't use them (false positive). It will also fail if the page content contains a Form XObject which contains text (false negative).
does my page's content stream have BT/ET opertors?
This will get you closer, but will fail if there is not content in them (false positive) or if they're not present, but there's a Form XObject which contains text (false negative).
So really, the thing to do is to execute the entire page's content stream, including recursing on all XObject to look for text operators.
Now, there's another approach that you can take using my Atalasoft's software (disclaimer, I work for Atalasoft and have written most of the PDF handling code, I also worked on Acrobat versions 1-4). Instead of asking, does this page contain any text, you can ask "does this page contain only a single image?"
bool allPagesImages = true;
using (Document doc = new Document(inputStream))
{
foreach (Page p in doc.Pages)
{
if (!p.SingleImageOnly)
{
allPagesImages = false;
break;
}
}
}
Which will leave allPagesImages with a pretty decent indication that each page is all images, which if you're looking to OCR is the non-searchable documents, is probably what you really want.
The down side is that this will be a very high price for a single predicate, but it also gets you a PDF rasterizer and the ability to extract the images directly out of the file.
Now, I have no doubt that a solid engineer could work their way through the PDF spec and write some code to extend iTextPdfSharp to do this task I think that if I sat down with it, I might be able to write that predicate in a few days, but I already know most of the PDF spec. So it might take you more like two weeks to a month. So your choice.
I think this option could be your consideration, though I haven't tested the code yet but I think it can be done by read the properties for each PDF files that you want to proceed.
You might check this link :
http://www.codeguru.com/columns/vb/manipulating-pdf-files-with-itextsharp-and-vb.net-2012.htm
You have to read the producer properties right after you proceeded it. That's just only example. But my advice please include your code here so we can give a try to help you. Bless you

UILables, Text Flow and Layouts

Consider the following, I have paragraph data being sent to a view which needs to be placed over a background image, which has at the top and the bottom, fixed elements (fig1)
Fig1.
My thought was to split this into 4 labels (Fig1.example2) my question here is how I can get the text to flow through labels 1 - 4 given that label 1,2 & 3 ar of fixed height. I assumed here that label 3 should be populated prior to 4 hence the layout in the attached diagram.
Can someone suggest the best way of doing this with maybe an example?
Thanks
Wish I could help more, but I think I can at least point you in the right direction.
First, your idea seems very possible, but would involve lots of calculations of text size that would be ugly and might not produce ideal results. The way I see it working is a binary search of testing portions of your string with sizeWithFont: until you can get the best guess for what the label will fit into that size and still look "right". Then you have to actually break up the string and track it in pieces... just seems wrong.
In iOS 6 (unfortunately doesn't apply to you right now but I'll post it as a potential benefit to others), you could probably use one UILabel and an NSAttributed string. There would be a couple of options to go with here, (I haven't done it so I'm not sure which would be the best) but it seems that if you could format the page with html, you can initialize the attributed string that way.
From the docs:
You can create an attributed string from HTML data using the initialization methods initWithHTML:documentAttributes: and initWithHTML:baseURL:documentAttributes:. The methods return text attributes defined by the HTML as the attributes of the string. They return document-level attributes defined by the HTML, such as paper and margin sizes, by reference to an NSDictionary object, as described in “RTF Files and Attributed Strings.” The methods translate HTML as well as possible into structures of the Cocoa text system, but the Application Kit does not provide complete, true rendering of arbitrary HTML.
An alternative here would be to just use the available attributes, setting line indents and such according to the image size. I haven't worked with attributed strings at this level, so I the best reference would be the developer videos and the programming guide for NSAttributedString. https://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/AttributedStrings/AttributedStrings.html#//apple_ref/doc/uid/10000036-BBCCGDBG
For lesser versions of iOS, you'd probably be better off becoming familiar with CoreText. In the end you'll be rewarded with a better looking result, reusability/flexibility, the list goes on. For that, I would start with the CoreText programming guide: https://developer.apple.com/library/mac/#documentation/StringsTextFonts/Conceptual/CoreText_Programming/Introduction/Introduction.html
Maybe someone else can provide some sample code, but I think just looking through the docs will give you less of a headache than trying to calculate 4 labels like that.
EDIT:
I changed the link for CoreText
You have to go with CoreText: create your AttributedString and a CTFramesetter with it.
Then you can get a CTFrame for each of your textboxes and draw it in your graphics context.
https://developer.apple.com/library/mac/#documentation/Carbon/Reference/CTFramesetterRef/Reference/reference.html#//apple_ref/doc/uid/TP40005105
You can also use a UIWebView

When using pdfpages in LaTeX, how to avoid page breaks before the first page ?

I am creating a large LaTeX document, and my appendix has reproductions of several booklets that I have as PDFs. I am trying to create a section header and then include the pages at a slightly lower scale. For example:
\section{Booklet about Yada Yada Yada}
\includepdf[pages={-}, frame=true, scale=0.8]{booklet_yadayada.pdf}
However, pdfpagex does two annoying things. First, it devotes one output document page for included document page. I can live with that as I am using 80% scale. The main problem, however, is that the first page is also a new page, so I have a page with just a section title, and then a separate page with the booklet.
Is there some way to get pdfpages to be a little smarter here?
\includepdf uses \includegraphics internally, so something like
\section{Foo}
\fbox{\includegraphics[page=1,scale=0.8]{foo.pdf}}
would include the page without starting a new one, although it only does one page at a time.
For me the following worked just fine:
\includepdf[pages=1,pagecommand=\section{Section Heading}]{testpdf}
\includepdf[pages=2-,pagecommand={}]{testpdf}
I tried this solution too, but \includepdf keeps the advantage of outputting the file over the margin (the output is centered from the edges of the page).
So I openned pdfpages.sty, and I searched for \newpage command. I deleted the first occurance (line 326), just to try, and after saving then compiling again, there were no page break anymore.
Use the minipage environement :
\chapter*{Sujet du stage}
%\fbox{
\begin{minipage}{\textwidth}
\includepdf[scale=0.8]{../sujet-stage/main.pdf}
\end{minipage}
It doesn't add any extra page and it works with includepdf.
Thanks for all the answers - I couldn't for the life of me figure out what logic \includepdf uses to insert blank pages; the trick with including the first page via \includegraphics solved most (but not all) of those problems; so here are some notes:
First, out of curiosity, I have also tried to use only \includepdf, but split in two parts:
\includepdf[pages=1]{MYINCLDOC.pdf}
\includepdf[pages=2-last]{MYINCLDOC.pdf}
... unfortunately, this has the same problem as the question in OP.
Since #WASE's answer, there are now multiple \newpages in the source (pdfpages.sty). I tried reading the source, but I found it quite difficult; so I tried temporarily setting \newpage to \relax only for \includepdf - and that puts all pages in the document on top of each other; so probably not a good idea to get rid of \newpage blindly.
Just \includegraphics[page=1,scale=0.8]{foo.pdf} works - but (as #WASE also note) it is aligned at the top-left corner of the page body, which is to say inside the margins; for a full page we'd want the pdf inclusion overlaid over the whole page, margins included.
This page: graphics - How do I add an image in the upper, left-hand corner using TikZ and graphicx - TeX - LaTeX points to several possibilities for positioning on page over the margins; but for me, the best solution for a full page PDF inclusion is to use package tikz to center it to the page:
\begin{tikzpicture}[remember picture,overlay]
\node at (current page.center) {\includegraphics[page=1]{MYINCLDOC.pdf}};
\end{tikzpicture}
\includepdf[pages=2-last]{MYINCLDOC.pdf}
After this is done, as a bonus, I have also experienced:
Proper targets of PDF bookmarks (going to the right page when clicked)
If you use package pax, the data seems to be included also for the \includegraphics standalone first page, so no difference there
If you have a twoside document - pdfpages, with the above split of the first page in \includegraphics, will now (seemingly) correctly insert the equivalent of \cleardoublepages between pdfs that are included back to back (so I don't have to insert such a command manually).
Hope this helps someone,
Cheers!
I'm a little late, but the following solution worked for me:
\includepdf[pages={-},angle=90, scale=0.7]{lorem-ipsum.pdf}
All pages are imported, scaled and rotated by 90 degrees.
Works with Texmaker 5.0.4