How to set Page Scaling option in Apache PDfBox - pdfbox

In my app, I am using Apache PDFBox to render PDF file and to silent print that fine.
PDFBox works fine for rendering the PFD but I am facing issue scaling comes to the picture.
Here I want to set Page scaling before printing the PDF.
In acrobat reader's print popup, there are four options for printing the PDF.
1> Fit
2> Actual Size
3> Shrink over sized pages
4> Custom Scale
Here I want to set page scaling to Actual Size. How can I do it using Apache POI ??

I had this same problem and I felt like it was way too difficult to do this one simple thing. Even though it's not ideal, I eventually settled for the following - not even using the PDFBox printing.
The following code converts the pages to images one at a time and uses java2d to resize and print them out.
PDDocument pdfdoc = PDDocument.load(pdfPane.pdfFile);
#SuppressWarnings("unchecked")
final List<PDPage> pdfPages = pdfdoc.getDocumentCatalog().getAllPages();
PrinterJob pjob = PrinterJob.getPrinterJob();
pjob.setJobName(pdfPane.pdfFile.getName());
pjob.setPrintable(new Printable()
{
#Override
public int print( Graphics g, PageFormat pf, int page ) throws PrinterException
{
if (page > pdfPages.size())
return NO_SUCH_PAGE;
try
{
g.drawImage(pdfPages.get(page).convertToImage()
,(int)pf.getImageableX()
,(int)pf.getImageableY()
,(int)pf.getImageableWidth()
,(int)pf.getImageableHeight()
,null);
}
catch (IOException e)
{
LoggerUtil.error(e);
}
return PAGE_EXISTS;
}
});
pjob.print();
pdfdoc.close();

Related

PDFBox getText not returning all of the visible text

I am using PDFBox to extract text from my PDF document. It retrieves the text, but not all of it (specifically, seems like title/header and footer texts are missing). The parts that are missing are not images and are extracted when using text view in foxit reader.
I am using version 1.8.12 and made a test case with 2.0.2 just to see if it would return more of the content.
This is the code i used for 2.0.2:
public static void main(String[] args) {
File file = new File("D:\\\\file.pdf");
try {
PDDocument doc = PDDocument.load(file);
PDFTextStripper stripper = new PDFTextStripper();
//stripper.setSuppressDuplicateOverlappingText(false);
stripper.getText(doc);
} catch (Exception e) {
System.out.println("Exc errirs ");
}
}
Now I wonder are there any settings I missed? Is PDFBox failing because text is on top of some decorative elements (rectangle under text)?
Thanks
EDIT: link to file in question
As discussed in the comments, the text wasn't missing, but at the "wrong" position. By default, PDFBox text extraction extracts the characters as they come in the content stream, but they don't always come in a "natural" way. PDF files are created by software, not by humans.
An alternative is to use the sort option:
stripper.setSortByPosition(true)
However, as mkl pointed out, if the text is in two columns, you won't like the result either.

How to add watermark to a landscape file using pdfbox

I'm using pdfbox 1.8.11 and FOP to add water mark to pdf:s. It works nicely to most input pdf files.
However I get a problem when the file is in landscape, the watermarking will be 90 degree right rotated.
I had similar problem with visible signature, it is fixed. thanks to the solution in sign landscape file . Any idea how to make water mark rotation works? Thanks in advance!
The original picture for watermark is:
Up arrow
After FOP watermark the image is rotated:
image rotated
apologize for answer late.
The idea for 'water mark' here to add add some transforms into the original pdf using fop apache fop. You can fine java code example and fo template example from apache fop website.
In any case i will illustrate the example here too:
1. the java code of how to use fop
import org.apache.fop.apps.*;
import org.xml.sax.*;
import java.io.*;
import javax.xml.transform.*;
import javax.xml.transform.sax.*;
import javax.xml.transform.stream.*;
class rendtest {
private static FopFactory fopFactory = FopFactory.newInstance(new File(".").toURI());
private static TransformerFactory tFactory = TransformerFactory.newInstance();
public static void main(String args[]) {
OutputStream out;
try {
//Load the stylesheet
Templates templates = tFactory.newTemplates(
new StreamSource(new File(args[1])));
//First run (to /dev/null)
out = new org.apache.commons.io.output.NullOutputStream();
FOUserAgent foUserAgent = fopFactory.newFOUserAgent();
Fop fop = fopFactory.newFop(MimeConstants.MIME_PDF, foUserAgent, out);
Transformer transformer = templates.newTransformer();
transformer.setParameter("page-count", "#");
transformer.transform(new StreamSource(new File(args[0])),
new SAXResult(fop.getDefaultHandler()));
//Get total page count
String pageCount = Integer.toString(driver.getResults().getPageCount());
//Second run (the real thing)
out = new java.io.FileOutputStream(args[2]);
out = new java.io.BufferedOutputStream(out);
try {
foUserAgent = fopFactory.newFOUserAgent();
fop = fopFactory.newFop(MimeConstants.MIME_PDF, foUserAgent, out);
transformer = templates.newTransformer();
transformer.setParameter("page-count", pageCount);
transformer.transform(new StreamSource(new File(args[0])),
new SAXResult(fop.getDefaultHandler()));
} finally {
out.close();
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
for the problem i had for rendering landscape pdf:s, in fop template you only need to add one more attribute to tell this file is in landscape layout.
The attribute is to set reference-orientation="90". Then your other definitions in the fop template will be applied properly.

iTextSharp rotated PDF page reverts orientation when file is rasterized at print house

Using iTextSharp I am creating a PDF composed of a collection of existing PDFs, some of the included PDFs are landscape orientation and need to be rotated. So, I do the following:
private static void AdjustRotationIfNeeded(PdfImportedPage pdfImportedPage, PdfReader reader, int documentPage)
{
float width = pdfImportedPage.Width;
float height = pdfImportedPage.Height;
if (pdfImportedPage.Rotation != 0)
{
PdfDictionary pageDict = reader.GetPageN(documentPage);
pageDict.Put(PdfName.ROTATE, new PdfNumber(0));
}
if (width > height)
{
PdfDictionary pageDict = reader.GetPageN(documentPage);
pageDict.Put(PdfName.ROTATE, new PdfNumber(270));
}
}
This works great. The included PDFs rotated to portrait orientation if needed. The PDF prints correctly on my local printer.
This file is sent to a fulfillment house, and unfortunately, the landscape included files do not print properly when going through their printer and rasterization process. They use Kodak (Creo) NexRip 11.01 or Kodak (Creo) Prinergy 6.1. machines. The fulfillment house's suggestion is to: "generate a new PDF file after we rotate pages or make any changes to a PDF. It is as easy as exporting out to a PostScript and distilling back to a PDF."
I know iTextSharp doesn't support PostScript. Is there another way iTextSharp can rotate included PDFs to hold the orientation when rasterized?
First let me assure you that changing the rotation in the page dictionary is the correct procedure to achieve what you want. As far as I can see your code, there's nothing wrong with it. You are doing the right thing.
Unfortunately, you are faced with a third party product over which you have no control that is not doing the right thing. How to solve this?
I have written an example called IncorrectExample. I have named it that way because I don't want it to be used in a context that is different from yours. You can safely ignore all the warnings I added: they are not meant for you. This example is very specific to your problem.
Please try the following code:
public void manipulatePdf(String src, String dest)
throws IOException, DocumentException {
// Creating a reader
PdfReader reader = new PdfReader(src);
// step 1
Rectangle pagesize = getPageSize(reader, 1);
Document document = new Document(pagesize);
// step 2
PdfWriter writer
= PdfWriter.getInstance(document, new FileOutputStream(dest));
// step 3
document.open();
// step 4
PdfContentByte cb = writer.getDirectContent();
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
pagesize = getPageSize(reader, i);
document.setPageSize(pagesize);
document.newPage();
PdfImportedPage page = writer.getImportedPage(reader, i);
if (isPortrait(reader, i)) {
cb.addTemplate(page, 0, 0);
}
else {
cb.addTemplate(page, 0, 1, -1, 0, pagesize.getWidth(), 0);
}
}
// step 4
document.close();
reader.close();
}
public Rectangle getPageSize(PdfReader reader, int pagenumber) {
Rectangle pagesize = reader.getPageSizeWithRotation(pagenumber);
return new Rectangle(
Math.min(pagesize.getWidth(), pagesize.getHeight()),
Math.max(pagesize.getWidth(), pagesize.getHeight()));
}
public boolean isPortrait(PdfReader reader, int pagenumber) {
Rectangle pagesize = reader.getPageSize(pagenumber);
return pagesize.getHeight() > pagesize.getWidth();
}
I have taken the pages.pdf file as an example. This file is special in the sense that it has two pages in landscape that are created in a different way:
one page is a page of which the width is smaller than the height (sounds like it's a page in portrait), but as there's a /Rotate value of 90 added to the page dictionary, it is shown in landscape.
the other page isn't rotated, but it has a height that is smaller than the width.
In my example, I am using the classes Document and PdfWriter to create a copy of the original document. This is wrong in general because it throws away all interaction. I should use PdfStamper or PdfCopy instead, but it is right in your specific case because you don't need the interactivity: the final purpose of the PDF is to be printed.
With Document, I create new pages using a new Rectangle that uses the lowest value of the dimensions of the existing page as the width and the highest value as the height. This way, the page will always be in portrait. Note that I use the method getPageSizeWithRotation() to make sure I get the correct width and height, taking into account any possible rotation.
I then add a PdfImportedPage to the direct content of the writer. I use the isPortrait() method to find out if I need to rotate the page or not. Observe that the isPortrait() method looks at the page size without taking into account the rotation. If we did take into account the rotation, we'd rotate pages that don't need rotating.
The resulting PDF can be found here: pages_changed.pdf
As you can see, some information got lost: there was an annotation on the final page: it's gone. There were specific viewer preferences defined for the original document: they're gone. But that shouldn't matter in your specific case, because all that matters for you is that the pages are printed correctly.

Crop PDF & add margins

I have a PDF with a CropBox size of 6" wide x 9" high. I need to add it to a standard letter-sized PDF. If I change the CropBox size, then the cropmarks become visible. So ideally what I'd like to do is crop out just the visible portion of the page, then pad the sides so that the total height and width is letter-sized.
Is this possible using PDFBox or another Java class?
Have you found an answer to your problem ? I have been facing the same scenario this week.
I have a standard letter-size (8,5" x 11") PDF A, containing a header, a footer, and a form. I have no control over that PDF's generation, so the header and footer are a bit dirty and I need to remove them. My first approach was to extract the form into a Box (any type of box works), and then export it as a new PDF page. Problem is, my new Box is a certain size (let's say 6" x 7"), and after thorough research into the docs, I was unable to find a way to embed it into a 8,5" x 11" PDF B ; the output PDF was the same size as my Box. All scenarios either led to a blank PDF file of the right size, or a PDF containing my form but of wrong dimensions.
I then had no choice but to use another approach. It isn't very clean, but hey, when working with PDFs, black magic and workarounds are the main topic. I simply kept the original PDF A, and blanked out all the unwanted parts. That means, I created rectangles, filled them with white, and covered up the sections I wanted to hide. Result is a PDF file, of right dimension, containing only my form. Hooray ! Technically, the header and footer are still present in the page, there was no way to actually remove them ; I was only able to hide them (this doesn't make any difference to the end user as long as you're not hiding sensitive data).
I realize your question was submitted 2 years ago, but I had a very hard time finding a proper answer to my question online, so here's me giving back to the community, and hoping I can help future developers save some time. If you actually found a way to extract a box and embed it in a standard-size page, please post your answer !
Here is my code by the way :
import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.pdmodel.edit.PDPageContentStream;
import java.awt.Color;
import java.io.*;
import java.util.List;
// This code doesn't actually extract PDF elements per say
// It fills 2 rectangles in white to hide the header and the footer of our PDF page
public class ex {
// Arbitrary values obtained in a very obscure way
static int PAGE_WIDTH = 615;
static int PAGE_HEIGHT = 815;
#SuppressWarnings("unchecked")
public static void main(String[] args) throws IOException, COSVisitorException {
File inputFile = new File("C:\\input.pdf");
File outputFile = new File("C:\\output.pdf");
PDDocument inputDoc = PDDocument.load(inputFile);
PDDocument outputDoc = new PDDocument();
List<PDPage> pages = inputDoc.getDocumentCatalog().getAllPages();
PDPageContentStream pageCS = null;
// Lets paint our pages white !
for (PDPage page : pages) {
pageCS = new PDPageContentStream(inputDoc, page, true, false);
pageCS.setNonStrokingColor(Color.white);
// Top rectangle
pageCS.fillRect(0, 0, PAGE_WIDTH, 30);
// Bottom rectangle
pageCS.fillRect(0, PAGE_HEIGHT-30, PAGE_WIDTH, 30);
pageCS.close();
outputDoc.addPage(page);
}
// Save to file
outputFile.delete();
outputDoc.save(outputFile);
// Wait until the end to close all documents, or else you get an error
inputDoc.close();
outputDoc.close();
}
}
I have adopted the answer of John a little bit, maybe this will help someone.
I have changed the loop to create a new rectangle, with the wanted dimensions. Then the rectangle is set to the page and afterwards added to the new document. I used this snippet to crop a black border out of a long scanned document.
Notice that this will change the size of the pages.
import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDRectangle;
import org.apache.pdfbox.pdmodel.edit.PDPageContentStream;
import java.io.File;
import java.io.IOException;
import java.util.List;
public class Main {
#SuppressWarnings("unchecked")
public static void main(String[] args) throws IOException, COSVisitorException {
File inputFile = new File("/path/to/your/file");
File outputFile = new File("/path/to/your/file");
PDDocument inputDoc = PDDocument.load(inputFile);
PDDocument outputDoc = new PDDocument();
List<PDPage> pages = inputDoc.getDocumentCatalog().getAllPages();
// Lets paint our pages white !
for (PDPage page : pages) {
PDRectangle rectangle=new PDRectangle();
rectangle.setLowerLeftX(0);
rectangle.setLowerLeftY(0);
rectangle.setUpperRightX(500);
rectangle.setUpperRightY(680);
page.setMediaBox(rectangle);
page.setCropBox(rectangle);
outputDoc.addPage(page);
}
// Save to file
// outputFile.delete();
outputDoc.save(outputFile);
// Wait until the end to close all documents, or else you get an error
inputDoc.close();
outputDoc.close();
}
}
Other than adding a rectangle to the PDPage constructor you can do this do set the CropBox to any size:
PDRectangle box = new PDRectangle(pageWidth, pageHeight);
page.setMediaBox(box); // MediaBox > BleedBox > TrimBox/CropBox

Page by page conversion of PDF into TIFF with proper compression

Problem
There are PDF documents with different type of objects inside. There are simple texts. There can be scanned images that are B&W, and also other images, that are true color. The resolution can be quite high for both (~1789X2711).
I need to convert the PDF into a set of single page TIFF files. There are quite good tools for that. For example Irfanview, ImageMagick. The problem is that I have to define a single compression type for all the pages.
Using JPG for all pages would result in loosing details for B&W images and they would be huge compared to lossless fax compression.
Using lossless fax for all would wanish colors and details of true color images.
Idea
It would be nice to examine the PDF page by page. I could check the content of the page. What kind of images are there inside, and which compression is recommanded for the particular page. I think this can be done with IText, but I don't know exactly, how it should be done. A second thing is that I want to do this analysis without fully reading the PDF file. Is it possible?
Maybe the fastest solution would be to create a list of pages for each compression type with IText analysis, and then to call Irfanview to process the choosen pages with the proper compression.
Any ideas and recommendations are welcome.
UPDATE:
I have now an answer. It does not cover all requirements, and its not freeware. Any opensource ideas? Maybe Java based solutions?
This can be done with DotImage DotPdf from Atalasoft (cue the obligatory "I work there and work on these products"). Here is how I would do this task in C#:
PdfImageSource source = new PdfImageSource(pdfStream);
while (source.HasMoreImages()) {
AtalaImage image = source.AcquireNext();
string fileName = GetNextTiffName();
using (FileStream outStm = new FileStream(fileName, FileMode.Create)) {
TiffEncoder encoder = new TiffEncoder();
encoder.Compression = SelectCompression(image.PixelFormat);
image.Save(outStm, encoder, null);
}
source.Release(image);
}
private TiffCompression SelectCompression(PixelFormat pf)
{
switch (pf) {
// 1 bit? use CCITT G4
case PixelFormat.Pixel1bbIndexed: return TiffCompression.Group4FaxEncoding;
// 24 bit? use JPEG
case PixelFormat.Pixel24bppBgr: return TiffCompression.JpegCompression;
// all else, Lzw
default: return TiffCompression.Lzw;
}
}
You can make SelectCompression do pretty much whatever you want. If you select an invalid compression for that pixel format, the encoder will use an appropriate lossless one in its place (for example, if you select CCITT for 24bit color, the encoder will instead use Lzw).
Our PDF decoder knows when a PDF page is just gray and returns a gray image. It does NOT do anything to get you to 1 bit (this is so antialiased text looks good), however you could threshold the gray image and look at the overall differences between it and the gray image to determine if it could go to 1 bit).
Here's how you could do a set of pages:
public void ExtractNPages(Stream pdfStream, params int[] pageIndexes)
{
PdfImageSource source = new PdfImageSource(pdfStream);
for (int i in pageIndexes) {
AtalaImage image = source[i]; // implied Acquire
string fileName = GetNextTiffName();
using (FileStream outStm = new FileStream(fileName, FileMode.Create)) {
TiffEncoder = new TiffEncoder();
encoder.Compression = SelectCompression(image.PixelFormat);
image.Save(outStm, encoder, null);
}
source.Release(image);
}
}
so now you can just do ExtractNPages(stm, 0, 2, 4, 6);