decrease font size on exisiting text in pdf - pdf

I have a huge pdf containing >1000 pages, need to edit existing text, offcourse it is added by me using pdfbox addtext to each page example ... the text font size was very big text runs out of page..
now i want to decrease the size of font so that it will be within page limits... or i can clear the existing text and replace a the same text with new font...

credits to Tilman Hausherr for answer
If you used the code you linked to, then you will find the "added message" in the content stream array, as the second last item. PDPage.getCosObject().getItem(COSName.Contents) and save the file.
public void removeStamp(File src) throws IOException {
PDDocument doc = PDDocument.load(src);
PDPageTree pages = doc.getPages();
for (PDPage page : pages) {
COSArray array = ((COSArray) page.getCOSObject().getItem(COSName.CONTENTS));
array.remove(array.size() - 1);
}
doc.save(src);
}

Related

How to convert existing pdf files to A4 size using pdfbox?

I want to set a size(A4) to an existing document.
I am using pdfbox for watermarking. I used the following link to add watermark. Here I am using another file in which watermark text is there. Latter we are only adding this layer as overlay to original file.
Here the problem arises when file with watermark text is with different size than original document to which the watermark is to be added. In those case the watermark is not getting added properly in terms of position.
Version: I am using pdfbox 1.8. I tried with 2.0 but I am more comfortable with this version.
Here is the code
PDDocument originalPdfFile = PDDocument.load(filename);
PDRectangle pdRect=new PDRectangle(595, 842);//Here I am setting height and width in terms of points
List PageList = originalPdfFile.getDocumentCatalog().getAllPages();
int noOfPages=PageList.size();
System.out.println("No of pages in original document="+noOfPages);
PDPage page=new PDPage();
//PDPage page=new PDPage(PDPage.PAGE_SIZE_A4);
//Here also I tried to add page size
for (int i = 0; i < PageList.size(); i++) {
page=(PDPage)PageList.get(i);
System.out.println("Original Document size in page before cropping: "+(i+1)+", Page Resolution: "+page.getMediaBox());
page.setMediaBox(pdRect);
System.out.println("Original Document size in page after cropping: "+(i+1)+", Page Resolution: "+page.getMediaBox());
//System.out.println("Original Document size in page: "+i+", Height: "+page.getMediaBox().getHeight()+",Width: "+page.getMediaBox().getWidth());
PDRectangle rec=page.getMediaBox();
generateWatermarkText(organisationName,rec);
}
HashMap<Integer, String> overlayGuide = new HashMap<Integer, String>();
for(int i=0; i<originalPdfFile.getNumberOfPages(); i++)
{
overlayGuide.put(i+1, "C:/drm/final/final.pdf");
//watermarktext.pdf is the document which is a one page PDF with your watermark image in it.
}
Overlay overlay = new Overlay();
overlay.setInputPDF(originalPdfFile);
overlay.setOutputFile(filename);
overlay.setOverlayPosition(Overlay.Position.FOREGROUND);
overlay.overlay(overlayGuide,false);
//pdf will have the original PDF with watermarks.
The above code add watermark successfully but I am not able to shrink the page.
This line
PDRectangle pdRect=new PDRectangle(595, 842);
crops the page but it cuts the contains of the page, which I don't want. I want the contains but to should be fit in that page and the page should be of specified size(like A4 in my case).

PDFBox getText not returning all of the visible text

I am using PDFBox to extract text from my PDF document. It retrieves the text, but not all of it (specifically, seems like title/header and footer texts are missing). The parts that are missing are not images and are extracted when using text view in foxit reader.
I am using version 1.8.12 and made a test case with 2.0.2 just to see if it would return more of the content.
This is the code i used for 2.0.2:
public static void main(String[] args) {
File file = new File("D:\\\\file.pdf");
try {
PDDocument doc = PDDocument.load(file);
PDFTextStripper stripper = new PDFTextStripper();
//stripper.setSuppressDuplicateOverlappingText(false);
stripper.getText(doc);
} catch (Exception e) {
System.out.println("Exc errirs ");
}
}
Now I wonder are there any settings I missed? Is PDFBox failing because text is on top of some decorative elements (rectangle under text)?
Thanks
EDIT: link to file in question
As discussed in the comments, the text wasn't missing, but at the "wrong" position. By default, PDFBox text extraction extracts the characters as they come in the content stream, but they don't always come in a "natural" way. PDF files are created by software, not by humans.
An alternative is to use the sort option:
stripper.setSortByPosition(true)
However, as mkl pointed out, if the text is in two columns, you won't like the result either.

Increase left margin of an existing pdf using iTextSharp [duplicate]

My web application signs PDF documents. I would like to let users download the original PDF document (not signed) but adding an image and the signers in the left margin of the pdf document.
I've seen this idea in another web application, and I would like to do the same. Of course I would like to do it using itext library.
I have attached two images, the original PDF document (not signed) and the modified PDF document.
First this: it is important to change the document before you digitally sign it. Once digitally signed, these changes will break the signature.
I will break up the question in two parts and I'll skip the part about the actual watermarking as this is already explained here: How to watermark PDFs using text or images?
This question is not a duplicate of that question, because of the extra requirement to add an extra margin to the right.
Take a look at the primes.pdf document. This is the source file we are going to use in the AddExtraMargin example with the following result: primes_extra_margin.pdf. As you can see, a half an inch margin was added to the left of each page.
This is how it's done:
public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
int n = reader.getNumberOfPages();
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
// properties
PdfContentByte over;
PdfDictionary pageDict;
PdfArray mediabox;
float llx, lly, ury;
// loop over every page
for (int i = 1; i <= n; i++) {
pageDict = reader.getPageN(i);
mediabox = pageDict.getAsArray(PdfName.MEDIABOX);
llx = mediabox.getAsNumber(0).floatValue();
lly = mediabox.getAsNumber(1).floatValue();
ury = mediabox.getAsNumber(3).floatValue();
mediabox.set(0, new PdfNumber(llx - 36));
over = stamper.getOverContent(i);
over.saveState();
over.setColorFill(new GrayColor(0.5f));
over.rectangle(llx - 36, lly, 36, ury - llx);
over.fill();
over.restoreState();
}
stamper.close();
reader.close();
}
The PdfDictionary we get with the getPageN() method is called the page dictionary. It has plenty of information about a specific page in the PDF. We are only looking at one entry: the /MediaBox. This is only a proof of concept. If you want to write a more robust application, you should also look at the /CropBox and the /Rotate entry. Incidentally, I know that these entries don't exist in primes.pdf, so I am omitting them here.
The media box of a page is an array with four values that represent a rectangle defined by the coordinates of its lower-left and upper-right corner (usually, I refer to them as llx, lly, urx and ury).
In my code sample, I change the value of llx by subtracting 36 user units. If you compare the page size of both PDFs, you'll see that we've added half an inch.
We also use these coordinates to draw a rectangle that covers the extra half inch. Now switch to the other watermark examples to find out how to add text or other content to each page.
Update:
if you need to scale down the existing pages, please read Fix the orientation of a PDF in order to scale it

Splitting at a specific point in PDFBox

I would like to split to generate a new pdf by concatenating certain individual pages, but the last page has to be split at a certain point (i.e. all content above a limit to be included and everything below to be excluded - I only care about the ones having their upper left corner above a line). Is that possible using PDFbox?
One way to achieve the task, i.e. to split a page at a certain point (i.e. all content above a limit to be included and everything below to be excluded) would be to prepend a clip path.
You can use this method:
void clipPage(PDDocument document, PDPage page, BoundingBox clipBox) throws IOException
{
PDPageContentStream pageContentStream = new PDPageContentStream(document, page, true, false);
pageContentStream.addRect(clipBox.getLowerLeftX(), clipBox.getLowerLeftY(), clipBox.getWidth(), clipBox.getHeight());
pageContentStream.clipPath(PathIterator.WIND_NON_ZERO);
pageContentStream.close();
COSArray newContents = new COSArray();
COSStreamArray contents = (COSStreamArray) page.getContents().getStream();
newContents.add(contents.get(contents.getStreamCount()-1));
for (int i = 0; i < contents.getStreamCount()-1; i++)
{
newContents.add(contents.get(i));
}
page.setContents(new PDStream(new COSStreamArray(newContents)));
}
to clip the given page along the given clipBox. (It first creates a new content stream defining the clip path and then arranges this stream to be the first one of the page.)
E.g. to clip the content of a page along the horizontal line 650 units above the bottom, do this:
PDPage page = ...
PDRectangle cropBox = page.findCropBox();
clipPage(document, page, new BoundingBox(
cropBox.getLowerLeftX(),
cropBox.getLowerLeftY() + 650,
cropBox.getUpperRightX(),
cropBox.getUpperRightY()));
For a running example look here: ClipPage.java.

Crop PDF & add margins

I have a PDF with a CropBox size of 6" wide x 9" high. I need to add it to a standard letter-sized PDF. If I change the CropBox size, then the cropmarks become visible. So ideally what I'd like to do is crop out just the visible portion of the page, then pad the sides so that the total height and width is letter-sized.
Is this possible using PDFBox or another Java class?
Have you found an answer to your problem ? I have been facing the same scenario this week.
I have a standard letter-size (8,5" x 11") PDF A, containing a header, a footer, and a form. I have no control over that PDF's generation, so the header and footer are a bit dirty and I need to remove them. My first approach was to extract the form into a Box (any type of box works), and then export it as a new PDF page. Problem is, my new Box is a certain size (let's say 6" x 7"), and after thorough research into the docs, I was unable to find a way to embed it into a 8,5" x 11" PDF B ; the output PDF was the same size as my Box. All scenarios either led to a blank PDF file of the right size, or a PDF containing my form but of wrong dimensions.
I then had no choice but to use another approach. It isn't very clean, but hey, when working with PDFs, black magic and workarounds are the main topic. I simply kept the original PDF A, and blanked out all the unwanted parts. That means, I created rectangles, filled them with white, and covered up the sections I wanted to hide. Result is a PDF file, of right dimension, containing only my form. Hooray ! Technically, the header and footer are still present in the page, there was no way to actually remove them ; I was only able to hide them (this doesn't make any difference to the end user as long as you're not hiding sensitive data).
I realize your question was submitted 2 years ago, but I had a very hard time finding a proper answer to my question online, so here's me giving back to the community, and hoping I can help future developers save some time. If you actually found a way to extract a box and embed it in a standard-size page, please post your answer !
Here is my code by the way :
import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.pdmodel.edit.PDPageContentStream;
import java.awt.Color;
import java.io.*;
import java.util.List;
// This code doesn't actually extract PDF elements per say
// It fills 2 rectangles in white to hide the header and the footer of our PDF page
public class ex {
// Arbitrary values obtained in a very obscure way
static int PAGE_WIDTH = 615;
static int PAGE_HEIGHT = 815;
#SuppressWarnings("unchecked")
public static void main(String[] args) throws IOException, COSVisitorException {
File inputFile = new File("C:\\input.pdf");
File outputFile = new File("C:\\output.pdf");
PDDocument inputDoc = PDDocument.load(inputFile);
PDDocument outputDoc = new PDDocument();
List<PDPage> pages = inputDoc.getDocumentCatalog().getAllPages();
PDPageContentStream pageCS = null;
// Lets paint our pages white !
for (PDPage page : pages) {
pageCS = new PDPageContentStream(inputDoc, page, true, false);
pageCS.setNonStrokingColor(Color.white);
// Top rectangle
pageCS.fillRect(0, 0, PAGE_WIDTH, 30);
// Bottom rectangle
pageCS.fillRect(0, PAGE_HEIGHT-30, PAGE_WIDTH, 30);
pageCS.close();
outputDoc.addPage(page);
}
// Save to file
outputFile.delete();
outputDoc.save(outputFile);
// Wait until the end to close all documents, or else you get an error
inputDoc.close();
outputDoc.close();
}
}
I have adopted the answer of John a little bit, maybe this will help someone.
I have changed the loop to create a new rectangle, with the wanted dimensions. Then the rectangle is set to the page and afterwards added to the new document. I used this snippet to crop a black border out of a long scanned document.
Notice that this will change the size of the pages.
import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDRectangle;
import org.apache.pdfbox.pdmodel.edit.PDPageContentStream;
import java.io.File;
import java.io.IOException;
import java.util.List;
public class Main {
#SuppressWarnings("unchecked")
public static void main(String[] args) throws IOException, COSVisitorException {
File inputFile = new File("/path/to/your/file");
File outputFile = new File("/path/to/your/file");
PDDocument inputDoc = PDDocument.load(inputFile);
PDDocument outputDoc = new PDDocument();
List<PDPage> pages = inputDoc.getDocumentCatalog().getAllPages();
// Lets paint our pages white !
for (PDPage page : pages) {
PDRectangle rectangle=new PDRectangle();
rectangle.setLowerLeftX(0);
rectangle.setLowerLeftY(0);
rectangle.setUpperRightX(500);
rectangle.setUpperRightY(680);
page.setMediaBox(rectangle);
page.setCropBox(rectangle);
outputDoc.addPage(page);
}
// Save to file
// outputFile.delete();
outputDoc.save(outputFile);
// Wait until the end to close all documents, or else you get an error
inputDoc.close();
outputDoc.close();
}
}
Other than adding a rectangle to the PDPage constructor you can do this do set the CropBox to any size:
PDRectangle box = new PDRectangle(pageWidth, pageHeight);
page.setMediaBox(box); // MediaBox > BleedBox > TrimBox/CropBox