move PDF content using PDFBox - pdf

I need to be able to specify a rectangular area on a PDF page and move the text and graphic content of that area to a new location on the same page using PDFBox. Any graphics (lines, pictures, etc) will each move as a whole unit if selected in the area.
The PDF documents being modified originate as text based PCL and are converted to PDF using a third party tool. I can answer technical questions about these documents if needed.
This Stack Overflow question is exactly what I am after but that question seems to have been abandoned before a working solution was found?
I would bounty this question if I had a few more reputation points.
If you can help with any aspect of this issue I would appreciate your assistance, thank you.

I'm not as familiar with PDFBox as I should be but any library should be able to do the following; I know the one I represent can.
Create a new blank page that's the same size as your original. Copy the content of the original to an XObject and apply that to the blank page. Add a white rectangle to the page to obscure the rectangle in question. Clip the content of the original page to the rectangle you want to "move". Create a second XObject from that. Apply it to the new page in the position you want.
If PDFBox is capable of it, Sanitize the new page to remove the hidden content under the white box.

Related

Automatically remove all PDF content outside a crop area

For a deck of lecture slides, I have extracted several vector illustrations from a PDF-file. I did this by highlighting the relevant area in Preview.app, copying, and opening a new file from the clipboard.
The figures look just fine, even though I noticed that the files are a little large. When I open them in Illustrator, I can see what's described in the screenshot – that all of the page content is still there, it's just hidden because it lies outside the crop area.
Now I could simply remove everything except the relevant figures in Illustrator, but I would much rather automate the process, since I have a large number of figures.
How can I automate this process such that everything outside the crop area is discarded and everything inside it is preserved as a vector image?
You can use redact utility to remove the content.
Just go to https://doxiview.cib.de/showcase/index.html?locale=default
Choose redact tool
upload your PDF
Choose on the right Select Area and redact fill color as white
Mark all content, which you want to remove
click on apply
download PDF
Afterwards you can crop the PDF and you won't have the content being still there.
There's no need to rasterize. Just crop the pages then use Acrobat DC to "Sanitize" the document. That will completely remove any non-visible parts of the file.
In Acrobat Pro, go to Preflight and select the setting below.
Then click edit to the right
You should be able to create Adobe droplets with this preflight setting for automation

How to resize a PDF page with itext without scaling the content (in Java)

I have been trying for days to find a solution for my problem: I want to resize an existing pdf from A4 to a given individual smaller page size. And I need the real page size to be changed, not the crop box or something like that.
The original pdf will always consist of only one page and all content (e.g. texts (some with hyperlinks), images and tables) will fit into the wanted page size. In fact I want to trim the pfd page to a rectangle that exactly fits to the existing content (the content starts at the left upper corner).
As I found no way to change the page size of an existing pdf page, I tried to create a new pdf with the wanted page size and copy all the content of the original pdf to the new pdf. But that doesn't work either (I can create the new pdf page with the wanted size, but I cannot copy the content).
Any solution (iText 5 or 7) is welcome.

Cropping a region from a PDF page with PDFBox

I am trying to crop a region out of a PDF page programmatically. Specifically, my input is going to be a single page PDF and a bounding box on the page. Output is going to be a PDF that contains the characters, graphics paths and images from the original PDF, and it should look like the original PDF. In other words, I want a function that is similar to cropping a region out of an image, but with PDFs.
Three questions:
Is it at all possible to do? From my knowledge of PDFs, it seems possible. But I'm no expert, so I would like to know first if there are some things I'm missing here.
Is there any open source software for this?
Can PDFBox do this currently? I couldn't find such a functionality but I might have missed it. Does anybody know of any attempt of doing this?
1- Yes, this is called the crop box.
2- Yes, e.g. PDFBox.
3- Yes, just open a PDF, set a crop box, and save it:
PDDocument doc = PDDocument.load(new File(...));
PDPage page = doc.getPage(0);
page.setCropBox(new PDRectangle(20, 20, 200, 400));
doc.save(...);
doc.close();
The numbers in PDRectangle are user space units. 1 unit = 1/72 inches.
Note that the contents outside the cropbox are not gone, they are just hidden.

iTextSharp - when extracting a page it fails to carry over Adobe rectangle highlighting important info

Per the following site...
http://forums.asp.net/t/1630140.aspx?extracting+pdf+pages+using+itextsharp
...I use the function ExtractPages to produce a new PDF based on range of page numbers. My problem is that I noticed a PDF that had a rectangle on the 2nd page was not extracted along with the page. This causes me some fear that perhaps Adobe comments are not being carried over as well as the pages get extracted.
Is there a way I can adjust this code to take into consideration to bring over comments and objects like rectangles to the new PDF when ExtractPages is called? Am I missing a syntax or is that not available with version 5.5.0 of iTextSharp?
Your use of the verb extract in the context of extracting pages is confusing. People will think you want to extract text from a page. In reality, you want to import or copy pages.
The example you refer to uses PdfWriter. That's wrong: you should use PdfStamper (if only one existing PDF is involved) or PdfCopy (if multiple existing PDFs are involved). See my answer to the question How to keep original rotate page in itextSharp (dll) to find out why the example on forums.asp.net is a really, really bad example.
The fact that a page has "a rectangle" is irrelevant. Maybe the rectangle is an annotation. In that case, you're throwing that rectangle away by using the wrong example. Maybe the origin of the page is different from 0,0...
If your purpose is to create a new PDF containing only a selection of pages of the original PDF, please read my answer to Function that I can use to remove a single page from a PDF using iText

Reading text + graphic (like lines) info from an existing pdf

I want to read an existing pdf & extract the text and graphics information. Within graphics, currently i just need the drawn lines. There are many vendor component for reading PDF text, but are there ones that can give graphics info too ? Though free/open-source is preferred, I'm ok to commercial ones too.
The requirement is:
For every page in PDF:
Reading text blocks
Getting to know the canvas co-ordinate of the text block (rectangle containing the block). Note, for text with higher font size, the rect size will change.
Lines - need collection of (x1,y1,x2,y2) for every line in a page in pdf
Thanks,
- Seeker
This is my field, though the question is a bit old. Hopefully this still helps.
You leave some room for assumptions, so here are mine:
you seek a script, rather than stand-alone software
your object is archival
you are running command-line scripts:
Use this command line script, detailed at: http://stefaanlippens.net/extract-images-from-pdf-documents
you are running server-side code using imagemagick or graphicsmagick functions:
Something like "convert -background white -flatten test1.pdf test1.jpg" (imagemagick) will render the whole PDF page into a jpeg. If you want to then crop it to the image(s), then it depends upon the context of the project to determine the best script(s) to do that.
A rather complex question. If you wish to provide more details about the project, then I can provide some more guidance. Best of luck.