Actually cropping a PDF with PDF Clown - pdf

My objective is actually cropping a PDF file with PdfClown.
There are a lot of tools/library that allow cropping PDF, changing the PDF cropBox. This permits hiding contents outside a rectangular area, but content is still there, it might be accessed through a PDF parser and PDF size does not change.
On the contrary what I need is creating a new page containing only the contents inside the rectangular area.
So far I've tried scanning contents and selectively cloning them. But I didn't succeed yet. Any suggestions on using PdfClown for that?
I've seen someone is trying something similar with PdfBox Cropping a region from a PDF page with PDFBox not succeeding yet.

A bit late, but maybe it helps someone;
I am sucessfully doing what you are asking for - but with other libraries.
Required libraries : iText 4 or 5 and Ghostscript
Step 1 with pseudo code
Using iText, Create a PDFWRITER instance with a blank Doc. Open a PDFREADER object to the original file you want to crop. Import the Page, get a PDFTemplate Object from the source, set its .boundingBox property to the desired cropbox, wrap the template into an iText Image object and paste it onto the new page at an absolute position.
Dim reader As New PdfReader(sourcefile)
Dim doc As New Document()
Dim writer As PdfWriter = PdfWriter.GetInstance(doc, New System.IO.FileStream(outputfilename, System.IO.FileMode.Create))
//get the source page as an Imported Page
Dim page As PdfImportedPage = writer.GetImportedPage(reader, indexOfPageToGet) page
//create PDFTemplate Object at original size from source - see iText in Action book Page 91 for full details
Dim pdftemp As PdfTemplate = page.CreateTemplate(page.Width, page.Height)
//paste the original page onto the template object, see iText documentation what those parameters do (scaling, mirroring)
pdftemp.AddTemplate(page, 1, 0, 0, 1, 0, 0)
//now the critical part - set .boundingBox property on the template. This makes all objects outside the rectangle invisible
pdftemp.boundingBox = {iText Rectangle Structure with new Cropbox}
//template not needed anymore
writer.ReleaseTemplate(pdftemp)
//create an iText IMAGE object as wrapper to the template - with this img object absolute positionion on the final page is much easier
dim img as iTextSharp.Text.Image = Image.GetInstance(pdftemp)
// set img position
img.SetAbsolutePosition(x, y)
//set optional Rotation if needed
img.RotationDegrees = 0
//finally, this adds the actual content to the new document
doc.Add(img)
//cleanup
doc.Close()
reader.Close()
writer.Close()
The output file will visually look cropped. But the objects are still present in the PDF Stream. Filesize will probably remain very little changed yet.
Step 2:
Using Ghostscript and output device pdfwrite, combined with the correct command line parameters you can re-process the PDF from Step 1. This will give you a much smaller PDF. See Ghostscript documentation for the arguments https://www.ghostscript.com/doc/9.52/Use.htm
This steps actually gets rid of objects that are outside the bounding box - the requirement you asked for in your OP, at least for files that I deal with.
Optional Step 3:
Using MUTOOL with the -g option you can clean up unused XREF objects. Your original PDF probably had a lot of Xrefs, which increase filesize. After cropping some of them may not be needed anymore.
https://mupdf.com/docs/manual-mutool-clean.html
PDF Format is a tricky thing, normally I would agree with #Tilman Hausherr, my suggestion may not work for all files and covers the 'almost impossible' scenario, but it works for all cases that I deal with.

Related

In iTextSharp it is possible change the width & height of a PdfTemplate object. Can we do the same to an iText7 PdfCanvas/Xobject?

I'm converting some iTextSharp-heavy VB.net code into iText7, and a part of the old iTextSharp code changes a Pdftemplate object's width & height to adapt to a given situation.
Now in iText7, I have a PdfCanvas object bound to a PdfFormXObject serving the same role as the former PdfTemplate object. So far, so good.
But alas, I have this old code to contend with:
Dim oObjectTemplate As PdfTemplate = oContainerTemplate
dTemplateSizeIncrease = oObject.FontSize * 4
oObjectTemplate.Width += CSng(dTemplateSizeIncrease)
oObjectTemplate.Height += CSng(dTemplateSizeIncrease)
I tried looking into the robust documentation for an answer, but there's little wisdom out there to be found for converting iTextSharp to iText7.
By this point in the code, the object template (and its iText7 counterpart) have already been through a bit of logic and have certain values already set. I'm not eager to have to make a new instance to accommodate a size change.
So... is there a way to resize an iText7 PdfFormXObject once its been made?
iText 5 merely changes the bbox of a Form XObject, so it's totally possible to do the same thing in iText 7 - just set the modified bbox to the PdfFormXObject instance. Example code (it's in Java, but very easy to convert to C# or VB.NET):
Rectangle bbox = formXObject.getBBox().toRectangle();
bbox.setHeight(bbox.getHeight() + 100);
bbox.setWidth(bbox.getWidth() + 100);
formXObject.setBBox(new PdfArray(bbox));

How do I use iTextSharp (or iText) to crop and copy a page from one PDF to another

I've written code to do the following:
Take a PDF of a certain page size (e.g., 8.5" x 11")
Create a new PDF with a larger page size (e.g., 17" x 11")
Impose the original PDF onto the new one (e.g., 2-up such that the resulting new PDF has the original PDF side-by-side)
To do this, I use the PdfWriter.GetImportedPage method to get the current page from the original PDF, then use the PdfContentByte.AddTemplate(page, x, y) method to place the original page onto the current page of the new PDF.
My new challenge is that I need to crop the original PDF before adding it to the new PDF. For example, imagine I want to crop 2" off of the original PDF before imposing it. The input PDF would still be 8.5" x 11" and the new PDF would still be 17" x 11", but the two "copies" of the original PDF in the new one would have had 2" removed from its top, right, bottom and left sides.
Hopefully these images can make this clearer. Here's what I have now, doing a 2-up imposition. (This is working swimmingly.)
But here's what I need to do:
I know that I can alter the display of the PDF in a viewer by using the MediaBox or CropBox settings, but those settings aren't respected by AddTemplate. I know that with AddTemplate I can use a transform matrix to position the page or to scale or rotate it, but I don't want to shrink the original PDF, I want to crop it.
Thanks
I found that I can use the BoundingBox of the imported page to crop it prior to adding it to the new PDF (via AddTemplate).
So my code looks something like this:
PdfImportedPage page = writer.GetImportedPage(pageNumber);
// Crop!
page.BoundingBox = new Rectangle(llx, lly, urx, ury);
// Add to new PDF
writer.DirectContent.AddTemplate(page, x, y);
That does the trick!

Lost some text when extracting pdf

I've tried to get all the text on the page by using iText, but I have no idea why every coordinate text loses the last two character.
PdfDocument pdfDoc = new PdfDocument(new PdfReader(#"E:\Coding\COOR.pdf"));
LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
parser.ProcessPageContent(pdfDoc.GetFirstPage());
Console.Write(strategy.GetResultantText());
pdfDoc.Close();
Console.WriteLine("Great!");
Console.ReadKey();
You can also download my code from
https://1drv.ms/u/s!Al1hUSZtR4OjwU3XVBRQGneVaZlS
In short
The reason for that "lost text" is that the missing "text" isn't there to start with!
In detail
The contents of you PDF file are constructed in a misleading manner.
On the one hand there are very many path definitions which then are stroked (drawn). These drawings create what you can see in a viewer, both text and table lines.
On the other hand there are a few text drawing instructions to draw text using text rendering mode 3 which is... invisible! These drawings create the text you can copy&paste in a viewer or extract using iText.
Unfortunately the text in the text drawing instructions and the text drawn using paths does not match completely. The text you retrieve via copy&paste or text extraction, therefore, differs from your expectations.
Also the glyph sizes and positions are not exactly the same
To illustrate this I made the text drawing instructions use the normal (fill) text rendering mode. The top left corner which originally looks like this:
with that change looks like this:
As you see the formerly invisible text is only approximately at the same position as the visible drawings, and it is somewhat broken: The symbol for degrees is weirdly represented as "¡ã", and the longitude fractional seconds and the following symbol for seconds are missing.
To correctly extract the originally visible data, you'll need to use OCR instead of text extraction.

How to insert values into an existing PDF on the fly?

There is a PDF with some fields to accept values from the user(for example: a "bio data" form). My question is that how can I insert the user inputs to the Correct fields of the existing PDF and to generate the filled PDF?
if i using iTextSharp, then how can i choose the co ordinates to print values?
Is there any design tools to design rectangle fields to accept values?
because my PDF template have lots of fields to get values from user side.
tnx in adv.
There are two possibilities:
Your original PDF is a form:
You can check this by checking if the PDF has any fields as explained here: convert pdf editable fields into text using java programming
You'll need to adapt the Java code to C# code or you can use RUPS as shown in my answer to the question How to get specific types from AcroFields? Like PushButtonField, RadioCheckField, etc
In this case, filling out the form is easy:
PdfStamper pdfStamper = new PdfStamper(new PdfReader(templateFile), new FileStream(fileName, FileMode.Create));
AcroFields acroFields = pdfStamper.AcroFields;
acroFields.SetField(key, value);
pdfStamper.FormFlattening = true;
pdfStamper.Close();
You can have as many lines with SetField() as you want. In these lines key is the field name as defined in the original form; value is the value you want to add at the position(s) of that field.
The line with the pdfStamper.FormFlattening is optional. If you set that value to true, all interactivity will be removed: the form will no longer be a form. If you remove the line or set that value to false, then the form will still be a form. You'll be able to change the content of the fields and extract the value of the fields.
Your original PDF is not a form:
A PDF may look like a form to the human eye, but if it doesn't have AcroForm fields (and no XFA either), then a machine won't consider it as being a form. In this case, you have to understand that all the content is fixed at fixed coordinates on the page. You can add content at absolute positions, but the original content won't move.
There are different ways to add content to an existing PDF and they all involve PdfStamper. Once you have obtained PdfContentByte object from this PdfStamper then you can add text as explained in the documentation. Read the sections Manipulating existing PDFs and Absolute positioning of text or take a look at the content tagged with the keyword PdfStamper. The watermark examples should be interesting too.
I would advice not to use this second approach as it is very hard to find the exact coordinates to use. If your PDF isn't a form, turn it into a form using Adobe Acrobat and use the first approach. The first approach is much more future proof: if you ever have to change something in your form, you can change that form without having to change your code (provided that you preserve the original field names).
ItextSharp provides you to do the same, using pdfStamper class of ItextSharp.
Just a sample for your reference.
//create pdfreader instance and read content of existing PDF file into it, by providing it's path
PdfReader pdfReader = new PdfReader(FILE_PATH);
// create stamper instance to edit the exiting file
PdfStamper pdfStamper = new PdfStamper(pdfReader, Response.OutputStream);
// perform your edit operation here.....
.
.
.
// close pdfStamper instance
stamper.Close();

How to draw Matrix Code with PdfSharp?

I need to make a PDF report via PdfSharp. The report must include a QRCode, or data matrix code, but I can't seem to be able to draw it on the page.
The values it's asking for are value as String and length as Integer so here's what I'm doing:
Dim myNewCode As New PdfSharp.Drawing.BarCodes.CodeDataMatrix("1234567890", 10)
Then I try to draw it:
gfx.DrawMatrixCode(myNewCode, myXPoint)
It asks for an XPoint location so I set it to this:
Dim myXPoint As New XPoint(500,500)
Which only needs values for x and y.
It compiles OK but when I try to open the file I get the next error
An error exists on this page. Acrobat may not display the page correctly. Please contact the person who created the PDF document to correct the problem
My Acrobat version is 11.0.5, and there is no problem opening other PDF files which already contain these kind of codes.
Specify the size to get a correct PDF file:
var myXSize = new XSize(100, 100);
var myNewCode = new PdfSharp.Drawing.BarCodes.CodeDataMatrix("1234567890", 10, myXSize);
var myXPoint = new XPoint(200, 300);
gfx.DrawMatrixCode(myNewCode, myXPoint);
Please note that due to legal reasons, the open source version of PDFsharp does not include the implementation of the Data Matrix Code and shows dummy images instead.
Another option would be to use a 3rd party library (ZXing) to generate the QR Code bitmap and draw it as a bitmap with DrawImage() on the PDF.