How do I use iTextSharp (or iText) to crop and copy a page from one PDF to another - pdf

I've written code to do the following:
Take a PDF of a certain page size (e.g., 8.5" x 11")
Create a new PDF with a larger page size (e.g., 17" x 11")
Impose the original PDF onto the new one (e.g., 2-up such that the resulting new PDF has the original PDF side-by-side)
To do this, I use the PdfWriter.GetImportedPage method to get the current page from the original PDF, then use the PdfContentByte.AddTemplate(page, x, y) method to place the original page onto the current page of the new PDF.
My new challenge is that I need to crop the original PDF before adding it to the new PDF. For example, imagine I want to crop 2" off of the original PDF before imposing it. The input PDF would still be 8.5" x 11" and the new PDF would still be 17" x 11", but the two "copies" of the original PDF in the new one would have had 2" removed from its top, right, bottom and left sides.
Hopefully these images can make this clearer. Here's what I have now, doing a 2-up imposition. (This is working swimmingly.)
But here's what I need to do:
I know that I can alter the display of the PDF in a viewer by using the MediaBox or CropBox settings, but those settings aren't respected by AddTemplate. I know that with AddTemplate I can use a transform matrix to position the page or to scale or rotate it, but I don't want to shrink the original PDF, I want to crop it.
Thanks

I found that I can use the BoundingBox of the imported page to crop it prior to adding it to the new PDF (via AddTemplate).
So my code looks something like this:
PdfImportedPage page = writer.GetImportedPage(pageNumber);
// Crop!
page.BoundingBox = new Rectangle(llx, lly, urx, ury);
// Add to new PDF
writer.DirectContent.AddTemplate(page, x, y);
That does the trick!

Related

iText7 - create PDF with exact dimensions when printed - how?

I'm creating a simple PDF using iText7 (C#) but I need it to be printed at exactly the right size. Here's my code:
PdfWriter writer = new PdfWriter("output.pdf");
PdfDocument pdf = new PdfDocument(writer);
pdf.SetDefaultPageSize(iText.Kernel.Geom.PageSize.LETTER);
var page = pdf.AddNewPage();
page.SetCropBox(new iText.Kernel.Geom.Rectangle(36, 36, 7.5f * 72, 10 * 72));
PdfCanvas canvas = new PdfCanvas(page);
canvas.SetStrokeColor(ColorConstants.BLACK).SetLineWidth(3);
canvas.MoveTo(36, 36);
canvas.LineTo(36, 36 + 72); // Draw a line 1 inch long
canvas.LineTo(36 + 72, 36 + 72); // Draw a second line, perpendicular to the first, also 1 inch long
canvas.ClosePathStroke();
pdf.Close();
If I right-click the resulting PDF and select "Print", my triangle is off the bottom of the page.
When I open the resulting PDF in the PDF program I'm using (PDF Architect), it gives me a few options:
If I just click "Print", it gives me lines that are 1 1/16" long and start about 1/8" from the edge of the page, so by default PDF Architect seems to be taking the contents of my crop box and expanding it to the maximum page availability.
If I click on "Fit" before clicking "Print", that results in the desired output - lines 1" long, starting 1/2" from each side of the page. That works but is error-prone - too easy to forget to click "Fit" every time.
Is there a way to generate a PDF that contains information that says "I'm targeting this document at letter size, but I'm staying 1/2" away from all the edges, so when you print, if the printer has margins <= 1/2 inch you should be fine, and just print it exactly how I've described without any shrinking or enlarging"?
You will not be able to completely control this from the PDF document. The PDF processor (e.g. viewing application) or the printer (driver) will always be able to scale the content up or down.
Apparently, PDF Architect has the "Fit" option enabled by default, so it scales the page to the selected paper size.
You are setting a crop box of 7.5x10 in. I assume you're printing to Letter sized (8.5x11 in) paper. So the 7.5x10 page will indeed be scaled up, and your content will become slightly larger.
Is there a way to generate a PDF that contains information that says "I'm targeting this document at letter size, but I'm staying 1/2" away from all the edges, so when you print, if the printer has margins <= 1/2 inch you should be fine, and just print it exactly how I've described without any shrinking or enlarging"?
I would not set the crop box. When the pages in the PDF document are Letter size and the output paper is also Letter size, it should not matter whether the "Fit" option is enabled or not, as not scaling needs to happen. It's definitely not a fool proof solution, but at least it's less error prone.

Lost some text when extracting pdf

I've tried to get all the text on the page by using iText, but I have no idea why every coordinate text loses the last two character.
PdfDocument pdfDoc = new PdfDocument(new PdfReader(#"E:\Coding\COOR.pdf"));
LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
parser.ProcessPageContent(pdfDoc.GetFirstPage());
Console.Write(strategy.GetResultantText());
pdfDoc.Close();
Console.WriteLine("Great!");
Console.ReadKey();
You can also download my code from
https://1drv.ms/u/s!Al1hUSZtR4OjwU3XVBRQGneVaZlS
In short
The reason for that "lost text" is that the missing "text" isn't there to start with!
In detail
The contents of you PDF file are constructed in a misleading manner.
On the one hand there are very many path definitions which then are stroked (drawn). These drawings create what you can see in a viewer, both text and table lines.
On the other hand there are a few text drawing instructions to draw text using text rendering mode 3 which is... invisible! These drawings create the text you can copy&paste in a viewer or extract using iText.
Unfortunately the text in the text drawing instructions and the text drawn using paths does not match completely. The text you retrieve via copy&paste or text extraction, therefore, differs from your expectations.
Also the glyph sizes and positions are not exactly the same
To illustrate this I made the text drawing instructions use the normal (fill) text rendering mode. The top left corner which originally looks like this:
with that change looks like this:
As you see the formerly invisible text is only approximately at the same position as the visible drawings, and it is somewhat broken: The symbol for degrees is weirdly represented as "¡ã", and the longitude fractional seconds and the following symbol for seconds are missing.
To correctly extract the originally visible data, you'll need to use OCR instead of text extraction.

Actually cropping a PDF with PDF Clown

My objective is actually cropping a PDF file with PdfClown.
There are a lot of tools/library that allow cropping PDF, changing the PDF cropBox. This permits hiding contents outside a rectangular area, but content is still there, it might be accessed through a PDF parser and PDF size does not change.
On the contrary what I need is creating a new page containing only the contents inside the rectangular area.
So far I've tried scanning contents and selectively cloning them. But I didn't succeed yet. Any suggestions on using PdfClown for that?
I've seen someone is trying something similar with PdfBox Cropping a region from a PDF page with PDFBox not succeeding yet.
A bit late, but maybe it helps someone;
I am sucessfully doing what you are asking for - but with other libraries.
Required libraries : iText 4 or 5 and Ghostscript
Step 1 with pseudo code
Using iText, Create a PDFWRITER instance with a blank Doc. Open a PDFREADER object to the original file you want to crop. Import the Page, get a PDFTemplate Object from the source, set its .boundingBox property to the desired cropbox, wrap the template into an iText Image object and paste it onto the new page at an absolute position.
Dim reader As New PdfReader(sourcefile)
Dim doc As New Document()
Dim writer As PdfWriter = PdfWriter.GetInstance(doc, New System.IO.FileStream(outputfilename, System.IO.FileMode.Create))
//get the source page as an Imported Page
Dim page As PdfImportedPage = writer.GetImportedPage(reader, indexOfPageToGet) page
//create PDFTemplate Object at original size from source - see iText in Action book Page 91 for full details
Dim pdftemp As PdfTemplate = page.CreateTemplate(page.Width, page.Height)
//paste the original page onto the template object, see iText documentation what those parameters do (scaling, mirroring)
pdftemp.AddTemplate(page, 1, 0, 0, 1, 0, 0)
//now the critical part - set .boundingBox property on the template. This makes all objects outside the rectangle invisible
pdftemp.boundingBox = {iText Rectangle Structure with new Cropbox}
//template not needed anymore
writer.ReleaseTemplate(pdftemp)
//create an iText IMAGE object as wrapper to the template - with this img object absolute positionion on the final page is much easier
dim img as iTextSharp.Text.Image = Image.GetInstance(pdftemp)
// set img position
img.SetAbsolutePosition(x, y)
//set optional Rotation if needed
img.RotationDegrees = 0
//finally, this adds the actual content to the new document
doc.Add(img)
//cleanup
doc.Close()
reader.Close()
writer.Close()
The output file will visually look cropped. But the objects are still present in the PDF Stream. Filesize will probably remain very little changed yet.
Step 2:
Using Ghostscript and output device pdfwrite, combined with the correct command line parameters you can re-process the PDF from Step 1. This will give you a much smaller PDF. See Ghostscript documentation for the arguments https://www.ghostscript.com/doc/9.52/Use.htm
This steps actually gets rid of objects that are outside the bounding box - the requirement you asked for in your OP, at least for files that I deal with.
Optional Step 3:
Using MUTOOL with the -g option you can clean up unused XREF objects. Your original PDF probably had a lot of Xrefs, which increase filesize. After cropping some of them may not be needed anymore.
https://mupdf.com/docs/manual-mutool-clean.html
PDF Format is a tricky thing, normally I would agree with #Tilman Hausherr, my suggestion may not work for all files and covers the 'almost impossible' scenario, but it works for all cases that I deal with.

How to draw Matrix Code with PdfSharp?

I need to make a PDF report via PdfSharp. The report must include a QRCode, or data matrix code, but I can't seem to be able to draw it on the page.
The values it's asking for are value as String and length as Integer so here's what I'm doing:
Dim myNewCode As New PdfSharp.Drawing.BarCodes.CodeDataMatrix("1234567890", 10)
Then I try to draw it:
gfx.DrawMatrixCode(myNewCode, myXPoint)
It asks for an XPoint location so I set it to this:
Dim myXPoint As New XPoint(500,500)
Which only needs values for x and y.
It compiles OK but when I try to open the file I get the next error
An error exists on this page. Acrobat may not display the page correctly. Please contact the person who created the PDF document to correct the problem
My Acrobat version is 11.0.5, and there is no problem opening other PDF files which already contain these kind of codes.
Specify the size to get a correct PDF file:
var myXSize = new XSize(100, 100);
var myNewCode = new PdfSharp.Drawing.BarCodes.CodeDataMatrix("1234567890", 10, myXSize);
var myXPoint = new XPoint(200, 300);
gfx.DrawMatrixCode(myNewCode, myXPoint);
Please note that due to legal reasons, the open source version of PDFsharp does not include the implementation of the Data Matrix Code and shows dummy images instead.
Another option would be to use a 3rd party library (ZXing) to generate the QR Code bitmap and draw it as a bitmap with DrawImage() on the PDF.

Printing PDF differs on different printers

I created A4 PDF document using iTextSharp. It contains simple table of which every cell in fact is label on the page. I drew it by using PdfContentByte, so the size of the programmatically drawing area is 595 x 842 points. Thus I drew rectangles (as table cells) using points as units.
I fixed the size and place (in points) of printing content by checking printed pages on my printer. I used Acrobat Reader and options: without scaling ('None') and default paper size 8.3' x 11.7'
Now if I print the same PDF on different printer the content (table) is shifted to the left or/and top direction. So the distances between page's edges and the outer frame of the table are different on different printers. Sometimes the content is cut but I know that different printers have different printing area - so it understood.
But I can not understand why it is shifted. Are they others parameters that I don't know?