I'm creating a simple PDF using iText7 (C#) but I need it to be printed at exactly the right size. Here's my code:
PdfWriter writer = new PdfWriter("output.pdf");
PdfDocument pdf = new PdfDocument(writer);
pdf.SetDefaultPageSize(iText.Kernel.Geom.PageSize.LETTER);
var page = pdf.AddNewPage();
page.SetCropBox(new iText.Kernel.Geom.Rectangle(36, 36, 7.5f * 72, 10 * 72));
PdfCanvas canvas = new PdfCanvas(page);
canvas.SetStrokeColor(ColorConstants.BLACK).SetLineWidth(3);
canvas.MoveTo(36, 36);
canvas.LineTo(36, 36 + 72); // Draw a line 1 inch long
canvas.LineTo(36 + 72, 36 + 72); // Draw a second line, perpendicular to the first, also 1 inch long
canvas.ClosePathStroke();
pdf.Close();
If I right-click the resulting PDF and select "Print", my triangle is off the bottom of the page.
When I open the resulting PDF in the PDF program I'm using (PDF Architect), it gives me a few options:
If I just click "Print", it gives me lines that are 1 1/16" long and start about 1/8" from the edge of the page, so by default PDF Architect seems to be taking the contents of my crop box and expanding it to the maximum page availability.
If I click on "Fit" before clicking "Print", that results in the desired output - lines 1" long, starting 1/2" from each side of the page. That works but is error-prone - too easy to forget to click "Fit" every time.
Is there a way to generate a PDF that contains information that says "I'm targeting this document at letter size, but I'm staying 1/2" away from all the edges, so when you print, if the printer has margins <= 1/2 inch you should be fine, and just print it exactly how I've described without any shrinking or enlarging"?
You will not be able to completely control this from the PDF document. The PDF processor (e.g. viewing application) or the printer (driver) will always be able to scale the content up or down.
Apparently, PDF Architect has the "Fit" option enabled by default, so it scales the page to the selected paper size.
You are setting a crop box of 7.5x10 in. I assume you're printing to Letter sized (8.5x11 in) paper. So the 7.5x10 page will indeed be scaled up, and your content will become slightly larger.
Is there a way to generate a PDF that contains information that says "I'm targeting this document at letter size, but I'm staying 1/2" away from all the edges, so when you print, if the printer has margins <= 1/2 inch you should be fine, and just print it exactly how I've described without any shrinking or enlarging"?
I would not set the crop box. When the pages in the PDF document are Letter size and the output paper is also Letter size, it should not matter whether the "Fit" option is enabled or not, as not scaling needs to happen. It's definitely not a fool proof solution, but at least it's less error prone.
Related
I've written code to do the following:
Take a PDF of a certain page size (e.g., 8.5" x 11")
Create a new PDF with a larger page size (e.g., 17" x 11")
Impose the original PDF onto the new one (e.g., 2-up such that the resulting new PDF has the original PDF side-by-side)
To do this, I use the PdfWriter.GetImportedPage method to get the current page from the original PDF, then use the PdfContentByte.AddTemplate(page, x, y) method to place the original page onto the current page of the new PDF.
My new challenge is that I need to crop the original PDF before adding it to the new PDF. For example, imagine I want to crop 2" off of the original PDF before imposing it. The input PDF would still be 8.5" x 11" and the new PDF would still be 17" x 11", but the two "copies" of the original PDF in the new one would have had 2" removed from its top, right, bottom and left sides.
Hopefully these images can make this clearer. Here's what I have now, doing a 2-up imposition. (This is working swimmingly.)
But here's what I need to do:
I know that I can alter the display of the PDF in a viewer by using the MediaBox or CropBox settings, but those settings aren't respected by AddTemplate. I know that with AddTemplate I can use a transform matrix to position the page or to scale or rotate it, but I don't want to shrink the original PDF, I want to crop it.
Thanks
I found that I can use the BoundingBox of the imported page to crop it prior to adding it to the new PDF (via AddTemplate).
So my code looks something like this:
PdfImportedPage page = writer.GetImportedPage(pageNumber);
// Crop!
page.BoundingBox = new Rectangle(llx, lly, urx, ury);
// Add to new PDF
writer.DirectContent.AddTemplate(page, x, y);
That does the trick!
I've tried to get all the text on the page by using iText, but I have no idea why every coordinate text loses the last two character.
PdfDocument pdfDoc = new PdfDocument(new PdfReader(#"E:\Coding\COOR.pdf"));
LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
parser.ProcessPageContent(pdfDoc.GetFirstPage());
Console.Write(strategy.GetResultantText());
pdfDoc.Close();
Console.WriteLine("Great!");
Console.ReadKey();
You can also download my code from
https://1drv.ms/u/s!Al1hUSZtR4OjwU3XVBRQGneVaZlS
In short
The reason for that "lost text" is that the missing "text" isn't there to start with!
In detail
The contents of you PDF file are constructed in a misleading manner.
On the one hand there are very many path definitions which then are stroked (drawn). These drawings create what you can see in a viewer, both text and table lines.
On the other hand there are a few text drawing instructions to draw text using text rendering mode 3 which is... invisible! These drawings create the text you can copy&paste in a viewer or extract using iText.
Unfortunately the text in the text drawing instructions and the text drawn using paths does not match completely. The text you retrieve via copy&paste or text extraction, therefore, differs from your expectations.
Also the glyph sizes and positions are not exactly the same
To illustrate this I made the text drawing instructions use the normal (fill) text rendering mode. The top left corner which originally looks like this:
with that change looks like this:
As you see the formerly invisible text is only approximately at the same position as the visible drawings, and it is somewhat broken: The symbol for degrees is weirdly represented as "¡ã", and the longitude fractional seconds and the following symbol for seconds are missing.
To correctly extract the originally visible data, you'll need to use OCR instead of text extraction.
I have seen this effect many times while reading pdf documents. So, some pdf have this funny smudged font which looks like it is a scanned image. However, I am able to select the font, and while selecting it the highlighted font appears differently as seen in the images.
Default appearance
Appearance on selection of font
Overall, seems like some ocr is happening behind the scene.
The document reader I am using is Atril 1.12.2 document viewer.
My question is: What is encoded in the pdf, image or text? What is happening to text when I am selecting it?
Another nice change can be observed in the document shared by the OP:
What we see here indeed is the result of OCR. But it's not some ocr happening behind the scene in the viewer, OCR has already happened before and the results have been integrated into the PDF.
The PDF page actually contains a scanned image upon which invisible text is drawn.
As long as nothing is selected, Atril shows exactly that, you only see the scanned image. As soon as you start selecting text, though, it appears to cover the marked area in blue and display the marked (formerly invisible) text in white upon it.
In situations, therefore, in which the invisible text is not added exactly above the corresponding letters in the image, this might result in funny gaps like the one in the OP's screenshot after "multidimensional". In case of errors in the OCR output, one sees the erroneous data like in my screenshots.
Other PDF viewer often merely mark the text by applying some effect to the text area, e.g. inverting colors or overlaying a semi-transparent color.
It might be considered an advantage of the Atril approach that already in the selection process one sees the exact text one is selecting and probably eventually going to copy.
Inside the content stream
As mentioned above, the PDF page actually contains a scanned image upon which invisible text is drawn.
In the page content stream the corresponding instructions look like this:
1 0 0 1 0 0.2401 cm
(shift the coordinate system a minute bit up)
1 1 1 rg
1 i
/RelativeColorimetric ri
/R794 gs
0 0 576 719.5 re
f
(filling the image area to be with white color)
q
576 0 0 719.5 0 0 cm
/Im0 Do
Q
(drawing the bitmap image)
1 0 0 1 0 -0.2401 cm
(shift the coordinate system a minute bit down, undoing the initial upshift)
BT
(beginning a text object)
0 0 0 rg
(setting the fill color to black)
/TT1 1 Tf
0.05 Tc
0 Tw
3 Tr
(selecting the font TT1 at size 1, a bit of extra space between characters, no extra space between words, and text rendering mode 3, i.e. invisible)
7.3 0 0 7.3 83.8 678.4401 Tm
(SOFTWARE-PRACTICE ) Tj
(setting the text coordinate system to be shifted by 83.8 horizontally and 678.4401 vertically and to be scaled by 7.3 and drawing some text)
0.08 Tc
7.4 0 0 7.1 175.2 678.4401 Tm
(AND ) Tj
(changing character spacing a bit, setting the text coordinate system to be shifted by 175.2 horizontally and 678.4401 vertically and to be scaled by 7.4 horizontally and 7.1 vertically and drawing some text)
...
TL;DR
What is encoded in the pdf, image or text?
Both, the image plus invisible text upon it.
What is happening to text when I am selecting it?
Atril covers the text in blue and draws the selected (formerly invisible) text upon it in white.
I would like to draw 2 texts onto my PDF.
The first text should be aligned to the top left corner.
This works fine.
I'm using:
canvas = stamper.GetOverContent(i)
watermarkFont = iTextSharp.text.pdf.BaseFont.CreateFont(iTextSharp.text.pdf.BaseFont.HELVETICA, iTextSharp.text.pdf.BaseFont.CP1252, iTextSharp.text.pdf.BaseFont.NOT_EMBEDDED)
watermarkFontColor = iTextSharp.text.BaseColor.RED
canvas.MoveTo(0, 0) 'I think the canvas is the space that we draw onto. My documents always start at position X=0 and Y=0, so move to 0,0 should be fine
canvas.BeginText()
canvas.SetFontAndSize(watermarkFont, 12)
canvas.SetColorFill(watermarkFontColor)
canvas.ShowTextAligned(Element.ALIGN_TOP, uText, 0, 830, 0) 'is 830 the width of the available space?
canvas.EndText()
Now I would like to draw another text approximately 100 pixels below the first text.
I'm using:
canvas.MoveTo(0, 100) 'let's draw the second text at X=100, Y=100
canvas.BeginText()
canvas.SetFontAndSize(watermarkFont, 12)
canvas.SetColorFill(watermarkFontColor)
canvas.ShowTextAligned(Element.ALIGN_CENTER, uBewirtung, 0, 830, 0)
canvas.EndText()
The second text however doesn't show up at all.
I suspect I'm drawing outside the document, but I don't see my mistake.
The MoveTo() method is meant for drawing paths (lines amd shapes in graphics state), not text (in text state). It adds an m operator to the content stream. If you are a PDF specialist, you should use the SetTextMatrix() method inside your BT/ET text block: What does setTextMatrix of contentByte class in iText do?
Note the if; it is important. If you are not a PDF specialist, you shouldn't be toying around with those methods. You should use ColumnText.ShowTextAligned() instead of BeginText(), EndText() and all of the lines you added in-between. Those methods are meant for people who speak PDF syntax.
I created A4 PDF document using iTextSharp. It contains simple table of which every cell in fact is label on the page. I drew it by using PdfContentByte, so the size of the programmatically drawing area is 595 x 842 points. Thus I drew rectangles (as table cells) using points as units.
I fixed the size and place (in points) of printing content by checking printed pages on my printer. I used Acrobat Reader and options: without scaling ('None') and default paper size 8.3' x 11.7'
Now if I print the same PDF on different printer the content (table) is shifted to the left or/and top direction. So the distances between page's edges and the outer frame of the table are different on different printers. Sometimes the content is cut but I know that different printers have different printing area - so it understood.
But I can not understand why it is shifted. Are they others parameters that I don't know?