Lost some text when extracting pdf

Lost some text when extracting pdf - pdf

I've tried to get all the text on the page by using iText, but I have no idea why every coordinate text loses the last two character.
PdfDocument pdfDoc = new PdfDocument(new PdfReader(#"E:\Coding\COOR.pdf"));
LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
parser.ProcessPageContent(pdfDoc.GetFirstPage());
Console.Write(strategy.GetResultantText());
pdfDoc.Close();
Console.WriteLine("Great!");
Console.ReadKey();
You can also download my code from
https://1drv.ms/u/s!Al1hUSZtR4OjwU3XVBRQGneVaZlS

In short
The reason for that "lost text" is that the missing "text" isn't there to start with!
In detail
The contents of you PDF file are constructed in a misleading manner.
On the one hand there are very many path definitions which then are stroked (drawn). These drawings create what you can see in a viewer, both text and table lines.
On the other hand there are a few text drawing instructions to draw text using text rendering mode 3 which is... invisible! These drawings create the text you can copy&paste in a viewer or extract using iText.
Unfortunately the text in the text drawing instructions and the text drawn using paths does not match completely. The text you retrieve via copy&paste or text extraction, therefore, differs from your expectations.
Also the glyph sizes and positions are not exactly the same
To illustrate this I made the text drawing instructions use the normal (fill) text rendering mode. The top left corner which originally looks like this:
with that change looks like this:
As you see the formerly invisible text is only approximately at the same position as the visible drawings, and it is somewhat broken: The symbol for degrees is weirdly represented as "¡ã", and the longitude fractional seconds and the following symbol for seconds are missing.
To correctly extract the originally visible data, you'll need to use OCR instead of text extraction.

Related

iText7 - create PDF with exact dimensions when printed - how?

I'm creating a simple PDF using iText7 (C#) but I need it to be printed at exactly the right size. Here's my code:
PdfWriter writer = new PdfWriter("output.pdf");
PdfDocument pdf = new PdfDocument(writer);
pdf.SetDefaultPageSize(iText.Kernel.Geom.PageSize.LETTER);
var page = pdf.AddNewPage();
page.SetCropBox(new iText.Kernel.Geom.Rectangle(36, 36, 7.5f * 72, 10 * 72));
PdfCanvas canvas = new PdfCanvas(page);
canvas.SetStrokeColor(ColorConstants.BLACK).SetLineWidth(3);
canvas.MoveTo(36, 36);
canvas.LineTo(36, 36 + 72); // Draw a line 1 inch long
canvas.LineTo(36 + 72, 36 + 72); // Draw a second line, perpendicular to the first, also 1 inch long
canvas.ClosePathStroke();
pdf.Close();
If I right-click the resulting PDF and select "Print", my triangle is off the bottom of the page.
When I open the resulting PDF in the PDF program I'm using (PDF Architect), it gives me a few options:
If I just click "Print", it gives me lines that are 1 1/16" long and start about 1/8" from the edge of the page, so by default PDF Architect seems to be taking the contents of my crop box and expanding it to the maximum page availability.
If I click on "Fit" before clicking "Print", that results in the desired output - lines 1" long, starting 1/2" from each side of the page. That works but is error-prone - too easy to forget to click "Fit" every time.
Is there a way to generate a PDF that contains information that says "I'm targeting this document at letter size, but I'm staying 1/2" away from all the edges, so when you print, if the printer has margins <= 1/2 inch you should be fine, and just print it exactly how I've described without any shrinking or enlarging"?

You will not be able to completely control this from the PDF document. The PDF processor (e.g. viewing application) or the printer (driver) will always be able to scale the content up or down.
Apparently, PDF Architect has the "Fit" option enabled by default, so it scales the page to the selected paper size.
You are setting a crop box of 7.5x10 in. I assume you're printing to Letter sized (8.5x11 in) paper. So the 7.5x10 page will indeed be scaled up, and your content will become slightly larger.
Is there a way to generate a PDF that contains information that says "I'm targeting this document at letter size, but I'm staying 1/2" away from all the edges, so when you print, if the printer has margins <= 1/2 inch you should be fine, and just print it exactly how I've described without any shrinking or enlarging"?
I would not set the crop box. When the pages in the PDF document are Letter size and the output paper is also Letter size, it should not matter whether the "Fit" option is enabled or not, as not scaling needs to happen. It's definitely not a fool proof solution, but at least it's less error prone.

Libre Office Labels don't show up as "AcroFields" in iTextSharp?

so I've been trying to generate a report. I've tried quite a few things already but there always seems to be problems. I'm currently trying iTextSharp 4.1.6.
My current strategy is to use LibreOffice to create a document with editable pdf fields, or I guess they are called "AcroFields". I'm not sure since I can't find a definition. But anyways, I assume that all of these are "AcroFields":
But if I put all of those into a form and export as pdf only some of them show up as AcroFields:
var reader = new PdfReader(File.ReadAllBytes("abc.pdf"));
foreach(var field in reader.AcroFields.Fields)
{
Console.WriteLine(((DictionaryEntry)field).Key);
}
> Text Box 1
Check Box 1
Numeric Field 1
Formatted Field 1
Date Field 1
List Box 1
Combo Box 1
Push Button 1
Option Button 1
Notice how Label Field 1 is not present. If it were present then doing a text replace might be easy. Except it's not present so it's looking like even iText can't do a simple text replace in a pdf. Is this true? How would you replace text in a pdf document using iTextSharp?

Notice how Label Field 1 is not present.
As there is no AcroForm form field type "label", form labels usually are drawn as regular page content in PDF files.
If it were present then doing a text replace might be easy. Except it's not present so it's looking like even iText can't do a simple text replace in a pdf. Is this true?
Indeed, in general there is no simple text replacement in a PDF.
How would you replace text in a pdf document using iTextSharp?
I would determine the bounding box coordinates of the text to replace using the iText text extraction feature with some extension that returns text plus coordinates. Then I'd remove that text by redaction using iText's PdfCleanUp... classes. Finally I'd add the replacement text as new text in the bounding box determined at start.
Unfortunately for you, both good text extraction and redaction are not present in your version 4.1.6; for this approach you should update at least to 5.5.x.
Alternatively, though, as you've been trying to generate a report, I assume the template design is in your hands. In that case you can put your labels into read-only text fields which you can change (they are read-only only to GUI users).

How do I use iTextSharp (or iText) to crop and copy a page from one PDF to another

I've written code to do the following:
Take a PDF of a certain page size (e.g., 8.5" x 11")
Create a new PDF with a larger page size (e.g., 17" x 11")
Impose the original PDF onto the new one (e.g., 2-up such that the resulting new PDF has the original PDF side-by-side)
To do this, I use the PdfWriter.GetImportedPage method to get the current page from the original PDF, then use the PdfContentByte.AddTemplate(page, x, y) method to place the original page onto the current page of the new PDF.
My new challenge is that I need to crop the original PDF before adding it to the new PDF. For example, imagine I want to crop 2" off of the original PDF before imposing it. The input PDF would still be 8.5" x 11" and the new PDF would still be 17" x 11", but the two "copies" of the original PDF in the new one would have had 2" removed from its top, right, bottom and left sides.
Hopefully these images can make this clearer. Here's what I have now, doing a 2-up imposition. (This is working swimmingly.)
But here's what I need to do:
I know that I can alter the display of the PDF in a viewer by using the MediaBox or CropBox settings, but those settings aren't respected by AddTemplate. I know that with AddTemplate I can use a transform matrix to position the page or to scale or rotate it, but I don't want to shrink the original PDF, I want to crop it.
Thanks

I found that I can use the BoundingBox of the imported page to crop it prior to adding it to the new PDF (via AddTemplate).
So my code looks something like this:
PdfImportedPage page = writer.GetImportedPage(pageNumber);
// Crop!
page.BoundingBox = new Rectangle(llx, lly, urx, ury);
// Add to new PDF
writer.DirectContent.AddTemplate(page, x, y);
That does the trick!

How the value of the tj operator is generated in a pdf document (justified text)

I can't understand and find how the value of the tj operator is generated??
Here I paste result before and after changes in the display of the text (on the second block I changed the position Left-Justice and then again comeback to Centered)
I think pdf use some of prng, but what kind of, I can't find
HElp please
[(\003\024\027\005\003\030\036\b)-114.267(\003\006\007\024\036\b)-113.297(\026\002\024\003\032\020\b)-113.337(\b)-111.574(#\024\002\f\005\002\021\003\007\004\f\005\b)-117.089(\003\006\002\003\b)-114.08
[(\003\024\027\005\003\030\036\b)-114.366(\003\006\007\024\036\b)-113.297(\026\002\024\003\032\020\b)-113.327(\b)-111.693(#\024\002\f\005\002\021\003\007\004\f\005\b)-116.98(\003\006\002\003\b)-114.188

First of all, the PDF format does not explicitly support text justification. PDF does not even know text column definitions to justify text in!
All the PDF format supports is
setting or changing the text matrix (and text line matrix), scaling, character and word spacing explicitly and
drawing text pieces which implicitly changes the text matrix.
Thus, if a PDF processor changes the justification of a line of text, it actually first has to have determined
which text pieces belong together and form that line of text;
text pieces can be given as arguments of the Tj or TJ instructions (or more seldom the " or ' instructions); in simple cases the whole line is drawn using a single instruction but you cannot count on that in general; and
what the left and right borders of the text column are to justify between;
e.g. these borders might be standard values assumed by the processor for certain page formats or derived from the current clip path.
Having determined these data, the procedure differs for different kinds of justification:
left justification - position the text matrix at the left text column border at the height of the line and simple let the text drawing instructions follow;
right justification - calculate the width of the drawn line using the current font, position the text matrix at the right text column border minus that width at the height of the line, and let the text drawing instructions follow;
center justification - calculate the width of the drawn line using the current font, position the text matrix at the middle of the text column minus half that width at the height of the line, and let the text drawing instructions follow;
full justification - calculate the width of the drawn line using the current font, set the character spacing and word spacing (using the Tc and Tw instructions, probably with a tweak of the Tz horizontal scaling) to use up the difference between that width and the text column width, position the text matrix at the left text column border at the height of the line, and let the text drawing instructions follow;
or calculate the width of the drawn line using the current font, change the text drawing instructions to use up the difference between that width and the text column width (e.g. using the numeric TJ array argument values), position the text matrix at the left text column border at the height of the line, and let the changed text drawing instructions follow;
or even apply a combination of these methods.
(The changes applied when doing a full justification - character spacing, word spacing, changes of text drawing instructions - obviously additionally are undone when later again changing to another type of justification...)
Positioning the text matrix can happen using the Tm, Td, TD, and T* instructions.
By the way, the positioning and scaling of the text also is influenced by the current transformation matrix. Thus, cm instructions can also be used for justification. But this is less likely than the use of the instructions mentioned above...
Unfortunately you merely supplied an excerpt from the array argument of a TJ instruction before and after such a justification job. One sees that the numeric elements of that array change very slightly. Whether this actually is the justification itself (as per the second option of the full justification above) or merely some computational inaccuracy cannot be told without the context.

Set a PDFFormField's value with a specific font size

I have a very simple use-case for a filling up an acroform. I have a non-multi line text field. I would like to resize the font size to fit in the width of the text field.
The PDF spec mentions that a font size of 0 implies auto fit to width. However PDFBox - 1419 & PDFBOX-1402 mention that this isn’t supported in pdfbox.
Hence I have some small logic to calculate the font-sizes based on the widths etc. However I’m facing problems setting the font size.
I’m seeing the behavior mentioned in PDFBox - 1419.
Starts out with incorrect font size. If I click into the field, it displays correctly. Click outside the field, it reverts back to the wrong display.
Code :
pdfFormField.getDictionary.setString(COSName.DA, "/Helv 10 Tf 0 g”)
pdfFormField.setValue("Hello")
Any pointers or help would be much appreciated.
A simple example of such a PDF is here

Pdfbox form field classes read the default appearance into a member variable early in their life-cycle and don't follow-up to changes in the form field dictionary they are based on. Thus, when creating the appearance stream during pdfFormField.setValue("Hello"), the former DA value is used.
After setting the default appearance, therefore, you have to instantiate the form field object anew. Then set the field value using this new object.
For sample code look at this answer to How to set the text of a PDTextbox to a color?; here the existing DA value of a text field is changed to contain a color setting operation before the field value is set.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Lost some text when extracting pdf - pdf

Related

iText7 - create PDF with exact dimensions when printed - how?

Libre Office Labels don't show up as "AcroFields" in iTextSharp?

How do I use iTextSharp (or iText) to crop and copy a page from one PDF to another

How the value of the tj operator is generated in a pdf document (justified text)

Set a PDFFormField's value with a specific font size

Categories

Resources