Special characters in PDF form fields and global and fieldbased DR - pdf

I have a question regarding a weird form field behaviour.
Two pdf documents, both have textfield(s) using Helvetica as a font
Both are filled with values using the same iText logic (cp. below)
The field value (/V) is correct for both PDFs however the field appearance is not.
One Pdf is working fine the other scrambles special character like the euro symbol € or German characters like üöäß.
I tried to define a substitute font (as described in the book) however never got € and ß to work.
The only difference I could find is that a /DR dictionary is defined on field level for the non-working PDF (in adition to the global one). But if I remove it, the € sign still doesn't work. Please note, that I am not talking about asian or some exotic unicode characters here - all are part of the standard helvetica font (as the other PDF proves)
Question(s):
Any ideas how to get the non working PDF to correctly display the characters?
Or does the PDF violates the pdf spec somehow? (It was created using Acrobat which makes that unlikely but not impossible).
If you suggest to replace the form field font - how can I differentiate between working and non working PDF files since I don't want to do that for perfectly valid and working files
Update: The code is not the problem (I am certain of that since its the same code for both) however for the sake of completeness here it is:
AcroFields acroFields = stamper.getAcroFields();
try {
boolean successful = acroFields.setField("Mitarbeiter", "öäü߀#");
if (!successful) {
//throw some exception
}
}
catch (DocumentException de) {
//some exceptionhandling
}

I didn't find any clues in the PDF reference about this, but the font that is used for the field doesn't define an encoding. However: an encoding is defined at the level of the resource dictionary (/DR). If you use that encoding, then the appearance of the field is created correctly. Note that the ISO specification doesn't say anything about the existence of an /Encoding entry at the level of the resource dictionary.
I've made a small update to iText. You can check the changes in revision 6693. This way, iText will now check if the /DR dictionary has encoding values in case no encoding is defined at the level of the font. With this fix, your form is filled out correctly.

Related

Why is it so hard to convert PDF to plain text?

I needed to convert some PDF back to text. I tried many soft and online tools and result was always mediocre.
Why is it so difficult technically speaking ?
Let's not assume you are talking about PDFs which merely wrap some bitmap image because it should be clear that in that case you can only resort to OCR with all its restrictions.
Let's instead assume that text is drawn in the PDF at hand.
What is drawn on a PDF page is determined by a sequence of instructions in the content stream of that page. "Text is drawn" on a page means that among those instructions there are some setting the font to use by the instructions to come, some setting the text position and direction to use by the instructions to come, and some actually drawing text given by "string arguments".
Text extraction is the task of taking the sequence of instructions from a content stream and instead of drawing the text as indicated by the font and position setting instructions, to export it in a sensible order using a standard encoding, usually the encoding of the character type of the used programming language / platform.
The first problem is to understand the encoding of the string arguments of those text drawing instructions:
each font can have its own encoding; to extract the text one cannot simply ignore everything but the instructions drawing text and concatenate their string contents, you always have to take the current font into account (some extremely simple text extractors ignore this and, therefore, fail pretty often to return something sensible);
there are a large number of predefined encodings, some reminding of encodings you know, e.g. WinAnsiEncoding, many you likely don't know, e.g. Add-RKSJ-H; these encodings may use a constant number of bytes per glyph or they may be mixed-multibyte; so a text extractor must support very many encodings to start with;
encodings also may be completely ad-hoc and arbitrary; in particular in case of embedded subset fonts one often sees ad-hoc encodings generated by dealing out character codes from some starting value whenever one is needed; i.e. the first glyph in a given font used on a page is given the starting value as code, the next, different glyph is given the starting value plus one, the next, different one the starting value plus two, etc; "Hello World" and a starting value of 48 (ASCII value of '0') would result in "01223453627"; these fonts may contain a mapping to Unicode but they are not required to.
The next problem is to make sense out of the order of the strings:
the string drawing instructions may occur in an arbitrary order, e.g "Hello" might be drawn "lo" first, then after moving back "el", then after again moving back "H"; to extract the text one cannot ignore text positioning instructions and simply concatenate text strings, you always have to take the current position into account (some simple text extractors ignore this and, therefore, can fail to return something sensible);
multi-columnar text may present a difficulty, text may be drawn line by line, e.g. first the text of the top line of the first column, then the top line of the second column, then the second line of the first column, then the second line of the second column, etc.; there need not be any hints in the PDF that the text is multi-columnar.
Another problem is to recognize formatting or styling artifacts:
spaces between words need not be created by drawing a space glyph, it may also be done by text position changing instructions; text extractors not trying to recognize gaps created by text positioning instructions may return a result without spaces; on the other hand the same technique can be used to draw adjacent glyphs at an optimal distance, aka kerning; text extractors trying to recognize gaps created by text positioning instructions may falsely return spaces where there should be none;
sometimes selected words are printed s p a c e d o u t for extra emphasis; in the extracted text these gaps might be presented as space characters which automatic postprocessing of the text may see as word separators;
usually for bold text one uses a different, bold font program; if that is not at hand, people sometimes get creative and emulate bold by printing the same text twice with a minute offset; with a slightly larger offset (or a different transformation) and a different color a shadow effect can be emulated; if the text extractor does not try to recognize this, you end up having some duplicate characters in the output.
More problems arise due to incomplete or wrong extra information:
ToUnicode maps of fonts (optional maps from character code to Unicode) may be incomplete or contain errors; there e.g. are many questions here on stack overflow dealing with incorrect ToUnicode maps for Indian writings; the text extraction results reflect these errors;
there even are PDFs with contradictory information, e.g. with an error in the ToUnicode map but the correct information in an ActualText entry; this is used by some PDF creators to allow correct copy&paste from some programs (preferring an ActualText entry in such a situation) while injecting errors in the output of other programs (preferring ToUnicode information then).
Yet another problem arises if you expect the text extractor to extract only text eventually visible in the page:
text may be drawn outside the current clipping area or outside the visible page area; text extractors need to keep these in mind;
text may be drawn using the rendering mode "invisible"; text extractors have to keep an eye on the rendering mode;
text may be drawn using the same color as the background; to recognize this, a text extractor can not only look at the current instruction and a few graphics state details, it has to take into account anything drawn beforehand in the location of the text;
text may be drawn as a clip path; to recognize whether this text is visible in the end, a text extractor must keep track of what is drawn in the text area as long as the clip path is active;
text may be covered by something else later; a text extractor must drop recognized text in such a case; but depending on blend modes and transparency settings these coverings might or might not allow the text to shine through; thus, for a correct result the text extractor must for each glyph keep track of the color its drawn with, the color of the backdrop, and what all those spiffy effects do with those colors later on; and of course, both glyph color and backdrop color can be interesting, e.g. some shading colors; and the color spaces involved may differ, requiring one to convert back and forth between color spaces; and so on.
Furthermore, text may be drawn where text extractors usually don't look:
some tools hide text from text extraction by putting it into a pattern and filling the page area with that pattern;
similarly there are type 3 fonts; each character in a type 3 font is represented by its own content stream; thus, a tool can draw all text in the content stream of a single type 3 font glyph and then draw that glyph on the page.
...
You surely have meanwhile gotten an idea why text extraction results can be less than optimal. And be assured, the list above is not complete, there still are more complications for text extraction.

How do I make iText 7 diacritic mark stacking work correctly?

I have run into a problem with iText 7 where diacritic marks are painted on top of one another instead of stacking properly when multiple marks are used on a single character. Is there a setting that makes them appear correctly, or is this a bug in iText 7? Any help greatly appreciated. This can be observed if you create a text object in your PDF like below. Obviously, replace the relevant bit with an actual font object, rather than than what I have in there.
new Text("ḗ and ṓ are characters that display incorrectly").setFont(<UNICODE COMPATIBLE FONT LIKE CHARIS>);
While Bruno and Benoit correctly pointed out that for advanced typography stuff like stacking diacritical marks you need pdfCalligraph module, there is a workaround you can try at your own risk. If your combinations of base glyph and diacritics are real, meaning they occur in real texts in some languages or some other known contexts, then such combinations are most probably present in Unicode and have their own number associated with them. For instance, in text you provided, they are 0xU1E17 and 0x1E53 Unicode characters. Some fonts may contain such glyphs, so that there is a second option to showing base glyph and stacking diacritics: showing combined glyphs. For example, ArialUni shipped with Windows does contain the above mentioned glyphs.
To try this approach, you would need the following code for composing known Unicode base glyph + diacritics combinations into single glyphs:
String originalStr = "ḗ and ṓ are characters that display incorrectly";
String normalizedStr = java.text.Normalizer.normalize(originalStr, Normalizer.Form.NFC);
new Text(normalizedStr); // Use this normalized Text instance
The result that I got with ArialUni:
But again as I mentioned do it at your own risk because it will only work if there are necessary combinations present both in Unicode and font. For correct rendering you still should use pdfCalligraph.

Extract only the text from PDF files with CGPDFScanner

There are a number of questions (some answered and others not) about extracting simple text from PDF files. Stackoverflow has been helpful to point out that the PDF Adobe documentation is very clear to detect objects during parsing: i.e. one should use 'BT' and 'ET' PDF reference Operators to construct the callbacks when using CGPDFScanner.
The apple documentation shows a callback example:
static void op_BT (CGPDFScannerRef s, void *info) {
const char *name;
if (!CGPDFScannerPopName(s, &name))
return;
printf("BT /%s\n", name);
}
And, among other CGPDFScanner commands, the above call-back is set-up by first creating:
myTable = CGPDFOperatorTableCreate();
CGPDFOperatorTableSetCallback (myTable, "BT", &op_BT);
All good so far, but the Apple documentation doesn't appear to help low-to-intermediate programmers like me to understand the next step: Beyond identifying the text block (presumably between BT and BE callbacks?), what few steps/lines are needed during/in/outside the callback to capture the identified text block into a NSString?
Many thanks.
The first thing you should do is download the PDF reference. These days that's an ISO standard, but you can download the Acrobat SDK (http://www.adobe.com/devnet/acrobat.html) which contains an Adobe copy that will serve you just as well.
Read chapter 9. It'll teach you that on the one hand you need to understand text operators (Tj, ', ", TJ) and on the other hand you need to understand fonts and encodings.
The text operators are the operators that you can intercept that add "strings" to the PDF document; while all text operators must appear between BT and ET blocks, intercepting these BT and ET blocks by itself isn't going to do much for you I think.
Fonts are important because they will define how the bytes used by those operators correspond to actual (Unicode) characters. So if you want to derive the meaning of the bytes you get from the PDF file, you need to know how to use fonts to derive that meaning.
Some additional points:
Don't assume BT and ET correspond to an actual text block or paragraph as you may know it from an application such as InDesign or Word. One text block may contain a whole page or a single character (or nothing).
There are also text state operators that determine how the text is going to be shown on the page. There are ways for example to draw invisible text; you may or may not wish to extract that type of text. If you don't, you'll need to support enough text state operators that you can tell the difference.
Not a small task :)
Update after looking at sample PDF
Because in comments the question was refined to indicate text extraction of a specific type of PDF file, let me add a little additional information.
1) Looking at the PDF file you reference, you won't be able to skip the font/encoding problem. The fonts in the sample PDF file are subsetted which means that you don't have "cleartext" in the PDF page description but instead indexes that have to be mapped through the encoding of the fonts used to get meaningful text.
2) Extracting the text is possible, if you look at the following output from pdfToolbox (warning, I'm affiliated rather heavily with this tool):
<page id="33">
<words>
<word txt="Senator">
<parts>
<part tlh="28.3481" tlv="868.534" trh="55.4455" trv="868.534" blh="28.3481" blv="859.902" brh="55.4455" brv="859.902"></part>
</parts>
</word>
<word txt="House,">
<parts>
<part tlh="57.5305" tlv="868.534" trh="82.123" trv="868.534" blh="57.5305" blv="859.902" brh="82.123" brv="859.902"></part>
</parts>
</word>
<word txt="85">
<parts>
<part tlh="84.208" tlv="868.534" trh="92.548" trv="868.534" blh="84.208" blv="859.902" brh="92.548" brv="859.902"></part>
</parts>
</word>
There are undoubtedly other tools which can give a similar (or better) result, so extracting the text by itself should be doable.
The big problem is going to be finding the text you're interested in in the right order. The extraction I used here gives the text of each "word" and it's position (bounding box) on the page. When I look through the XML when you get to the table, the challenge is going to be which text belongs to which table cell, where rows and columns end etc...
In a way this problem is harder than the problem of simply detecting lines of text because you're dealing with a pretty dense table (and where my problem was largely one-dimensional (gathering everything on the same line) this problem is two-dimensional.

Do PDF files generally use "correct" character codes for font glyphs?

Say I have a PDF file that contains one or more embedded fonts. Here's my understanding of how a single character of text is rendered:
First, determine which font the character uses.
Use the font's "cmap," embedded in the PDF, to determine the font's glyph name for the given character. For example, the character '&' in PDF text might map to a glyph that the font internally calls 'ampersand'.
Use the font's "glyf" table to determine the bounding box / drawing instructions for the glyph name.
Here's my question: is a PDF cmap generally consistent? Put another way, if I encounter the character "&" in a PDF, can I be assured that the cmap will always map "&" to the ampersand glyph? Or does some PDF-generation software create its own arbitrary mapping between character codes and glyph names (which would be rather evil and possibly break in-PDF searching and text selection)?
Of course I realize it's possible for the cmap to use an unintuitive mapping -- I guess I'm asking, does this actually happen in the Real World?
My specific use-case is in the world of music fonts. I'm analyzing characters in a PDF to determine which music glyph each one represents (e.g., treble clef, notehead, etc.). I want to know how confident I can be that the combination of font name and character code will always result in the same glyph. For example, if I know the font name is "Opus" and the glyph is "#", can I assume that will always be mapped to the treble clef glyph? Or do I have to analyze the glyph's metrics to make sure it's actually a treble clef?
It differs from one PDF creator to another.
A fairly common method (alas!) is "order encountered", where the first character in the text stream gets mapped to \001, the next to \002 and so on. So the text "Hello" would be encoded as \001\002\003\003\004.
I want to know how confident I can be that the combination of font name and character code will always result in the same glyph.
In a single PDF document, if the same font object is used in different contexts, it will be true -- the mapping is defined inside the font object. If you encounter another font object that uses the same font but it points to another font stream (i.e., the font subset is embedded twice), then it may not be true. Each subset may have an encoding of its own.
Only if the font object contains a /ToUnicode mapping, you can be confident that values map to the correct characters.

Altering an embedded TrueType font so it will be usable by Windows GDI

I am trying to render PDF content to a GDI device context (a 24bit bitmap to be exact). Parsing the PDF stream into PDF objects and rendering the PDF commands from the content dictionary works well, including font rendering.
Embedded fonts are decompressed from their FontFile streams and "loaded" using AddFontMemResourceEx. Now some embedded fonts remove some TrueType tables that are needed by GDI, like the 'name' table. Because of this, I tried to modify the font by parsing the TrueType subset font into it's tables and modify those tables that have data missing / missing tables are regenerated with as correct information as possible.
I use the Microsoft Font Validator tool to see how "correct" the generated font is. I still get a few errors, like for the maxp table the max values are usually too large (it is a subset) or the xAvgCharWidth field does not equal the calculated value of the 'OS/2' table is not correct but this does not stop other embedded fonts to be useable. The fonts embedded using PDFCreator are the ones that are problematic.
Questions:
How can I determine what I need to
change to the font file in order for
GDI to be able to use it?
Are there any other font validation
tools that might give me insight
into what is still wrong with the
fontfile?
If needed: I can make an original fontfile and an altered fontfile available for download somewhere.
What modifications are made so far:
Make sure there is a 'head', 'hhea', 'maxp' and 'OS/2' section.
If we have a symbol font, clear out Panose and Unicode fields in the 'OS/2' section
Fill in correct values for WInAscent/Desc and TypoAsc/Desc if they we're zero.
Fill in acceptable values for super/subscript/underline positions and sizes.
Scan all glyphs that are left en fill in X/Y min/max values in head.
Rebuild the name section with information from the PDF file it came from.
Almost a year late, but I found the answer:
The font kind (symbol or not) should be the same for the 'cmap' table and the 'name' table. So if the cmap has a 3,0,4 cmap (MS, symbol, segment delta coding) and the name table contains 3,1,$0409 (MS, Unicode, enUS) entries, the font will not load.
It looks like the presence of a 'symbol cmap' determines if the font is considered a symbol font by Windows; flags in 'OS/2' don't seem to matter.
So if a font seems correct using 'Microsoft Font Validator', check if the symbol/unicode fields line up in the 'cmap' and 'name' tables.
With AddMemoryFont in GDI+ you can check it's Status for any errors in the memory font, like NotTrueTypeFont for example.
One option for GDI may be to try to load the embedded font into a document/form yourself with TTLoadEmbeddedFont and then check any errors returned from the error messages. The only functions that supply more information than this one are CreateFontPackage/MergeFontPackage, and their error codes, but I do fail to see how they can be used in your situation.
Barring all this, have you had a chance to review PDFCreator's source code (assuming you're using the open source one as opposed to the commercial one)?