parsing pdf with pdf-box shows unknown characters - pdf

When I try to parse a pdf file with pdf box in java which is generated with cups pdf, showing junk characters. But it works perfectly with common pdfs I checked font cups pdf shows FreeMono_00.ttf (but i didn't se such a font anywhere) and working pdf shows ArialMT.
Anything I want to do differently for parsing the pdfs generated using cups-pdf.
below is the code I'm using for parsing.
parser = new PDFParser(new FileInputStream(File file));
parser.parse();
COSDocument cosDoc = parser.getDocument();
PDFTextStripperpdfStripper = new PDFTextStripper();
PDDocument pdDoc = new PDDocument(cosDoc);
String parsedText = pdfStripper.getText(pdDoc);
output is getting like this
)LOH1DPHDVGW[W
6XEMHFWVXEMHFWVVDPSOH
0HVVDJHVHQGLQJGHWDLOVDORQJZLWKSULQWILOH
8VHU1DPH$EGXOUD]DN30
8VHU,'D#DFRP
just copy paste also gives like this

I'm only repeating what I've read... I'm inexperienced here. If there were more mavens answering PDF/PDFBox questions, I'd wait for one to answer.
I believe the font either doesn't contain Unicode tables at all, or has been embedded in the document without Unicode tables. If the text seems to be a simple substitution cipher for a single given document, that would tend to confirm this.
If the font is embedded, I think sometimes only an extract of the glyphs you are actually using is embedded. That's likely here, since the font is not installed on the system (you said), and the original FreeMono font is large - over 4000 glyphs. In this case, I fear that the correspondence between character and glyph may be document-dependent - but I'm speculating.

Related

Nbspace not available

I am using pdfbox 2.0.9
I have a pdf with acrofrom only and I want set nbspace character to a field:
field.setValue("\u00A0");
But I get error:
java.lang.IllegalArgumentException: U+00A0 ('nbspace') is not available in this font Courier encoding: WinAnsiEncoding
I understand font on current field is not supporting these character.
How can I with pdfbox2.0.14 get pdf fonts list available on my pdf?
This topic might be related How to print `Non-breaking space` to a pdf using apache pdf box?
The text fields in your PDF use the font Helv.
The AcroForm resources font Helv is defined with the following encoding:
5 0 obj
<<
/Type/Encoding
/Differences[
24/breve/caron/circumflex/dotaccent/hungarumlaut/ogonek/ring/tilde
39/quotesingle
96/grave
128/bullet/dagger/daggerdbl/ellipsis/emdash/endash/florin/fraction
/guilsinglleft/guilsinglright/minus/perthousand/quotedblbase/quotedblleft
/quotedblright/quoteleft/quoteright/quotesinglbase/trademark/fi/fl/Lslash
/OE/Scaron/Ydieresis/Zcaron/dotlessi/lslash/oe/scaron/zcaron
160/Euro
164/currency
166/brokenbar
168/dieresis/copyright/ordfeminine
172/logicalnot/.notdef/registered/macron/degree/plusminus/twosuperior
/threesuperior/acute/mu
183/periodcentered/cedilla/onesuperior/ordmasculine
188/onequarter/onehalf/threequarters
192/Agrave/Aacute/Acircumflex/Atilde/Adieresis/Aring/AE/Ccedilla
/Egrave/Eacute/Ecircumflex/Edieresis/Igrave/Iacute/Icircumflex
/Idieresis/Eth/Ntilde/Ograve/Oacute/Ocircumflex/Otilde/Odieresis
/multiply/Oslash/Ugrave/Uacute/Ucircumflex/Udieresis/Yacute/Thorn
/germandbls/agrave/aacute/acircumflex/atilde/adieresis/aring/ae
/ccedilla/egrave/eacute/ecircumflex/edieresis/igrave/iacute
/icircumflex/idieresis/eth/ntilde/ograve/oacute/ocircumflex/otilde
/odieresis/divide/oslash/ugrave/uacute/ucircumflex/udieresis/yacute
/thorn/ydieresis
]
>>
endobj
As there is no font program embedded for this font, this encoding is based on the StandardEncoding. This base encoding does not contain a non-breaking space. Furthermore your Differences array does not add nbspace either.
Thus, you cannot draw a non-breaking space using that encoding and, therefore, also not using that Helv font.
As far as I know, PDFBox does not supply replacement fonts in such a case, i.e. if asked to create a new text field appearance while setting a value which contains a character not supported in the form field default appearance font encoding.
One work-around might be to not ask PDFBox to generate an appearance to start with, instead mark the AcroForm with a NeedAppearances value true, and hope a later PDF processor / viewer does use a replacement font in such a case. There is no guarantee this works, probably the next processor needing appearances also doesn't supply replacement fonts. Nonetheless, there at least is a chance it does...
Depending on the exact version of PDFBox, though,
field.setValue(value);
may always trigger appearance generation. If that is the case for you, you have to set the field value like this
field.getCOSObject().setString(COSName.V, value);

unable to render Č character in pdf [duplicate]

I have a problem when adding characters such as "Č" or "Ć" while generating a PDF. I'm mostly using paragraphs for inserting some static text into my PDF report. Here is some sample code I used:
var document = new Document();
document.Open();
Paragraph p1 = new Paragraph("Testing of letters Č,Ć,Š,Ž,Đ", new Font(Font.FontFamily.HELVETICA, 10));
document.Add(p1);
The output I get when the PDF file is generated, looks like this: "Testing of letters ,,Š,Ž,Đ"
For some reason iTextSharp doesn't seem to recognize these letters such as "Č" and "Ć".
THE PROBLEM:
First of all, you don't seem to be talking about Cyrillic characters, but about central and eastern European languages that use Latin script. Take a look at the difference between code page 1250 and code page 1251 to understand what I mean. [NOTE: I have updated the question so that it talks about Czech characters instead of Cyrillic.]
Second observation. You are writing code that contains special characters:
"Testing of letters Č,Ć,Š,Ž,Đ"
That is a bad practice. Code files are stored as plain text and can be saved using different encodings. An accidental switch from encoding (for instance: by uploading it to a versioning system that uses a different encoding), can seriously damage the content of your file.
You should write code that doesn't contain special characters, but that use a different notations. For instance:
"Testing of letters \u010c,\u0106,\u0160,\u017d,\u0110"
This will also make sure that the content doesn't get altered when compiling the code using a compiler that expects a different encoding.
Your third mistake is that you assume that Helvetica is a font that knows how to draw these glyphs. That is a false assumption. You should use a font file such as Arial.ttf (or pick any other font that knows how to draw those glyphs).
Your fourth mistake is that you do not embed the font. Suppose that you use a font you have on your local machine and that is able to draw the special glyphs, then you will be able to read the text on your local machine. However, somebody who receives your file, but doesn't have the font you used on his local machine may not be able to read the document correctly.
Your fifth mistake is that you didn't define an encoding when using the font (this is related to your second mistake, but it's different).
THE SOLUTION:
I have written a small example called CzechExample that results in the following PDF: czech.pdf
I have added the same text twice, but using a different encoding:
public static final String FONT = "resources/fonts/FreeSans.ttf";
public void createPdf(String dest) throws IOException, DocumentException {
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream(DEST));
document.open();
Font f1 = FontFactory.getFont(FONT, "Cp1250", true);
Paragraph p1 = new Paragraph("Testing of letters \u010c,\u0106,\u0160,\u017d,\u0110", f1);
document.add(p1);
Font f2 = FontFactory.getFont(FONT, BaseFont.IDENTITY_H, true);
Paragraph p2 = new Paragraph("Testing of letters \u010c,\u0106,\u0160,\u017d,\u0110", f2);
document.add(p2);
document.close();
}
To avoid your third mistake, I used the font FreeSans.ttf instead of Helvetica. You can choose any other font as long as it supports the characters you want to use. To avoid your fourth mistake, I have set the embedded parameter to true.
As for your fifth mistake, I introduced two different approaches.
In the first case, I told iText to use code page 1250.
Font f1 = FontFactory.getFont(FONT, "Cp1250", true);
This will embed the font as a simple font into the PDF, meaning that each character in your String will be represented using a single byte. The advantage of this approach is simplicity; the disadvantage is that you shouldn't start mixing code pages. For instance: this won't work for Cyrillic glyphs.
In the second case, I told iText to use Unicode for horizontal writing:
Font f2 = FontFactory.getFont(FONT, BaseFont.IDENTITY_H, true);
This will embed the font as a composite font into the PDF, meaning that each character in your String will be represented using more than one byte. The advantage of this approach is that it is the recommended approach in the newer PDF standards (e.g. PDF/A, PDF/UA), and that you can mix Cyrillic with Latin, Chinese with Japanese, etc... The disadvantage is that you create more bytes, but that effect is limited by the fact that content streams are compressed anyway.
When I decompress the content stream for the text in the sample PDF, I see the following PDF syntax:
As I explained, single bytes are used to store the text of the first line. Double bytes are used to store the text of the second line.
You may be surprised that these characters look OK on the outside (when looking at the text in Adobe Reader), but don't correspond with what you see on the inside (when looking at the second screen shot), but that's how it works.
CONCLUSION:
Many people think that creating PDF is trivial, and that tools for creating PDF should be a commodity. In reality, it's not always that simple ;-)
If you are using the FontProvider, I managed to solve the display of the special characters by setting the registerShippedFreeFonts parameter to true:
FontProvider dfp = new DefaultFontProvider(true, true, false);
See also: https://itextpdf.com/en/resources/books/itext-7-converting-html-pdf-pdfhtml/chapter-6-using-fonts-pdfhtml

Itextsharp: Export greek letters to PDF? [duplicate]

I have a problem when adding characters such as "Č" or "Ć" while generating a PDF. I'm mostly using paragraphs for inserting some static text into my PDF report. Here is some sample code I used:
var document = new Document();
document.Open();
Paragraph p1 = new Paragraph("Testing of letters Č,Ć,Š,Ž,Đ", new Font(Font.FontFamily.HELVETICA, 10));
document.Add(p1);
The output I get when the PDF file is generated, looks like this: "Testing of letters ,,Š,Ž,Đ"
For some reason iTextSharp doesn't seem to recognize these letters such as "Č" and "Ć".
THE PROBLEM:
First of all, you don't seem to be talking about Cyrillic characters, but about central and eastern European languages that use Latin script. Take a look at the difference between code page 1250 and code page 1251 to understand what I mean. [NOTE: I have updated the question so that it talks about Czech characters instead of Cyrillic.]
Second observation. You are writing code that contains special characters:
"Testing of letters Č,Ć,Š,Ž,Đ"
That is a bad practice. Code files are stored as plain text and can be saved using different encodings. An accidental switch from encoding (for instance: by uploading it to a versioning system that uses a different encoding), can seriously damage the content of your file.
You should write code that doesn't contain special characters, but that use a different notations. For instance:
"Testing of letters \u010c,\u0106,\u0160,\u017d,\u0110"
This will also make sure that the content doesn't get altered when compiling the code using a compiler that expects a different encoding.
Your third mistake is that you assume that Helvetica is a font that knows how to draw these glyphs. That is a false assumption. You should use a font file such as Arial.ttf (or pick any other font that knows how to draw those glyphs).
Your fourth mistake is that you do not embed the font. Suppose that you use a font you have on your local machine and that is able to draw the special glyphs, then you will be able to read the text on your local machine. However, somebody who receives your file, but doesn't have the font you used on his local machine may not be able to read the document correctly.
Your fifth mistake is that you didn't define an encoding when using the font (this is related to your second mistake, but it's different).
THE SOLUTION:
I have written a small example called CzechExample that results in the following PDF: czech.pdf
I have added the same text twice, but using a different encoding:
public static final String FONT = "resources/fonts/FreeSans.ttf";
public void createPdf(String dest) throws IOException, DocumentException {
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream(DEST));
document.open();
Font f1 = FontFactory.getFont(FONT, "Cp1250", true);
Paragraph p1 = new Paragraph("Testing of letters \u010c,\u0106,\u0160,\u017d,\u0110", f1);
document.add(p1);
Font f2 = FontFactory.getFont(FONT, BaseFont.IDENTITY_H, true);
Paragraph p2 = new Paragraph("Testing of letters \u010c,\u0106,\u0160,\u017d,\u0110", f2);
document.add(p2);
document.close();
}
To avoid your third mistake, I used the font FreeSans.ttf instead of Helvetica. You can choose any other font as long as it supports the characters you want to use. To avoid your fourth mistake, I have set the embedded parameter to true.
As for your fifth mistake, I introduced two different approaches.
In the first case, I told iText to use code page 1250.
Font f1 = FontFactory.getFont(FONT, "Cp1250", true);
This will embed the font as a simple font into the PDF, meaning that each character in your String will be represented using a single byte. The advantage of this approach is simplicity; the disadvantage is that you shouldn't start mixing code pages. For instance: this won't work for Cyrillic glyphs.
In the second case, I told iText to use Unicode for horizontal writing:
Font f2 = FontFactory.getFont(FONT, BaseFont.IDENTITY_H, true);
This will embed the font as a composite font into the PDF, meaning that each character in your String will be represented using more than one byte. The advantage of this approach is that it is the recommended approach in the newer PDF standards (e.g. PDF/A, PDF/UA), and that you can mix Cyrillic with Latin, Chinese with Japanese, etc... The disadvantage is that you create more bytes, but that effect is limited by the fact that content streams are compressed anyway.
When I decompress the content stream for the text in the sample PDF, I see the following PDF syntax:
As I explained, single bytes are used to store the text of the first line. Double bytes are used to store the text of the second line.
You may be surprised that these characters look OK on the outside (when looking at the text in Adobe Reader), but don't correspond with what you see on the inside (when looking at the second screen shot), but that's how it works.
CONCLUSION:
Many people think that creating PDF is trivial, and that tools for creating PDF should be a commodity. In reality, it's not always that simple ;-)
If you are using the FontProvider, I managed to solve the display of the special characters by setting the registerShippedFreeFonts parameter to true:
FontProvider dfp = new DefaultFontProvider(true, true, false);
See also: https://itextpdf.com/en/resources/books/itext-7-converting-html-pdf-pdfhtml/chapter-6-using-fonts-pdfhtml

PDF text extraction issue - font/capitalization inconsistencies

I am trying to extract text from a pdf book and continue to run an issue where sections of copied text fail to retain the proper capitalization properties when pasted into a text document. I have rights to reproduce the book and also have a license to use all necessary fonts. At first I thought that the issue was caused by the fonts not being embedded, but I checked and all fonts appear to be subset embedded. Within the pdf there are over 100 fonts used which have one of the following properties:
TrueType Encoding: Ansi
TrueType (CID) Encoding: Identity-H
Type 1 (CID) Encoding: Identity-H
Type 1 Encoding: Custom
The languages within the book include English, German, Spanish and Italian. In German capitalization is absolutely critical. It tends to lose the uppercase properties more than the lower.
An example of the error would be: WELD -> weld
I am really at a loss at what to do here. I have requested that the owner of the book embed the fonts which he has done as subsets but the problem continues. I have tried saving the pdf file as a postscript and then ran it through distiller which correctly much of the problem, but in some cases resulted in text being replaced with different characters or numbers showing up as skulls. I understand that CID fonts might be contributing to the issue, but I have come across instance where a non CID font had the same result.
What could be causing this issue? Is it that the fonts are subset versus fully embedded? Is there a better way to save the native file (InDesign) to a pdf that will allow for better font extraction? Does it have to do with non-unicode fonts and if so is there an alternative that does not require the owner to select different fonts?
Any and all assistance is greatly appreciated.
That's indeed funny. The sample PDF provided by the OP indeed visibly contains upper case characters, some of them in upper case only lines, some in mixed case lines, which by Adobe Reader are extracted as lower case characters.
You wonder
What could be causing this issue?
As an example how that happens let's look at Pelle Più bella
In the page content that phrase actually looks like the visual representation in capital letters:
/T1_0 1 Tf
-0.025 Tc 12 0 0 12 379.5354 554.8809 Tm
(PELLE PI\331 BELLA)Tj
Looking at the used font T1_0 (a DIN-Bold subset) we see that it claims to use WinAnsiEncoding which would also indicate an interpretation of those character codes in the page stream as capital letters
But the font also has a ToUnicode mapping, and this mapping maps
<41> <0061> — 'A' → a
<42> <0062> — 'B' → b
<43> <0043> — 'C' → C
<44> <0044> — 'D' → D
<45> <0065> — 'E' → e
<49> <0069> — 'I' → i
<4C> <006C> — 'L' → l
<4D> <004D> — 'M' → M
<4E> <006E> — 'N' → n
<50> <0050> — 'P' → P
<52> <0072> — 'R' → r
<53> <0053> — 'S' → S
<54> <0074> — 'T' → t
<D9> <00F9> — 'Ù' → ù
(I only extracted the mappings from character codes which in WinAnsiEncoding represent capital letters.)
Is there a better way to save the native file (InDesign) to a pdf that will allow for better font extraction?
Sorry, I'm not really into InDesign. But that software being from Adobe I would be surprised if that was a bug in InDesign or its export to PDF. Could it instead be that there are some information in the InDesign file which tag PELLE PIÙ BELLA as Pelle Più bella which InDesign then in the PDF export translates into this ToUnicode mapping?
Does it have to do with non-unicode fonts and if so is there an alternative that does not require the owner to select different fonts?
In case of your sample document there are three fonts, all of them with an Encoding entry WinAnsiEncoding, all of them being an embedded subset, but only two have such funny ToUnicode mappings, DIN-Medium and DIN-Bold, while Helvetica has no ToUnicode mapping. So it somehow is font related. How exactly I cannot say.
A workaround in case of your sample document would be to remove the ToUnicode mapping from the font dictionaries.
For example using Java and the iText library you can do that like this:
PdfReader reader = new PdfReader(INPUT);
for (int i = 1; i <= reader.getXrefSize(); i++)
{
PdfObject obj = reader.getPdfObject(i);
if (obj != null && obj.isDictionary())
{
PdfDictionary dic = (PdfDictionary) obj;
if (PdfName.FONT.equals(dic.getAsName(PdfName.TYPE)))
{
dic.remove(PdfName.TOUNICODE);
}
}
}
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(OUTPUT));
stamper.close();
reader.close();
After this manipulation Adobe Reader text extraction results in
PELLE PIÙ BELLA
This obviously only works in situations like the one in your sample document.
If in your other documents there is a mixture of fonts some of which require their respective ToUnicode map for text extraction while others are like the trouble fonts above, you might want to add some extra conditions to the Java code to only remove the map in the buggy font definitions.
No need to jump through PDF hoops. It isn't even a good text interchange format to begin with.
Is there a better way to save the native file (InDesign) to a pdf that will allow for better font extraction?
Ask the file provider to make an RTF export. This will retain all used fonts and formatting.
Your WELD-weld problem might be because of the font (if it contains both upper- and lowercase mapped to the same glyphs), use of an OpenType feature such as All Capitals, or even something like a badly created text-only stream inside the PDF.

PDF font mapping error

While rendering a PDF file generated by PDFCreator 0.9.x. I noticed it contains an error in the character mapping. Now, an error in a PDF file is nothing to be wondered about, Acrobat does wonders in rendering faulty PDF files hence a lot of PDF generators create PDFs that do not adhere fully to the PDF standard.
I trief to create a small example file: http://test.continuit.nl/temp/Document.pdf
The single page renders a single glyph (a capital A) using a Tj command (See stream 5 0 obj). The font selected (7 0 obj) contains a font with a single glyph embedded. So far so good. The char is referenced by char #1. Given the Encoding of the font it contains a Differences part: [ 1 /A ]. Thus char 1 -> character /A. Now in the embedded subset font there is a cmap that matches no glyph at character 65 (eg capital A) the cmap section of the font does define the character in exactly the order in the PDF file Font -> Encoding -> Differences array.
It looks like the character mapping / encoding is done twice. Only Files from PDFCreator 0.9.x seem to be affected.
My question is: Is this correct (or did I make a mistake and is the PDF correct) and what would you do to detect this situation in order to solve the rendering problem.
Note: I do need to be able to render these PDFs..
Solution
In the ISO32000 file there is a remark that symbolic TrueType fonts (flag bit 3 is on in the font descriptor) the encoding is not allowed and you should IGNORE it, using a simple 1on1 encoding always. SO all in all, if it is a symbolic font, I ignore the Encoding object altogether and this solves the problem.
The first point is that the file opens and renders correctly in Acrobat, so its almost certain that the file is correct. In fact it opens and renders correctly in a wide range of PDF consumers, so in fact it is correct.
The font in question is a TrueType font, so actually yes, there are two kinds of 'encoding'. First there is PDF/PostScript Encoding. This maps a character code into a glyph name. In your case it maps character code 1 to glyph name /A.
In a PostScript font we would then look up the name /A in the CharStrings dictionary, and that would give us the character description, which we would then execute. Things are different with a TrueType font though.
You can find this on page 430 of the 1.7 PDF Reference Manual, where it states that:
"A TrueType font program’s built-in encoding maps directly from character codes to glyph descriptions by means of an internal data structure called a “cmap” (not to be confused with the CMap described in Section 5.6.4, “CMaps”)."
I believe in your case that you simply need to use the character code (0x01) directly in the CMAP sub table. This will give you a GID of 36.