Does any body can help me with letter issue in PDFBox I'm trying to print letter "ń" ( polish letter ) and I'm getting something like þÿ J . Dı B R O W 2S0 :K0 3I.
Please help!
I came upon the same problem with Bulgarian. In short, I think there isn't an easy solution. Basically you need a utf font. If you use one of the standard 14 type1 fonts (like Helvetica or Courier) - they only support the basic latin alpabet, so they can't do the job. You could load a truetype utf font but pdfbox has a hardocded WinAsciiEncoding for all truetype (type1 as well) fonts which is wrong. You could do what Open office does as far as I can see - create a subset of a font so that you don't embed the whole font file in the pdf. Unfortunately this functionality is missing in pdfbox but there is a Jira and more information:
https://issues.apache.org/jira/browse/PDFBOX-922
If you find a good solution please share!
you can change it to unicode characters into LegacyPDFStream Engine java class
Related
I'm trying to extract text from Arabic pdfs - raw data extraction not OCR -.
I tried many packages, tools and none of them worked, python packages, pdfBox, adobe API, and many other tools and all of them field to extract the text correctly, either it reads the text LTR or it do wrong decoding.
Here is a two sample from different tools
sample 1:
املحتويات
7 الثانية الطبعة مقدمة
9 وتاريخه األدب -١
51 الجاهليون -٢
95 الشعر نحل أسباب -٣
149 والشعراء الشعر -٤
213 مرض شعر -٥
271 الشعر -٦
285 الجاهيل النثر -٧
sample 2:
ﺔﻴﻧﺎﺜﻟا ﺔﻌﺒﻄﻟا ﺔﻣﺪﻘﻣ
ﻪﺨﻳرﺎﺗو بدﻷا -١
نﻮﻴﻠﻫﺎﺠﻟا -٢
ﺮﻌﺸﻟا ﻞﺤﻧ بﺎﺒﺳأ -٣
ءاﺮﻌﺸﻟاو ﺮﻌﺸﻟا -٤
ﴬﻣ ﺮﻌﺷ -٥
ﺮﻌﺸﻟا -٦
ﲇﻫﺎﺠﻟا ﺮﺜﻨﻟا -٧
original text
and yes I can copy it and get the same rendered text.
are there any tool that can extract Arabic text correctly
the book link can be found here
The text in a PDF is not the same as the text used for its construction, we can see that in your example where page 7 is shown in Arabic on the surface but is coded as 7 in the plain text.
However a greater problem is the Languages as supported by fonts, so in Notepad I had to accept a script font to see a similarity, but that is using a font substitution.
Another complication is Unicode and whitespace ordering.
so the result from
pdftotext -f 5 -l 5 في_الأدب_الجاهلي.pdf try.txt
At best will look like
Thus in summary your Sample 1 is equal if not better, than any other simple attempt.
Later Edit from B.A. comment below
I found a way to go around this, after extracting the text I open the txt file and normalize its content using unicodedata python module that offers unicodedata.normalize() function. So I can now say that pdftotext is the best tool for Arabic text extraction
Unicode Normalization should be fixing that issue. (you can choose NFKC)
Most programming languages have a normal.
check here for more info about normalization.
https://unicode.org/reports/tr15/
I am having issues with previewing PDF in Gmail. It doesn't recognize some of the international characters that I am using (it doesn't show letters like ą ć ś, but it shows for example ł). I am encoding the pdf with Cp1250.
Any ideas on whats going on?
It looks like you are using the Standard 14 Fonts and don't embed them into your PDF. PDF readers are required to bring along these fonts but only with a limited character set which does not include ą, ć, or ś but which does include ł which matches your observation
(it doesn't show letters like ą ć ś, but it shows for example ł)
For details on these fonts confer the PDF specification
9.6.2.2 Standard Type 1 Fonts (Standard 14 Fonts)
The PostScript names of 14 Type 1 fonts, known as the standard 14 fonts, are as follows: Times-Roman, Helvetica, Courier, Symbol, Times-Bold, Helvetica-Bold, Courier-Bold, ZapfDingbats, Times-Italic, Helvetica-Oblique, Courier-Oblique, Times-BoldItalic, Helvetica-BoldOblique, Courier-BoldOblique
These fonts, or their font metrics and suitable substitution fonts, shall be available to the conforming reader.
NOTE The character sets and encodings for these fonts are listed in Annex D. The font metrics files for the standard 14 fonts are available from the ASN Web site (see the Bibliography). For more information on font metrics, see Adobe Technical Note #5004, Adobe Font Metrics File Format Specification.
In Annex D you'll find ł but not ą, ć, or ś.
I am trying to extract text from a pdf book and continue to run an issue where sections of copied text fail to retain the proper capitalization properties when pasted into a text document. I have rights to reproduce the book and also have a license to use all necessary fonts. At first I thought that the issue was caused by the fonts not being embedded, but I checked and all fonts appear to be subset embedded. Within the pdf there are over 100 fonts used which have one of the following properties:
TrueType Encoding: Ansi
TrueType (CID) Encoding: Identity-H
Type 1 (CID) Encoding: Identity-H
Type 1 Encoding: Custom
The languages within the book include English, German, Spanish and Italian. In German capitalization is absolutely critical. It tends to lose the uppercase properties more than the lower.
An example of the error would be: WELD -> weld
I am really at a loss at what to do here. I have requested that the owner of the book embed the fonts which he has done as subsets but the problem continues. I have tried saving the pdf file as a postscript and then ran it through distiller which correctly much of the problem, but in some cases resulted in text being replaced with different characters or numbers showing up as skulls. I understand that CID fonts might be contributing to the issue, but I have come across instance where a non CID font had the same result.
What could be causing this issue? Is it that the fonts are subset versus fully embedded? Is there a better way to save the native file (InDesign) to a pdf that will allow for better font extraction? Does it have to do with non-unicode fonts and if so is there an alternative that does not require the owner to select different fonts?
Any and all assistance is greatly appreciated.
That's indeed funny. The sample PDF provided by the OP indeed visibly contains upper case characters, some of them in upper case only lines, some in mixed case lines, which by Adobe Reader are extracted as lower case characters.
You wonder
What could be causing this issue?
As an example how that happens let's look at Pelle Più bella
In the page content that phrase actually looks like the visual representation in capital letters:
/T1_0 1 Tf
-0.025 Tc 12 0 0 12 379.5354 554.8809 Tm
(PELLE PI\331 BELLA)Tj
Looking at the used font T1_0 (a DIN-Bold subset) we see that it claims to use WinAnsiEncoding which would also indicate an interpretation of those character codes in the page stream as capital letters
But the font also has a ToUnicode mapping, and this mapping maps
<41> <0061> — 'A' → a
<42> <0062> — 'B' → b
<43> <0043> — 'C' → C
<44> <0044> — 'D' → D
<45> <0065> — 'E' → e
<49> <0069> — 'I' → i
<4C> <006C> — 'L' → l
<4D> <004D> — 'M' → M
<4E> <006E> — 'N' → n
<50> <0050> — 'P' → P
<52> <0072> — 'R' → r
<53> <0053> — 'S' → S
<54> <0074> — 'T' → t
<D9> <00F9> — 'Ù' → ù
(I only extracted the mappings from character codes which in WinAnsiEncoding represent capital letters.)
Is there a better way to save the native file (InDesign) to a pdf that will allow for better font extraction?
Sorry, I'm not really into InDesign. But that software being from Adobe I would be surprised if that was a bug in InDesign or its export to PDF. Could it instead be that there are some information in the InDesign file which tag PELLE PIÙ BELLA as Pelle Più bella which InDesign then in the PDF export translates into this ToUnicode mapping?
Does it have to do with non-unicode fonts and if so is there an alternative that does not require the owner to select different fonts?
In case of your sample document there are three fonts, all of them with an Encoding entry WinAnsiEncoding, all of them being an embedded subset, but only two have such funny ToUnicode mappings, DIN-Medium and DIN-Bold, while Helvetica has no ToUnicode mapping. So it somehow is font related. How exactly I cannot say.
A workaround in case of your sample document would be to remove the ToUnicode mapping from the font dictionaries.
For example using Java and the iText library you can do that like this:
PdfReader reader = new PdfReader(INPUT);
for (int i = 1; i <= reader.getXrefSize(); i++)
{
PdfObject obj = reader.getPdfObject(i);
if (obj != null && obj.isDictionary())
{
PdfDictionary dic = (PdfDictionary) obj;
if (PdfName.FONT.equals(dic.getAsName(PdfName.TYPE)))
{
dic.remove(PdfName.TOUNICODE);
}
}
}
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(OUTPUT));
stamper.close();
reader.close();
After this manipulation Adobe Reader text extraction results in
PELLE PIÙ BELLA
This obviously only works in situations like the one in your sample document.
If in your other documents there is a mixture of fonts some of which require their respective ToUnicode map for text extraction while others are like the trouble fonts above, you might want to add some extra conditions to the Java code to only remove the map in the buggy font definitions.
No need to jump through PDF hoops. It isn't even a good text interchange format to begin with.
Is there a better way to save the native file (InDesign) to a pdf that will allow for better font extraction?
Ask the file provider to make an RTF export. This will retain all used fonts and formatting.
Your WELD-weld problem might be because of the font (if it contains both upper- and lowercase mapped to the same glyphs), use of an OpenType feature such as All Capitals, or even something like a badly created text-only stream inside the PDF.
While parsing a pdf file, my parser encounter a Tf operator with the value of the SubType entry in the font dictionary set to TrueType. The Encoding entry is not present, the symblic flag is set.
My question is : how do I suppose to map the character codes to characters with no encoding ?
The PDF reference section 5.5.5 Character Encoding states that TrueType font has internal data represented in tables in the font files. It seems that those tables would help me map the character codes. Am I getting it right ? How can I extract those information from the font file ?
The font file extracted from the PDF gave something like :
I read Apple's documentation The True Type Font File but still don't get how can I extract those informations from those tables.
Any help, links or reading suggestion would be greatly appreciated.
Symblic flag means that encoding is set to [0..255] range. Every character code must be in the this range. Font presents glyphs only for these codes.
Here is a great set of resources regarding TrueType and OpenType font formats.
You can use freetype library function FT_Get_Char_Index for going from a character code to a glyph index. See FT_Get_Char_Index
You'll have to dump the truetype font to file and load it with freetype to get an FT_Face first.
While rendering a PDF file generated by PDFCreator 0.9.x. I noticed it contains an error in the character mapping. Now, an error in a PDF file is nothing to be wondered about, Acrobat does wonders in rendering faulty PDF files hence a lot of PDF generators create PDFs that do not adhere fully to the PDF standard.
I trief to create a small example file: http://test.continuit.nl/temp/Document.pdf
The single page renders a single glyph (a capital A) using a Tj command (See stream 5 0 obj). The font selected (7 0 obj) contains a font with a single glyph embedded. So far so good. The char is referenced by char #1. Given the Encoding of the font it contains a Differences part: [ 1 /A ]. Thus char 1 -> character /A. Now in the embedded subset font there is a cmap that matches no glyph at character 65 (eg capital A) the cmap section of the font does define the character in exactly the order in the PDF file Font -> Encoding -> Differences array.
It looks like the character mapping / encoding is done twice. Only Files from PDFCreator 0.9.x seem to be affected.
My question is: Is this correct (or did I make a mistake and is the PDF correct) and what would you do to detect this situation in order to solve the rendering problem.
Note: I do need to be able to render these PDFs..
Solution
In the ISO32000 file there is a remark that symbolic TrueType fonts (flag bit 3 is on in the font descriptor) the encoding is not allowed and you should IGNORE it, using a simple 1on1 encoding always. SO all in all, if it is a symbolic font, I ignore the Encoding object altogether and this solves the problem.
The first point is that the file opens and renders correctly in Acrobat, so its almost certain that the file is correct. In fact it opens and renders correctly in a wide range of PDF consumers, so in fact it is correct.
The font in question is a TrueType font, so actually yes, there are two kinds of 'encoding'. First there is PDF/PostScript Encoding. This maps a character code into a glyph name. In your case it maps character code 1 to glyph name /A.
In a PostScript font we would then look up the name /A in the CharStrings dictionary, and that would give us the character description, which we would then execute. Things are different with a TrueType font though.
You can find this on page 430 of the 1.7 PDF Reference Manual, where it states that:
"A TrueType font program’s built-in encoding maps directly from character codes to glyph descriptions by means of an internal data structure called a “cmap” (not to be confused with the CMap described in Section 5.6.4, “CMaps”)."
I believe in your case that you simply need to use the character code (0x01) directly in the CMAP sub table. This will give you a GID of 36.