I have a generated PDF file with Japaneses text I use Arial Unicode MS as a font some letters displayed correctly but for others I see gibrish like this 〠ぇ》❥ (hope you can see it)
How do I get it to display Japanese characters instead?
Looks like Arial Unicode MS has 2 different version 0.84 and 1.01 the later has a better support for Japanese
Related
I'm trying to extract text from Arabic pdfs - raw data extraction not OCR -.
I tried many packages, tools and none of them worked, python packages, pdfBox, adobe API, and many other tools and all of them field to extract the text correctly, either it reads the text LTR or it do wrong decoding.
Here is a two sample from different tools
sample 1:
املحتويات
7 الثانية الطبعة مقدمة
9 وتاريخه األدب -١
51 الجاهليون -٢
95 الشعر نحل أسباب -٣
149 والشعراء الشعر -٤
213 مرض شعر -٥
271 الشعر -٦
285 الجاهيل النثر -٧
sample 2:
ﺔﻴﻧﺎﺜﻟا ﺔﻌﺒﻄﻟا ﺔﻣﺪﻘﻣ
ﻪﺨﻳرﺎﺗو بدﻷا -١
نﻮﻴﻠﻫﺎﺠﻟا -٢
ﺮﻌﺸﻟا ﻞﺤﻧ بﺎﺒﺳأ -٣
ءاﺮﻌﺸﻟاو ﺮﻌﺸﻟا -٤
ﴬﻣ ﺮﻌﺷ -٥
ﺮﻌﺸﻟا -٦
ﲇﻫﺎﺠﻟا ﺮﺜﻨﻟا -٧
original text
and yes I can copy it and get the same rendered text.
are there any tool that can extract Arabic text correctly
the book link can be found here
The text in a PDF is not the same as the text used for its construction, we can see that in your example where page 7 is shown in Arabic on the surface but is coded as 7 in the plain text.
However a greater problem is the Languages as supported by fonts, so in Notepad I had to accept a script font to see a similarity, but that is using a font substitution.
Another complication is Unicode and whitespace ordering.
so the result from
pdftotext -f 5 -l 5 في_الأدب_الجاهلي.pdf try.txt
At best will look like
Thus in summary your Sample 1 is equal if not better, than any other simple attempt.
Later Edit from B.A. comment below
I found a way to go around this, after extracting the text I open the txt file and normalize its content using unicodedata python module that offers unicodedata.normalize() function. So I can now say that pdftotext is the best tool for Arabic text extraction
Unicode Normalization should be fixing that issue. (you can choose NFKC)
Most programming languages have a normal.
check here for more info about normalization.
https://unicode.org/reports/tr15/
So I am trying to extract English and Hindi text from a PDF file. The English text is extracted properly. But when I try to extract the Hindi Text, some characters are replaced by circle/squares.
I copied the Hindi text snippet directly from the PDF File to a Word document and I get the same squares for some characters.
PDFBox Version: 2.0.7
PDF Version: 1.6(Acrobat 7.x)
Security Details(PDF):
Font Details:
I cannot attach the PDF, but here is a snippet of the PDF File(Adobe Acrobat Reader).
Note: I have drawn the black bar as it contains the address of someone.
Output of text extracted using PDFBox:
पता: कालकाजी, दि ण िद ी, िद ी - 110019
As you can see from the output of PDFBox text extraction above, some of the characters are replaced by circles. The same happens when I manually copy from PDF to a word document.
I have tried tesseract OCR also, but that is giving an even worse output. I would like to know any other options that I can try?
For instance, extracting the data using PDFBox, not as a text but an image?
EDIT:: Also getting the following warnings.
03:58:38.711 [main] WARN o.a.pdfbox.pdmodel.font.PDType0Font - No
Unicode mapping for CID+26 (26) in font Lohit-Devanagari
I am having issues with previewing PDF in Gmail. It doesn't recognize some of the international characters that I am using (it doesn't show letters like ą ć ś, but it shows for example ł). I am encoding the pdf with Cp1250.
Any ideas on whats going on?
It looks like you are using the Standard 14 Fonts and don't embed them into your PDF. PDF readers are required to bring along these fonts but only with a limited character set which does not include ą, ć, or ś but which does include ł which matches your observation
(it doesn't show letters like ą ć ś, but it shows for example ł)
For details on these fonts confer the PDF specification
9.6.2.2 Standard Type 1 Fonts (Standard 14 Fonts)
The PostScript names of 14 Type 1 fonts, known as the standard 14 fonts, are as follows: Times-Roman, Helvetica, Courier, Symbol, Times-Bold, Helvetica-Bold, Courier-Bold, ZapfDingbats, Times-Italic, Helvetica-Oblique, Courier-Oblique, Times-BoldItalic, Helvetica-BoldOblique, Courier-BoldOblique
These fonts, or their font metrics and suitable substitution fonts, shall be available to the conforming reader.
NOTE The character sets and encodings for these fonts are listed in Annex D. The font metrics files for the standard 14 fonts are available from the ASN Web site (see the Bibliography). For more information on font metrics, see Adobe Technical Note #5004, Adobe Font Metrics File Format Specification.
In Annex D you'll find ł but not ą, ć, or ś.
I recently discovered an issue with IE10. We have a web page that displays English text beside a translation in Japanese. Some of the Japanese characters display as squares. In the view source page all characters are correctly rendered. The database also has the characters correctly rendered. The unusual part is that when I block the characters with the cursor they convert to the correct characters.
IE10 I believe has a bug.
Anyone having similar issue or know of a fix? Checked all language settings, regional settings, browser font settings and many other tests. Nothing corrects this issue.
This issue was related to a dual byte character which some fonts and windows applications will support.
Some older fonts may use a two hex character representation to present a single character. Some fonts support this and some do not.
In this case the characters at issue were the following…..
ジ
シ and ゙
The latter two which I think are special characters that combined are intended to represent ジ.
The Unicode Standard from the Unicode ISO web site table defines them like so…..
Decimal Character HEX Name
12472 ジ 30B8 KATAKANA LETTER ZI
12471 シ 30B7 KATAKANA LETTER SI
12441 っ゙ 3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK (combined with small tu (っ))
So some fonts use 12471 + 12441 to make 12472. This is what I found. But the actual string has 12471 + 12441 and not 12472 or the hex: 0x30B7, 0x3099 and not 0x30B8.
Any time a font being used does not support this binding, a box is displayed. The challenge is that it may be as simple as someone creating a birthday card using a non-compliant UTF8 font that could cause a PC to not allow the character to display correctly.
Does any body can help me with letter issue in PDFBox I'm trying to print letter "ń" ( polish letter ) and I'm getting something like þÿ J . Dı B R O W 2S0 :K0 3I.
Please help!
I came upon the same problem with Bulgarian. In short, I think there isn't an easy solution. Basically you need a utf font. If you use one of the standard 14 type1 fonts (like Helvetica or Courier) - they only support the basic latin alpabet, so they can't do the job. You could load a truetype utf font but pdfbox has a hardocded WinAsciiEncoding for all truetype (type1 as well) fonts which is wrong. You could do what Open office does as far as I can see - create a subset of a font so that you don't embed the whole font file in the pdf. Unfortunately this functionality is missing in pdfbox but there is a Jira and more information:
https://issues.apache.org/jira/browse/PDFBOX-922
If you find a good solution please share!
you can change it to unicode characters into LegacyPDFStream Engine java class