Where can I get Adobe-Identity-UCS cmap file? - pdf

I have a pdf file which can not be extracted text by pdfbox or itext7. The font is encoded by Identity-H with Adobe-Identity-UCS. The details of ToUnicode are given below.
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo > def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000><FFFF>
endcodespacerange
endcmap
CMapName currentdict /CMap defineresource pop
end
end
The ToUnicode is invalid. Is there any way to fixed it?
I tried to download an intact Adobe-Identity-UCS cmap file and to replace it. But after a lot of google searching, I can't find the Adobe-Identity-UCS cmap file.
Any help? Thanks.
Edit:
Chinese-cidmap-broken.pdf

The ToUnicode CMap you show corresponds to the example ToUnicode CMap in the PDF specification ISO 32000 (either part), merely without any bfrange or bfchar section.
Thus, what you have essentially is a template into which one can put arbitrary mappings.
Concerning your question, therefore:
Is there any way to fixed it?
Yes and no.
Yes, you can fix it by adding the appropriate bfrange or bfchar sections with the correct mappings.
BUT... to do so you need to know which codes map to which Unicode strings respectively for the font at hand, the name Adobe-Identity-UCS by itself usually does not imply the mapping. So also:
No, not without additional information.
#Tilman in his comment to your question referenced one of his answers in which he showed how to add a missing ToUnicode map using information on the actual mappings gathered from different sources.

Related

PDF to Text for Urdu and Arabic using Ghostscript

I have a few PDF files which are in the Urdu language, and some of the PDF files are in the Arabic language.
I want to convert the PDF files to text format. I have issued the following Ghostscript code from the command line in my Windows 10 system:
gswin64c.exe -sDEVICE=txtwrite -o output.txt new.pdf
The text file is generated, however, the contents of text file is not in the Urdu language or Arabic language.
This is how it looks like (I have pasted a portion of output as it is huge):
ی첺جⰧ�� ہ셈ے
How can I properly convert PDF to text using Ghostscript?
Well basically the answer is that the PDF files you have supplied have 'not terribly good' ToUnicode CMap tables.
Looking at your first file we see that it uses one font:
26 0 obj
<<
/BaseFont /CCJSWK+JameelNooriNastaleeq
/DescendantFonts 28 0 R
/Encoding /Identity-H
/Subtype /Type0
/ToUnicode 29 0 R
/Type /Font
>>
endobj
That has a ToUnicode CMap in object 29, the ToUnicode CMap is meant to map character codes to Unicode code points. Looking at the first piece of text as an example we see:
/C2_0 1 Tf
13 0 0 13 39.1302 561.97 Tm
<0003>Tj
/Span<</ActualText<FEFF0645062A>>> BDC
<38560707>Tj
So that's character code 0x003 (notice no marked content for the first characetr), looking at the ToUnicode CMap we see:
<0003> <0020>
So character code 0x003 maps to Unicode point U+0020, a space. The next two character codes are 3856 and 0707. Again consulting the ToUnicode CMap we see:
<3856> <062A0645>
So that single character code maps to two Unicode code points, U+062A and U+0645, Which is 'Teh' ت and 'Meem' م
So far so good. The next code is 0707, when we look at the ToUnicode CMap it comes up as 0xFFFD, which is the 'replacement character' �. Obviously that's meaningless.
We then have this :
0.391 0 Td
[<011C07071FEE>1 <0003>243.8 <2E93>]TJ
/Span<</ActualText<FEFF0644>>> BDC
<0707>Tj
EMC
So that's character codes 0x011C, 0x0707, 0x1FEE, 0x0003, 0x2E93 followed by 0x0707. Notice that the final <0707> is associated with a Marked Content definition which says the ActualText is Unicode 0x0644, which is the 'Lam' glyph ل
So clearly the ToUnicode CMap should be mapping 0707 to U+0644, and it doesn't.
Now when given a ToUnicode CMap the text extraction code trusts it. So your problem with this file is that the ToUnicode CMap is 'wrong', and that's why the text is coming out incorrect.
I haven't tried to debug further through the file, it is possible there are other errors.
Your second file has this ToUnicode CMap:
26 0 obj
<<
/Length 606
>>
stream
/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
/Registry (AABACF+TT1+0) /Ordering (T42UV) /Supplement 0 >> def
/CMapName /AABACF+TT1+0 def
/CMapType 2 def
1 begincodespacerange <0003> <0707> endcodespacerange
15 beginbfchar
<0003> <0020>
<0011> <002E>
<00e7> <062A>
<00ec> <062F>
<00ee> <0631>
<00f3> <0636>
<00f8> <0641>
<00fa> <0644>
<00fc> <0646>
<00fe> <0648>
<0119> <0647>
<011a> <064A>
<0134> <0066>
<013b> <006D>
<0707> <2423>
endbfchar
2 beginbfrange
<00e4> <00e5> <0627>
<011f> <0124> <0661>
endbfrange
endcmap CMapName currentdict /CMap defineresource pop end end
The first text in the file is:
<3718>Tj
And again, that's not in the CMap. Because the text extraction code prioritises the CMAp (because it's usually reliable), the missing entries cause the extraction to basically fail.
In addition to the fact that the ToUnicode CMaps are incorrect, the embedded fonts are subset and use an Identity-H CMap for drawing. That eliminates another source of information we could use.
Fundamentally the only way you're going to get text out of that PDF fie is manual transcription or OCR software.
Since you are using Ghostscript on Windows, the distributed binary includes Tesseract so you could try using that with pdfwrite and an Urdu training file to produce a PDF file with a possibly better ToUnicode CMap. You could then extract the text from that PDF file.
You would have to tell the pdfwrite device not to use the embedded ToUnicode CMaps, see the UseOCR switch documented here https://ghostscript.com/doc/9.56.1/VectorDevices.htm#PDF
And information on setting up the OCR engine and getting output here https://ghostscript.com/doc/9.56.1/Devices.htm#OCR-Devices
You may get better results by using an 'image' OCR output and then using the text extraction on that file to get the text out.

asking for the unicode of letter conjunctions

I occasionally encounter some special character while parsing PDF documents. They are actually two English letters, like 'fi', 'tt', or 'ti', but visually they look like conjuncted and they actually exist in PDF string as one character.
I checked the 'ToUnicode' for these characters, but I just found the 'ToUnicode' CMap table are disrupted, therefore I cannot find their unicode.
For example, <012E> Tj will print fi like attached picture. But in its corresponding Font's ToUnicode CMap: <012E> <0001>, which is meaningless.
Could anybody let me know their unicode code point? Possible to find it from the corresponding font program?
Thanks for any advice.
fi:
tt:
ti:
First of all, what you call letter conjunctions usually is known as ligatures. Thus, I will use that term here from now on.
Unicode discourages the use of specific code points for ligatures:
The existing ligatures exist basically for compatibility and round-tripping with non-Unicode character sets. Their use is discouraged. No more will be encoded in any circumstances.
Ligaturing is a behavior encoded in fonts: if a modern font is asked to display “h” followed by “r”, and the font has an “hr” ligature in it, it can display the ligature. Some fonts have no ligatures, while others (especially fonts for non-Latin scripts) have hundreds of ligatures. It does not make sense to assign Unicode code points to all these font-specific possibilities.
(Unicode FAQ on ligatures)
Thus, you should not use the existing ligature code points.
You appear to attempt to find the correct ToUnicode mapping for ligature glyphs. For this simply remember that the values of ToUnicode mappings do not need to be single code points but may be multiple ones:
n beginbfchar
srcCode dstString
endbfchar
where dstString may be a string of up to 512 bytes.
(ISO 32000-1, section 9.10.3 ToUnicode CMaps)
Concerning your example, therefore:
For example, <012E> Tj will print fi like attached picture. But in its corresponding Font's ToUnicode CMap: <012E> <0001>, which is meaningless.
Simply use
<012E> <00660069>
If you want to use ligature code points nonetheless, query the Wikipedia article on Orthographic Ligatures, it lists some ligature code points. In particular <FB01> for fi, so for your example:
<012E> <FB01>
But remember, their use is discouraged.

Without embeded fonts, is PDF limited to only 4281 characters (of AGL)? How to display more glyphs?

Adobe Glyph List (AGL) is described as
is a mapping of 4,281 glyph names to one or more Unicode characters.
For what I understand those are PDF Names like /Adieresis allow to specify the respective unicode character U+00C4 and if my understanding is correct those 4,281 Names can be used to specify a mapping like done here for the font named /F1 in a pages /Resources dictionary:
<<
/Type /Page
/Resources <<
/Font <<
/F1 <<
/Type /Font
/Subtype /Type1
/BaseFont /Times-Roman
/Encoding <<
/Differencs [ 1 /Adiaresis /adiaresis ]
>>
>>
>>
>>
The key issue, which I cannot get to wrap my head around is that via the /Differences Array and the predefined AGL names I would only be able to use those 4,281 glyphs/characters from the base/builtin/standard set of PDF fonts, wouldn't I?
Basically what I am asking is whether it is correct that to display text containing any character not included in those 4,281 AGL characters, would be impossible without embedding those glyphs into the produced pdf?
Also I am confused that there is a /toUnicode feature in PDF allowing to associate glyphs/cmaps of embedded fonts with the unicode characters they those glyphs should represent (hence there was some thinking about "unicode") yet I cannot seem to find the way to use any reasonable unicode codepoints or half-way working encoding (i.e. UTF-8) to make use of the built-in fonts in PDF.
So am is my assumption correct that without going the length to generate a font to embed within a pdf file, the text can only ever be at most from the set of those 4,281 characters only?
In order to support all 65,557 characters within Unicode's Basic Multilingual Plane, it would be required to generate a font containing the used glyphs in the text, since except those 4,281 AGL glyph there seems to be no way to reference to those unicode characters, correct?
Motivation
It would be nice to have a way in PDF that would be the equivalent to HTML5's
<meta charset="utf-8">. Allowing text to be encoded in one simple compatible encoding for unicode, and not having to deal with complicated things as CID/GID/Postscript Glyph Names etc.
This answer first discusses the use of non-AGL names in differences arrays and the more encompassing encodings of composite fonts. Then it discusses which fonts a viewer actually does have to have available. Finally it considers all this in light of the clarifications accompanying your bounty offer.
AGL names and Differences arrays
First let's consider the focal point of your original question,
The key issue, which I cannot get to wrap my head around is that via the /Differences Array and the predefined AGL names I would only be able to use those 4,281 glyphs/characters from the base/builtin/standard set of PDF fonts, wouldn't I?
Basically what I am asking is whether it is correct that to display text containing any character not included in those 4,281 AGL characters, would be impossible without embedding those glyphs into the produced pdf?
i.e. your assumption is that only those 4,281 AGL glyph names can be used in the Differences array of the encoding entry of a simple font.
This is not the case, you can also use arbitrary names not found on the AGL. E.g. using this font
7 0 obj
<<
/Type /Font
/Subtype /TrueType
/BaseFont /Arial
/FirstChar 32
/LastChar 32
/Widths [500]
/FontDescriptor 8 0 R
/Encoding 9 0 R
>>
endobj
8 0 obj
<<
/Type /FontDescriptor
/FontName /Arial
/FontFamily (Arial)
/Flags 32
/FontBBox [-665.0 -325.0 2000.0 1040.0]
/ItalicAngle 0
/Ascent 1040
/Descent -325
/CapHeight 716
/StemV 88
/XHeight 519
>>
endobj
9 0 obj
<<
/Type /Encoding
/BaseEncoding /WinAnsiEncoding
/Differences [32 /uniAB55]
>>
endobj
the instruction
( ) Tj
shows you a ꭕ ('LATIN SMALL LETTER CHI WITH LOW LEFT SERIF' U+AB55 which if I saw correctly is not on the AGL) on a system with Arial (ArialMT.ttf) installed.
Thus, to display an arbitrary glyph, you merely need a font you know containing that glyph with a name you know available to the PDF viewer in question. The name doesn't have to be an AGL name, it can be arbitrary!
Encodings of composite fonts
Furthermore, you often aren't even required to enumerate the characters you need as long as your required characters are in the same named encoding for composite fonts!
Here the Encoding shall be
The name of a predefined CMap, or a stream containing a CMap that maps character codes to font numbers and CIDs. If the descendant is a Type 2 CIDFont whose associated TrueType font program is not embedded in the PDF file, the Encoding entry shall be a predefined CMap name (see 9.7.4.2, "Glyph Selection in CIDFonts").
And among the predefined CMaps there are numerous CJK ones. As long as the viewer in question has access to a matching font, you can use a composite font with such an encoding to get access to a lot of CJK glyphs.
Which fonts does a viewer have to have available?
Thus, if the viewer in question has appropriate fonts available, you don't need to embed font programs to display any glyph. But which fonts does a viewer have available?
Usually a viewer will allow access to all fonts registered with the operation system it is running on, but strictly speaking it only has to have very few fonts accessible, PDF processors supporting PDF 1.0 to PDF 1.7 files only need to know the so-called standard 14 fonts and pure PDF 2.0 processors need to know none.
Annex D of the specification clarifies the character ranges to support:
All characters listed in D.2, "Latin character set and encodings" shall be supported for the Times, Helvetica, and Courier font families, as listed in 9.6.2.2, "Standard Type 1 fonts (standard 14 fonts) (PDF 1.0-1.7)" by a PDF processor that supports PDF 1.0 to 1.7.
D.4, "Symbol set and encoding" and D.5, "ZapfDingbats set and encoding" describe the character sets and built-in encodings for the Symbol and ZapfDingbats (ITC Zapf Dingbats) font programs, which belong to the standard 14 predefined fonts.
D.2 essentially is a table describing the StandardEncoding, MacRomanEncoding, WinAnsiEncoding, and PDFDocEncoding. These all are very similar single byte encodings.
D.4 and D.5 contain a single table each describing additional single byte encodings.
Thus, all you can actually expect from a PDF 1.x viewer are these less than 1000 characters!
(You wondered about this in comments to this answer to another question of yours.)
Concerning your clarifications
In your text accompanying your bounty offer you expressed a desire for
being enabled to create a "no frills" program that is able to generate pdf files, where the input data are UTF-8 unicode strings. "No frills" being a reference to the fact that such a software would ideally be able to skip handling font porgam data (such as createing a subset font pogram for inclusion into the pdf).
As explained above, you can do so, either by customized encodings of a number of simple fonts or by the more encompassing named encodings of composite fonts. If you know that the target PDF viewer has these fonts available, that is!
sketch a way that actually would allow to have characters from at least the Adobe-GB1 charset as referenced via "UniCNS−UTF16−H" to be rendered in pdf-viewers, while the pdf file not having any font program embedded for that achieving this.
"UniCNS−UTF16−H" just happens to be one of the predefined encodings allowable for composite fonts. Thus, you can use a composite font with this encoding without embedding the font program as long as the viewer has the appropriate fonts accessible. As far as Adobe Reader is concerned, this usually amounts to having the Extended Asian Language Pack installed.
the limitations to use anything else the WinAnsiEncoding, MacRomanEncoding, MacExpertEncoding with those 14 standard fonts.
As explained above you can merely count on less than 1000 glyphs being available for sure in an arbitrary PDF 1.x viewer. In a pure PDF 2.0 viewer you actually cannot count on even that!
The specification quotes above are from ISO 32000-2; similar requirements can already be found in ISO 32000-1.
Without embeded fonts, is PDF limited to only 4281 characters (of AGL)?
No. Though you should embed fonts to help ensure that the PDF looks the same everywhere.
Basically what I am asking is whether it is correct that to display text containing any character not included in those 4,281 AGL characters, would be impossible without embedding those glyphs into the produced pdf?
It is possible yes, though you would ideally stick with a "standard" encoding, such one of the Orderings. See the "Predefined CMaps" in the PDF specification for these.
If you start making changes to the encoding, such as using Differences, then you are making run time font substitution for the PDF processing program much more difficult.
Regarding /ToUnicode that is just for text extraction, and has nothing to do with rendering. If you stick with a standard encoding as recommended above this is not needed.
There is no 4,281 font glyph limit inherent in PDF. I think you are a bit confused, you don't have to embed fonts in a PDF. Besides the Standard 14 fonts all PDF viewers should be able to handle, PDF software is going to look for fonts installed on the system when not embedded otherwise so it's not as if you have no embedded fonts you lose the ability to display glyphs at all.
You would define a different encoding with the Differences array if the base encoding doesn't reflect what is in the font.
ToUnicode comes into play for text extraction vs text showing.

PDF toUnicode cmap table restore

I have multiple pdf files without 'toUnicode' cmap table. Absence of cmap table restricts me from copying the text from pdf files.
As far as I know, there is a possibility to add 'toUnicode' mapping in pdf file, but in my case adding static values is not an option, different files have different glyph codes.
So the question is the following. Is there any possibility to restore 'toUnicode' cmap table, perhaps with the help of Ghostscript, or are there any options at all?
Thanks.
No, you cannot add ToUnicode CMaps to an existing PDF file using Ghostscript.
In the general case, you can't do it at all, except manually. As you note in the question, different files will be constructed to use different character code->Glyph mappings, which means that the character code to Unicode mapping will also be different.
Since the character code selection is often based on the order in which glyphs are used in a file (so the fist glyph is character code 1, the second is character code 2 etc) you can see that there is no prospect of identifying a 'one size fits all' solution.
You could use some kind of OCR to scan the rendered output, identify each glyph and find the Unicode code point for it. Then you could construct a CMap by identifying the character code for the glyph and mapping it to the Unicode value.
You could, then, add the ToUnicode CMap to the PDF file, and update the Font Descriptor with the object number of the ToUnicode CMap.
Ghostscript won't do any of that for you, and I haven't heard of any tool which will.

PDF font mapping error

While rendering a PDF file generated by PDFCreator 0.9.x. I noticed it contains an error in the character mapping. Now, an error in a PDF file is nothing to be wondered about, Acrobat does wonders in rendering faulty PDF files hence a lot of PDF generators create PDFs that do not adhere fully to the PDF standard.
I trief to create a small example file: http://test.continuit.nl/temp/Document.pdf
The single page renders a single glyph (a capital A) using a Tj command (See stream 5 0 obj). The font selected (7 0 obj) contains a font with a single glyph embedded. So far so good. The char is referenced by char #1. Given the Encoding of the font it contains a Differences part: [ 1 /A ]. Thus char 1 -> character /A. Now in the embedded subset font there is a cmap that matches no glyph at character 65 (eg capital A) the cmap section of the font does define the character in exactly the order in the PDF file Font -> Encoding -> Differences array.
It looks like the character mapping / encoding is done twice. Only Files from PDFCreator 0.9.x seem to be affected.
My question is: Is this correct (or did I make a mistake and is the PDF correct) and what would you do to detect this situation in order to solve the rendering problem.
Note: I do need to be able to render these PDFs..
Solution
In the ISO32000 file there is a remark that symbolic TrueType fonts (flag bit 3 is on in the font descriptor) the encoding is not allowed and you should IGNORE it, using a simple 1on1 encoding always. SO all in all, if it is a symbolic font, I ignore the Encoding object altogether and this solves the problem.
The first point is that the file opens and renders correctly in Acrobat, so its almost certain that the file is correct. In fact it opens and renders correctly in a wide range of PDF consumers, so in fact it is correct.
The font in question is a TrueType font, so actually yes, there are two kinds of 'encoding'. First there is PDF/PostScript Encoding. This maps a character code into a glyph name. In your case it maps character code 1 to glyph name /A.
In a PostScript font we would then look up the name /A in the CharStrings dictionary, and that would give us the character description, which we would then execute. Things are different with a TrueType font though.
You can find this on page 430 of the 1.7 PDF Reference Manual, where it states that:
"A TrueType font program’s built-in encoding maps directly from character codes to glyph descriptions by means of an internal data structure called a “cmap” (not to be confused with the CMap described in Section 5.6.4, “CMaps”)."
I believe in your case that you simply need to use the character code (0x01) directly in the CMAP sub table. This will give you a GID of 36.