Use iText 5 with Extended Latin-A encoding without IDENTITY_H character encoding? - pdf

We are still using iText 5 and have a new requirement to support Latin Extended-A character encodings. If we use IDENTITY_H with, say, Arial Unicode, this is entirely possible. But since we cannot ship Arial Unicode with our software, we would like to use the normal Arial font (which supports Latin Extended-A).
Is this possible with iText 5? Or do we need iText 7's IExtraEncoding interface to achieve this?

Related

Generating PDF from scratch, how are glyphs mapped to character codes?

I want to generate a Portable Document Format (PDF) by an original program of mine.
I am going to experiment an original typesetting program, and in the course of development I want to avoid external tools and fonts as far as possible.
So, it would be ideal to avoid using XeTeX, LuaTeX, among other engines.
And I want to store the glyph information internally in my program or my library.
But where should the character code be specified in the PDF so that the viewer program knows when they are copied or searched?
To generate glyphs, my naive approach is to save, in local library, raster images or Bézier curve parameters that correspond to the characters.
According to the PDF Reference, that seems well possible.
I do not care for kerning, ligature, or other aesthetics virtues for my present purpose, or at least that can be dealt later.
Initially, I think I may generate a Postscript, and use Ghostscript to convert that to PDF.
But it is pointed out here that Postscript does not support Unicode, which I will certainly use.
My option is then reduced to directly generating PDF from scratch.
My confusion is, though my brute-force approach may render correctly, I guess the resulting PDF would be such that the viewer is unable to copy, nor search, since I would have specified nowhere about the character codes.
In PDF Reference p.122, we see that there are several different objects.
What seems relevant are text objects, path objects, and image objects.
Is it possible to associate a image object to its character code?
As I recall, there are some scanned PDF, for example the freely-previewed parts of scanned Google-Books, in which you can copy strings correctly.
What is the method or field specifying that?
However, I think in various tables that follows the PDF Reference, there is no suitable slot for Unicode code.
Similarly, it is not clear how to associate a path object to its character code.
If this can be done, the envisioned project would be easiest, since I just extract out some open source fonts' Bézier curve parameters (I believe that can be done) and translate them myself to the PDF-allowed format.
If both image- and path-objects are impossible to hold character codes, I conclude that a text object is (obviously) more suitable for representing a glyph together with its character code.
Maybe a more correct way would be embedding a custom font, synthesized in runtime, in the PDF.
This is mentioned verbally and briefly in p.364, sec. 5.8, "Embedded Font Programs".
That does seem rather difficult and requires tremendous research.
I would like that you recommend some tutorials for embedding fonts, and they are not easy to find.
In fact, I find exemplary PDF files are itself already scarce, as most of them seems to come in LZ-compressed binary files (I guess).
Indeed, I try to compile a "Hello world" PDF in non-Computer-Modern font, and open with a text editor, and all I see is blanks, control characters, and Mojibake-like strings.
In summary, how do I (if possible) represent a glyph by a text object, image object, or a path object so that is character code can be known?
For concreteness, can you generate a PDF so that: there is shown a circle, but when you copy that, you copy the character "A"?
The association between the curves and the character code is the font. There are several tables involved that do the mappings. The font has an Encoding vector which is indexed by the character code and yields a Glyph name. For copying out of the document, there must also be a ToUnicode vector which maps to unicode code points.
If you study a simple example of a PostScript Type 3 font, that should be very beneficial in understanding a PDF font. I have a short one in this calendar program.
To answer the bold question, if you convert gridcal.ps to pdf, copying the moon glyph results in the character 1 because it is in the ascii position for 1 in the Encoding vector. Some other of the glyphs, notably sun, mars and venus are recognized by Ghostscript, which produces a mapping to the Unicode character. This is very clever, but probably not sufficiently extensive to rely upon (indeed, moon, mercury, jupiter and saturn are not recognized).

OpenType Layout tables used in font ArialMT are not implemented in PDFBox

I'm using the CMS Magnolia in one of our projects. In the log files there are many errors like:
OpenType Layout tables used in font ArialMT are not implemented in PDFBox
What impact has this on a PDF? Can it be opened? Does it look 'nice' or is it some kind of broken?
This is an INFO if you are using the current version (2.0.11). It is only relevant if you use PDFBox to create PDFs, it means that certain advanced font features (GDEF, GSUB, GPOS) are not (yet) supported. You'll need these for certain languages e.g. Thai or Arabic or Indian languages. It can also be used for ligatures in latin languages (fl, fi, ffl, ffi).
Some work on this topic is being done in PDFBOX-4189, but there is still a lot to do.
As for Magnolia, PDFBox is used either in indexing of pdf documents or in generating preview of pdf. For first use case error is completely irrelevant, for the second it might mean that preview is not as accurate as it could be. Nothing major tho either. You can reconfigure log4j to stop seeing this error.

How can get original font names from PDF generated by Ghostscript?

I have a pdf which is produced by Ghostscript 8.15. I need to process this pdf from my software which extract font names from pdf file and then perform some operations. But when I extract font names from this pdf file, these names are not as same as should be. For example: Original font name is 'NOORIN05', but pdf file contains 'TTE25A5F90t00'. How can decode these font names to original names. All fonts are TTF.
NOTE:
Why I need to extract the fonts.
Actually there is a software named InPage which was most famous in India and Pakistan to write documents in Urdu language, because before the unicode support in word processor, this was the only solution to type Urdu in computer. Due to complexity of Urdu language, this software uses 89 fonts files named NOORIN01 TO NOORIN89. The reason of using too many font files is to contain all the Urdu ligatures which are more than 19 thousands. Because each file can contains only 255 ligatures so that's why they used this technique before the unicode. Now copy and paste the text from pdf file generated by this software, results a garbage in MS Word. The reason which I told above 89 font files. So there was no way to extract text from such kind of old pdf files. (Now a days this software has support of unicode but I am talking about old files). So I developed a software in C# to extract text from such old pdf files. The algorithm I am using, creating a database file which contains all the names of 89 font files with all the aschii codes, and in next column I typed Urdu unicode ligature in unicode. I process the pdf file character by character with font, matching the font name from my database file and get the unicode ligature from database and then display in a text box. So in this way I get the unicode text successfully. My software was working fine in many of pdf files. But few days ago I get complaint from a person that your software fails to extract text from this pdf. When I test, I found that the pdf file doesn't contain the original font names so that's why my software unable to do further process. When I checked the properties of this pdf file, It shows the PDF producer GPL Ghostscript 8.15. So I search the net and study the documentation related to fonts but still couldn't find any clue to decode and get the original font names.
The first thing you should do is try a more recent version of Ghostscript. 8.16 is 14 years old..... The current version is 9.21.
If that doe snot preserve the original names (potentially including the usual subset prefix) then we'll need to see an example input file which exhibits the problem.
It might also be helpful if you were to explain why you need to extract the font names, possibly you are attempting something which simply isn't possible.
[EDIT}
OK so now I understand the problem, I'm afraid the answer to your question is 'you can't get the original font name'.
The PDF file was created from the output of the (Adobe-created) Windows PostScript printer driver. When that embeds TrueType fonts into the PostScript stream as type 42 fonts, it gives them a pseudo-random name which is composed of 'TT' followed by some additional characters that may look like hex, but aren't.
Old versions of the Ghostscript pdfwrite device (and 8.15 is very old) simply used that name verbatim, and that's what has been used for the font names in the PDF file you supplied.
Newer versions are capable of digging further into the font and picking up the original font name which is present in the PostScript. Unfortunately the old versions didn't preserve that. Once you've thrown the information away there is no way to get it back again.
So if the only thing you have is this PDF file then its simply not possible to get the font names back. If the person who supplied you with the PDF file can remake it, using a more recent version of Ghostscript, then it will work. But I presume they don't have the PostScript program used to create a 14 year old file.

How to deal with unicode character encoding issues while converting documents from PDF to Text

I am trying to extract text from a PDF. The PDF contains text in Hindi (Unicode). The utility for extraction I am using is Apache PDFBox ( http://pdfbox.apache.org/). The extractor extracts the text, but the text is not recognizable. I tried changing between many encodings and fonts, but the expected text is still not recognized.
Here is an example:
Say text in PDF is : पवार
What it looks after extraction is: ̄Ö3⁄4ÖÖ ̧ü
are there any suggestion?
PDF is – at its heart – a print format and thus records text as a series of visual glyphs, not as actual text. Originally it was never intended as a digital archive format and that still shows in many documents. With complex scripts, such as Arabic or Indic scripts that require glyph substitution, ligation and reordering you often get a mess, basically. What you usually get there are the glyph IDs that are used in the embedded fonts which do not have any resemblance to Unicode or an actual text encoding (fonts represent glyphs, some of which may be mapped to Unicode code points, but some are just needed for font-internal use, such as glyph variants based on context or ligatures). You can see the same with PDFs produced by LaTeX, especially with non-ASCII characters and math.
PDF also has facilities to embed the text as text alongside the visual representation, but that's solely at the discretion of the generating application. I have heard Word tries very hard to retain that information when producing PDFs but many PDF generators do not (it usually works somewhat for Latin, that's probably why nearly no one bothers).
I think the best bet for you if the PDF doesn't have the plain text available is OCR on the PDF as an image.

How to pdflatex with CJK characters/font/encoding

What's the best way to combine pdflatex with CJK characters/font/encoding?
I'd like to generate pdf that includes CJK characters, and in the future all possible unicode characters.
I'm thinking about using 'The CJK package for LaTeX' for cjk characters specifically but it seems not to be maintained since 2006.
Can you suggest something better?
Actually, I upgraded to the last version of tex-live (the 2009 vintage) and all is fine with pdfpages/xelatex
I can now work in full UTF-8/unicode and use my favorite ttf and otf fonts.