How can get original font names from PDF generated by Ghostscript? - pdf

I have a pdf which is produced by Ghostscript 8.15. I need to process this pdf from my software which extract font names from pdf file and then perform some operations. But when I extract font names from this pdf file, these names are not as same as should be. For example: Original font name is 'NOORIN05', but pdf file contains 'TTE25A5F90t00'. How can decode these font names to original names. All fonts are TTF.
NOTE:
Why I need to extract the fonts.
Actually there is a software named InPage which was most famous in India and Pakistan to write documents in Urdu language, because before the unicode support in word processor, this was the only solution to type Urdu in computer. Due to complexity of Urdu language, this software uses 89 fonts files named NOORIN01 TO NOORIN89. The reason of using too many font files is to contain all the Urdu ligatures which are more than 19 thousands. Because each file can contains only 255 ligatures so that's why they used this technique before the unicode. Now copy and paste the text from pdf file generated by this software, results a garbage in MS Word. The reason which I told above 89 font files. So there was no way to extract text from such kind of old pdf files. (Now a days this software has support of unicode but I am talking about old files). So I developed a software in C# to extract text from such old pdf files. The algorithm I am using, creating a database file which contains all the names of 89 font files with all the aschii codes, and in next column I typed Urdu unicode ligature in unicode. I process the pdf file character by character with font, matching the font name from my database file and get the unicode ligature from database and then display in a text box. So in this way I get the unicode text successfully. My software was working fine in many of pdf files. But few days ago I get complaint from a person that your software fails to extract text from this pdf. When I test, I found that the pdf file doesn't contain the original font names so that's why my software unable to do further process. When I checked the properties of this pdf file, It shows the PDF producer GPL Ghostscript 8.15. So I search the net and study the documentation related to fonts but still couldn't find any clue to decode and get the original font names.

The first thing you should do is try a more recent version of Ghostscript. 8.16 is 14 years old..... The current version is 9.21.
If that doe snot preserve the original names (potentially including the usual subset prefix) then we'll need to see an example input file which exhibits the problem.
It might also be helpful if you were to explain why you need to extract the font names, possibly you are attempting something which simply isn't possible.
[EDIT}
OK so now I understand the problem, I'm afraid the answer to your question is 'you can't get the original font name'.
The PDF file was created from the output of the (Adobe-created) Windows PostScript printer driver. When that embeds TrueType fonts into the PostScript stream as type 42 fonts, it gives them a pseudo-random name which is composed of 'TT' followed by some additional characters that may look like hex, but aren't.
Old versions of the Ghostscript pdfwrite device (and 8.15 is very old) simply used that name verbatim, and that's what has been used for the font names in the PDF file you supplied.
Newer versions are capable of digging further into the font and picking up the original font name which is present in the PostScript. Unfortunately the old versions didn't preserve that. Once you've thrown the information away there is no way to get it back again.
So if the only thing you have is this PDF file then its simply not possible to get the font names back. If the person who supplied you with the PDF file can remake it, using a more recent version of Ghostscript, then it will work. But I presume they don't have the PostScript program used to create a 14 year old file.

Related

PDF Copy Text Issue: Weird Characters

I tried to copy text from a PDF file but get some weird characters. Strangely, Okular can recoqnize the text, but not with Sumatra PDF or Adobe, all three applications are installed in Windows 10 64 bit. To better explain my issue, here is the video https://streamable.com/sw1hc. The "text layer workaround file" is one solution I got. Any help is greatly appreciated. Regards
In short: The (original) PDF does not contain the information required for regular text extraction as described in the PDF specification. Depending on the exact nature of your task, you might try to add the required information to the existing text objects and fonts or you might go for OCR.
Mapping character codes to Unicode as described in the PDF specification
The PDF specification ISO 32000-1 (and similarly ISO 32000-2, too) describes an algorithm for mapping character codes to Unicode values using information available directly inside the PDF.
It has been quoted very often in other stack overflow answers (see here, here, here, here, here, or here), so I won't quote it here again.
Essentially this is the algorithm used by Adobe Acrobat during copy&paste and also by many other text extractors.
In PDFs which don't contain the information required for text extraction, you eventually get to this point in the algorithm:
If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.
What happens if the algorithm above fails to produce a Unicode value
This is where the text extraction implementations differ, they try to determine the matching Unicode value by using heuristics or information from beyond the PDF or applying OCR to the glyph in question.
That the different programs you tried returned so different results shows that
your PDF does not contain the information required for the algorithm above from the PDF specification and
the heuristics used by those programs differ relevantly and Okular's heuristics work best for your document.
What to do in such a case
There are multiple options, more or less feasible depending on your concrete case:
Ask the source of the PDF for a version that contains proper information for text extraction.
Unless you have a contract with that source that requires them to supply the PDFs in a machine readable form or the source is otherwise obligated to do so, they usually will decline, though...
Apply OCR to the PDF in question.
Depending on the quality of the OCR software and the glyphs in the PDF, the results can be of a questionable quality; e.g. in your "PDF copy text issue-Text layer workaround.pdf" the header "Chapter 1: Derivative Securities" has been recognized as "Chapter1: Deratve Securites"...
You can try to interactively add manually created ToUnicode maps to the PDF, e.g. as described by Tilman Hausherr in his answer to "how to add unicode in truetype0font on pdfbox 2.0.0".
Depending on the number of different fonts you have to create the mappings for, this approach might easily require way too much time and effort...

PDF File header sequence: Why '25 e2 e3 cf d3' bits stream used in many document?

I know that inform to a reader whether the pdf contains binary or not.
But why "25 e2 e3 cf d3" not random binary? Because so many document has that.
Is it Just because, so many use same pdf library ?
Refs:
PDF format. function of %-started sequence
comp.text.pdf>pdf format
Looking through the PDFs I have here it looks like a number of PDF processors use these very letters "%âãÏÓ", among them Adobe products.
Not all of those processors use the same basic PDF library, so the use of the same letters cannot be explained by something like that.
Most likely it is due to the fact that Adobe software creates PDFs with that second line comment. For many years developers of other software used example files produced by Adobe software as templates for the PDFs they created.
Yes, the specification ISO 32000-1 merely requires
If a PDF file contains binary data, as most do (see 7.2, "Lexical Conventions"), the header line shall be immediately followed by a comment line containing at least four binary characters—that is, characters whose codes are 128 or greater.
(and the earlier PDF references also recommend the same), so there is no need to use the same binary characters.
But there also is no reason not to use them. Why deviate from the working example files produced by Adobe software in this regard?
Especially in the years before the ISO specification, when there only were the PDF references, one tended to be as Adobe-like as possible in the document structure created as the PDF references were not considered normative in nature by Adobe. Thus, if your document was valid by the references, Adobe viewers could still reject it without that counting as a bug...

Pentaho don't Generete PDF in UTF-8 encoding

I have a problem related with PDF exportation in Pentaho BI plattform. I'm not able to produce a correct PDF file encoded in UTF-8 and which contains Spanish characters. That procedure neither works properly in local Report Designer nor in BI server. Special characters like 'ñ' or 'ç' are skipped in the PDF file. Generation in other formats works just fine (HTML, Excel, etc.).
I've been struggling with that issue for few days being unable to find any solution and would be grateful for any clue.
Thanks in advance
P.S. Report Designer and BI platform version 6.1.0.1
Seems like a font issue. Your font needs to know how to work with unicode and it needs to specify how to "draw" the characters you want.
Office programs (at least MS office) by default automatically select font, which can render any character (if font substitution is enabled), however PDF readers don't do it: they always use the exact font you've specified.
When selecting appropriate font, you have to pay attention to supported Unicode characters and to the font's license: some fonts don't allow embedding and Pentaho embeds font's subset, which was used, into generated PDF files if encoding is UTF-8 or Identity-H.
To install fonts for linux server you need to copy font files either to your java/lib/fonts/ folder or to /usr/share/fonts/, grant read rights to the server's user and restart the server application.

WordML to PDF conversion

We receive wordml documents which are basically XML files generated from msword docs which contains all formatting instructions also. Now we have a requirement to convert these files to PDF. I looked at iText xmlworker to do this conversion. What it did was simply removed all XML tags and gave me all the contents as single paragraph in PDF with no formatting.
How to make sure that generated PDF contains text with correct format from this wordml doc.
iText's product XMLWorker requires you to handle each XML element manually (unless you have HTML as input). The XML schema for MS Word documents is extremely complicated, so you'd be working on that for a few years to get something that looks even remotely ok. In short, XMLWorker doesn't do what you think it does.
If you want MS Word to PDF conversion, you need another kind of solution. XDocReport (MIT license) is one of these, and it has plugins for both iText 2 (LGPL license) and iText 5 (AGPL license). Results are not perfect though.

How to deal with unicode character encoding issues while converting documents from PDF to Text

I am trying to extract text from a PDF. The PDF contains text in Hindi (Unicode). The utility for extraction I am using is Apache PDFBox ( http://pdfbox.apache.org/). The extractor extracts the text, but the text is not recognizable. I tried changing between many encodings and fonts, but the expected text is still not recognized.
Here is an example:
Say text in PDF is : पवार
What it looks after extraction is: ̄Ö3⁄4ÖÖ ̧ü
are there any suggestion?
PDF is – at its heart – a print format and thus records text as a series of visual glyphs, not as actual text. Originally it was never intended as a digital archive format and that still shows in many documents. With complex scripts, such as Arabic or Indic scripts that require glyph substitution, ligation and reordering you often get a mess, basically. What you usually get there are the glyph IDs that are used in the embedded fonts which do not have any resemblance to Unicode or an actual text encoding (fonts represent glyphs, some of which may be mapped to Unicode code points, but some are just needed for font-internal use, such as glyph variants based on context or ligatures). You can see the same with PDFs produced by LaTeX, especially with non-ASCII characters and math.
PDF also has facilities to embed the text as text alongside the visual representation, but that's solely at the discretion of the generating application. I have heard Word tries very hard to retain that information when producing PDFs but many PDF generators do not (it usually works somewhat for Latin, that's probably why nearly no one bothers).
I think the best bet for you if the PDF doesn't have the plain text available is OCR on the PDF as an image.