How to deal with unicode character encoding issues while converting documents from PDF to Text - pdf

I am trying to extract text from a PDF. The PDF contains text in Hindi (Unicode). The utility for extraction I am using is Apache PDFBox ( http://pdfbox.apache.org/). The extractor extracts the text, but the text is not recognizable. I tried changing between many encodings and fonts, but the expected text is still not recognized.
Here is an example:
Say text in PDF is : पवार
What it looks after extraction is: ̄Ö3⁄4ÖÖ ̧ü
are there any suggestion?

PDF is – at its heart – a print format and thus records text as a series of visual glyphs, not as actual text. Originally it was never intended as a digital archive format and that still shows in many documents. With complex scripts, such as Arabic or Indic scripts that require glyph substitution, ligation and reordering you often get a mess, basically. What you usually get there are the glyph IDs that are used in the embedded fonts which do not have any resemblance to Unicode or an actual text encoding (fonts represent glyphs, some of which may be mapped to Unicode code points, but some are just needed for font-internal use, such as glyph variants based on context or ligatures). You can see the same with PDFs produced by LaTeX, especially with non-ASCII characters and math.
PDF also has facilities to embed the text as text alongside the visual representation, but that's solely at the discretion of the generating application. I have heard Word tries very hard to retain that information when producing PDFs but many PDF generators do not (it usually works somewhat for Latin, that's probably why nearly no one bothers).
I think the best bet for you if the PDF doesn't have the plain text available is OCR on the PDF as an image.

Related

Generating PDF from scratch, how are glyphs mapped to character codes?

I want to generate a Portable Document Format (PDF) by an original program of mine.
I am going to experiment an original typesetting program, and in the course of development I want to avoid external tools and fonts as far as possible.
So, it would be ideal to avoid using XeTeX, LuaTeX, among other engines.
And I want to store the glyph information internally in my program or my library.
But where should the character code be specified in the PDF so that the viewer program knows when they are copied or searched?
To generate glyphs, my naive approach is to save, in local library, raster images or Bézier curve parameters that correspond to the characters.
According to the PDF Reference, that seems well possible.
I do not care for kerning, ligature, or other aesthetics virtues for my present purpose, or at least that can be dealt later.
Initially, I think I may generate a Postscript, and use Ghostscript to convert that to PDF.
But it is pointed out here that Postscript does not support Unicode, which I will certainly use.
My option is then reduced to directly generating PDF from scratch.
My confusion is, though my brute-force approach may render correctly, I guess the resulting PDF would be such that the viewer is unable to copy, nor search, since I would have specified nowhere about the character codes.
In PDF Reference p.122, we see that there are several different objects.
What seems relevant are text objects, path objects, and image objects.
Is it possible to associate a image object to its character code?
As I recall, there are some scanned PDF, for example the freely-previewed parts of scanned Google-Books, in which you can copy strings correctly.
What is the method or field specifying that?
However, I think in various tables that follows the PDF Reference, there is no suitable slot for Unicode code.
Similarly, it is not clear how to associate a path object to its character code.
If this can be done, the envisioned project would be easiest, since I just extract out some open source fonts' Bézier curve parameters (I believe that can be done) and translate them myself to the PDF-allowed format.
If both image- and path-objects are impossible to hold character codes, I conclude that a text object is (obviously) more suitable for representing a glyph together with its character code.
Maybe a more correct way would be embedding a custom font, synthesized in runtime, in the PDF.
This is mentioned verbally and briefly in p.364, sec. 5.8, "Embedded Font Programs".
That does seem rather difficult and requires tremendous research.
I would like that you recommend some tutorials for embedding fonts, and they are not easy to find.
In fact, I find exemplary PDF files are itself already scarce, as most of them seems to come in LZ-compressed binary files (I guess).
Indeed, I try to compile a "Hello world" PDF in non-Computer-Modern font, and open with a text editor, and all I see is blanks, control characters, and Mojibake-like strings.
In summary, how do I (if possible) represent a glyph by a text object, image object, or a path object so that is character code can be known?
For concreteness, can you generate a PDF so that: there is shown a circle, but when you copy that, you copy the character "A"?
The association between the curves and the character code is the font. There are several tables involved that do the mappings. The font has an Encoding vector which is indexed by the character code and yields a Glyph name. For copying out of the document, there must also be a ToUnicode vector which maps to unicode code points.
If you study a simple example of a PostScript Type 3 font, that should be very beneficial in understanding a PDF font. I have a short one in this calendar program.
To answer the bold question, if you convert gridcal.ps to pdf, copying the moon glyph results in the character 1 because it is in the ascii position for 1 in the Encoding vector. Some other of the glyphs, notably sun, mars and venus are recognized by Ghostscript, which produces a mapping to the Unicode character. This is very clever, but probably not sufficiently extensive to rely upon (indeed, moon, mercury, jupiter and saturn are not recognized).

PDF Copy Text Issue: Weird Characters

I tried to copy text from a PDF file but get some weird characters. Strangely, Okular can recoqnize the text, but not with Sumatra PDF or Adobe, all three applications are installed in Windows 10 64 bit. To better explain my issue, here is the video https://streamable.com/sw1hc. The "text layer workaround file" is one solution I got. Any help is greatly appreciated. Regards
In short: The (original) PDF does not contain the information required for regular text extraction as described in the PDF specification. Depending on the exact nature of your task, you might try to add the required information to the existing text objects and fonts or you might go for OCR.
Mapping character codes to Unicode as described in the PDF specification
The PDF specification ISO 32000-1 (and similarly ISO 32000-2, too) describes an algorithm for mapping character codes to Unicode values using information available directly inside the PDF.
It has been quoted very often in other stack overflow answers (see here, here, here, here, here, or here), so I won't quote it here again.
Essentially this is the algorithm used by Adobe Acrobat during copy&paste and also by many other text extractors.
In PDFs which don't contain the information required for text extraction, you eventually get to this point in the algorithm:
If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.
What happens if the algorithm above fails to produce a Unicode value
This is where the text extraction implementations differ, they try to determine the matching Unicode value by using heuristics or information from beyond the PDF or applying OCR to the glyph in question.
That the different programs you tried returned so different results shows that
your PDF does not contain the information required for the algorithm above from the PDF specification and
the heuristics used by those programs differ relevantly and Okular's heuristics work best for your document.
What to do in such a case
There are multiple options, more or less feasible depending on your concrete case:
Ask the source of the PDF for a version that contains proper information for text extraction.
Unless you have a contract with that source that requires them to supply the PDFs in a machine readable form or the source is otherwise obligated to do so, they usually will decline, though...
Apply OCR to the PDF in question.
Depending on the quality of the OCR software and the glyphs in the PDF, the results can be of a questionable quality; e.g. in your "PDF copy text issue-Text layer workaround.pdf" the header "Chapter 1: Derivative Securities" has been recognized as "Chapter1: Deratve Securites"...
You can try to interactively add manually created ToUnicode maps to the PDF, e.g. as described by Tilman Hausherr in his answer to "how to add unicode in truetype0font on pdfbox 2.0.0".
Depending on the number of different fonts you have to create the mappings for, this approach might easily require way too much time and effort...

How can get original font names from PDF generated by Ghostscript?

I have a pdf which is produced by Ghostscript 8.15. I need to process this pdf from my software which extract font names from pdf file and then perform some operations. But when I extract font names from this pdf file, these names are not as same as should be. For example: Original font name is 'NOORIN05', but pdf file contains 'TTE25A5F90t00'. How can decode these font names to original names. All fonts are TTF.
NOTE:
Why I need to extract the fonts.
Actually there is a software named InPage which was most famous in India and Pakistan to write documents in Urdu language, because before the unicode support in word processor, this was the only solution to type Urdu in computer. Due to complexity of Urdu language, this software uses 89 fonts files named NOORIN01 TO NOORIN89. The reason of using too many font files is to contain all the Urdu ligatures which are more than 19 thousands. Because each file can contains only 255 ligatures so that's why they used this technique before the unicode. Now copy and paste the text from pdf file generated by this software, results a garbage in MS Word. The reason which I told above 89 font files. So there was no way to extract text from such kind of old pdf files. (Now a days this software has support of unicode but I am talking about old files). So I developed a software in C# to extract text from such old pdf files. The algorithm I am using, creating a database file which contains all the names of 89 font files with all the aschii codes, and in next column I typed Urdu unicode ligature in unicode. I process the pdf file character by character with font, matching the font name from my database file and get the unicode ligature from database and then display in a text box. So in this way I get the unicode text successfully. My software was working fine in many of pdf files. But few days ago I get complaint from a person that your software fails to extract text from this pdf. When I test, I found that the pdf file doesn't contain the original font names so that's why my software unable to do further process. When I checked the properties of this pdf file, It shows the PDF producer GPL Ghostscript 8.15. So I search the net and study the documentation related to fonts but still couldn't find any clue to decode and get the original font names.
The first thing you should do is try a more recent version of Ghostscript. 8.16 is 14 years old..... The current version is 9.21.
If that doe snot preserve the original names (potentially including the usual subset prefix) then we'll need to see an example input file which exhibits the problem.
It might also be helpful if you were to explain why you need to extract the font names, possibly you are attempting something which simply isn't possible.
[EDIT}
OK so now I understand the problem, I'm afraid the answer to your question is 'you can't get the original font name'.
The PDF file was created from the output of the (Adobe-created) Windows PostScript printer driver. When that embeds TrueType fonts into the PostScript stream as type 42 fonts, it gives them a pseudo-random name which is composed of 'TT' followed by some additional characters that may look like hex, but aren't.
Old versions of the Ghostscript pdfwrite device (and 8.15 is very old) simply used that name verbatim, and that's what has been used for the font names in the PDF file you supplied.
Newer versions are capable of digging further into the font and picking up the original font name which is present in the PostScript. Unfortunately the old versions didn't preserve that. Once you've thrown the information away there is no way to get it back again.
So if the only thing you have is this PDF file then its simply not possible to get the font names back. If the person who supplied you with the PDF file can remake it, using a more recent version of Ghostscript, then it will work. But I presume they don't have the PostScript program used to create a 14 year old file.

bad character encoding after xsl 1.0 transform

I'm just beginning to learn about encoding issues, and I've learned just enough to know that it's far more complicated (on Windows, at least) than I had imagined, and that I have a lot more to learn.
I have an xml document that I believe is encoded with UTF-8. I'm using a VB.net app to transform the xml with (XslCompiledTransform and XmlTextWriter) into a column-specific text file. Some characters in the xml are coming out bad in the output text file. Example: an em-dash (—) is being turned into three characters "—". When that happens the columns in the file are thrown off.
As I understand it, an em-dash is not even a "Unicode character". I wouldn't expect to have issues with it. But, I can make the problem come and go by changing the encoding specified in the VB.net app.
If I use this, the em-dash is preserved:
Using writer = New XmlTextWriter(strOutputPath, New UTF8Encoding(True))
If I use this, the em-dash is corrupted into "—":
Using writer = New XmlTextWriter(strOutputPath, New UTF8Encoding(False))
The True/False simply tells VB whether to write the BOM at the beginning of the file. As I understand it, the BOM is neither necessary nor recommended for UTF-8. So, my preference is False - but then I get the weird characters.
I have several questions:
How can I be certain that the xml file is UTF-8? Is there a Windows tool that can tell me that?
How can I be certain that the transformed file is actually bad? Could it be that the real problem is the editor I'm using to look at it? Both EmEditor and UltraEdit show the same thing.
I've tried using the XVI32 hex editor to look at the file. I want to know what is actually written to disk, rather than what some GUI program is displaying to me. But, even on a file that looks good in EmEditor, XVI32 shows me the bad characters. Could it be that XVI32 just doesn't understand non-ASCII characters? What Windows hex editor would you recommend for this purpose?
The XML file is 650 MB, and the final text file is 380 MB - so that limits the list of useful tools somewhat.
You say 'As I understand it, an em-dash is not even a "Unicode character".' What do you mean by that? The Unicode character set definitely contains a code for em dash: 2014 hex. In the UTF-8 encoding, it will be 3 bytes: E2, 80, 94.
I suspect Martin Honnen is right that your editor is simply not showing the file properly. A couple of suggestions:
I'm not familiar with the editors you mention, but editors that handle different encodings will often silently choose an encoding by which to interpret the file (based on the BOM if there is one, and sometimes based on the character codes they see). They also typically have some way of showing what encoding they are interpreting the file as, and a way of tell them to load (or reload) the file as a particular encoding. If your editor does not have these features, I suggest you get one that does, such as EditPlus, or NotePad++.
As far as the hex editor, again I'm not familiar with the one you mention, but the whole point of a hex editor is to see the raw bytes. Such editors often offer a text view also (often side by side with the hex view) and if they do, I would not rely on their handling of encoding. Just use them to view the hex bytes and see if the bytes for your em dash are the same in both files.
Another way viewing the file can go wrong: even if your editor is interpreting the file as UTF-8, not all fonts will have all unicode characters in them, and for those characters not in the font they may display a little square or nothing at all. Try a few different fonts, or find one that purports to support unicode (though no font support ALL of Unicode, and there are several revisions of the Unicode spec which add more characters). Lucida Sans Unicode, I think, is one that will be on most Windows systems.
Another trick: I highly recommend the utility BabelMap. You can look up any unicode character there and see what the unicode value is, and you can copy the character from there and paste it into the file in your text editor and see how it displays it.
UltraEdit offers several configuration settings for working with UTF-8 encoded files. There is Auto detect UTF-8 files in File Handling - Unicode/UTF-8 Detection configuration dialog which is by default enabled.
With this setting enabled UltraEdit searches for the UTF-8 BOM. If not present, it searches in first few KB for an UTF-8 declaration as usually present in head of HTML/XHTML files or in first line of an XML file. If there is no BOM and no standardized encoding information at top of the file, UltraEdit searches within the first 64 KB for byte sequences which looks like UTF-8 encoded characters. If such a byte sequence can be found, the file is interpreted as UTF-8 encoded file by UltraEdit. For example a file containing only the 3 bytes E2 80 94 is interpreted as UTF-8 encoded file.
UltraEdit indicates in the status bar at bottom of the main window which encoding is detected and active (on save) for the active file. In the status bar there is either UTF-8 or U8- displayed depending on which status bar is used (advanced or basic) and which version of UltraEdit is used as the older versions have only the basic status bar.
Only files encoded in UTF-8 with no BOM, no UTF-8 character set or encoding declaration and no UTF-8 encoded character within the first 64 KB are opened wrong as ANSI file. In such cases the user can use the enhanced File - Open command of UltraEdit and explicitly select UTF-8 encoding before opening the file with button Open.
For completness, there is also a configuration setting which can be manually added to uedit32.ini which results in opening all files not being detected as UTF-16 files as UTF-8 encoded files. This is a setting useful for those who want to work only with UTF-8 encoded files even if there are very often no characters in a file present with a code value greater than 127.
For more information about working with UTF-8 encoded files take a look in the UltraEdit forums. There are a few topics with lots of information about editing UTF-8 encoded files in UltraEdit.

How to extract data from a PDF file while keeping track of its structure?

My objective is to extract the text and images from a PDF file while parsing its structure. The scope for parsing the structure is not exhaustive; I only need to be able to identify headings and paragraphs.
I have tried a few of different things, but I did not get very far in any of them:
Convert PDF to text. It does not work for me as I lose images and the structure of the document.
Convert PDF to HTML. I found a few tools that helped me with this, and the best one so far is pdftohtml. The tool is really good presentation wise, but I haven't been able to successfully parse the HTML.
Convert PDF to XML. Same as above.
Anyone has any suggestions on how to tackle this problem?
There is essentially not an easy cut-and-paste solution because PDF isn't really very interested in structure. There are many other answers on this site that will tell you things in much more detail, but this one should give you the main points:
If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?
If you want to do this in PDF itself (where you would have the majority of control over the process), you'll have to loop over all text on pages and identify headers by looking at their text properties (fonts used, size relative to the other text on the page, etc...).
On top of that you'll also have to identify paragraphs by looking at the positioning of text fragments, white space on the page, closeness of certain letters, words and lines... PDF by itself doesn't even have a concept for a "word", let alone "lines" or "paragraphs".
To complicate things even more, the way text is drawn on the page (and thus the order in which it appears in the PDF file itself) doesn't even have to be the proper reading order (or what us humans would consider to be proper reading order).
PDF parsing for headers and its sub contents are really very difficult (It doesn't mean its impossible ) as PDF comes in various formats. But I recently encountered with tool named GROBID which can helps in this scenario. I know it's not perfect but if we provide proper training it can accomplish our goals.
Grobid available as a opensource on github.
https://github.com/kermitt2/grobid
You may do use the following approach like this with iTextSharp or other open source libraries:
Read PDF file with with iTextSharp or similar open source tools and collect all text objects into an array (or convert PDF to HTML using the tool like pdftohtml and then parse HTML)
Sort all text objects by coordinates so you will have them all together
Then iterate through objects and check the distance between them to see if 2 or more objects can be merged into one paragraph or not
Or you may use the commercial tool like ByteScout PDF Extractor SDK that is capable of doing exactly this:
extract text and images along with analyzing the layout of the text
XML or CSV where text objects are merged or splitted into paragraphs inside a virtual layout grid
access objects via special API that makes it possible to address each object via its "virtual" row and column index disregarding how it is stored inside the original PDF.
Disclaimer: I am affiliated with ByteScout
PDF files can be parsed with tabula-py, or tabula-java.
I made a full tutorial on how to use tabula-py on this article. You can tabula in a web-browser too as long as you have installed Java.
Unless its is Marked Content, PDF does not have a structure.... You have to 'guess' it which is what the various tools are doing. There is a good blog post explaining the issues at http://blog.idrsolutions.com/2010/09/the-easy-way-to-discover-if-a-pdf-file-contains-structured-content/
As mentioned in the answers above, PDF's aren't very easy to parse. However, if you have certain additional information regarding the text that you want to parse, you can pull it off.
If your headings are positioned at specific parts of the page, you can parse the PDF file and sort the parsed output by coordinates.
If you have prior knowledge of the spacing between headings and paragraphs, you could also leverage this information to parse the file.
PDFBox is a PDF parsing tool that you can use for extracting text and images on top of which you can define your custom rules for parsing.
However, for parsing PDFs you need to have some prior knowledge of the general format of the PDF file. You can check out the following blogpost Document parsing for more information regarding document parsing.
Disclaimer:I was involved in writing the blogpost.
iText api:
PdfReader pr=new PdfReader("C:\test.pdf");
References:
PDFReader