Hindi to english from pdf - pdf

I am not able to copy hindi content from pdf file. When I am trying to copy/paste that content it changes to different hindi characters.
Example-
Original- विधान सभा
After paste- नरधरन सभर
it shows like this.
Can anybody help me to get the exact hindi characters.

What was used to create the PDF?
It was likely been created with an embedded font subset and no toUnicode mapping. Basically the codes of the characters used in the content of the PDF are mapped to glyphs embedded in the PDF which are displayed, but there is no mapping from these codes to regular Unicode codes so copying them produces gibberish. The only way to extract the original contents would be with some form of OCR.
Another possibility is that the application you are pasting it into is not shaping the characters correctly.

Related

PDF Unicode Text Not Encoding Properly

I have asked this question on Upwork.com and Adobe Support forums but no one has been able to provide an answer.
When I convert a Microsoft Word document to PDF (using either the Adobe Printer or Save As > PDF), the Unicode text in the document does not encode properly. The text appears correct, but when I copy and paste it to another program, such as Notepad, there are errors:
For example, the text in Microsoft Word is
बूढ़े पिता ने ऐसी भूमिका ...
The text appears correct in the PDF file, but when I copy and paste it Notepad, it is:
बूढ़े 􀉟पता ने ऐसी भू􀉠मका ...
Note the boxes.
This erroneous encoding means it is not possible to search the PDF document properly. For example, if I search for पिता (the second word), I will not get a match. I would have to search 􀉟पता.
Embedding the fonts makes no difference.
Please advise how I can create a PDF properly, so that the encoding is exactly like it is in any other program, i.e. without any of these boxes or other issues.
Thank you

how do I extract the Arabic text of this PDF file correctly?

Today i tried to search a Arabic word in a PDF file that contained Arabic content.
All PDF reader soft wares cannot search any Arabic word in this PDF file.
So I dragged PDF file into Firefox browser and selected a area that contained some words by inspect elements and saw this:
hw ½oiC instead of آخرین سخن
What is type of the encoding used in this PDF file?
how can i encode this to normal text?
It's difficult to comment on the file you are looking at without seeing it but a good starting point is to try Acrobat and by either copying the text and pasting it into a text editor or doing a search for the text content will reveal if it can be extracted correctly or not.
If it can't be extracted properly then there's a good chance the font is lacking a ToUnicode entry (see Section 9.10.1 of the ISO PDF 32000-1:2008 specification for more information).

How to export text document containing astral Unicode characters to PDF

I regularly create documents that need Unicode characters above U+FFFF. Unfortunately, OpenOffice and LibreOffice are both unable to correctly export these characters when creating a PDF. The actual data gets mangled by a completely asinine algorithm, while the display just consists of various overlapping question mark boxes.
This is not a font issue. I embed all used fonts in the PDF and all characters below U+FFFF work perfectly fine.
Until now I have been working around this issue by mapping the glyphs I need to a custom PUA font. This solves the display problems, but obviously makes the actual content of the text unsearchable and quite fragile. I haven’t been able to find any settings that might affect the handling of Unicode characters in PDF.
Therefore I have three questions:
Is there a way to make OpenOffice/LibreOffice handle astral characters correctly on PDF export?
If not, is there an external tool that can convert .odt files to PDF while preserving astral characters?
If not, is there another good rich-text editor using a different file format that can deal with astral characters in PDFs?

Copy text from PDF with custom FONT

I am trying to copy some text from a PDF. But When I paste it in a word file, it is just some garbage. Something like മുഖവുര. The PDF is in Malayalam language. When I see File->Properties->Fonts, It says BRHMalayalam (Embedded Subset) as shown in the screenshot.
I installed various Malayalam fonts but still no luck. Can anyone please guide me?
The PDF I am trying to copy from is https://drive.google.com/open?id=0B3QCwY9Vanoza0tBdFJjd295WEE&authuser=0
Installing fonts won't help, since they are embedded in the document. The reader will use the ones in the document.
In fact it almost certainly must use the ones on the document, because it will probably have used character codes specific to each font subset.
Your PDF probably has character codes which are not Unicode values, and does not contain ToUnicode CMaps for the fonts in question (note the same font name embedded multiple times). There is no realistic way to copy the text.
The best you can do is OCR it.
After looking at the file, and confirming the answer already given by #KenS, the problem with this PDF document is in fact how it's constructed. Or rather how the font in the document has been embedded.
The document contains a number of Times and Arial fonts, for which the text can be copied successfully. Those fonts are embedded as a subset with a WinAnsi encoding. What is actually in the file is close enough to that, that the text seems to copy out well.
The problem font (BRHMalayalam) is also embedded as a subset, and its encoding is also set as WinAnsiEncoding, which completely doesn't make sense.
And because the font doesn't contain a ToUnicode mapping table, a PDF viewer has no other choice when copying and pasting to assume the characters in the PDF are indeed Win Ansi encoding which means you end up with (garbled) latin characters.
Just convert the pdf file to word file and then edit or copy or modify the text present in the file simple :)
and after completion go to file -> save as -> and change the format of doc to pdf ..hope u understood :)

PDF data extraction gives symbols/gibberish?

I have a piece of software called PDF2XL which is normally great for extracting tables of data from PDF files. I've used it with hundreds of files before.
This one file though, gives me gibberish output that I can't even copy and paste into this textarea correctly. All sorts of unicode weirdness.
If I copy and paste as per normal into excel/notepad I get the same issue.
I assume it's something to do with a messed up character encoding header in the PDF file? How can I change this? I'm on Windows and have no software that can edit PDFs, so if I need to edit/re-save it, please recommend a free piece of SW to do it.
Thanks!
There are an increasing number of PDF files the used subsetted fonts which is basically a custom encoding. Normally the font descriptor in the PDF should have a ToUnicode table to allow the text extraction to decode the font encoding and return the correct text.
Some PDF producers are doing this on purpose to prevent easy PDF text extraction for things such as financial reports. If there is only one font then you could manually decode the font but in my experience I have seen PDF's with multiple random encodings which makes it nearly impossible to decode automatically.
One way to test for these types of PDF's is to open the file in Acrobat, select some text, copy it and then paste it into Notepad. If the text is garbled then the PDF is using a subsetted font and there is not much more you can do. If Acrobat can't extract the text correctly then nothing else can. It may as well be a page of hieroglyphs.