Detect embedded characters in PDF using PdfBox - pdfbox

I am extracting text from a PDF file using PdfBox. When the PDF does not contain any embedded fonts everything works fine. The problem occurs when there are some TrueType embedded fonts. I discovered that in same cases the embedded fonts replace the shape of default characters with some other shapes. For example a char code for 'ï' is used to encode 'ł'. I am aware that I cannot get the real shape of the character without any mapping or OCR. I would like to know which characters might be redefined by the embedded characters. My question is how can I know which characters in the PDF stream are defined by the embedded fonts?

Related

How to export text document containing astral Unicode characters to PDF

I regularly create documents that need Unicode characters above U+FFFF. Unfortunately, OpenOffice and LibreOffice are both unable to correctly export these characters when creating a PDF. The actual data gets mangled by a completely asinine algorithm, while the display just consists of various overlapping question mark boxes.
This is not a font issue. I embed all used fonts in the PDF and all characters below U+FFFF work perfectly fine.
Until now I have been working around this issue by mapping the glyphs I need to a custom PUA font. This solves the display problems, but obviously makes the actual content of the text unsearchable and quite fragile. I haven’t been able to find any settings that might affect the handling of Unicode characters in PDF.
Therefore I have three questions:
Is there a way to make OpenOffice/LibreOffice handle astral characters correctly on PDF export?
If not, is there an external tool that can convert .odt files to PDF while preserving astral characters?
If not, is there another good rich-text editor using a different file format that can deal with astral characters in PDFs?

Ghostscript converting pdf to text file, output is unreadable

I was trying to convert a pdf document into text file. everything works until i open the output file to see its unreadable the characters are in some Chinese font
" 琀攀猀琀 "
this is my command line
gswin64c.exe -ps2ascii -sDEVICE=txtwrite -sOutputFile=outputtext.txt test.pdf
im i doing something wrong?
You haven't posted the file, so its not possible to be absolutely certain, however....
Almost certainly the text in your PDF file is not encoded using an ASCII encoding scheme (possibly contains sunset fonts), and does not contain a ToUnicode CMap for the font in question. Additionally the glyph names are not standard names (or its a TrueType font, which don't have named glyphs).
Without any of the above information txtwrite doesn't have any clue what the character codes represent, and so simply emits them verbatim.
Given that you are seeing Chinese glyphs, I would suspect that the original font is a CIDFont, probably a TrueType font, subset and has no ToUnicode CMap.
In this case, the only way to get the text out will be to use OCR.

Copy text from PDF with custom FONT

I am trying to copy some text from a PDF. But When I paste it in a word file, it is just some garbage. Something like മുഖവുര. The PDF is in Malayalam language. When I see File->Properties->Fonts, It says BRHMalayalam (Embedded Subset) as shown in the screenshot.
I installed various Malayalam fonts but still no luck. Can anyone please guide me?
The PDF I am trying to copy from is https://drive.google.com/open?id=0B3QCwY9Vanoza0tBdFJjd295WEE&authuser=0
Installing fonts won't help, since they are embedded in the document. The reader will use the ones in the document.
In fact it almost certainly must use the ones on the document, because it will probably have used character codes specific to each font subset.
Your PDF probably has character codes which are not Unicode values, and does not contain ToUnicode CMaps for the fonts in question (note the same font name embedded multiple times). There is no realistic way to copy the text.
The best you can do is OCR it.
After looking at the file, and confirming the answer already given by #KenS, the problem with this PDF document is in fact how it's constructed. Or rather how the font in the document has been embedded.
The document contains a number of Times and Arial fonts, for which the text can be copied successfully. Those fonts are embedded as a subset with a WinAnsi encoding. What is actually in the file is close enough to that, that the text seems to copy out well.
The problem font (BRHMalayalam) is also embedded as a subset, and its encoding is also set as WinAnsiEncoding, which completely doesn't make sense.
And because the font doesn't contain a ToUnicode mapping table, a PDF viewer has no other choice when copying and pasting to assume the characters in the PDF are indeed Win Ansi encoding which means you end up with (garbled) latin characters.
Just convert the pdf file to word file and then edit or copy or modify the text present in the file simple :)
and after completion go to file -> save as -> and change the format of doc to pdf ..hope u understood :)

Create ttf with a set of chars from other font

Im programming an app that generates PDFs for several situations. In some of them there are greek letters that should be displayed. My Problem is that including a font like arial that provides these chars adds some mb to my pdfs because the whole font is included in the pdf.
Is there a way to include just the chars needed or to generate a "new" font that only includes the chars needed as another font?
The feature you are looking for is font subsetting. This is normally a function of the code making the pdf.

How to replace or modify the font or glyphs embedded in a PDF file?

I want to replace the font embedded in an existing PDF file programmatically (with iText).
iText itself does not seem to provide any data model for glyphs and fonts, but I believe it can let me retrieve and update the binary stream that contains the font.
It's OK even if I don't know which glyph is associated to which font - what I want to do is just to replace them. To be precise, I want to embold all glyphs in a PDF document.
Replacing fonts in rendering time is not an option because the output must be PDF with all information preserved as is.
Is there anyone who has done this before with iText or any other PDF libraries?
PDF files define a set of fonts (ie F0, F1, F2) and then define these separately so you could theoretically rewrite the entry for F0. You would have to ensure the 2 fonts have the same spacing (or you will have to rewrite the PDF as well), and probably hack the PDF manually.