Is it possible to replace subset embedded fonts by non-subset embedded fonts in PDF by any mechanism (for example using iText)?
The goal is to write some java routine which will takes an input PDF with subset fonts and outputs a PDF with fully embedded fonts, with a full copy of the entire character set of a fonts are stored in the PDF.
Any advice would have helped me a lot.
Related
I am not a programmer, but a normal user who uses Linux.
I want to use Ghostscript to DISPLAY Pdf files, not to CREATE Pdf files. (I have never used Ghostscript until now).
But I want Ghostscript to automatically replace all fonts with other fonts when I open the PDF. No matter if the fonts are embedded or not.
With which fonts should the fonts be replaced?
Answer: I want to create a list of fonts, that I want to be available for replacement.
But which of these fonts on the list should be used?
Answer: The one that best matches the metric of the font to be replaced.
Is it possible to do this somehow?
You can't get Ghostscript to do what you are asking. If a PDF file contains fonts Ghostscript will use those fonts, it will only substitute if it cannot find an embedded font.
The reason for this is simple; the font embedded in the PDF file is the correct font. It's Metrics are correct, and the mapping form character code to the appropriate glyph selector in the font will be correct.
It's also a non-trivial problem to select from a list of fonts the one which 'best matches the metrics of the font to be replaced'. What characteristics should be considered ? How should those be determined ?
When a font is not embedded then Ghostscript will consult its own list of fonts and CIDFonts. Both of these lists can be customised, the documentation is here
But since a substitute font is always going to be a compromise, you can't tell Ghostscript not to use the embedded fonts in a PDF. Well technically you could, by modifying the PDF interpreter, but you say you aren't a programmer, so I doubt you will want to try that.
I am generating PDF files which contain English and Chinese characters (using the Ruby Prawn library). I don't want to embed a Chinese font file in the generated PDF files, because these files need to stay small. So I'm wondering if I could just mentioning a Chinese font name in my PDF files, and have the PDF readers correctly rendering the Chinese characters because the PDF readers would already have the Chinese font file.
Is that something sensible? If so is there any commonly used Chinese font that one can expect to be installed in most of the PDF readers used by Chinese people?
The best way to ensure that a PDF file can be displayed on a any reader, is to use partially embedded fonts (also known as font subset). In PDF, you don't need to include the whole font with your document, having a subset with just the glyphs that were used in the file is enough for the file to be portable.
I am trying to copy some text from a PDF. But When I paste it in a word file, it is just some garbage. Something like മുഖവുര. The PDF is in Malayalam language. When I see File->Properties->Fonts, It says BRHMalayalam (Embedded Subset) as shown in the screenshot.
I installed various Malayalam fonts but still no luck. Can anyone please guide me?
The PDF I am trying to copy from is https://drive.google.com/open?id=0B3QCwY9Vanoza0tBdFJjd295WEE&authuser=0
Installing fonts won't help, since they are embedded in the document. The reader will use the ones in the document.
In fact it almost certainly must use the ones on the document, because it will probably have used character codes specific to each font subset.
Your PDF probably has character codes which are not Unicode values, and does not contain ToUnicode CMaps for the fonts in question (note the same font name embedded multiple times). There is no realistic way to copy the text.
The best you can do is OCR it.
After looking at the file, and confirming the answer already given by #KenS, the problem with this PDF document is in fact how it's constructed. Or rather how the font in the document has been embedded.
The document contains a number of Times and Arial fonts, for which the text can be copied successfully. Those fonts are embedded as a subset with a WinAnsi encoding. What is actually in the file is close enough to that, that the text seems to copy out well.
The problem font (BRHMalayalam) is also embedded as a subset, and its encoding is also set as WinAnsiEncoding, which completely doesn't make sense.
And because the font doesn't contain a ToUnicode mapping table, a PDF viewer has no other choice when copying and pasting to assume the characters in the PDF are indeed Win Ansi encoding which means you end up with (garbled) latin characters.
Just convert the pdf file to word file and then edit or copy or modify the text present in the file simple :)
and after completion go to file -> save as -> and change the format of doc to pdf ..hope u understood :)
I have a piece of software called PDF2XL which is normally great for extracting tables of data from PDF files. I've used it with hundreds of files before.
This one file though, gives me gibberish output that I can't even copy and paste into this textarea correctly. All sorts of unicode weirdness.
If I copy and paste as per normal into excel/notepad I get the same issue.
I assume it's something to do with a messed up character encoding header in the PDF file? How can I change this? I'm on Windows and have no software that can edit PDFs, so if I need to edit/re-save it, please recommend a free piece of SW to do it.
Thanks!
There are an increasing number of PDF files the used subsetted fonts which is basically a custom encoding. Normally the font descriptor in the PDF should have a ToUnicode table to allow the text extraction to decode the font encoding and return the correct text.
Some PDF producers are doing this on purpose to prevent easy PDF text extraction for things such as financial reports. If there is only one font then you could manually decode the font but in my experience I have seen PDF's with multiple random encodings which makes it nearly impossible to decode automatically.
One way to test for these types of PDF's is to open the file in Acrobat, select some text, copy it and then paste it into Notepad. If the text is garbled then the PDF is using a subsetted font and there is not much more you can do. If Acrobat can't extract the text correctly then nothing else can. It may as well be a page of hieroglyphs.
I want to replace the font embedded in an existing PDF file programmatically (with iText).
iText itself does not seem to provide any data model for glyphs and fonts, but I believe it can let me retrieve and update the binary stream that contains the font.
It's OK even if I don't know which glyph is associated to which font - what I want to do is just to replace them. To be precise, I want to embold all glyphs in a PDF document.
Replacing fonts in rendering time is not an option because the output must be PDF with all information preserved as is.
Is there anyone who has done this before with iText or any other PDF libraries?
PDF files define a set of fonts (ie F0, F1, F2) and then define these separately so you could theoretically rewrite the entry for F0. You would have to ensure the 2 fonts have the same spacing (or you will have to rewrite the PDF as well), and probably hack the PDF manually.