Im programming an app that generates PDFs for several situations. In some of them there are greek letters that should be displayed. My Problem is that including a font like arial that provides these chars adds some mb to my pdfs because the whole font is included in the pdf.
Is there a way to include just the chars needed or to generate a "new" font that only includes the chars needed as another font?
The feature you are looking for is font subsetting. This is normally a function of the code making the pdf.
Related
I'm currently working on a project that can convert HTML canvas to PDF, user can select the font and draw the text in the canvas and export as a PDF(vector), but there's a problem that user can enter other language text that the font doesn't really support it. It's shown fine in the canvas because the browser did the font fallback mechanism maybe to grab the system font as a fallback, but in the exported PDF it's all corrupted. I've embedded the font in the PDF but the font doesn't have corresponding glyph, and the PDF reader like adobe doesn't have font fallback mechanism so it all become .nodef
I have two ideas but that aren't really satisfying.
1. Collect all glyph from each sentence and create a new font
Walk through each char and check if current font has corresponding glyph, if so, adding it to the new font list, if not, using an alternative font from the font stack #1 as the fallback to get the glyph and adding it to the new font list, then finally converting it as a new font and embedding it in the new PDF.
It seems good but in reality the performance of generating new font is terrible.
(I was using Opentype.js to load and write a new font, when we exported the font by using toArrayBuffer method, it took 10 mins for 6,000 words)
#1, Font stack is a stack like ['Crimson Text', 'Pt Sans', 'Noto Sans'], if the first font can't find corresponding glyph then go next until the end we give up.
2. If encountered any missing char, change the font-family of that sentence to Arial Unicode MS or Noto
It's pretty simple but it converts every word in the sentence to Arial Unicode MS or Noto, besides, it's hard to find a good font that contains most of language's glyph and we can't use font stack mechanism because we only can use one font in one sentence.
My goal is to make the exported PDF similar with the canvas that user drew, hoping someone can give me some direction 😥, many thanks
The usual solution would be to embed all four fonts in your stack plus noto (all suitably subsetted, preferably), and switch between them mid-word as required.
Building a new frankenfont from the fonts as you suggest is not required, though I admire the ambition!
I have an application, that extracts headings out of pdf files. The documents, that the application is supposed to work with, all have more or less coherent structure and formatting, in fact, telling if a text chunk is bold or not, is very important. Recently I came across a bunch of files, where some chunks visually appear bold, but do not have "bold" piece in string representation of font. The following SO thread how can i get text formatting with iTextSharp helped me to understand, that there is one more way of making text appear bold. However in my case calling GetTextRenderMode() does not help either, as it returns 0 as if it were normal text. So are there any other ways of making text appear bold, and is it possible to detect it using iTextSharp ?
You are making the assumption that the font inside your PDF file knows if it's bold or not. Let's take a look inside and check if your assumption is correct.
This is what the subset JOJJAH of the font TT116t00 looks like when you look at the internals of the PDF file you have shared:
We see that the font is of subtye /TrueType, we see that the /ItalicAngle is 0, and... we see that the 3rd bit of the /Flags is set. Let's check the PDF reference to find out what this tells us:
I quote:
The font contains glyphs outside the Adobe standard Latin character set.
The glyphs look bold, because the glyphs are drawn in a way that they appear bold. You see the font as bold because you are human. However, when a machine looks at the font, it doesn't have a clue that the font is bold. A machine just follows the instructions stored in the /FontFile2 stream.
In short: iTextSharp doesn't have any indications that the font is bold.
I am trying to copy some text from a PDF. But When I paste it in a word file, it is just some garbage. Something like മുഖവുര. The PDF is in Malayalam language. When I see File->Properties->Fonts, It says BRHMalayalam (Embedded Subset) as shown in the screenshot.
I installed various Malayalam fonts but still no luck. Can anyone please guide me?
The PDF I am trying to copy from is https://drive.google.com/open?id=0B3QCwY9Vanoza0tBdFJjd295WEE&authuser=0
Installing fonts won't help, since they are embedded in the document. The reader will use the ones in the document.
In fact it almost certainly must use the ones on the document, because it will probably have used character codes specific to each font subset.
Your PDF probably has character codes which are not Unicode values, and does not contain ToUnicode CMaps for the fonts in question (note the same font name embedded multiple times). There is no realistic way to copy the text.
The best you can do is OCR it.
After looking at the file, and confirming the answer already given by #KenS, the problem with this PDF document is in fact how it's constructed. Or rather how the font in the document has been embedded.
The document contains a number of Times and Arial fonts, for which the text can be copied successfully. Those fonts are embedded as a subset with a WinAnsi encoding. What is actually in the file is close enough to that, that the text seems to copy out well.
The problem font (BRHMalayalam) is also embedded as a subset, and its encoding is also set as WinAnsiEncoding, which completely doesn't make sense.
And because the font doesn't contain a ToUnicode mapping table, a PDF viewer has no other choice when copying and pasting to assume the characters in the PDF are indeed Win Ansi encoding which means you end up with (garbled) latin characters.
Just convert the pdf file to word file and then edit or copy or modify the text present in the file simple :)
and after completion go to file -> save as -> and change the format of doc to pdf ..hope u understood :)
I have PDF files which are dynamically generated, with text, vectors, and subsetted fonts. I can see which fonts are used in various viewers - is there a way of displaying the actual subsetted characters of those fonts?
For example, I see the document contains the subsetted subsetted fonts "AAAAAC+FreeMono" and "AAAAAD+DejaVuSans". How do I find how many characters were subsetted from these fonts, and what characters they were?
(I tried loading the fonts in FontForge, but it just crashes while opening the file)
The solution is to save the font data to a file and load it into a font editor. A subset font file is still a valid font file but it is possible that FontForge expects some data in the font that is not there. I have seen also many fonts that are not properly subset and this could also cause loading problems in a font editor.
I am extracting text from a PDF file using PdfBox. When the PDF does not contain any embedded fonts everything works fine. The problem occurs when there are some TrueType embedded fonts. I discovered that in same cases the embedded fonts replace the shape of default characters with some other shapes. For example a char code for 'ï' is used to encode 'ł'. I am aware that I cannot get the real shape of the character without any mapping or OCR. I would like to know which characters might be redefined by the embedded characters. My question is how can I know which characters in the PDF stream are defined by the embedded fonts?