How to handle PDF font fallback perfectly when generating it? - pdf

I'm currently working on a project that can convert HTML canvas to PDF, user can select the font and draw the text in the canvas and export as a PDF(vector), but there's a problem that user can enter other language text that the font doesn't really support it. It's shown fine in the canvas because the browser did the font fallback mechanism maybe to grab the system font as a fallback, but in the exported PDF it's all corrupted. I've embedded the font in the PDF but the font doesn't have corresponding glyph, and the PDF reader like adobe doesn't have font fallback mechanism so it all become .nodef
I have two ideas but that aren't really satisfying.
1. Collect all glyph from each sentence and create a new font
Walk through each char and check if current font has corresponding glyph, if so, adding it to the new font list, if not, using an alternative font from the font stack #1 as the fallback to get the glyph and adding it to the new font list, then finally converting it as a new font and embedding it in the new PDF.
It seems good but in reality the performance of generating new font is terrible.
(I was using Opentype.js to load and write a new font, when we exported the font by using toArrayBuffer method, it took 10 mins for 6,000 words)
#1, Font stack is a stack like ['Crimson Text', 'Pt Sans', 'Noto Sans'], if the first font can't find corresponding glyph then go next until the end we give up.
2. If encountered any missing char, change the font-family of that sentence to Arial Unicode MS or Noto
It's pretty simple but it converts every word in the sentence to Arial Unicode MS or Noto, besides, it's hard to find a good font that contains most of language's glyph and we can't use font stack mechanism because we only can use one font in one sentence.
My goal is to make the exported PDF similar with the canvas that user drew, hoping someone can give me some direction 😥, many thanks

The usual solution would be to embed all four fonts in your stack plus noto (all suitably subsetted, preferably), and switch between them mid-word as required.
Building a new frankenfont from the fonts as you suggest is not required, though I admire the ambition!

Related

How to detect Font Family in PDF?

I have a pdf which contains many fonts and What is the best way to check whether it contains font that belongs to Arial font family ?
Is this possible in any language?
I couldn't find any library or language that could do this.
So, I tried by converting pdf to image using ImageMagick and segmented all alphabets present in the image(pdf).And then I tried to compare all segmented alphabets to segmented images of arial font family alphabets which worked fine.
I created all datasets using MS Word.But arial font family looks different in different editors.By 'looks different', I mean the segmented image of same alphabet has different pixel values in different editors.And also alphabet of 10pt size has different dimensions in different editors.So, this method doesn't work.
Any Suggestions on how to do this? May be using svg file or ps file
I also came to learn that, In pdf's alphabets are rendered using Bezier curves, where each bezier curve is drawn using some control points and nodes.
Are these control points, same for all alphabets that belong to one font family? If Yes, how to extract control points of alphabets in pdf as these can be used to detect font family.
There can be three types of text in your document:
text that isn't real text, but part of a raster image,
vector text drawn by PDF syntax without using a real font,
vector text using a real font.
The answer to your question depends on the type of text you're confronted with:
There is no way to extract font information if the text is not real text, but part of a raster image. You need an OCR tool to convert the pixels into characters, but you won't get any info about the font family. You could try to compare pixels, but you've already tried that and you've discovered that this isn't trivial (one might consider your current solution as a bad workaround / bad design).
You describe text that is drawn on a page using Bézier curves. Although, it's possible to draw text like this, you won't find many PDFs that are drawn like this. The reason is obvious: every time you'd need a specific glyph, let's say A, you would need to add the syntax to draw that glyph on the page, leading to plenty of redundant PDF syntax.
PDFs usually work with fonts. A font is stored in a PDF file using a font dictionary. The syntax that makes up a page refers to that font using a name that can be chosen by the PDF producer, but that corresponds with an entry in the page resources that contains a reference to the font dictionary. Each font has an encoding mapping characters to glyphs. In the page content we use characters, based on these characters, glyphs will be selected in the font.
You are asking about the font family. This information is stored in the font dictionary. Take a look at my answer to the question What are the ways of checking if piece of text in PDF documernt is bold using iTextSharp and you'll get an idea of what such a font dictionary looks like.
Do you see the /BaseFont entry in the font dictionary? It has values such as JOJJAH+TT116t00. In this case, the name of the font is "TT116t00", but what is "JOJJAH"? That is explained in my answer to the question What are the extra characters in the font name of my PDF?
Not all fonts are embedded. Sometimes the name of the font is sufficient for the viewer to know what the glyphs look like. For instance: there are 14 Standard Type 1 fonts that every viewer should be able to render.
Arial isn't one of those fonts, so if you want to be sure that Arial is rendered correctly, that font needs to be embedded. The font dictionary will refer to a Font Descriptor where you'll find the syntax to draw glyphs using linear paths, Bézier curves, etc. Suppose that you need the character A, then the font descriptor will contain some syntax that knows how to draw that character. The font dictionary will also have a map that maps the character A to the glyph A. Now when you need that glyph in your content, you can just use the character A and that will refer to the syntax that draws the glyph A. That syntax is stored inside the PDF only once.
Suppose that a PDF has the full Arial font embedded, then the value of /BaseFont would be Arial. However, if we'd embed the full Arial font, the PDF would be bloated. There are way too many characters in Arial; we don't need them all. That's why we'll only embed one or more subsets. When you see 6 characters followed by a + sign in the /BaseFont entry, you have discovered a font subset.
Getting the /BaseFont entry of a font dictionary can be done using different libraries. On the official iText site, we have different Q&As that explain how to Inspect a PDF. There's also an example that lists the fonts used in a PDF. Maybe that can be helpful.
NOTE: as explained in the help section, more specifically on the page What topics can I ask about here?, you will find rule #4: Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam.
I have given you the general information about where to find font information inside a PDF, but it's not allowed for you to ask questions to recommend the best tool to do this. Sorry for that.

How to convert italic font to normal font in pdf using some library?

Is there any way to convert Italic font, Bold font in my pdf to normal font using some library like Imagemagick or GhostScript etc. ?
Basically the answer is 'no' though there are several levels of caveat in there.
The most common scenario for a PDF file is that it contains an embedded font, and that font is subset. In this case the font will use a custom Encoding, so that when you see 'Hello' on your monitor, the actual character codes might be 'Axtte' or similar gibberish. If the font also contain a ToUnicode table you could, technically, create an embedded subset of the regular font from the same family as the bold or italic and embed that, and it would work. This would be an immense amount of work.
If the font isn't subset then it may not contain a custom Encoding, which would make that task easier, because you wouldn't have to re-encode the replacement.
If the font isn't embedded, then you need only change the font name in the Font object, because the PDF consumer will have to find a substitute anyway.
Note that, because PDF is a binary format, with an index (xref) containing the offset of every object in the file, any changes will mean that the xref table has to be reconstructed, again a considerable task.
I'm not aware of any tools which would do any of this for you automatically, you'd have to write your own, though some things could be done automatically. MuPDF for example will 'fix' a PDF file which has an incorrect xref table for you.
And even after all that, the likelihood is that the spacing would be different for the italic or bold font compared to the regular font anyway, and would look peculiar if you replaced them with a regular font.
So, fundamentally, no.
In low-level PDF you can apply some rendering flags in front of a text stream. Like the "Rendering Mode" Tr operation. For instance, in this scenario you can include the rendering of text outline and increase outline drawing width with the command sequence 0.4 w 2 Tr which will cause Normal text to become more "bold" (There are other better ways to accomplish this using the Font Description dictionary). However, one can also employ this tactic to slim down bold text using a clipped thicker outline, but this may not be ideal.
As for italic, most fonts contain a metric indicating their italic angle, and you can use this to add a faux italic using a shear CTM transformation matrix with the cm operation. Once again, this may work better to add an italic shear, but may also have some success in removing it.
See the PDF Reference.
This will require a library with lower level PDF building and you would have to do it manually, but it is possible technically.

extracting italic word from PDF using iText [duplicate]

I have an application, that extracts headings out of pdf files. The documents, that the application is supposed to work with, all have more or less coherent structure and formatting, in fact, telling if a text chunk is bold or not, is very important. Recently I came across a bunch of files, where some chunks visually appear bold, but do not have "bold" piece in string representation of font. The following SO thread how can i get text formatting with iTextSharp helped me to understand, that there is one more way of making text appear bold. However in my case calling GetTextRenderMode() does not help either, as it returns 0 as if it were normal text. So are there any other ways of making text appear bold, and is it possible to detect it using iTextSharp ?
You are making the assumption that the font inside your PDF file knows if it's bold or not. Let's take a look inside and check if your assumption is correct.
This is what the subset JOJJAH of the font TT116t00 looks like when you look at the internals of the PDF file you have shared:
We see that the font is of subtye /TrueType, we see that the /ItalicAngle is 0, and... we see that the 3rd bit of the /Flags is set. Let's check the PDF reference to find out what this tells us:
I quote:
The font contains glyphs outside the Adobe standard Latin character set.
The glyphs look bold, because the glyphs are drawn in a way that they appear bold. You see the font as bold because you are human. However, when a machine looks at the font, it doesn't have a clue that the font is bold. A machine just follows the instructions stored in the /FontFile2 stream.
In short: iTextSharp doesn't have any indications that the font is bold.

What are the ways of checking if piece of text in PDF documernt is bold using iTextSharp

I have an application, that extracts headings out of pdf files. The documents, that the application is supposed to work with, all have more or less coherent structure and formatting, in fact, telling if a text chunk is bold or not, is very important. Recently I came across a bunch of files, where some chunks visually appear bold, but do not have "bold" piece in string representation of font. The following SO thread how can i get text formatting with iTextSharp helped me to understand, that there is one more way of making text appear bold. However in my case calling GetTextRenderMode() does not help either, as it returns 0 as if it were normal text. So are there any other ways of making text appear bold, and is it possible to detect it using iTextSharp ?
You are making the assumption that the font inside your PDF file knows if it's bold or not. Let's take a look inside and check if your assumption is correct.
This is what the subset JOJJAH of the font TT116t00 looks like when you look at the internals of the PDF file you have shared:
We see that the font is of subtye /TrueType, we see that the /ItalicAngle is 0, and... we see that the 3rd bit of the /Flags is set. Let's check the PDF reference to find out what this tells us:
I quote:
The font contains glyphs outside the Adobe standard Latin character set.
The glyphs look bold, because the glyphs are drawn in a way that they appear bold. You see the font as bold because you are human. However, when a machine looks at the font, it doesn't have a clue that the font is bold. A machine just follows the instructions stored in the /FontFile2 stream.
In short: iTextSharp doesn't have any indications that the font is bold.

Possible to edit PDF without embedded font installed?

I have a PDF and need to edit its text.
It has an embedded font and I'm not able to find the font to install. Is it possible to edit the text and maintain its embedded font when I don't have the font installed?
I'm editing with Acrobat X and its warning me that I don't have the font and forcing me to change the font of the text I want to edit to one that I have installed. I've Googled for a couple hours and found the font family, but not the variation that's embedded. Because the font is already in the document, I would have thought Acrobat or another software program could tap it to allow me to edit, though I'm guessing I'm missing something.
It depends...
1.
Most likely, your embedded font is a subset of the full font. That means that it contains only these glyphs (shapes that represent printable characters) which were required as representations for characters used by the original PDF.
If your edit wants to insert a character that wasn't present in the original PDF, Acrobat (or any other editing method) has to use a different font. The font embedded in the PDF simply doesn't have glyph that is suitable for your edit!
Also, your subsetted font's name is not exactly like the full set font's name: it uses has a randomly composed 6 uppercase character prefix with a +-sign to build the used font name, like ABCXYZ+Arial.
You could employ the free (as in beer) Acrobat Plugin FontReporter to generate a list of all glyphs contained in your font.
You can also use this answer:
"How can I extract embedded fonts from a PDF as valid font files?"
to extract the font in question.
Then open the extracted font (which will be the subset variant, mind you!) in FontForgehttp://fontforge.github.io/en-US/), and there check two things:
the original font's license
the original font's creator
With that knowledge you could then definitely find the variant of the family that is embedded (or rather: that was present when the PDF generating and font embedding software created the subset).
2.
If the (subsetted) font you are looking at uses a so called "custom encoding", it may be almost impossible to edit the PDF file with standard tools (even if you have the same font locally installed that was used to create the PDF file).