extracting italic word from PDF using iText [duplicate] - pdf

I have an application, that extracts headings out of pdf files. The documents, that the application is supposed to work with, all have more or less coherent structure and formatting, in fact, telling if a text chunk is bold or not, is very important. Recently I came across a bunch of files, where some chunks visually appear bold, but do not have "bold" piece in string representation of font. The following SO thread how can i get text formatting with iTextSharp helped me to understand, that there is one more way of making text appear bold. However in my case calling GetTextRenderMode() does not help either, as it returns 0 as if it were normal text. So are there any other ways of making text appear bold, and is it possible to detect it using iTextSharp ?

You are making the assumption that the font inside your PDF file knows if it's bold or not. Let's take a look inside and check if your assumption is correct.
This is what the subset JOJJAH of the font TT116t00 looks like when you look at the internals of the PDF file you have shared:
We see that the font is of subtye /TrueType, we see that the /ItalicAngle is 0, and... we see that the 3rd bit of the /Flags is set. Let's check the PDF reference to find out what this tells us:
I quote:
The font contains glyphs outside the Adobe standard Latin character set.
The glyphs look bold, because the glyphs are drawn in a way that they appear bold. You see the font as bold because you are human. However, when a machine looks at the font, it doesn't have a clue that the font is bold. A machine just follows the instructions stored in the /FontFile2 stream.
In short: iTextSharp doesn't have any indications that the font is bold.

Related

How to detect Font Family in PDF?

I have a pdf which contains many fonts and What is the best way to check whether it contains font that belongs to Arial font family ?
Is this possible in any language?
I couldn't find any library or language that could do this.
So, I tried by converting pdf to image using ImageMagick and segmented all alphabets present in the image(pdf).And then I tried to compare all segmented alphabets to segmented images of arial font family alphabets which worked fine.
I created all datasets using MS Word.But arial font family looks different in different editors.By 'looks different', I mean the segmented image of same alphabet has different pixel values in different editors.And also alphabet of 10pt size has different dimensions in different editors.So, this method doesn't work.
Any Suggestions on how to do this? May be using svg file or ps file
I also came to learn that, In pdf's alphabets are rendered using Bezier curves, where each bezier curve is drawn using some control points and nodes.
Are these control points, same for all alphabets that belong to one font family? If Yes, how to extract control points of alphabets in pdf as these can be used to detect font family.
There can be three types of text in your document:
text that isn't real text, but part of a raster image,
vector text drawn by PDF syntax without using a real font,
vector text using a real font.
The answer to your question depends on the type of text you're confronted with:
There is no way to extract font information if the text is not real text, but part of a raster image. You need an OCR tool to convert the pixels into characters, but you won't get any info about the font family. You could try to compare pixels, but you've already tried that and you've discovered that this isn't trivial (one might consider your current solution as a bad workaround / bad design).
You describe text that is drawn on a page using Bézier curves. Although, it's possible to draw text like this, you won't find many PDFs that are drawn like this. The reason is obvious: every time you'd need a specific glyph, let's say A, you would need to add the syntax to draw that glyph on the page, leading to plenty of redundant PDF syntax.
PDFs usually work with fonts. A font is stored in a PDF file using a font dictionary. The syntax that makes up a page refers to that font using a name that can be chosen by the PDF producer, but that corresponds with an entry in the page resources that contains a reference to the font dictionary. Each font has an encoding mapping characters to glyphs. In the page content we use characters, based on these characters, glyphs will be selected in the font.
You are asking about the font family. This information is stored in the font dictionary. Take a look at my answer to the question What are the ways of checking if piece of text in PDF documernt is bold using iTextSharp and you'll get an idea of what such a font dictionary looks like.
Do you see the /BaseFont entry in the font dictionary? It has values such as JOJJAH+TT116t00. In this case, the name of the font is "TT116t00", but what is "JOJJAH"? That is explained in my answer to the question What are the extra characters in the font name of my PDF?
Not all fonts are embedded. Sometimes the name of the font is sufficient for the viewer to know what the glyphs look like. For instance: there are 14 Standard Type 1 fonts that every viewer should be able to render.
Arial isn't one of those fonts, so if you want to be sure that Arial is rendered correctly, that font needs to be embedded. The font dictionary will refer to a Font Descriptor where you'll find the syntax to draw glyphs using linear paths, Bézier curves, etc. Suppose that you need the character A, then the font descriptor will contain some syntax that knows how to draw that character. The font dictionary will also have a map that maps the character A to the glyph A. Now when you need that glyph in your content, you can just use the character A and that will refer to the syntax that draws the glyph A. That syntax is stored inside the PDF only once.
Suppose that a PDF has the full Arial font embedded, then the value of /BaseFont would be Arial. However, if we'd embed the full Arial font, the PDF would be bloated. There are way too many characters in Arial; we don't need them all. That's why we'll only embed one or more subsets. When you see 6 characters followed by a + sign in the /BaseFont entry, you have discovered a font subset.
Getting the /BaseFont entry of a font dictionary can be done using different libraries. On the official iText site, we have different Q&As that explain how to Inspect a PDF. There's also an example that lists the fonts used in a PDF. Maybe that can be helpful.
NOTE: as explained in the help section, more specifically on the page What topics can I ask about here?, you will find rule #4: Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam.
I have given you the general information about where to find font information inside a PDF, but it's not allowed for you to ask questions to recommend the best tool to do this. Sorry for that.

How to convert italic font to normal font in pdf using some library?

Is there any way to convert Italic font, Bold font in my pdf to normal font using some library like Imagemagick or GhostScript etc. ?
Basically the answer is 'no' though there are several levels of caveat in there.
The most common scenario for a PDF file is that it contains an embedded font, and that font is subset. In this case the font will use a custom Encoding, so that when you see 'Hello' on your monitor, the actual character codes might be 'Axtte' or similar gibberish. If the font also contain a ToUnicode table you could, technically, create an embedded subset of the regular font from the same family as the bold or italic and embed that, and it would work. This would be an immense amount of work.
If the font isn't subset then it may not contain a custom Encoding, which would make that task easier, because you wouldn't have to re-encode the replacement.
If the font isn't embedded, then you need only change the font name in the Font object, because the PDF consumer will have to find a substitute anyway.
Note that, because PDF is a binary format, with an index (xref) containing the offset of every object in the file, any changes will mean that the xref table has to be reconstructed, again a considerable task.
I'm not aware of any tools which would do any of this for you automatically, you'd have to write your own, though some things could be done automatically. MuPDF for example will 'fix' a PDF file which has an incorrect xref table for you.
And even after all that, the likelihood is that the spacing would be different for the italic or bold font compared to the regular font anyway, and would look peculiar if you replaced them with a regular font.
So, fundamentally, no.
In low-level PDF you can apply some rendering flags in front of a text stream. Like the "Rendering Mode" Tr operation. For instance, in this scenario you can include the rendering of text outline and increase outline drawing width with the command sequence 0.4 w 2 Tr which will cause Normal text to become more "bold" (There are other better ways to accomplish this using the Font Description dictionary). However, one can also employ this tactic to slim down bold text using a clipped thicker outline, but this may not be ideal.
As for italic, most fonts contain a metric indicating their italic angle, and you can use this to add a faux italic using a shear CTM transformation matrix with the cm operation. Once again, this may work better to add an italic shear, but may also have some success in removing it.
See the PDF Reference.
This will require a library with lower level PDF building and you would have to do it manually, but it is possible technically.

Copy text from PDF with custom FONT

I am trying to copy some text from a PDF. But When I paste it in a word file, it is just some garbage. Something like മുഖവുര. The PDF is in Malayalam language. When I see File->Properties->Fonts, It says BRHMalayalam (Embedded Subset) as shown in the screenshot.
I installed various Malayalam fonts but still no luck. Can anyone please guide me?
The PDF I am trying to copy from is https://drive.google.com/open?id=0B3QCwY9Vanoza0tBdFJjd295WEE&authuser=0
Installing fonts won't help, since they are embedded in the document. The reader will use the ones in the document.
In fact it almost certainly must use the ones on the document, because it will probably have used character codes specific to each font subset.
Your PDF probably has character codes which are not Unicode values, and does not contain ToUnicode CMaps for the fonts in question (note the same font name embedded multiple times). There is no realistic way to copy the text.
The best you can do is OCR it.
After looking at the file, and confirming the answer already given by #KenS, the problem with this PDF document is in fact how it's constructed. Or rather how the font in the document has been embedded.
The document contains a number of Times and Arial fonts, for which the text can be copied successfully. Those fonts are embedded as a subset with a WinAnsi encoding. What is actually in the file is close enough to that, that the text seems to copy out well.
The problem font (BRHMalayalam) is also embedded as a subset, and its encoding is also set as WinAnsiEncoding, which completely doesn't make sense.
And because the font doesn't contain a ToUnicode mapping table, a PDF viewer has no other choice when copying and pasting to assume the characters in the PDF are indeed Win Ansi encoding which means you end up with (garbled) latin characters.
Just convert the pdf file to word file and then edit or copy or modify the text present in the file simple :)
and after completion go to file -> save as -> and change the format of doc to pdf ..hope u understood :)

What are the ways of checking if piece of text in PDF documernt is bold using iTextSharp

I have an application, that extracts headings out of pdf files. The documents, that the application is supposed to work with, all have more or less coherent structure and formatting, in fact, telling if a text chunk is bold or not, is very important. Recently I came across a bunch of files, where some chunks visually appear bold, but do not have "bold" piece in string representation of font. The following SO thread how can i get text formatting with iTextSharp helped me to understand, that there is one more way of making text appear bold. However in my case calling GetTextRenderMode() does not help either, as it returns 0 as if it were normal text. So are there any other ways of making text appear bold, and is it possible to detect it using iTextSharp ?
You are making the assumption that the font inside your PDF file knows if it's bold or not. Let's take a look inside and check if your assumption is correct.
This is what the subset JOJJAH of the font TT116t00 looks like when you look at the internals of the PDF file you have shared:
We see that the font is of subtye /TrueType, we see that the /ItalicAngle is 0, and... we see that the 3rd bit of the /Flags is set. Let's check the PDF reference to find out what this tells us:
I quote:
The font contains glyphs outside the Adobe standard Latin character set.
The glyphs look bold, because the glyphs are drawn in a way that they appear bold. You see the font as bold because you are human. However, when a machine looks at the font, it doesn't have a clue that the font is bold. A machine just follows the instructions stored in the /FontFile2 stream.
In short: iTextSharp doesn't have any indications that the font is bold.

How to bold a text in PDF?

I'm developing a new function to "my" program. This function is able to write PDF files by the simple way, making a simple text file with some codes of PDF standard.
I'm trying to understand how it works yet, but my first problem is about how to apply bold on some line of my document.
I've already downloaded the PDF REFERENCES GUIDE, but I've not found nothing about it.
Any idea?
PDF is not like HTML where you can apply formatting tags for emphasis. As you've read in the PDF reference, all that you do in PDF is to setup a graphics environment (colours used, fonts used, etc) and then put text on the page.
If you want to have something show in bold, use a font that is bold. If you want to have something show in italic, use a font that is italic.
Older software used dirty tricks to create "bold-alike" text, but the good (and easy) way to do it is to make sure you select the correct font before you start drawing text.