Scraping Text from PDF with underlines and strikethroughs

Scraping Text from PDF with underlines and strikethroughs - pdf

I have a PDF that contains many underlines and strikethroughs in the text. I would like to be able to convert this PDF to HTML. I have tried many different tools, and all of them will sometimes catch the underlines and strikethroughs as text formatting, and at other times will convert the underlines and strikethroughs to graphics, which is (as far as I can tell) useless to me.
I would really like to know how these programs differentiate between underlines that format text and underlines that are converted to graphics, and how I might be able to access the document and capture everything as text formatting.
I may be taking the wrong approach with this, and am open to any possible solutions, I think I just need to be pointed in the right direction.
Thank you in advance for any assistance.

There are no underlines and strikethroughs in PDF, there are just lines being drawn on top of text.
PDF tools that detect underlines and strikethroughs will usually look for a line drawing that is close enough to the text, or some other similar heuristics, then add the corresponding style to the text output when converting into another format. However this kind of approach will never work for 100% of the cases.

Related

How to get the underlined text from PDF file?

everyone!
I try to get some underlined text from PDF file by itext, it seems very difficult for me. I've searched the solution for a long time, and I've learned how to get the text's fontfamily, fontsize and text location. However, no underline.
Looking forward to your help!
Thank you!

It might not be possible with itext, but you can achieve this with pdfbox at some extent
look at this: https://stackoverflow.com/a/40039407/4353762
But beware it might not work in some cases, the library needs to know the font and descriptors of the font. if you throw a pdf with unknown type then the descriptor will return null and the code will simply break with NullPointerException.
If you want to handle NullPointerExceptions manually then you might need to look at underlines and strikeThrough methods of
PDFStyledTextStripper.java

What are the ways of checking if piece of text in PDF documernt is bold using iTextSharp

I have an application, that extracts headings out of pdf files. The documents, that the application is supposed to work with, all have more or less coherent structure and formatting, in fact, telling if a text chunk is bold or not, is very important. Recently I came across a bunch of files, where some chunks visually appear bold, but do not have "bold" piece in string representation of font. The following SO thread how can i get text formatting with iTextSharp helped me to understand, that there is one more way of making text appear bold. However in my case calling GetTextRenderMode() does not help either, as it returns 0 as if it were normal text. So are there any other ways of making text appear bold, and is it possible to detect it using iTextSharp ?

You are making the assumption that the font inside your PDF file knows if it's bold or not. Let's take a look inside and check if your assumption is correct.
This is what the subset JOJJAH of the font TT116t00 looks like when you look at the internals of the PDF file you have shared:
We see that the font is of subtye /TrueType, we see that the /ItalicAngle is 0, and... we see that the 3rd bit of the /Flags is set. Let's check the PDF reference to find out what this tells us:
I quote:
The font contains glyphs outside the Adobe standard Latin character set.
The glyphs look bold, because the glyphs are drawn in a way that they appear bold. You see the font as bold because you are human. However, when a machine looks at the font, it doesn't have a clue that the font is bold. A machine just follows the instructions stored in the /FontFile2 stream.
In short: iTextSharp doesn't have any indications that the font is bold.

Converting pdf to vector image

I'm trying to use pdf content (mathematics) in my webpage. I basically want to convert the pdf to some vector image. Converting the pdf to swf does the job very well, but as flash isn't supported on every platform, I'm trying to find another solution.
I read about svg, but as those pdf's contain a lot of mathematics, the result of the converters I found is really ugly and incorrect.
I've also thought about retyping the latex, and displaying it using mathjax, in some way this is the best solution, but also very time consuming.
The only thing I want is to convert it to a nice vector image, I don't want to change the content, or anything else. Besides converting to swf or retyping it, is there any other solution ?
Edit:
this is svg output
and here original pdf

The only solution I could find is illustrator.
Just open the pdf, save as svg, and choose to embed all used glyphs.
Result is perfect:
https://dl.dropboxusercontent.com/u/58922976/Sol-10.1.svg

what about using flash + raster image in case of platform without flash, if flash mostly works for you?

Your PDF is a little difficult for reasons that are probably not apparent to you.
The core problem with it is that some of the graphics in the document are actually drawn using custom glyphs. You can see this if you copy and paste the text out of Acrobat. There are a variety of unusual characters in there that don't seem to serve any useful purpose. That's those squares at the bottom of your SVG with EEs and FFs in them.
However these characters are actually custom glyphs for things like the braces around the matrices at the bottom of the page. So they are both fairly important and also very specific to this document.
I tried ABCpdf .NET to convert your PDF to SVG. It worked fine apart from these custom glyphs at the bottom. The output was about 90KB. It looked very similar to your inkscape SVG output but just a bit smaller (the inkscape one is 160KB).
The only way to get rid of these non-Unicode glyphs is to vectorize the text. I did this using ABCpdf and the output looked fine in SVG. But... vectorized text is big and SVG isn't a particularly efficient medium. The output was about 1MB! Zipped it goes down to half that but it's still no-where near as efficient as the original PDF.
The problems I am seeing here are going to be universal whatever format you use. These custom characters are always going to be problematic whether you output to SVG, SWF, HTML canvas, VML or indeed any vector format.
So what would I suggest? Well the obvious vector format that is widely used on the web is... PDF!
I know it's not quite what you're looking for but I think this is the realistic solution given the constraints above. :-)

How to "mask" certain text in a PDF document

I have a PDF document, and I want to mask certain text blocks. The reason why I want to do this, is because I don't want this text to be indexed, nor I want this information to be easily accessible by selecting and copying this text block.
What should be the right way to do this?
I guess turning the text to raster would be bad idea, and I don't know if there is some tool that can make only cartain text parts with special privileges.

You will need a program that can convert a font into a series of shapes.
Illustrator may have the functionality you want: see here and here.

PDF Text Direction

How is text direction for right-to-left languages, like Arabic, encoded in PDF? My understanding is that since PDF is fundamentally a graphical format, the concept of text-direction doesn't need to really be encoded. Rather, the glyphs simply need to be painted on-screen from right to left. However, the PDF reference manual mentions an attribute called WritingMode, where you can specify combinations left-to-right, right-to-left and top-to-bottom, bottom-to-top.
So my questions is:
(1) If my understanding is correct, and RTL or LTR is merely expressed by the way the glyphs are painted on-screen, what is the point of the WritingMode attribute?
(2) If there is no actual directionality information encoded in the PDF file, other than the order the glyphs are painted, how does a PDF-to-Text program know if a given line is supposed to be read right-to-left or left-to-right? (I suppose the PDF program could just check if the Unicode codepoints extracted from a ToUnicode map fall into a range that corresponds to an RTL language.)

WritingMode is only for Tagged PDF, if I'm reading the spec correctly. If a PDF doesn't contain the appropriate logical structure, you don't get WritingMode.
The general answer, as I understand it, is "it depends". In R-L writing, you probably have the text advance info encoded in the font and a single text placement will advance the text to the right place. I say 'probably' because it might be that the actual generation software ignores this and places each glyph on its own, regardless of the text advance in the font. Then you get fun languages like Arabic and Hebrew which aren't strictly R-L, as numbers are still L-R within a R-L line.

Text direction will be set in the Trm

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas