PDF itext TOC generation - pdf

I have to merge multiple PDF documents into a single PDF document. Besides this, I have to generate TOC. The original documents will contain text with a specific style (say H1). This special text becomes part of TOC.
Have used iText for merging multiple PDF files. I am unable to find example/API on parsing the document to find all the contents having style H1.
Generating TOC is next challenge.

You don't. PDFs don't have styles. They have "current Graphic State", which includes:
current transformation matrix (CTM).
stroke & fill colors
clipping path
font & size
gobs of other text state stuff (char spacing, word spacing, leading, text render mode...)
Including a separate text transformation matrix which is combined with the CTM.
So first you have to track all this stuff (which iText can mostly do for you). Then you have to determine how big "H1" text is, and latch on to all the text that is in that size screen size, taking the CTM, text matrix, and font size into account (which iText will do for you again, IIRC).
And just to make life more exciting for folks like yourself, it's entirely possible that the text you're looking at isn't text at all. It could be paths, or a bitmap... at which point you need OCR, and I don't think you'll get much in the way of size info with OCR.
You'll need to write a TextRenderListener that determines the final size of a given piece of text (and whether or not its a part of the last piece) and filter out all the stuff that's too small. You'll then build your TOC based on the text you find.

Related

qpdf - replace text in existing PDF file

this is the first I'm working with PDFs on this level. So please be patient with
my noob question. I understand the logical and physical structure of an PDF file
on a basic level.
I have an PDF that contains a dummy ID that needs to be replaced. To check, if there
is way to do this, I used qpdf to expand the PDF using
qpdf --qdf --object-streams=disable orig.pdf expanded.pdf
Using a hex editor I located the dummy ID in expanded.pdf and changed the value by
simply swapping two digits
<001800180017> Tj => <001700170018> Tj
and saved it. Opening expanded.pdf in Acrobat didn't show the modification. The original
ID 443 is still rendered, but searching for "443" doesn't find it. When searching for
"334", the modified content, I get the rendered original ID 443 highlighted.
The PDF consist of text and vector graphic. When I insert additional digits (which obviously
invalidates the offsets in the xref), I get an error message regarding a missing font and
all digits are shown as dots but the vector graphic is still in place. This seems to indicate
that the ID is not part of the graphic.
What did I miss?
EDIT 1:
After mkl's comment, I did a deeper analysis of my PDF and found, that beside the obvious graphic content, all text was rendered by a series of m/l/c commands follwoed by a BT/ET section. Color for stroke and non-stroke was 0,0,0 for both in the BT/ET section.
Is this because of the used embedded non-standard font?
Are PDFs with embedded fonts usually done this way? A graphics part for the visual representation and a transparent (hidden) text part just to get searching and highlighting capabilities?
Looking back I wonder what I did to get the dots when I first modified the
content. I seems impossible and I can't reproduce it either.
Thanks
Tom
First off, the following is merely guesswork as you could not share the pdf in question. Educated guesswork but guesswork nonetheless.
You report that you changed the value by simply swapping two digits in the text drawing instruction argument and now can successfully search for the value with swapped digits but that Acrobat didn't show the modification.
Furthermore you observed that all text was rendered by a series of m/l/c commands followed by a BT/ET section.
The main situation in which one observes text being rendered as arbitrary vector graphics (a series of m/l/c commands), is in pdfs in which the producer didn't want text extraction to be possible and replaced text drawing instructions by arbitrary vector graphics instructions.
This apparently is not the case in your pdf as the text drawing instructions are not replaced but merely supplemented by the vector graphics ones.
Supposing that this construct is used for a reason and not by accident, I can only assume that the pdf producer was not willing or allowed to embed the font in question but wanted the specific font appearance to be displayed without having to count on the font being installed on the computer the pdf is viewed on.
Thus, the text appearance is drawn using arbitrary vector graphics instructions and the following text drawing instructions actually draw nothing but merely make the text searchable and extractable. This way there is no need to embed the apparent font face as font program. (Text drawing instructions can be made to draw nothing either by using a font with all blank glyphs or by using the text rendering mode "invisible".)
If this assumption turns out to be correct, your task to replace the dummy id requires not merely editing the arguments of the text drawing instructions but also replacing the arbitrary vector graphics instructions showing the dummy id appearance by other instructions showing the actual id.
If you happen to have the font in question and are willing and able to embed it, you can actually replace the arbitrary vector graphics instructions by text drawing instructions using the font. Otherwise be prepared to also draw the actual id as arbitrary vector graphics.

How to convert a multi-page PDF table to a spreadsheet format?

I have a huge PDF file with 300+ pages on which a big 10+ column table is spread. I am using Linux and would like to have a simple command line command which would convert this table to a text importable to a spreadsheet.
Currently I am using pdftotext -layout, and gives quite good results, other than every page is considered independently and column widths and positions change from page to page (due to different maximum column content width on each page), so I cannot simply import the resulting text file to a spreadsheet application and split it to columns by a fixed column width.
I have tried to crop every column on every page (their position is identical across the whole PDF file), but in the result the empty rows are merged together, so the rows with content will be shifted with respect to each other.
If pdftotext had an option to convert the file with a STRICT LAYOUT (not by column content width), that would help. Or if I could stack all pages in PDF file to a single page, that could also solve it.
What are the options to solve this problem?
You are misunderstanding the nature of the content of a PDF file. There are no tables in PDf, there is no metadata (generally) to describe the content as a table. The text you see on the page may not be laid out in the reading order.
For example the PDF file might contain a line of text drawn at the top of the page, then one at the bottom, then a paragraph in the middle before jumping back up to the top for a headline.
In addition there may be no spaces between two text fragments. Text is drawn at an absolute position on the page, so you can draw (for example ) cell A, then move the current point by say 1 cm, rthen draw cell B and so on. Since there's no 'space' characters between the two cells, a naive text extraction will, naturally, assume the two lines of text are coninuous.
The STRICT LAYOUT you want isn't impossible, but you can't do it with a simple text file, because the original layout isn't made up of simple text characters, sometimes the space between two characters, or two fragments of text is done by moving the current point before drawing the text.
Ghostscript's txtwrite device in its simplest mode attempts to replicate the layout by replacing the white space with actual space characters in a fixed pitch font. This 'might' be good enough for you, but it equally well might not. That's because it operates by defining the smallest distance used on the page as being one space character. All distances between text is then replaced by a number of space characters, as many as are required to make up the space. This can (and often does) result in very wide output files with a lot of white space.
Essentially what you seem to want isn't really possible, you can't take a rich format like PDF and replicate it, including the layout, with nothing more than text characters.

How to convert italic font to normal font in pdf using some library?

Is there any way to convert Italic font, Bold font in my pdf to normal font using some library like Imagemagick or GhostScript etc. ?
Basically the answer is 'no' though there are several levels of caveat in there.
The most common scenario for a PDF file is that it contains an embedded font, and that font is subset. In this case the font will use a custom Encoding, so that when you see 'Hello' on your monitor, the actual character codes might be 'Axtte' or similar gibberish. If the font also contain a ToUnicode table you could, technically, create an embedded subset of the regular font from the same family as the bold or italic and embed that, and it would work. This would be an immense amount of work.
If the font isn't subset then it may not contain a custom Encoding, which would make that task easier, because you wouldn't have to re-encode the replacement.
If the font isn't embedded, then you need only change the font name in the Font object, because the PDF consumer will have to find a substitute anyway.
Note that, because PDF is a binary format, with an index (xref) containing the offset of every object in the file, any changes will mean that the xref table has to be reconstructed, again a considerable task.
I'm not aware of any tools which would do any of this for you automatically, you'd have to write your own, though some things could be done automatically. MuPDF for example will 'fix' a PDF file which has an incorrect xref table for you.
And even after all that, the likelihood is that the spacing would be different for the italic or bold font compared to the regular font anyway, and would look peculiar if you replaced them with a regular font.
So, fundamentally, no.
In low-level PDF you can apply some rendering flags in front of a text stream. Like the "Rendering Mode" Tr operation. For instance, in this scenario you can include the rendering of text outline and increase outline drawing width with the command sequence 0.4 w 2 Tr which will cause Normal text to become more "bold" (There are other better ways to accomplish this using the Font Description dictionary). However, one can also employ this tactic to slim down bold text using a clipped thicker outline, but this may not be ideal.
As for italic, most fonts contain a metric indicating their italic angle, and you can use this to add a faux italic using a shear CTM transformation matrix with the cm operation. Once again, this may work better to add an italic shear, but may also have some success in removing it.
See the PDF Reference.
This will require a library with lower level PDF building and you would have to do it manually, but it is possible technically.

Get x/y and width/height of all characters in a PDF using GhostScript

I need to get the x/y, width/height, and page number of each individual character in a PDF, ideally as percentages.
Clearly, Ghost Script is able to do this as it wouldn't be possible to convert PDFs to raster images otherwise. Is there a simple way to get Ghostscript to give me this information or am I going to need to modify the source to hook into this functionality?
Glyphs are rendered to bitmaps (using FreeType) and stored in the glyph cache tagged with the font and matrix so that they can be uniquely identified. When text is rendered to the page the cache is consulted first and if a hit exists that bitmap is drawn at the current point. If not then the glyph is rendered and cached.
However, extremely large point sizes are left uncached, and rendered each time to avoid filling up or overflowing the cache.
So in order to retrieve this information using Ghostscript you would need to write a device which has a set of text methods. You would need to capture the bitmaps from the glyph in order to determine the width and height of the glyphs, and the current point would give you the position on the page. The output_page method would tell you that a page had completed, so you would need to track the page number yourself.
You could look at the txtwrite device to see how text is processed, and the epswrite device to see how to retrieve bitmaps, you'll need some combination of both.
Be aware that 'text' in a PDF file need not be text. What appears to be text can be bitmaps, or vectors. Text can be encoded in unusual ways, and there may be no way to retrieve Unicode or other identifiable information about the glyphs (again the txtwrite device shows how you might extract such information if possible).
Also, fonts are not always embedded in PDF files, in which case a substitute font is used, which would mess up your width/height information.
This is quite a big project.

How Does a PDF Store Text

I am attempting to gain a better understanding of how a PDF stores text. Generally speaking, when a PDF is created from an application like MS Word (or in my case SQL Server Reporting Services), how is text stored by the PDF? I would hope that the resulting document isn't OCR'ed in this particular scenario the way it would be if the original PDF document had been created from an image.
To get a bit more detailed, I am trying to understand how text extractors for PDFs work. My initial understanding of PDF was that it stored (PostScript) instructions on how to draw the "image" of the document to a page or a printer, and that there was no actual text contained within the document itself. Subsequently, I was thinking that a text extractor might reverse-engineer such instructions to generate the text that the PDF would otherwise generate. I am not confident of this, though.
PDF contains several different types of objects; not only vectorial or raster drawing instructions. Text in in particular is represented by text elements. These include a string of characters that should be drawn at certain positions using a specific font.
Text extraction from PDFs can be a complicated affair because the file format is oriented for page layout. A text element may be an entire paragraph, or a single character. Even a single word may consist of several text elements if different typefaces are mixed. Also, the characters are not necessarily encoded in a standard encoding such as Unicode. They may be encoded in a way specific to a particular font.
If you are lucky enough to deal with Tagged PDF files such as PDF/A or PDF/UA, text extraction can be a lot easier because text spans are identified as such, and a mapping to Unicode characters is defined.
Wikipedia doesn't have the complete specification but does serve as an introduction: http://en.wikipedia.org/wiki/Portable_Document_Format#Text