Does pdf has styles, headers and footers information as docx? - pdf

Does pdf has styles, headers and footers information as docx file have separate xml files for these?

Regular PDFs don't have styles, but different fonts (for instance Helvetica is one font, Helvetica-Bold is another font of the same family).
They don't have headers and footers, just like they don't have paragraphs, section titles, table rows or table cells. Everything you see in a PDF page, is just a bunch of glyphs and paths and shapes drawn on a canvas.
However: if your PDF is a Tagged PDF, the PDF contains something that is known as the StructTreeRoot. This means that, apart from the presentation of the content, you also have a tree structure that stores the semantics of the content. This structure contains references to the content on the different pages, allowing you (for instance) to find out which lines belong together in a paragraph, which parts of the page are "artefacts" (such as a repeating header or footer), which content is organized as a table, etc...
Tagged PDF is a requirement for PDF/A Level A and PDF/UA documents. A majority of the PDF files you can find in the wild aren't tagged (properly).

Related

What governs the text selection order of PDFs, how can it be improved when generating PDFs?

A number of PDFs, particularly those exported by presentation software, desktop publishing or latex typesetting seem to have an illogical text selection marquee order.
For example selecting parts of a math equation in one of my documents seems to randomly select another large block of equations elsewhere on the page, even though they are separated by body text. Is this a problem in the PDF viewer(mac preview) or in the PDF file itself. What procedures should be followed when programmatically generating PDFs to insure a logical ordering for textual selection.
Text selection in PDF viewers is determined by an algorithm in the viewer. Different viewers will have different algorithms and yield different results. Some viewers will leverage the structure tags if they are present, others will ignore the tags even when present.
Unfortunately, there is nothing you can do as the PDF author to influence how any particular viewer software interprets the text rendering instructions into words then into blocks of text into page regions and finally into a text selection.

using "PDFBox" how to identify "Table of contents" page

I am using apache pdfbox framework to read pdf text content.
I have to get the content from "Table of Content" page (if present in the pdf), should be able to identify the Table of content page through pdfbox api.
kindly provide your suggestions.
The table of content in a PDF file is not easily identified by any structure you can just pull from the PDF document. You will have to do text extraction and identify the table of content by its properties.
PDF in general doesn't contain content structure such as table of contents, chapters, headers, footers or even paragraphs or lines of text.

How to find the used characters in a subsetted font?

I have PDF files which are dynamically generated, with text, vectors, and subsetted fonts. I can see which fonts are used in various viewers - is there a way of displaying the actual subsetted characters of those fonts?
For example, I see the document contains the subsetted subsetted fonts "AAAAAC+FreeMono" and "AAAAAD+DejaVuSans". How do I find how many characters were subsetted from these fonts, and what characters they were?
(I tried loading the fonts in FontForge, but it just crashes while opening the file)
The solution is to save the font data to a file and load it into a font editor. A subset font file is still a valid font file but it is possible that FontForge expects some data in the font that is not there. I have seen also many fonts that are not properly subset and this could also cause loading problems in a font editor.

How Does a PDF Store Text

I am attempting to gain a better understanding of how a PDF stores text. Generally speaking, when a PDF is created from an application like MS Word (or in my case SQL Server Reporting Services), how is text stored by the PDF? I would hope that the resulting document isn't OCR'ed in this particular scenario the way it would be if the original PDF document had been created from an image.
To get a bit more detailed, I am trying to understand how text extractors for PDFs work. My initial understanding of PDF was that it stored (PostScript) instructions on how to draw the "image" of the document to a page or a printer, and that there was no actual text contained within the document itself. Subsequently, I was thinking that a text extractor might reverse-engineer such instructions to generate the text that the PDF would otherwise generate. I am not confident of this, though.
PDF contains several different types of objects; not only vectorial or raster drawing instructions. Text in in particular is represented by text elements. These include a string of characters that should be drawn at certain positions using a specific font.
Text extraction from PDFs can be a complicated affair because the file format is oriented for page layout. A text element may be an entire paragraph, or a single character. Even a single word may consist of several text elements if different typefaces are mixed. Also, the characters are not necessarily encoded in a standard encoding such as Unicode. They may be encoded in a way specific to a particular font.
If you are lucky enough to deal with Tagged PDF files such as PDF/A or PDF/UA, text extraction can be a lot easier because text spans are identified as such, and a mapping to Unicode characters is defined.
Wikipedia doesn't have the complete specification but does serve as an introduction: http://en.wikipedia.org/wiki/Portable_Document_Format#Text

How to replace or modify the font or glyphs embedded in a PDF file?

I want to replace the font embedded in an existing PDF file programmatically (with iText).
iText itself does not seem to provide any data model for glyphs and fonts, but I believe it can let me retrieve and update the binary stream that contains the font.
It's OK even if I don't know which glyph is associated to which font - what I want to do is just to replace them. To be precise, I want to embold all glyphs in a PDF document.
Replacing fonts in rendering time is not an option because the output must be PDF with all information preserved as is.
Is there anyone who has done this before with iText or any other PDF libraries?
PDF files define a set of fonts (ie F0, F1, F2) and then define these separately so you could theoretically rewrite the entry for F0. You would have to ensure the 2 fonts have the same spacing (or you will have to rewrite the PDF as well), and probably hack the PDF manually.