What is "Tagged PDF"? - pdf

Can someone please explain what a "Tagged PDF" is, and how it differs from regular, non-tagged PDF?
Will tagged PDFs contain special content, such as XML, Rich Media, Javascript, or the like?
Which TeX-toolchains generate Tagged PDFs?

Tagged PDF is a PDF file that contains meta-information around certain groups of PDF instructions inside a page content. This meta-information has many use cases: Text-extraction, content-reflow, document accessibility, geographic information in PDF containing maps, etc.
If you need to know more details about this topic I would recommend reading Chapter 10 - Document Interchange of Adobe PDF Reference version 1.7.

The main reason it is used is for accessibility. With the correct tags, a screen reader (for a blind person) can understand where headings fall, what is a table/footnote/graphic and so on. Also there is a feature called PDF Article Threading which is useful for magazine or newspaper layouts where an article is split across boxes/pages.

Related

Better tiny format PDF over EPUB or MOBI?

I want to write a book. I'll do it without publishing house. I want to produce the printed paper version. I already know LaTeX, so I can make beautiful PDFs.
Since many people have an ebook-reader, I think it would be worth creating an ebook. I can produce a PDF with a very small format (not A4); but not an EPUB so easily.
Is it okay if I publish my book only in PDF or people will not buy my book because they prefer an EPUB or MOBI?
I think a lot of this depends on the audience of your book. Some things to think about:
One of the biggest differences between PDF and EPUB is that EPUB documents reflow the content based on the size of the reader screen -- so you can support multiple sizes of devices with one file. Apparently PDF files are larger, but I think this depends on how the PDF is created, and if the EPUB has embedded fonts, etc. See this SuperUser question about epub vs mobi vs pdf
Ages ago when I worked on kindle devices I found the font and layout support lacking. EPUB does let you embed your own fonts into the file, and most -- but not all -- readers will support the font and render things exactly the way you want. I suspect MOBI files are still not there yet. It sounds like you are interested in how your book renders on readers, so maybe MOBI files are not for you.
There are programs out there such as Calibre that will convert files from one format to another, for what it's worth.

Creating a PDF viewer using iTextSharp

I am trying to create a PDF viewer using the iTextSharp library, but there doesn't seem to be any documentation anywhere about how I can accomplish this. I don't need to create a PDF file, just display one and give users the option to save the file or export it to a CSV file.
Can somebody please point me in the right direction?
iText is not a PDF viewer (nor iTextSharp) for that matter, but it could be used to examine a PDF document. See for instance iText RUPS. iText RUPS is a tool that allows you to look under the hood of a PDF, more specifically at the PDF objects stored in a PDF as well as at the content streams.
This would be the first step towards writing a PDF viewer. However, iTextSharp doesn't interpret the content stream of a page, nor the resources that belong to that page (such as image streams, glyph descriptions, etc). If that's what you want to build, you need to consult ISO-32000-1. Note that it will probably take several man years to create a decent viewer.
As for the requirement to export a PDF document to a CSV, this may be possible if your original PDF is a Tagged PDF, but it will be impossible for the majority of PDF documents, including documents that consist of scanned images and documents with no machine-recognizable structure.
Please understand that this is a general answer. A more specific answer can not be given since your question is too broad for StackOverflow. All the answers you need can be found by using iText RUPS and reading ISO-32000-1 (there's a copy of ISO-32000-1 available on Adobe's web site).

I need some insight on PDF Bookmarks

I haven't done any programming to handle PDFs in depth, only PDF creation with PHP.
I've been asked into a project where the requirements are generating PDF bookmarks with titles created from selected text.
The scenario goes like this:
The user highlights some text in a given PDF file.
The user is prompted to enter the starting page number for the chapter (bookmark)
A bookmark is created with a title which points to the given page number.
Multi-level bookmarks to handle sub-chapters (like child nodes) should be supported.
Due to some restraints, the client would prefer this to be a web app if possible.
What platform/language/technology/library would you recommend?
Is it doable in a browser? Should this be a desktop app instead?
I am fluent in PHP/Javascript and capable in Python with tiny bits of experience on handling PDF files (nothing further than generating formatted PDF). (plus willing to learn anything new)
I've got some time to dig around and conceptualise it, so I'm very open to suggestions.
Any insight would be appreciated.

Are there tutorials and examples of how to interpret PDF documents

I am using tools such as PDFBox to interpret PDF files (including text, strokes, glyphs and images) and can access the streams and dictionaries. I am not clear on how these components link together and how to interpret them. In particular I would like to know how to access fonts from the streams.
NOTE: I am not interested in tutorials on how to create PDF documents
You probably should start from reading PDF Reference. It's a huge file but you might read only relevant parts.
To understand font streams you are basically need to read about TrueType and Type1 font formats (it's not an easy reading either). PDF may contain other font types but TrueType and Type1 are probably most widely used.
Fiddling with fonts might be complicated so you will probably find it easier to use some font library as FreeType for extracting information from PDF font streams.
There are lots of good article on planetpdf.com and many PDF developers run blogs with useful generic articles. We have run a whole load on our blog (http://www.jpedal.org/PDFblog/)

How to remove duplicate pages from a PDF?

I have a pdf of an ebook which has many pages duplicated, and I would like to remove them.
There are many PDF toolkits that could do this. You don't say which language or framework, so it's hard to recommend one.
You could also use Adobe Acrobat if you don't want to do it programmatically.
Disclaimer: I work for Atalasoft
Our DotImage Document Imaging Toolkit has PDF page manipulation classes. Tutorial here:
http://www.atalasoft.com/products/dotimage/white-papers/building-pdf-documents-with-dotimage
It shows adding pages, but we also support removing (and reordering)