Is there a way to extracting semantic informations from PDF? (converting PDF to pure XHTML) - pdf

I'm finding a way to extract semantic structural informations (like title, heading, paragraph or lists) from PDF. Because I want to get a pure structural data from PDF.
Finally, I want to create an pure XHTML from the PDF. With only structural informations. No design or layout.
I know, PDF can be created without any structural information. I don't consider those PDFs. Only regularly well-structured PDFs are considered.
I'm new to PDF. So I don't know it offers regular semantic structure or not. If it exists, it's library will offer it. So I want to know whether PDF spec has those information, and best way to get those information if exists.

I would highly recommend reading through the PDF spec:
http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf
There isn't a "semantic structure" to the document like you might find in an HTML file; it's much more complicated.
The file format is largely based on a COS Object Tree, which is essentially a set of objects referencing each other in various manners, but not in any particular order (with some exceptions).
Some of these objects contain what you are likely after (document tages, etc). Moreover, these objects can be encoded in various ways.
Very complicated.
I would recommend looking at some of the well developed PDF libraries out there like iText:
http://itextpdf.com/

What do you mean by 'well-structured'?
If the PDFs contain marked content you can get an almost perfect extraction of semantic data. Otherwise it simply does not exist but might be 'guessed' in some cases.

Related

inserting an entir pdf into another by raw text manipulation

I need to include a pdf into another pdf that is being created by text manipulation, not through a package. (In particular, I'm using livecode, which is well suited to the generation of the information I need, and can easily do text manipulation).
Once included, I will be adding additional objects (primarily text, but also a few small squares).
I only need to be able to access the included pdf by page and area, such as (200,200) to (400,400) of page 5; I don't need any access to its objects.
Simply appending to the pdf won't do the job, as I'll actually be including multiple source pdfs into a single pdf output with my addition.
I would like to simply make the original pdf an indirect object in the output pdf, and then refer to and use it. In particular, I would like to avoid having to "disassemble" the source pdf into components to build a new cross-reference table.
Can this be done? Or do I need to make new absolute references for each object in every dictionary, and to every reference to them? (I only need to be able to refer to regions and page, not the actual objects).
something that could be used on a one-time basis to convert an entire multi-page pdf wold also be a usable (but inferior) solution.
I've found that search engines aren't yielding usable results, as they are swamped with solutions for individual products, and not the pdf itself.
First of all, PDFs in general are not text data, they are binary. They may look textual as they contain identifiers built from ASCII values of words, but treating them as text, unless one and one's tools are extremely cautious, is a sure way to damage them.
But even if we assume such caution, unless your input PDFs are internally of a very simple and similar structure, creating code that allows to merge them and manipulate their content essentially is complexity-wise akin to creating a generic PDF library/package.
I would like to simply make the original pdf an indirect object in the output pdf, and then refer to and use it.
Putting them into one indirect object each would work if you needed them merely as an unchanged attachment. But you want to change them.
In particular, I would like to avoid having to "disassemble" the source pdf into components to build a new cross-reference table.
You will at least have to parse ("disassemble") the objects related to the pages you want to manipulate, add the manipulated versions thereof, and add cross references for the changed objects.
And you only mention cross reference tables. Don't forget that in case of a general solution you also have to be able to handle cross reference streams and object streams.
Or do I need to make new absolute references for each object in every dictionary, and to every reference to them? (I only need to be able to refer to regions and page, not the actual objects).
If you really want to merge the source PDFs into a target one, you'll indeed need to renumber the objects from most source PDFs.
If as a target a portable collection (aka portfolio) of the source PDFs would suffice, you might not need to do that. In that case you merely have to apply the changes you want to the source PDFs (by means of incremental updates, if you prefer), and then combine all those manipulated sources in a result portfolio.
I've found that search engines aren't yielding usable results
The cause most likely is that you underestimate the complexities of the format PDF. Combining and manipulating arbitrary existing PDFs usually requires you to use a third-party library or create the equivalent of such a library yourself.
Only manipulating existing PDFs is a bit easier, and so is combining PDFs in a portfolio. Nonetheless, even in this case you should have studied the PDF specification quite a bit.
Restricting oneself to string manipulations to implement this makes the task much more complex - I'd say impossible for generic PDFs, daring for PDFs of simple and similar build.

Does the PDF standard provide a way to store extractable (semantic) text?

PDF is very nice for humans to read, but it is pretty awful to extract the data from. There are tons of tools to extract the data from PDF (pdftotext from poppler, pdftohtml, XPdf, tabula, a-pdf, ...).
As you can see in questions like this, those tools are not optimal.
It would be better if the PDF contained already the data in a structured way to be extracted. Something like a striped-down version of HTML. Especially for tables, there is a lot of information lost. For example, when you convert a Word document to PDF and then to text.
Does the PDF standard provide a way to store the structure of a table? If not, is it possible to extend the PDF standard? What would be the process for that?
What you are looking for, most likely are tagged PDFs.
Tagged PDFs are specified in ISO 32000-1, section 14.8. They mark content parts as paragraphs, headers, lists (and list items), tables (and table rows, headers, and data cells) etc. with assorted attributes.
To do so they make use of the PDF logical structure facilities (see ISO 32000-1, section 12.7) which in turn use the marked content operators (see ISO 32000-1, section 12.6) to tag pieces of content streams with IDs which are referenced from a structure tree object model outside the content streams.
In a tagged PDF you can walk that structure tree like a XML DOM and retrieve the associated text pieces making use of the ID markers in the content.
For details please study the PDF specification ISO 32000-1 or its update ISO 32000.2.
Adobe shared a copy of ISO 32000-1 (merely replacing ISO headers and references), simply search the web for "PDF32000_2008". Currently it's located here: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf

Losing Aria/accessibility when converting from HTML to PDF

I am using ABCpdf to generate a collection of PDFs from HTML markup, and am struggling with making it fully accessible.
The HTML pages include several graphs which are created by CSS, and which are completely ignored by the screenreader.
I have tried using aria-label to give a written explanation of the graphs, but it is lost in the conversion. I have tried configuring the Gecko engine within ABCpdf in numerous ways, including scaling back security options, altering markup options, and adding special tags to explicitly include an element. The PDF is tagged and is rated as fully accessible by our evaluation program.
I haven't been able to find a way to include "hidden" text in the PDF for the purpose of screenreaders. Any help is appreciated!
EDIT: Due to security concerns, I am unable to display the actual data behind the graphs. Manual steps are also not an option due to the sheer number of generated PDFs, and a short timeline.
HTML-to-PDF conversion utilities are usually pretty basic and typically don't handle complex CSS very well at all. You may be better off taking a screen capture and then using alt-text to describe the intent of the graph. Sometimes the simplest approach is the most reliable.
Another way of approaching the issue would be to present the complete data set to users via a data table. That way, they can "see" everything contained in the graph, and it won't matter if the graph itself is inaccessible. If placing a giant data table in the middle of your document doesn't fit with your formatting, you can also include the data set in an appendix with a note or hyperlink in the text directing readers where they can go to access the entirety of information.

If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?

I have been trying to write a simple console application or PowerShell script to extract the text from a large number of PDF documents. There are several libraries and CLI tools that offer to do this, but it turns out that none are able to reliably identify document structure. In particular I am concerned with the recognition of text columns. Even the very expensive PDFLib TET tool frequently jumbles the content of two adjacent columns of text.
It is frequently noted that the PDF format does not have any concept of columns, or even words. Several answers to similar questions on SO mention this. The problem is so great that it even warrants academic research. This journal article notes:
All data objects in a PDF file are represented in a
visually-oriented way, as a sequence of operators which...generally
do not convey information about higher level text units such as
tokens, lines, or columns—information about boundaries between such
units is only available implicitly through whitespace
Hence, all extraction tools I have tried (iTextSharp, PDFLib TET, and Python PDFMiner) have failed to recognize text column boundaries. Of these tools, PDFLib TET performs best.
However, SumatraPDF, the very lightweight and open source PDF Reader, and many others like it can identify columns and text areas perfectly. If I open a document in one of these applications, select all the text on a page (or even the entire document with CTRL+A) copy and paste it into a text file, the text is rendered in the correct order almost flawlessly. It occasionally mixes the footer and header text into one of the columns.
So my question is, how can these applications do what is seemingly so difficult (even for the expensive tools like PDFLib)?
EDIT 31 March 2014: For what it's worth I have found that PDFBox is much better at text extraction than iTextSharp (notwithstanding a bespoke Strategy implementation) and PDFLib TET is slightly better than PDFBox, but it's quite expensive. Python PDFMiner is hopeless. The best results I have seen come from Google. One can upload PDFs (2GB at a time) to Google Drive and then download them as text. This is what I am doing. I have written a small utility that splits my PDFs into 10 page files (Google will only convert the first 10 pages) and then stitches them back together once downloaded.
EDIT 7 April 2014. Cancel my last. The best extraction is achieved by MS Word. And this can be automated in Acrobat Pro (Tools > Action Wizard > Create New Action). Word to text can be automated using the .NET OpenXml library. Here is a class that will do the extraction (docx to txt) very neatly. My initial testing finds that the MS Word conversion is considerably more accurate with regard to document structure, but this is not so important once converted to plain text.
I once wrote an algorithm that did exactly what you mentioned for a PDF editor product that is still the number one PDF editor used today. There are a couple of reasons for what you mention (I think) but the important one is focus.
You are correct that PDF (usually) doesn't contain any structure information. PDF is interested in the visual representation of a page, not necessarily in what the page "means". This means in its purest form it doesn't need information about lines, paragraphs, columns or anything like that. Actually, it doesn't even need information about the text itself and there are plenty of PDF files where you can't even copy and paste the text without ending up with gibberish.
So if you want to be able to extract formatted text, you have to indeed look at all of the pieces of text on the page, perhaps taking some of the line-art information into account as well, and you have to piece them back together. Usually that happens by writing an engine that looks at white-space and then decides first what are lines, what are paragraphs and so on. Tables are notoriously difficult for example because they are so diverse.
Alternative strategies could be to:
Look at some of the structure information that is available in some PDF files. Some PDF/A files and all PDF/UA files (PDF for archival and PDF for Universal Accessibility) must have structure information that can very well be used to retrieve structure. Other PDF files may have that information as well.
Look at the creator of the PDF document and have specific algorithms to handle those PDFs well. If you know you're only interested in Word or if you know that 99% of the PDFs you will ever handle will come out of Word 2011, it might be worth using that knowledge.
So why are some products better at this than others? Focus I guess. The PDF specification is very broad, and some tools focus more on lower-level PDF tasks, some more on higher-level PDF tasks. Some are oriented towards "office" use - some towards "graphic arts" use. Depending on your focus you may decide that a certain feature is worth a lot of attention or not.
Additionally, and that may seem like a lousy answer, but I believe it's actually true, this is an algorithmically difficult problem and it takes only one genius developer to implement an algorithm that is much better than the average product on the market. It's one of those areas where - if you are clever and you have enough focus to put some of your attention on it, and especially if you have a good idea what the target market is you are writing this for - you'll get it right, while everybody else will get it mediocre.
(And no, I didn't get it right back then when I was writing that code - we never had enough focus to follow-through and make something that was really good)
To properly extract formatted text a library/utility should:
Retrieve correct information about properties of the fonts used in the PDF (glyph sizes, hinting information etc.)
Maintain graphics state (i.e. non-font parameters like text and page scaling etc.)
Implement some algorithm to decide which symbols on a page should be treated like words, lines or columns.
I am not really an expert in products you mentioned in your question, so the following conclusions should be taken with a grain of salt.
The tools that do not draw PDFs tend to have less expertise in the first two requirements. They have not have to deal with font details on a deeper level and they might not be that well tested in maintaining graphics state.
Any decent tool that translates PDFs to images will probably become aware of its shortcomings in text positioning sooner or later. And fixing those will help to excel in text extraction.

embed serial number to PDF file?

To prevent the casual distribution of pdf document, is there any way such as embedding the serial number to the file?
My idea is to embed the id bound to user and enable to find who distribute the file.
I know it's not preventing the distribution but may discourage the casual distribution by the certain level.
Any other solution is also welcome.
Thanks!
Common way is placing of meta data, but you can easily remove them.
Let's search hideouts (most of them low-level)!
Non-mark text
Text under overlapping objects
Objects of older versions (doesn't noticed by reader, but there with redundant information)
Marks in streams between BX-EX (with weird information from readers point of view)
Information before %PDF-X
Information above %%EOF
Substitution of names for some elements (like font name)
Steganography
Manipulation from used fonts
Whitespacing
Images with setganograpy
My favorite are steganography and BX-EX block within stream, with proper compression and/or encryption it is hard to find (if do not know, where it is). To make search harder wrap some normal blocks with BX-EX.
Some of ways are easy to remove, some harder, but decided attacker will be able to find and sanitize them all. Think about copy-paste of text or print trough PDF-printer.
You can render transparent text. You can write text outside the media box of a page. You can add custom document property. There are plenty of ways to do this.
Why not create a digital id on the documents?