Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
How can I find out what's wrong with my PDF file. I'm making a web application for digitally signing pdf documents. To do that, I'm doing a lot of stuff in the code like adding signature fields before inserting the actual signature. The code is too long to post here. The code is working fine for most of the pdf documents. However, for some pdf's it breaks the pdf, so it can't be opened with the Acrobat Reader.
Is there any way I can find out what's wrong with my pdf. The problem is that Acrobat Reader is giving me very general errors without further explanation.
The error I get is: There was an error opening this document. There was a problem reading this document (14).
Here is the signed pdf file that is making problems if someone wants to take a look:
https://easyupload.io/1a0i8x
There are many errors in the cross references of the PDF.
Actually it looks like the source PDF had a cross reference stream (or at least hybrid cross references) and your program stored the cross reference data from the stream as a cross reference table, ignoring the fact that the entries referring to objects in object streams make no sense in a cross reference table.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 years ago.
Improve this question
I received a PDF file that uses unusual fonts.
The fonts look fine to the human eye,
but if I try to cut-past them, I get a string of '???'
Is it possible to replace the PDF document's defined fonts with normal fonts (e.g., on Foxit Phantom PDF editor)?
This may be possible, e.g. with PitStop Pro from Enfocus. However, as others indicated in the comments, it is possible that the fonts in the pdf and the pdf itself have had all information to make this possible removed.
Some more detail about this maybe:
The encoding in the PDF could tell software which character is to be shown, and then that character would be selected from the font for display, but it is also possible to create a pdf so it only says 'show glyph number 3 in of the embedded font'. That is what the 'Identity-H' encoding you see in the summary does.
Note that the word glyph and not 'character' is specifically used when talking about the individual 'drawings' that make up a font to indicate that these things are only 'random' drawings until some information is added in the font to indicate which letter (or other character, like a number) they represent.
E.g. for the character 'lower-case-a', the font you currently look at has this glyph:
a
but other fonts will have something that may look completely different. Only because we have learned to read these different images as the letter lower-case-a do we think they are/represent 'the same letter'.
If this information is not present in the PDF, as is your case, it is still possible that this information can be gotten from the font included in the pdf: a font on your computer needs some way to allow a program to select the right glyph if it wants to display 'lower-case-a'. However, if the pdf is set up to simply say 'show glyph number 3 in of the embedded font', this information isn't necessary anymore, and can be removed from the font before the font is put inside the pdf. This is done either to make the pdf smaller, or to prevent people from copying the text, e.g. of copyrighted works.
In this case, only OCR can help. I think Adobe Acrobat (the full version, not Adobe Reader) has added exactly that in one of the latest versions; however this means it is trying to guess the letter from the 'image' shown, so this may make mistakes.
I'm trying to edit an XSL-FO document to be processed in Java via Apache FOP. Are there any tags that might not be supported or that have become depreciated?
I didn't write the original XSL-FO. It was originally a WordML document that I converted via Word2FO, and I've gotten rid of junk characters and made sure all the tags are closed properly, so the only thing I can think of is that some of these tags might not be supported. Particularly:
Tags with Microsoft-related properites, including fonts like Arial-Word-Unicode, references to Microsoft-Office Smarttags, and other word-related properties
SVG tags
External graphics tags with huge streams of nonsense text to represent an image.
I've been looking online for a list of these unsupported tags, but I can't seem to find anything.
I have found what I am looking for on this page.
http://xmlgraphics.apache.org/fop/compliance.html
Very useful for looking up what is/isn't supported.
I should really stop finding these solutions after asking these questions.
This question is answered, and the above link will probably help anyone who needs it answered too. Sorry to bother the rest of you.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I am trying to learn the PDF file format.
To this end I downloaded Adobe's PDF specification file, which is huge.
So to help me study the details of PDF, I want to follow its abstract explanations by looking in parallel at some real-world PDF files.
For example, one idea was to create a PDF file (using LaTeX) which has only one page and as content even only one character, a.
But when I open this PDF file in a hex editor (or in other tools that can show the internal PDF structure), there is a lot of binary or compressed content inside this PDF. For an example for what I see, look at the screenshot below:
I simply can not identify which part of this binary is representing my character a in this PDF.
The same happens with all the real-world PDF files I've tried so far. I simply cannot find any PDF files which contain working example code to help me understand the generic PDF language specification.
I would like others to explain to me: is there a practical way to study the PDF specification while at the same time verifying its bits and pieces with real PDF files?
I would like to know: which software tools are commonly used by PDF programmers that would help a newbie developer like me to dissect and un-compress existing binary PDF files so their source code can be investigated using a simple text editor? (Note: I'm not asking for a recommendation. In compliance with the SO FAQ I just want to know if such tools do exist, and which names they have.)
Is there a resource of freely available PDF files which don't contain binary and/or compressed content? Or how could I create my own such example files?
Are there (preferably free) PDF editors/parsers available which can visualize + dissect the raw binary data of PDF files and expose their structure?
I only need a first hook. The entry point, if you will, to the narrow path in the thick jungle of real world PDF files, which I then could follow along... while using the help of this bushwacker called 'PDF Specification'.
The creators of iText (a Java/C# lib to create and manipulate PDFs) published a tool called RUPS.
From the sourceforge page:
RUPS is an abbreviation for Reading and Updating PDF Syntax. RUPS is a tool built on top of iText® that allows you to look inside a PDF document and browse the different PDF objects and content streams. (Updating PDFs isn't possible yet.)
The way I helped myself to learn PDF syntax was this:
Looked for a tool that could de-compress PDFs (de-compress the internal streams).
Found qpdf, Jay Birkenbilt's commandline tool described as: "does structural, content-preserving transformations on PDF files".
Routinely running qpdf --qdf input.pdf decompressed-input.pdf.
Opening the newly created decompressed-input.pdf in a text editor.
The --qdf mode of the tool transforms the binary and ASCII elements of PDFs in a very useful way, without changing their visual page appearance (and it's very fast):
Decompress previously compressed objects (exposing f.e. the PDF language source code of page element drawing operations).
Also expand object streams (ObjStrm).
Normalize the presentation of arrays, strings etc.
Re-number objects so they start from 1 0 obj and then present them in ascending order in the file.
Repair b0rken xref entries.
Add comments which contain an object's original identity in the original file.
Add comments for each page.
...and some more.
Looking at these (now mostly ASCII) files in a normal text editor is way more easy than trying to figure out the original binary PDF.
I would recommend taking a look at a few files using PDF Vole (a tool based on iText, and similar to RUPS).
PDF Vole and RUPS will both allow you to navigate through the structure of a PDF file, inspect the entries on every object, decompress compressed streams, decrypt the file when needed, look at the content of pages and annotations, and track down the relation between objects in the file.
For example this file:
Will look like this in PDF Vole:
You could also take a look on the class hierarchy of iText itself (which is almost 1-to-1 with the PDF spec) and the book that explains it, iText in Action.
If you are trying to generate PDF files via code, then this CodeProject source code might help.
The code along with the Adobe specification should get you going. I don't think there are many short cuts here. Understanding PostScript is going to take some study!
EDIT: and seeing as a PDF is compressed PostScript, something like RoPS could be handy too.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I've tried using LaTeX and DocBook for documenting programming tools, to get PDF output. What I've found is that these tools are excellent in some ways - easily versioned, and generating very usable PDF manuals. But there is a serious flaw. Code-snippets cannot simply be cut-and-pasted out of the PDF.
With DocBook, the problem is the loss of whitespace - mostly for indentation, but any repeated spaces seem to get stripped out. So, once you paste the snippet into a text editor, you'll need to clean up the indentation and vertical alignment. Not too much hassle for two or three lines, but it quickly gets annoying.
With LaTeX - well, it's a mess. The following was taken from a PDF generated using the LaTeX in MikTeX 2.8.
node myclas s
f f i e l d f i e l d 0 1 : i n t ;
f i e l d f i e l d 0 2 : ” char ” ;
g;
The intended example is...
node myclass
{
field field01 : int;
field field02 : "char*";
};
Other than the fact LaTeX plays with the quotes, the intended form is what you see in Adobe Reader - but not much like what you get from a cut-and-paste. Don't ask me what's going on with the spaces, or why the braces turned into letters, or what happened to the asterisk - I don't know!
Mostly, I've noticed these things playing with ways of keeping my own personal notes, and just went back to other ways. Some notes are in HTML or plain text, so I can version them. Others are in an old Journal program I've used for years. But I've written a tool that I may want to release soon - and I'll want to include a usable PDF manual, which will need to include examples.
So - is there a way of creating PDF documentation where the code snippets can be easily cut-and-pasted? Preferably a way that allows me to keep "sources" in versioned text files.
EDIT
Any solution must be portable. I will need to use it on Linux and on Windows XP.
EDIT
It looks like this may be impossible.
I've tried printing from Notepad++ to the Adobe Acrobat Pro 7 printer driver. The resulting document looked fine, but cutting and pasting gave the same missing whitespace problems as occur with DocBook.
I tried using the touchup text tool in Acrobat Pro to add leading spaces. These are preserved when you save and reload - but when you select text normally in acrobat, they aren't included. You can only cut-and-paste including those spaces using the touchup text tool, so far as I can tell, which is obviously not included in reader.
In other words, this looks like a fundamental limitation - not of the PDF format itself so much as the tools that work with it. There appears to be a general assumption at work here that whitespace is insignificant - which for my purposes obviously isn't true.
EDIT
One solution may be a "text field". I can add these fairly easily using Acrobat Pro, can set a fixed width font, enter multiple lines of text and make the field read only. In Acrobat Pro 7, the text in the field then isn't selectable - but in Reader 9 it is selectable and everything is preserved when you cut and paste.
The question is - can text fields be generated directly using some kind of markup language that is usable to create complete manuals?
I'd suggest enscript. I use it for producing archives and documentations.
Also, you can merge multiple source codes ps'ed with enscript into another pdf.
If your code is kept in external files, one way would be to attach the original file(s) as PDF attachments. This could be done with Docbook, LaTeX, DITA, and a few others.
For example, if you are using this method to include code in Docbook, you can write some code to your XSL customization layer for adding the external file as an attachment to the PDF. As far as I know, this is portable (although I haven't personally tried to open PDF files with attachments in Evince, Okular, Xpdf, etc to see what happens).
If you are processing the Docbook files using even FOP, you should still be able to write something into your customization layer to attach files. See the section on PDF attachments. You could even output a link to the attachment below the codeblock in the PDF if you want to make it more discoverable to people.
A similar solution should be possible using LaTeX with the attachfile package.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
For a small project I have to parse pdf files and take a specific part of them (a simple chain of characters). I'd like to use python to do this and I've found several libraries that are capable of doing what I want in some ways.
But now after a few researches, I'm wondering what is the real structure of a pdf file, does anyone know if there is a spec or some explanations anywhere online? I've found a link on adobe but it seems that it's a dead link :(
Here is a link to Adobe's reference material
http://www.adobe.com/devnet/pdf/pdf_reference.html
You should know though that PDF is only about presentation, not structure. Parsing will not come easy.
I found the GNU Introduction to PDF to be helpful in understanding the structure. It includes an easily readable example PDF file that they describe in complete detail.
Other helpful links:
PDF Succinctly book is longer and has helpful pictures.
Introduction to the Insides of PDF is a presentation that isn't as in-depth but gives a quick overview and has lots of pictures.
When I first started working with PDF, I found the PDF reference very hard to navigate.
It might help you to know that the overview of the file structure is found in syntax, and what Adobe call the document structure is the object structure and not the file structure. That is also found in Syntax. The description of operators is hidden away in Appendix A - very useful for understanding what is happening in content streams. If you ever have the pain of working with colour spaces you will find that hidden in Graphics! Hopefully these pointers will help you find things more quickly than I did.
If you are using windows, pdftron CosEdit allows you to browse the object structure to understand it. There is a free demo available that allows you to examine the file but not save it.
Here's the raw reference of PDF 1.7, and here's an article describing the structure of a PDF file. If you use Vim, the pdftk plugin is a good way to explore the document in an ever-so-slightly less raw form, and the pdftk utility itself (and its GPL source) is a great way to tease documents apart.
I'm trying to do pretty much the same thing. The PDF reference is a very difficult document to read. This tutorial is a better start I think.
This may help shed a little light:
(from page 11 of PDF32000.book)
PDF syntax is best understood by considering it as four parts, as shown in Figure 1:
• Objects. A PDF document is a data structure composed from a small set of basic types of data objects.
Sub-clause 7.2, "Lexical Conventions," describes the character set used to write objects and other
syntactic elements. Sub-clause 7.3, "Objects," describes the syntax and essential properties of the objects.
Sub-clause 7.3.8, "Stream Objects," provides complete details of the most complex data type, the stream
object.
• File structure. The PDF file structure determines how objects are stored in a PDF file, how they are
accessed, and how they are updated. This structure is independent of the semantics of the objects. Sub-
clause 7.5, "File Structure," describes the file structure. Sub-clause 7.6, "Encryption," describes a file-level
mechanism for protecting a document’s contents from unauthorized access.
• Document structure. The PDF document structure specifies how the basic object types are used to
represent components of a PDF document: pages, fonts, annotations, and so forth. Sub-clause 7.7,
"Document Structure," describes the overall document structure; later clauses address the detailed
semantics of the components.
• Content streams. A PDF content stream contains a sequence of instructions describing the appearance of
a page or other graphical entity. These instructions, while also represented as objects, are conceptually
distinct from the objects that represent the document structure and are described separately. Sub-clause
7.8, "Content Streams and Resources," discusses PDF content streams and their associated resources.
Looks like navigating a PDF file will require a little more than a passing effort.
If You want to parse PDF using Python please have a look at PDFMINER. This is the best library to parse PDF files till date.
Didier have a tool to parse the PDF:
http://didierstevens.com/files/software/pdf-parser_V0_4_3.zip
or here:
http://blog.didierstevens.com/programs/pdf-tools/ which cataloged several related pdf-analysis tools.
Another tool is here:
http://mshahzadlatif.wordpress.com/2011/09/28/view-pdf-structure-using-adobe-acrobat-or-a-free-tool-called-pdfxplorer/
Extracting text from PDF is a hard problem because PDF has such a layout-oriented structure. You can see the docs and source code of my barely-successful attempt on CPAN (my implementation is in Perl). The PDF data structure is very cool and well designed, but it's easier to write than read.
One way to get some clues is to create a PDF file consisting of a blank page. I have CutePDF Writer on my computer, and made a blank Wordpad document of one page. Printed to a .pdf file, and then opened the .pdf file using Notepad.
Next, use a copy of this file and eliminate lines or blocks of text that might be of interest, then reload in Acrobat Reader. You'd be surprised at how little information is needed to make a working one-page PDF document.
I'm trying to make up a spreadsheet to create a PDF form from code.
You need the PDF Reference manual to start reading about the details and structure of PDF files. I suggest to start with version 1.7.
On windows I used a free tool PDF Analyzer to see the internal structure of PDF files.
This will help in your understanding when reading the reference manual.
(I'm affiliated with PDF Analyzer, no intention to promote)
To extract text from a PDF, try this on Linux, BSD, etc. machine or use Cygwin if on Windows:
pdfinfo -layout some_pdf_file.pdf
A plain text file named some_pdf_file.txt is created. The simpler the PDF file layout, the more straightforward the .txt file output will be.
Hexadecimal characters are frequently present in the .txt file output and will look strange in text editors. These hexadecimal characters usually represent curly single and double quotes, bullet points, hyphens, etc. in the PDF.
To see the context where the hexadecimal characters appear, run this grep command, and keep the original PDF handy to see what character the codes represent in the PDF:
grep -a --color=always "\\\\[0-9][0-9][0-9]" some_pdf_file.txt
This will provide a unique list of the different octal codes in the document:
grep -ao "\\\\[0-9][0-9][0-9]" some_pdf_file.txt|sort|uniq
To convert these hexadecimal characters to ASCII equivalents, a combination of grep, sed, and bc can be used, I'll post the procedure to do that soon.