PDF content is not enough to reconstruct the PDF? - pdf

I open a pdf file "test.pdf" with Vim and copy its content to another text buffer that I save as "copy.pdf". I don't understand why "copy.pdf" is different, can be opened as a pdf (the title shows) but the page is empty.
The same happens when I read the file in Javascript with FileReader.readAsBinaryString and rewrite it to disk, so it is not related to how I copy in Vim.
Even more strange, the Finder says that the copy is actually 30KB bigger.
Where are the hidden bytes?

Usually when I see this sort of behavior and resulting blank pages, it is the result of using a program or process that is treating the binary information of a PDF as text in some form or another - for example, doing CR/LF conversion, tab to space conversion or interpreting the data as UTF-8 instead of binary. Doing any sort of transformation will ruin the binary streams within a PDF and will cause the offset bytes in the cross-reference table to become incorrect, causing the PDF to be unreadable.
Perhaps your process of writing back to disk doing CR/LF conversion or otherwise treating your binary blob as non-binary?

Related

TJ and Tj operators showing garbage values after decoding

I have used zlib python library to decode stream which were compressed using FlateDecode. Until now, all the pdf files I have worked with, showed correct values in Tj and TJ operators but I am facing issue decoding this pdf as I am not getting what's displayed in the PDF.
I am able to copy text from the PDF to notepad without any issue and also pdftotext is giving expected results with correct words as output.
I have also used Adobe Preflight to see the document's internal structure to double check the decoded text I am getting via zlib but even that shows garbage values and it doesn't match to what's displayed in the PDF.
Why do I get this garbage value in text operators and how is pdftotext still able to get the correct results ?
Also, How do I get correct results via python/zlib ?
PDF File
The values in the TJ/Tj operators are PDF codepoints (normally one byte, sometimes two). You will need to see which font is in operation, then read the font encoding (there are many kinds). PDF text extraction is very hard. I wouldn't advise trying it yourself.
You have been lulled into a false sense of security by seeing PDF files in which the PDF codepoints happen to be exactly the same as the unicode codepoints they represent - i.e you have been looking at files which use simple font encodings.

Soure PDF Code Edited and it is nor Visible

I just edited the pdf source code with a text editor and after saving it all content was not visible. I change it as it was before but the content still not visible.
Could someone help me?
What do you mean by 'source code' ?
Do you mean you opened a PDF file in a text editor and tried to alter it ? That is pretty much doomed to failure, PDF is a binary format, and various parts of it are referenced by a cross reference table, which points to precise offsets within the file. If you edit the PDF file then you may do any of the following:
1) Convert CR/LF to CR or LF or CR/LF pairs
2) Mangle 8-bit binary data into 'something else' which can include a local code page, depending on the editor you used.
3) Altered the offset of a critical object.
All of the above will cause the PDF file to be broken. Almost all of these changes are invisible, so if you 'change it as it was before' you probably didn't, you just changed the visible differences.
If the file is broken, then all you can do is replace it with your backup. You did backup the file before you started editing it, right ?

PDF editing directly then deleting edits still leaves pdf corrupted

My PDF looked fine until I edited it, and now it still appears to be corrupted even after I took out my edits. A file diff program is saying that the two files are the same, but only one is displaying the information.
To reproduce:
1) Open PDF and make sure there is stuff inside of it
2) Open PDF in a text editor and add text at the top
3) Open PDF normally and it is empty
4) delete text added in step 2
5) PDF is still corrupted despite having SAME file contents
This also happens if I literally copy and paste the code from a PDF into a different file and try to open that. It won't open.
Is there any way to be able to be able to add text to a PDF and have it not corrupt?
PDF is a binary format. Even if it looks quite text'ish, it is not text. In particular PDF files usually contain binary data streams, e.g. for images or embedded fonts or compressed arbitrary content. Furthermore, PDFs rely on PDF objects starting at offsets noted in a cross reference table or stream in the file.
Many text editors, though, do not only apply the changes you type in to a document but also do other stuff, like unifying line breaks (DOS CRLF or Unix LF or Max CR), replacing byte sequences they could not interpret by a special character (e.g. the Unicode REPLACEMENT CHARACTER) or dropping them altogether, etc.
The former (unifying line breaks) moves the data without updating the cross reference information, rendering it useless. If the bytes interpreted as line break characters were actually parts of binary stream data, the stream data also is damaged.
The latter (byte sequence replacement) usually damages contents of streams in the PDF with compressed data or other sensitive binary data beyond repair. Depending on the sequence length, this also moves data and so invalidates cross references.
Thus, using a text editor to edit a PDF usually is a sure way to break a PDF.
Is there any way to be able to be able to add text to a pdf and have it not corrupt?
Yes, using PDF aware software, e.g. Adobe Acrobat but there also are others. If you prefer a programming approach, use a good general purpose PDF library. There are such libraries for many programming platforms.
For a very few types of changes, one can also use a hex editor (only replacing some bytes, not inserting or removing anything), but you really should know what you are doing.

PDF data extraction gives symbols/gibberish?

I have a piece of software called PDF2XL which is normally great for extracting tables of data from PDF files. I've used it with hundreds of files before.
This one file though, gives me gibberish output that I can't even copy and paste into this textarea correctly. All sorts of unicode weirdness.
If I copy and paste as per normal into excel/notepad I get the same issue.
I assume it's something to do with a messed up character encoding header in the PDF file? How can I change this? I'm on Windows and have no software that can edit PDFs, so if I need to edit/re-save it, please recommend a free piece of SW to do it.
Thanks!
There are an increasing number of PDF files the used subsetted fonts which is basically a custom encoding. Normally the font descriptor in the PDF should have a ToUnicode table to allow the text extraction to decode the font encoding and return the correct text.
Some PDF producers are doing this on purpose to prevent easy PDF text extraction for things such as financial reports. If there is only one font then you could manually decode the font but in my experience I have seen PDF's with multiple random encodings which makes it nearly impossible to decode automatically.
One way to test for these types of PDF's is to open the file in Acrobat, select some text, copy it and then paste it into Notepad. If the text is garbled then the PDF is using a subsetted font and there is not much more you can do. If Acrobat can't extract the text correctly then nothing else can. It may as well be a page of hieroglyphs.

How to change value of a textbox in a pdf

I have to make several certificates with the same design but different names. So I've tried to make an uncompressed pdf file with a place holder text and tried to change it with a text editor. For some reason it didn't work. I could only see a single letter of the replaced text.
When I try the same thing with an eps file, it works but since eps doesn't keep (AFAIK) page orientation, there is a chance that it something will be different with different names.
Does anyone know why this didn't work or how to change a text box in a pdf file (with sed)?
(I created the master pdf with Illustrator CS4)
Thank you
In general, editing PDFs in a text editor is a Bad Idea. PDFs depend on the byte offsets of various objects to not move.
If you KNOW your editor won't change the EOL bytes (or what it thinks are eol bytes), and you DO NOT change the length of the text entry's object as a whole, you're okay.
For example:
1 0 obj
<</Type/Annotation/Subtype/Widget/V(PlaceHolder Value)/T(Field Title)...>>
endobj
If your new value is longer than "placeholder value", you're screwed.
Most PDFs contain quite a bit of compressed binary data. Some of that data WILL be misinterpreted as EOL characters. Changing them will:
a: break your compressed stream
b: possibly change the byte offsets of the rest of the PDF.
When I hack on PDF files, I always use a hex editor.
Bottom Line: Don't mess with PDFs as a text stream. Mess with them as PDF files, using a PDF library. There's sure to be one capable of altering form field values in your language of choice.
You can also look into FDF and XFDF to see if they'll suit you better. Both file formats store field/value pairs and a reference to the form to use with those pairs. FDF uses PDF's syntax, while XFDF is an XML grammar. You can serve the [X]FDF to your end user and they will see the filled-in form.
WARNING: Unless the form is Reader Enabled (requires Acrobat (pro?)), they won't be able to save the version of the form they get after opening the [X]FDF, only view/print it. Of course they can save the [X]FDF, but many users might balk at this Strange New Format.