PDF editing directly then deleting edits still leaves pdf corrupted - pdf

My PDF looked fine until I edited it, and now it still appears to be corrupted even after I took out my edits. A file diff program is saying that the two files are the same, but only one is displaying the information.
To reproduce:
1) Open PDF and make sure there is stuff inside of it
2) Open PDF in a text editor and add text at the top
3) Open PDF normally and it is empty
4) delete text added in step 2
5) PDF is still corrupted despite having SAME file contents
This also happens if I literally copy and paste the code from a PDF into a different file and try to open that. It won't open.
Is there any way to be able to be able to add text to a PDF and have it not corrupt?

PDF is a binary format. Even if it looks quite text'ish, it is not text. In particular PDF files usually contain binary data streams, e.g. for images or embedded fonts or compressed arbitrary content. Furthermore, PDFs rely on PDF objects starting at offsets noted in a cross reference table or stream in the file.
Many text editors, though, do not only apply the changes you type in to a document but also do other stuff, like unifying line breaks (DOS CRLF or Unix LF or Max CR), replacing byte sequences they could not interpret by a special character (e.g. the Unicode REPLACEMENT CHARACTER) or dropping them altogether, etc.
The former (unifying line breaks) moves the data without updating the cross reference information, rendering it useless. If the bytes interpreted as line break characters were actually parts of binary stream data, the stream data also is damaged.
The latter (byte sequence replacement) usually damages contents of streams in the PDF with compressed data or other sensitive binary data beyond repair. Depending on the sequence length, this also moves data and so invalidates cross references.
Thus, using a text editor to edit a PDF usually is a sure way to break a PDF.
Is there any way to be able to be able to add text to a pdf and have it not corrupt?
Yes, using PDF aware software, e.g. Adobe Acrobat but there also are others. If you prefer a programming approach, use a good general purpose PDF library. There are such libraries for many programming platforms.
For a very few types of changes, one can also use a hex editor (only replacing some bytes, not inserting or removing anything), but you really should know what you are doing.

Related

PDF content is not enough to reconstruct the PDF?

I open a pdf file "test.pdf" with Vim and copy its content to another text buffer that I save as "copy.pdf". I don't understand why "copy.pdf" is different, can be opened as a pdf (the title shows) but the page is empty.
The same happens when I read the file in Javascript with FileReader.readAsBinaryString and rewrite it to disk, so it is not related to how I copy in Vim.
Even more strange, the Finder says that the copy is actually 30KB bigger.
Where are the hidden bytes?
Usually when I see this sort of behavior and resulting blank pages, it is the result of using a program or process that is treating the binary information of a PDF as text in some form or another - for example, doing CR/LF conversion, tab to space conversion or interpreting the data as UTF-8 instead of binary. Doing any sort of transformation will ruin the binary streams within a PDF and will cause the offset bytes in the cross-reference table to become incorrect, causing the PDF to be unreadable.
Perhaps your process of writing back to disk doing CR/LF conversion or otherwise treating your binary blob as non-binary?

Enabling select and copy of text content in PDF

What can prevent PDF-1.4 document's content from being selectable and copyable?
I'm generating PDF-1.4 documents using TTF fonts, which are successfully embedded in it (see screenshot below).
Yet I can't select and copy the text from the document. I have studied the PDF-1.4 spec and found only one mention of copy-protecting the document, which has a prerequisite of first encrypting it. And I don't encrypt the document.
So, ideally, I'd like to discover an exhaustive list of reasons, that can prevent the PDF text from being copied, and ways to control that.
There is only one reason, you are embedding your fonts partially. The information you are storing there is the minimum required for drawing the glyphs, but it is not enough for allowing text extraction. For example, in Acrobat Professional, optimizing a file for reducing file size will have this effect, since everything that is not strictly required for presenting the content will be discarded.

PDF data extraction gives symbols/gibberish?

I have a piece of software called PDF2XL which is normally great for extracting tables of data from PDF files. I've used it with hundreds of files before.
This one file though, gives me gibberish output that I can't even copy and paste into this textarea correctly. All sorts of unicode weirdness.
If I copy and paste as per normal into excel/notepad I get the same issue.
I assume it's something to do with a messed up character encoding header in the PDF file? How can I change this? I'm on Windows and have no software that can edit PDFs, so if I need to edit/re-save it, please recommend a free piece of SW to do it.
Thanks!
There are an increasing number of PDF files the used subsetted fonts which is basically a custom encoding. Normally the font descriptor in the PDF should have a ToUnicode table to allow the text extraction to decode the font encoding and return the correct text.
Some PDF producers are doing this on purpose to prevent easy PDF text extraction for things such as financial reports. If there is only one font then you could manually decode the font but in my experience I have seen PDF's with multiple random encodings which makes it nearly impossible to decode automatically.
One way to test for these types of PDF's is to open the file in Acrobat, select some text, copy it and then paste it into Notepad. If the text is garbled then the PDF is using a subsetted font and there is not much more you can do. If Acrobat can't extract the text correctly then nothing else can. It may as well be a page of hieroglyphs.

How to change value of a textbox in a pdf

I have to make several certificates with the same design but different names. So I've tried to make an uncompressed pdf file with a place holder text and tried to change it with a text editor. For some reason it didn't work. I could only see a single letter of the replaced text.
When I try the same thing with an eps file, it works but since eps doesn't keep (AFAIK) page orientation, there is a chance that it something will be different with different names.
Does anyone know why this didn't work or how to change a text box in a pdf file (with sed)?
(I created the master pdf with Illustrator CS4)
Thank you
In general, editing PDFs in a text editor is a Bad Idea. PDFs depend on the byte offsets of various objects to not move.
If you KNOW your editor won't change the EOL bytes (or what it thinks are eol bytes), and you DO NOT change the length of the text entry's object as a whole, you're okay.
For example:
1 0 obj
<</Type/Annotation/Subtype/Widget/V(PlaceHolder Value)/T(Field Title)...>>
endobj
If your new value is longer than "placeholder value", you're screwed.
Most PDFs contain quite a bit of compressed binary data. Some of that data WILL be misinterpreted as EOL characters. Changing them will:
a: break your compressed stream
b: possibly change the byte offsets of the rest of the PDF.
When I hack on PDF files, I always use a hex editor.
Bottom Line: Don't mess with PDFs as a text stream. Mess with them as PDF files, using a PDF library. There's sure to be one capable of altering form field values in your language of choice.
You can also look into FDF and XFDF to see if they'll suit you better. Both file formats store field/value pairs and a reference to the form to use with those pairs. FDF uses PDF's syntax, while XFDF is an XML grammar. You can serve the [X]FDF to your end user and they will see the filled-in form.
WARNING: Unless the form is Reader Enabled (requires Acrobat (pro?)), they won't be able to save the version of the form they get after opening the [X]FDF, only view/print it. Of course they can save the [X]FDF, but many users might balk at this Strange New Format.

Copy+pasting text from PDF results in garbage

I am writing a Master's thesis - NLP system. I have one component - extractor.
It is extracting a plain text from PDF files. There are a few PDF files that can not be extracted correctly. Extractor (PDFBox library) returns a string like this:
"┤xDn║if|d├gDF"Ti&cD╬lh d FÁhis~n ╗xd f«"d┤ffih »h"
or
"10a61a91a22a25a3a27a17a23a20a8a13a14a61a25a17"
I was checking each file that makes this extraction's problem and all these files' text also can not be copy-pasted from PDF Reader (Adobe Reader and FoxIt reader). Viewing them in this readers is enabled, but after selecting its content and copying to the clipboard I get the same wrong text (as described above - strings of not semantically correct chars or strings of digits and letters).
Could anybody help me???
Very often in such cases, where you can't select, copy'n'paste text from the Acrobat (Reader) window, there is another option which may work nevertheless:
Open 'File' menu,
select 'Save as...',
select 'Text (normal) (*.txt)',
browse to the target directory,
type the name you want to use for the text file.
You'll have all text from all pages in the file and need to locate the spot you wanted to copy'n'paste initially -- insofar it is not as comfortable as direct copy'n'paste. But it works more reliably....
It also works with acroread on Linux (but you have to choose 'Save as text...' from the file menu).
Update
You can use the pdffonts command line utility to get a quick-shot analysis of the fonts used by a PDF.
Here is an example output, which demonstrates where a problem for text extraction will very likely occur. It uses one of these hand-coded PDF files from a GitHub-Repository which was created to provide PDF sample files which are well commented and may easily be opened in a text editor:
$ pdffonts textextract-bad2.pdf
name type encoding emb sub uni object ID
------------------------------- ------------ ----------- --- --- --- ---------
BAAAAA+Helvetica TrueType WinAnsi yes yes yes 12 0
CAAAAA+Helvetica-Bold TrueType WinAnsi yes yes no 13 0
How to interpret this table?
The above PDF file uses two subsetted fonts (as indicated by the BAAAAA+ and CAAAAA+ prefixes to their names, as well as by the yes entries in the sub column), Helvetica and Helvtica-Bold.
Both fonts are of type TrueType.
Both fonts use a WinAnsi encoding (a font encoding maps char identifiers used in the PDF source code to glyphs that should be drawn).
However, only for font /Helvetica there is a /ToUnicode table available inside the PDF (for /Helvetica-Bold there is none), as indicated by the yes/no in the uni-column).
The /ToUnicode table is required to provide a reverse mapping from character identifiers/codes to characters.
A missing /ToUnicode table for a specific font is almost always a sure indicator that text strings using this font cannot be extracted or copied'n'pasted from the PDF. (Even if a /ToUnicode table is there, text extraction may still pose a problem, because this table may be damaged, incorrect or incomplete -- as seen in many real-world PDF files, and as also demonstrated by a few companion files in the above linked GitHub repository.)
If are able to successfully select and copy the text in Adobe Reader -- indicated that the PDF does contain text objects -- but you can't paste the copied text into Notepad without it looking like a bunch of garbage characters, then the problem is probably related to the CMap that the selected text uses.
The PDF specification provides many options for the display of textual content and the related extraction of the text content. A CMap specifies the mapping from character codes to character selectors. The PDF spec outlines some predefined CMaps, but other CMaps can also be embedded.
My guess is that either the CMap for this text is corrupt or that the PDFBox library doesn't support this particular CMap. I suggest trying a different SDK just to see if you get any different results.
When opened as a Gmail attachment in Chrome (the internal PDF browser) copying does copy normal readable characters!
It worked for me when I had this problem and for others as well. I think the Chrome PDF viewer uses the Google Drive OCR automatically... It's like magic!
What was the PDF created with. Some PDFs do not contain any encoding information, just the data to draw it. So there is no way to extract the data.
Select the text you wish to copy.
Right click
Choose option "Export Selection as"
In the dialog box, choose a file name and save the new file as Rich Text Format (RTF)
Open RTF to see your text!
The best way to deal with this is Convert PDF file to Word by using this website.
https://www.ilovepdf.com/pdf_to_word
The garbage issue will be fixed
The best way to deal with this is (assuming you have Adobe Acrobat, or something similar, not sure if Reader can do this) is save the doc as a JPEG. Then recompile all the images as a single pdf, then use the OCR function to find text in the pages, then you can copy and paste the text.
PDF is not a text document. It's more of a vector graphic format that sometimes can contain text. So there are some documents from which you can't extract text unless you are willing to do OCR. That's just the way it is.