I'm trying to hide a file in a PDF file code. I've already search some information to help me. I've tried to uncompress the pdf using pdftk ( pdftk pdf.pdf output uncompress.pdf uncompress ). Then I tried different things such as :
Insert commentary : I put " %TEXT_TO_HIDE " in the uncompress pdf file code.
add new object : I put " 0 0 obj << TEXT_TO_HIDE << endobj " in the uncompress pdf file code.
modify an existing object
then i compress it using pdftk again
In each case, I obtain a new pdf, which is looking different from the original. It's not corrupted but images have different colors, and some original text are missing.
So, do you know some rules to change a pdf code without anyone notice ?
(PS : Sorry if my english is bad ^^ )
You cannot modify a PDF file in a text editor and expect the file to be still compliant in general. PDF is a binary format and you need to read the PDF specification to figure out how to modify it.
That said, there are heaps of places where you can "hide" information in a PDF document, the real question is how much data you want to hide, and to what purpose. The purpose typically links to how secure exactly this needs to be.
As some examples:
1) PDF allows embedding complete files in the actual PDF file. This is not really secure as anyone with decent software can extract these files (but the file itself could still be secured of course).
2) PDF allows adding arbitrary objects anywhere (or almost anywhere) in the file. This is a great way to hide information, but someone with the right tools can browse the object tree (even if the file is compressed) and see what you did.
3) PDF allows adding for example white text on a white background or text behind other objects. Again, there are ways around this for people with the right software.
4) Adobe's PDF spec allows at least 1K of fluff after the %%EOF marker (although ISO 32000 does not). Keep in mind that this is visible to anyone opening the file with a decent text or binary editor. (Thanks Jongware).
In short, you need to define much better what exactly you want to accomplish and how "secure" secure is in your use case.
You should also consider how "robust" the method must be. Should someone be able to save your PDF file with Acrobat for example with the hidden code intact? Some of the above methods may not be robust enough to ensure that with absolute certainty.
Related
I would like to read a PDF file as a text (postscript), add new objects in the file structure and save the final output as a new PDF but If I just copied the PDF PostScript content and paste it in a newly created PDF file (where encoding='ansi'), the file doesn't work.
I am sure that this may be encoding issue but I am not sure what I should do to have a valid PDF file format after manipulating the original PostScript content.
Here is the piece of code that didn't work with me:
pdf_file = open('Input.pdf', 'r', encoding='ansi').read()
pdf_file_bytes = bytearray(pdf_file, 'ansi')
pdf_file = open('Output_bytes.pdf', 'wb').write(pdf_file_bytes)
And as I said, the output PDF is not valid!
First problem; the content of a PDF file is PDF, not PostScript.
Secondly, PDF is a binary file foramt so if you copy and paste it any kind of translation (such as CR/LF) will break it.
You haven't said what programming language your code uses, though it looks like Python. If it is Python then reading the file as binary instead of text might help.
A PDF file is a complex file format consisting of various objects, unless you under low-level syntax of the PDF specification carefully it will be difficult to impossible to arbitrarily replace some bytes with some other bytes and have it result in a still valid PDF file.
More to the point what are you trying to accomplish. E.g. there may be a high-level way of doing whatever you're trying to do that doesn't involve manipulating PDF syntax directly. E.g. if you need to modify a font, add an annotation, set the PDF version, etc. Otherwise if you actually need to modify PDF syntax you need to use a library capable of dealing with low-level objects.
My PDF looked fine until I edited it, and now it still appears to be corrupted even after I took out my edits. A file diff program is saying that the two files are the same, but only one is displaying the information.
To reproduce:
1) Open PDF and make sure there is stuff inside of it
2) Open PDF in a text editor and add text at the top
3) Open PDF normally and it is empty
4) delete text added in step 2
5) PDF is still corrupted despite having SAME file contents
This also happens if I literally copy and paste the code from a PDF into a different file and try to open that. It won't open.
Is there any way to be able to be able to add text to a PDF and have it not corrupt?
PDF is a binary format. Even if it looks quite text'ish, it is not text. In particular PDF files usually contain binary data streams, e.g. for images or embedded fonts or compressed arbitrary content. Furthermore, PDFs rely on PDF objects starting at offsets noted in a cross reference table or stream in the file.
Many text editors, though, do not only apply the changes you type in to a document but also do other stuff, like unifying line breaks (DOS CRLF or Unix LF or Max CR), replacing byte sequences they could not interpret by a special character (e.g. the Unicode REPLACEMENT CHARACTER) or dropping them altogether, etc.
The former (unifying line breaks) moves the data without updating the cross reference information, rendering it useless. If the bytes interpreted as line break characters were actually parts of binary stream data, the stream data also is damaged.
The latter (byte sequence replacement) usually damages contents of streams in the PDF with compressed data or other sensitive binary data beyond repair. Depending on the sequence length, this also moves data and so invalidates cross references.
Thus, using a text editor to edit a PDF usually is a sure way to break a PDF.
Is there any way to be able to be able to add text to a pdf and have it not corrupt?
Yes, using PDF aware software, e.g. Adobe Acrobat but there also are others. If you prefer a programming approach, use a good general purpose PDF library. There are such libraries for many programming platforms.
For a very few types of changes, one can also use a hex editor (only replacing some bytes, not inserting or removing anything), but you really should know what you are doing.
I have a small PDF file, which is supposed to display just the string "Hello World!".
Unfortunately, it displays black boxes instead of the characters. I suppose there is some problem with the fonts, but I am not sure.
Is there a way to diagnose and troubleshoot this issue? All I see on the Internet is advices to do this and to do that, which helps to some and does not to others (nothing helped me). Looks like shooting in the dark to me.
Here is a concrete example. Why does this PDF display black squares instead of the string Hello World ?
EDIT
A bit of the context. I am trying to convert a trivial HTML to PDF using the wkhtmltopdf tool. It is an absolute frustration, because according to the Internet searches the tool is supposed to work and do it quite well. But the thing does not work for me and nothing I do changes this! Unfortunately, this tool seems the only free tool to convert HTML to PDF. This is a huge bummer.
If you want to find out whether a PDF is valid or what is wrong with it, there are a few general steps you can take:
1) Open it in Adobe Acrobat or Adobe Reader (on a desktop platform, not a tablet device). For a very long time the PDF format was owned by Acrobat and the way their software handles PDF is still close to the gold standard. However, there is a caveat with this; Acrobat is very, very smart in the way it handles PDF files and it will overlook or actively correct a number of mistakes other PDF engines might have a problem with...
2) Get yourself a preflight tool. These tools were invented for use in graphic arts, but have applications outside of it too. Popular examples are callas pdfToolbox (warning, I'm affiliated with this vendor!) or the "Preflight" plug-in you'll find in Adobe Acrobat Pro (which is actually also callas technology under the hood). Then preflight specifically against the PDF/A-1b or PDF/A-2b standard.
That last point deserves some more explanation. You should pick a PDF/A compliant preflight profile because the PDF/A (or PDF for Archival) standard is extremely picky. It's goal is to make sure that PDF files will still be readable in exactly the same way 50 years from now and to ensure that it tests a whole range of properties of the file itself and the different components in it. You might be able to ignore some of the errors you get (because some of them will be connected to the fact that the PDF/A identification isn't correct for example) but I wouldn't ignore any other errors unless you understand exactly what they mean and why they aren't relevant.
PS: Can you make your test file available some other way? The file you shared in your question is useless I think. When I do "Download" I get a PDF file that doesn't contain text and doesn't have fonts in it. Those rectangles you see are exactly that - rectangles. So this PDF renders fine - it's the PDF generation process (or the fact that you stored the file on Google docs - I really have no clue what that might do) that went berserk apparently.
In addition to David's hints (first using a known good viewer and then some preflight tool), there is a third level in the inspection process:
3) Inspect the PDF with your own eyes and with the PDF specification (made available by Adobe here) at hand in a text viewer (for a first impression) and (if the cause of the issue at hand is not immediately visible) then in a PDF browsing tool (for in-depth analysis).
This step is quite cumbersome at first but after some time you learn your way around in the PDFs.
A sample for such a PDF browser tool is RUPS but there are others around, too.
'Small PDF file supposed to display "Hello World!"'
Not correct. The file you linked to does not contain any code that could render pixels on screen or on paper that a human brain would read as "Hello World!". The file indeed does only contain vector drawing operations which result in 12 black boxes.
The command line tool pdffonts does not indicate any font being used in the file:
pdffonts so-file-#15858199.pdf
What could still cause the "rendering" of the words you are looking for: some vector or pixel drawing code contained in the PDF. To find out about this, you'll have to look into the low level source code of the PDF.
The original file is 1.570 Bytes. So this task looks not as being overly huge.
'Is there a way to diagnose and troubleshoot this issue?'
Using qpdf, a "command-line program that does structural, content-preserving transformations on PDF files", you can expand all contained streams (which are normally compressed):
qpdf --qdf --object-streams=disable so-file-#15858199.pdf qdf-#15858199.pdf
The resulting file, qdf-#15858199.pdf, is 3.875 Bytes. Now open it in a text editor. PDF object no. 6 (lines 66-219) contains the contents of the page. Lines 123-194 contain only the operators m (moveto), l (lineto) and h (closepath). These lines contain 12 different groups of drawing commands, where each one represents the path for one of the 12 black boxes you see rendered on screen or printed on paper:
102.400001 12.8000001 m
268.800004 12.8000001 l
268.800004 179.200002 l
102.400001 179.200002 l
102.400001 12.8000001 l
h
Line 196 contains
f
which is the fill operator to actually fill black color into so far constructed (closed) path. Nothing in the other lines (which I didn't analyze in detail) does any drawing that may resemble the shapes of any glyphs.
'Unfortunately, this tool seems the only free tool to convert HTML to PDF'
Not correct either.
1.
Assuming your "free" is meant as free as in liberty, then an alternative option is HTMLDOC.
HTMLDOC does not support specific fonts which may be assigned to your HTML input via CSS, but it does a good job in converting one or multiple HTML documents into a single PDF book containing chapters, page-numbering, page headers and footers and more. For all options available, see its full documentation.
2.
Assuming your "free" is meant as free as in beer, then an alternative option (for private usage only) could be PrinceXML.
PrinceXML does an extraordinarily good job when it comes to support almost all CSS features your HTML document may be using. See its documentation and also some of the sample PDF files produced by PrinceXML.
I have a piece of software called PDF2XL which is normally great for extracting tables of data from PDF files. I've used it with hundreds of files before.
This one file though, gives me gibberish output that I can't even copy and paste into this textarea correctly. All sorts of unicode weirdness.
If I copy and paste as per normal into excel/notepad I get the same issue.
I assume it's something to do with a messed up character encoding header in the PDF file? How can I change this? I'm on Windows and have no software that can edit PDFs, so if I need to edit/re-save it, please recommend a free piece of SW to do it.
Thanks!
There are an increasing number of PDF files the used subsetted fonts which is basically a custom encoding. Normally the font descriptor in the PDF should have a ToUnicode table to allow the text extraction to decode the font encoding and return the correct text.
Some PDF producers are doing this on purpose to prevent easy PDF text extraction for things such as financial reports. If there is only one font then you could manually decode the font but in my experience I have seen PDF's with multiple random encodings which makes it nearly impossible to decode automatically.
One way to test for these types of PDF's is to open the file in Acrobat, select some text, copy it and then paste it into Notepad. If the text is garbled then the PDF is using a subsetted font and there is not much more you can do. If Acrobat can't extract the text correctly then nothing else can. It may as well be a page of hieroglyphs.
I have to make several certificates with the same design but different names. So I've tried to make an uncompressed pdf file with a place holder text and tried to change it with a text editor. For some reason it didn't work. I could only see a single letter of the replaced text.
When I try the same thing with an eps file, it works but since eps doesn't keep (AFAIK) page orientation, there is a chance that it something will be different with different names.
Does anyone know why this didn't work or how to change a text box in a pdf file (with sed)?
(I created the master pdf with Illustrator CS4)
Thank you
In general, editing PDFs in a text editor is a Bad Idea. PDFs depend on the byte offsets of various objects to not move.
If you KNOW your editor won't change the EOL bytes (or what it thinks are eol bytes), and you DO NOT change the length of the text entry's object as a whole, you're okay.
For example:
1 0 obj
<</Type/Annotation/Subtype/Widget/V(PlaceHolder Value)/T(Field Title)...>>
endobj
If your new value is longer than "placeholder value", you're screwed.
Most PDFs contain quite a bit of compressed binary data. Some of that data WILL be misinterpreted as EOL characters. Changing them will:
a: break your compressed stream
b: possibly change the byte offsets of the rest of the PDF.
When I hack on PDF files, I always use a hex editor.
Bottom Line: Don't mess with PDFs as a text stream. Mess with them as PDF files, using a PDF library. There's sure to be one capable of altering form field values in your language of choice.
You can also look into FDF and XFDF to see if they'll suit you better. Both file formats store field/value pairs and a reference to the form to use with those pairs. FDF uses PDF's syntax, while XFDF is an XML grammar. You can serve the [X]FDF to your end user and they will see the filled-in form.
WARNING: Unless the form is Reader Enabled (requires Acrobat (pro?)), they won't be able to save the version of the form they get after opening the [X]FDF, only view/print it. Of course they can save the [X]FDF, but many users might balk at this Strange New Format.