problem with unlocking user-password secured PDF - pdf

I have a PDF file protected by password - I know the correct user password. The problem is, that I am only able to open it on Windows Adobe Reader. Every other PDF viewer (also Linux command prompts tools for removing passwords) returns the information that password is wrong.
Potential cause: password is long (30 characters) and contains non-Latin (Polish) characters (like ł ó ę ć ź ą). I tried things like Unicode to ASCII converter, but it does not work.
Has anybody idea why it works only in Acrobat? I just want to open this document on Linux. The best would be to remove password.
EDIT: document is secured by 128-bit AES, Acrobat mention that "can be opened by Acrobat 7.0 or newer". Printing, copying, etc is not allowed.
EDIT2: thanks for helping in comments, I tried SumatraPDF and it works - but only allow to print this to non-searchable images pdf.
I checked that it is based on mupdf engine, but mupdf on Linux cannot deal with this file - it crash.
Sumatra has open source, do anybody knows how to edit it to print to PDF in normal way?

SumatraPDF uses MuPDF as its engine for several formats such as ePub HTML and of course PDF. It can store (not remove) a know password as a hash so no need to keep inputting for everyday reading or adding comments to a PDF.
So if as suggested by #mkl using the password with local characters on a local PC may work in SumatraPDF it should work in MuPDF-GL which is a more basic viewer. Spoiler, certainly I can remove my own simple 9 character encrypted challenge.pdf (8 sequentially alphabetic characters are a known semi random sequence) to save in MuPDF as unprotected.pdf, but nobody has cracked it yet :-)
However MuPDF-GL has many more powerful abilities hidden under the surface.
Using MuPDF-GL you should be able to open the file when it prompts for that password. then press A which starts the annotator (you dont need to add anything) but simply change the save as settings.
So in this case if there were errors it will have fixed any needed to re-save but first switch OFF incremental and set encryption to none. There is no guarantee this will work for all cases but worth a try.
IF mupdf-gl does not work for you on linux you can try
MuTool mutool draw -p password -o unprotected.pdf protected.pdf
OR qpdf which can also rebuild a PDF with different restrictions given the correct input password(s).
qpdf --password=myverylongstring!"^$% --decrypt protected.pdf unprotected.pdf
or if the password may cause commandline UTF problems save it as first line of a text file and use
qpdf -password-file=password.txt --decrypt protected.pdf unprotected.pdf
Lastly if you wish to print a pdf file on Linux you have two potential options as readers OLD Evince works for me on Windows 32bit but for 64bit I prefer nightly cutting edge Okular.

Related

Is it possible to obfuscate PDF file binary data?

Is it possible to obfuscate the bytes that are visible when a PDF file is opened with a hex editor? Also, I wonder if there is any problem in viewing the contents of the PDF file even if it is obfuscated.
You will always be able to see whatever bytes are within a file using a hex editor.
There might be ways to generate your pdf pages using methods that don't involve directly writing the text into the pdf (for example using javascript that's obfuscated).
Like answered above, the bytes of the file are always visible when being viewed with a hex-editor. However there are some options to hide/protect data in the file:
You could encrypt either the whole pdf or partial datasets. Note that an encryption/decryption always requires a secret. When the file is fully encrypted you can't read it without the key.
You can add additional similiar dataframes but set them invisible in the pdf. Note that this technique blows up the size of the file.
You can use scripting languages which dynamicly build up your pdf. Be aware that this could look suspicious to users or any anti-virus software.
You can use tools steganography to hide your data. For example a tool you could use is steghide
You can simply compress datastreams in the pdf, e.g. using gzip or similiar compression tools. That way you can't read it directly. However that is easy to recognize and to uncompress for anyone.

I'd like to recognize the text of all pdfs on my computer and save them without moving them from their locations. Is it possible?

I've tried using Adobe Acrobat X Pro to "recognize text in multiple files."
When I start this process and it asks for the directory, I've chose C:, my main hard drive.
It took hours to load and when it did, the list of files it generated included word documents as well. Adobe said I couldn't proceed until I removed the problem files.
Once I removed all the pdfs Adobe flagged as having errors (like password protection) and the prompt remained, I assumed it meant the word documents in the list.
So I manually removed those too. But Adobe still said that I couldn't proceed until problem files were removed and there weren't any remaining files in the list that adobe had flagged as having issues.
My firm is trying to make sure all pdfs we have are searcheable. Currently, some are and some aren't. Our goal is to make them all searchable without removing them from their varied locations.
I think you can do this using a combination of
regular java : to list all files in a directory that match a given criterium (e.g. their name ends with '.pdf')
iText : to iterate over the PDF document and extract all images
Tess4J : a port of Tesseract (google OCR engine) for java, to turn the extracted images back into text
Unless I am much mistaken, Tesseract even offers a crude version of this workflow for you. But only for 1 pdf at a time. So you'd still need some windows/linux scripting to pipe in all files of a given directory.

Pentaho don't Generete PDF in UTF-8 encoding

I have a problem related with PDF exportation in Pentaho BI plattform. I'm not able to produce a correct PDF file encoded in UTF-8 and which contains Spanish characters. That procedure neither works properly in local Report Designer nor in BI server. Special characters like 'ñ' or 'ç' are skipped in the PDF file. Generation in other formats works just fine (HTML, Excel, etc.).
I've been struggling with that issue for few days being unable to find any solution and would be grateful for any clue.
Thanks in advance
P.S. Report Designer and BI platform version 6.1.0.1
Seems like a font issue. Your font needs to know how to work with unicode and it needs to specify how to "draw" the characters you want.
Office programs (at least MS office) by default automatically select font, which can render any character (if font substitution is enabled), however PDF readers don't do it: they always use the exact font you've specified.
When selecting appropriate font, you have to pay attention to supported Unicode characters and to the font's license: some fonts don't allow embedding and Pentaho embeds font's subset, which was used, into generated PDF files if encoding is UTF-8 or Identity-H.
To install fonts for linux server you need to copy font files either to your java/lib/fonts/ folder or to /usr/share/fonts/, grant read rights to the server's user and restart the server application.

Use ghostscript to delete a page (not extracting a range)

I know ghostscript can use -dfirstpage -dlastpage to only make a file from a range of pages, but I need to make it (or another command line program) delete the 2nd page in any pdf where the range of pages is not explicitly told. I thought this would be far easier because most printers let you specify "1,3-end" and I have been using PDFCreator to do it that way.
The one way I can think of doing it (very very messy) is to extract page 1, extract pages 3 to end, and then merge the two pdfs. But I also don't know how to have GS determine the number of pages.
Use the right tool for the job!
For reasons outlined by KenS, Ghostscript is not the best tool for what you want to achieve. A better tool for this task is pdftk. To remove the 2nd page from input.pdf, you should run this command line:
pdftk input.pdf cat 1 3-end output output.pdf
OK first things first, if you use Ghostscript's pdfwrite device you are NOT extracting, or deleting, or performing any other 'manipulation' operation on your source PDF file. I keep on reiterating this, but I'm going to say it again.
When you pass an input file through Ghostscript it is completely interpreted to a series of graphical primitives which are passed to the device, in general the device will render the primitives to a bitmap. In the case of the 'high level' devices such as pdfwrite, the primitives are re-assmebled into a brand new file, in the case of pdfwrite a PDF file.
This flexibility allows for input in a number of different page description languages (PostScript, PDF, PCL, PCL-XL, XPS) and then output in a few different high level formats (PostScript, EPS, flavours of PDF, XPS, PCL, PCL-XL).
But the new file bears no relation to the original, other than its appearance.
Now, having got that out of the way... You can use the pdf_info.ps PostScript program, supplied in the 'toolin' directory of the Ghostscript installation, to get a variety of information about PDF files, one of the things you can get is the number of pages in the PDF. You also don't need to bother, run the file once with -dLastPage=1, then run it again with -dFirstPage=2 (don't set LastPage), then run both resulting files to create a file with the pages from each combined.

Update a PDF to include an encrypted, hidden, unique identifier?

Background
The idea is this:
Person provides contact information for online book purchase
Book, as a PDF, is marked with a unique hash
Person downloads book
PDF passwords are easy to circumvent, or share
The ideal process would be something like:
Generate hash based on contact information
Store contact information and hash in database
Acquire book lock
Update an "include" file with hash text
Generate book as PDF (using pdflatex)
Apply hash to book
Release book lock
Send email with book download link
Technologies
The following technologies can be used (other programming languages are possible, but libraries will likely be limited to those supplied by the host):
C, Java, PHP
LaTeX files
PDF files
Linux
Question
What programming techniques (or open source software) should I investigate to:
Embed a unique hash (or other mark) to a PDF
Create a collusion-attack resistant mark
Develop a non-fragile (e.g., PDF -> EPS -> PDF still contains the mark) solution
Research
I have looked at the following possibilities:
Steganography
Natural Language Processing (NLP)
Convert blank pages in PDF to images; mark those images; reassemble PDF
LaTeX watermark package
ImageMagick
Issues
The possible solutions I have researched have the following issues:
Steganography. (a) Requires a master copy of the images, which are converted to EPS, which is CPU-intensive and time-consuming; (b) would the watermark survive PDF -> EPS -> PDF, or other types of conversion; (c) most images are drawings or screen captures, not photographs in PNG format.
LaTeX. Creates an image cache; any steganographic solution would have to intercept that process somehow.
NLP. Introduces grammatical errors; could change meaning of technical words.
Blank Pages. Immediately suspect; it is easy to replace suspicious blank pages.
Watermark Package. Draws visible marks.
ImageMagick. Draws visible marks.
What other solutions are possible?
Related Links
http://www.tcpdf.org/
invisible watermarks in images
Thank you!
I've done this for another project with PDFlib. We needed traceability for the generated PDFs in case the file was leaked. Basically:
Created a source template PDF with the content in place, set the document master password with the required options (no edit, no print, no screen-reader, etc...) set
At runtime, we applied a few watermarks (imposed page footer saying "This document checked out to user #12345", set a few of the metadata fields with user ID, download IP, download date/time, added a "this document copyright by..." cover page, etc...)
Optionally attach a user password to force a PW prompt when document is opened.
Since the latest PDF versions use AES-128 for their encryption, we just set a suitable randomly generated 128char high-entropy password - no one would ever be typing it in by hand so hard-to-typedness was irrelevant to us and actually preferable. The master password prevented end-users from making any changes to the document. The various noprint/no screen read options are actually enforced by the PDF reader and therefore bypassable, but can't hurt to set them anyways.
The downside to this is that PDFlib's licensing is fairly steep. I don't know if any of the free php PDF libraries support the latest PDF encryption schemes, especially the master password stuff, but if you budget can support it, PDFlib's the way to go for secure document production.