Background
The idea is this:
Person provides contact information for online book purchase
Book, as a PDF, is marked with a unique hash
Person downloads book
PDF passwords are easy to circumvent, or share
The ideal process would be something like:
Generate hash based on contact information
Store contact information and hash in database
Acquire book lock
Update an "include" file with hash text
Generate book as PDF (using pdflatex)
Apply hash to book
Release book lock
Send email with book download link
Technologies
The following technologies can be used (other programming languages are possible, but libraries will likely be limited to those supplied by the host):
C, Java, PHP
LaTeX files
PDF files
Linux
Question
What programming techniques (or open source software) should I investigate to:
Embed a unique hash (or other mark) to a PDF
Create a collusion-attack resistant mark
Develop a non-fragile (e.g., PDF -> EPS -> PDF still contains the mark) solution
Research
I have looked at the following possibilities:
Steganography
Natural Language Processing (NLP)
Convert blank pages in PDF to images; mark those images; reassemble PDF
LaTeX watermark package
ImageMagick
Issues
The possible solutions I have researched have the following issues:
Steganography. (a) Requires a master copy of the images, which are converted to EPS, which is CPU-intensive and time-consuming; (b) would the watermark survive PDF -> EPS -> PDF, or other types of conversion; (c) most images are drawings or screen captures, not photographs in PNG format.
LaTeX. Creates an image cache; any steganographic solution would have to intercept that process somehow.
NLP. Introduces grammatical errors; could change meaning of technical words.
Blank Pages. Immediately suspect; it is easy to replace suspicious blank pages.
Watermark Package. Draws visible marks.
ImageMagick. Draws visible marks.
What other solutions are possible?
Related Links
http://www.tcpdf.org/
invisible watermarks in images
Thank you!
I've done this for another project with PDFlib. We needed traceability for the generated PDFs in case the file was leaked. Basically:
Created a source template PDF with the content in place, set the document master password with the required options (no edit, no print, no screen-reader, etc...) set
At runtime, we applied a few watermarks (imposed page footer saying "This document checked out to user #12345", set a few of the metadata fields with user ID, download IP, download date/time, added a "this document copyright by..." cover page, etc...)
Optionally attach a user password to force a PW prompt when document is opened.
Since the latest PDF versions use AES-128 for their encryption, we just set a suitable randomly generated 128char high-entropy password - no one would ever be typing it in by hand so hard-to-typedness was irrelevant to us and actually preferable. The master password prevented end-users from making any changes to the document. The various noprint/no screen read options are actually enforced by the PDF reader and therefore bypassable, but can't hurt to set them anyways.
The downside to this is that PDFlib's licensing is fairly steep. I don't know if any of the free php PDF libraries support the latest PDF encryption schemes, especially the master password stuff, but if you budget can support it, PDFlib's the way to go for secure document production.
Related
I've tried using Adobe Acrobat X Pro to "recognize text in multiple files."
When I start this process and it asks for the directory, I've chose C:, my main hard drive.
It took hours to load and when it did, the list of files it generated included word documents as well. Adobe said I couldn't proceed until I removed the problem files.
Once I removed all the pdfs Adobe flagged as having errors (like password protection) and the prompt remained, I assumed it meant the word documents in the list.
So I manually removed those too. But Adobe still said that I couldn't proceed until problem files were removed and there weren't any remaining files in the list that adobe had flagged as having issues.
My firm is trying to make sure all pdfs we have are searcheable. Currently, some are and some aren't. Our goal is to make them all searchable without removing them from their varied locations.
I think you can do this using a combination of
regular java : to list all files in a directory that match a given criterium (e.g. their name ends with '.pdf')
iText : to iterate over the PDF document and extract all images
Tess4J : a port of Tesseract (google OCR engine) for java, to turn the extracted images back into text
Unless I am much mistaken, Tesseract even offers a crude version of this workflow for you. But only for 1 pdf at a time. So you'd still need some windows/linux scripting to pipe in all files of a given directory.
I have over 50 training documents (PDFs) at work. I would like to create a 'front end' that a user can 'run', which would provide a convenient access portal to all the PDFs available.
This needs to be able to be dropped on to my work colleagues laptops (they don't have Office on there but do have Acrobat). And it also needs to be able to be edited/added to as more PDF training materials are created.
I know that I could create a Word document that contained links to the PDFs, then convert that to a PDF itself. Or I could create an offline web page that linked to them, but I wondered if there was a better solution?
Like a way to compile an executable that would bring up a front-end and contain all the PDF files? I've seen similar things for car-repair manuals years ago, where you insert a CD, run an executable and get a nice front-end that essentially just allows you to browse PDF manuals.
Anyone know if this is possible and, if so, how to go about it?
Or does anyone know another viable solution to this?
Thanks
There are indeed various possibilities, depending on what the users have (Acrobat or Reader), and how you can control the distribution.
a) You create a front end PDF document which has links or buttons to open the subsequent documents residing in a subfolder or on the same level as the front end document.
b) You create a front end PDF document into which you embed the subsequent documents as Data Objects. You have buttons which export/open the embedded documents in a different window.
c) You create a front end PDF document into which you embed the subsequent documents as File Attachments (part of the Comments tools). You have buttons which open the embedded documents.
d) You would create a PDF Portfolio in Acrobat, containing the subsequent documents, and maybe provide an overview page from which you can open the documents.
Of these three approaches, a) would run in the biggest number of supporting PDF viewers, in particular also mobile devices. The downside is that you have the subsequent documents around loosely, and your users may mess up with them.
The most elegant (and app-like) approach would be b). However, it requires smart PDF viewers, and you would have to make sure that the user's viewer is not too dumb.
Approach c) would be a compromise between integrity and portability, and approach d) would be quite nice for distributing, but does require a PDF viewer by Adobe, and may most likely not work in any mobile viewer.
We have a software, which creates user reports and saves them into pdf documents. We're using Ghostscript for this.
I'm aware that PDF is "normally" an export format which is not editable, but one of our customer needs the possibility (for legal reasons) to edit these files.
I thought it can be possible to save the text in fillable forms (like adobe acrobat offers) and save it that way. Is it possible to create Text within a fillable form in a PDF and save it (with free tools like Ghostscript), so that the user can edit it later?
I read the Ghostscript documentation, but I didn't find anything.
GhostScript isn't really a terrific tool for this. You'd be better off with a PDF generation library which can add the appropriate annotations to the page - if you're wedded to using annotations.
If the "content" must be edited by end users, using widget annotations is not a horribly bad way of doing things, except that every end user needs to have a copy of Acrobat and if only some people are allowed to edit, you will likely have to play with owner password protection and permissions in order prevent anyone from changing field contents.
As for free tools, depending on the usage you could use iText or iTextSharp.
If you are required to be able to take the content of the document and be able to make changes to it on the fly, that's a trickier beast. If you can afford it (and it's certainly not free), my company Atalasoft, publishes a product that I wrote that lets you build PDF documents from scratch or from templates and embed the .NET objects that create the content into the PDF itself, which means that you can read those objects back out and change the content with a site-specific application, for example.
I want to generate a technical report from lisp (AllegroCL in my case) and I studied various packages/project to help me do this.
Requirements:
Need to generate a PDF
May create an intermediate format like RTF, Restructured TEXT, HTML, Word DOC or Latex
Need to be flexible to be able to add content throughout my application
Need to handle Multi-Page, Headers, Footers, Tables, inclusion of Images.
Possibilities:
cl-pdf and cl-typesetting: I checked this one out and it works for now, but is there a better alternative?
Some Latex generator, but ???
Question:
Do you know alternatives to easily generate (PDF) reports from lisp. What is the best workflow to go for?
we are using cl-pdf and cl-typesetting for the last 3 years and it has numerous issues... (like its confusion around encodings, or silently not rendering things that don't fit, or...) so, i don't recommend new development based on them.
currently we are in the process of moving all our export mechanisms to open document format. openoffice is all happy with it, and there's a plugin for ms office, too.
there's .fodt, the so called flat open document text format, which is a mere xml file describing a document. generating it is as easy as generating xml files.
you can also make parts of your document read-only with a password (insert a section and mark it read-only and protected by a password. when generating the xml, you can generate random hashes as password...).
I'd like to write some (java) code that takes a PDF document, and creates named destinations from all of the bookmarks. I think the iText API is the easiest way of doing this, but I have never used the API before.
How would you go about writing this sort of code with the iText API? Can iText do the parsing needed to manipulate existing PDFs by itself? The kind of manipulations I am thinking of are:
Open,
Find bookmarks,
Create destinations,
Save,
Close.
Or is there a different API that would be better?
Followup: I submitted a patch to iText a few months ago (it has now been accepted and is part of HEAD) that adds text parsing capabilities to iText. PdfBox (mentioned below) has (had?) problems with reading newer PDFs that use xref streams instead of the older xref table format.
Another library that is very good at parsing existing PDF files is PdfBox It can also be used for modifying an existing PDF. FYI - this is the text parser that Lucene uses.
I will also mention that iText does have the ability to parse a PDF file, it's just not great at parsing the text content on each page. If you are looking at accessing the PDF higher level constructs (Dictionaries, etc...) that are used for storing bookmarks, etc... and you don't mind getting your hands a little dirty with reading the PDF spec, you can absolutely do what you are asking about (we do it quite a bit ourselves).
The PDF Spec is big, but readable for the most part, and you don't have to worry about the bulk of it (which is geared towards actual page content and rendering) if all you are trying to do is extract bookmarks.
I'll just warn you up front that you may be disappointed with this. iText isn't really intended to be used as a parser. It's really more for creating entirely new PDF documents, but you can take a whack at it.
To start, using iText, you won't be able to modify the existing PDF document. What you can do, though, is to make a copy with the additional features that you want. (If somebody else knows better, please let me know, this drives me crazy.)
What you will want to do is create a PdfReader object from an input stream on your source file. Then create a PdfCopy object (which is just an extended PdfWriter that makes getting data from an existing source more convenient) for your destination.
As far as I can tell, the bookmarks cannot be obtained from iText at all. Another library may be needed. I think jpedal may have the ability to extract them (it can get them as an XML document, which you may then have to parse to get what you want.) However you get them, you can then add them to a java.util.List, and set that list as outline on the PDFCopy. The bookmarks themselves are just HashMaps with a particular set of keys. I'm not sure what all of the values are, but they include "Title", "Action" (which seems to be where you'd specify that this is a named destination, though I don't know what that value would be), and "URI" (which is used if this is an external link -- I suspect that this would specify the name of the named destination that you're linking to). Again, the specifics are hard to find.
Then iterate over the pages of the reader, importing each page to the PdfCopy. this page may help you.
Sorry I'm not more helpful to you. Good luck.
P.S. If anybody else knows of a better tool that's either (L)GPL or BSD licensed, I'd love to hear about it.