I want to make a PDF available on my website, but want to prevent the automatic parsing by bots that might not respect the normal PDF "security". The reason is that this is also commercially published and I am allowed to share for "personal use", but must not make it widely available that way. I originally created the PDF from Word.
I have tried using Ghostscript with the dNoOutputFonts option to convert text to glyphs, but the result is ridiculously big (from 2.5 MB to 180 MB). Scrambling the text encoding seems a good option, but I barely found any posts discussing this. There seems to be a commercial solution, but I was unable to find a way to do this e.g. using Ghostscript or qpdf. Any suggestion on how to achieve this (or alternative solutions)?
Operating system: Windows 10 64bit
Available versions of Ghostscript: 9.18, 9.27
Simple example PDF
Well, that's the advantage of fonts, you only have to describe each character once. Convert to outlines and you need to describe it every time, so yeah, much bigger.
Ghostscript's pdfwrite device goes to considerable effort to try and make text searchable, because in general people shout at us when a 'searchable' file becomes 'non-searchable'. So (amongst other things) it preserves any ToUnicode CMaps in the input file. To prevent simple indexing you need to avoid that. You haven't linked to a PDF file so I can't test this, but....
There are three places you need to edit:
/ghostpdl/Resource/Init/gs_pdfwr.ps, line 642, change:
/WantsToUnicode /GetDeviceParam .special_op {
exch pop
}{
//true
}ifelse
To:
//false
In the same file, at line 982, change:
/WantsToUnicode /GetDeviceParam .special_op {
exch pop
}{
//false
}ifelse
To:
//false
Then in /ghostpdl/Resource/Init/pdf_font.ps, line 614, change:
/WantsToUnicode /GetDeviceParam .special_op {
exch pop
}{
//false
}ifelse
To:
//false
That should prevent any ToUnicode information in the inptu file making it through to the output file. Depending on the Operating System you are using, and the way Ghostscript has been built (you haven't said), you may need to tell Ghostscript to include that directory in its search path, which you do with -I/ghostpdl/Resource/Init.
You should also set -dSubsetFonts=true, that will emit all fonts as subsets, I think that's the default but I can't immediately recall and it does no harm to set it. That means the first glyph that is encountered is encodesd at index 1, the second at index 2 etc. So Hello World becomes 0x01, 0x02, 0x03, 0x03, 0x04, 0x05, 0x06, 0x04, 0x07, 0x03, 0x08. The ordering wil be consistent throughout the file (obviously) but different for every font in the file and for every file. That should be adequately scrambled I'd have thought. It certainly won't be possible to search/copy/paste trivially.
If you make an example file available I can test it.
Oh, it also just occured to me that you might be able to get the same effect by using the ps2write device to create a PostScript file, then using the pdfwrite device to convert that back to PDF. The ps2write device can't embed ToUnicode CMaps, because there's no standard support in PostScript for that. Of course, it also means the content drops back to PostScript, which may result in other, unacceptable, quality/size chanegs.
Related
I have a PDF that contains a long list numbers, that was compressed using the JBIG2 algorithm.
When I look up the the internal file structure of my file I can find, that my pages are being built with two different XObjects:
(Pictured is Adobe Acrobat Preflight -> Internal structure.)
I can easily look at the specifics of the first one called "XIPLAYER0" (not pictured) it even gives me the information bit by bit if I want to. The second one is the one I am interested in tho. In it I can see that the image is built using 2 "Symbol Dictionaries" (first one marked grey). Is it possible to see the different entries in this dictionary? Or maybe even get some metadata for just one of them?
Sample PDF(Outside link)
This is not really about PDF, PDF is just the container for the JBIG2 format and its symbols dictionary, which is what you're really interested in.
But, as a first step, you'll need to get the JBIG2 images out of the PDF:
Extract images from PDF, how to handle JBIG2 encoded
That SO mentions poppler, and poppler does have a Python binding/wrapper:
https://pypi.org/project/python-poppler/
Once you get those JBIG2 files, maybe this can help:
jbig2_symbol_dict.c
The bigger project has a command-line util which has a "dump" option, but the source says it's not implemented^1:
case dump:
fprintf(stderr, "Sorry, segment dump not yet implemented\n");
break;
So if you're just curious/this is an academic question, the answer looks like "not really". If you need to read the text, how about OCR?
I have hand-crafted one PDF (https://media.24usoftware.com/d/PDFwithOnlyLink.pdf) and modified another one, generated from FileMaker, in order to add web link annotations (https://media.24usoftware.com/d/PDFwithLinks.pdf). They both pass validation at https://www.pdf-online.com/osa/validate.aspx, but for some reason when I open them in Adobe Reader they get immediately modified, so when I try to close them Adobe Reader asks me if I want to save changes. But I have no clue what the changes are and why they were made. Any ideas? Adobe's support claims there are syntax errors but without providing any details what's syntactically wrong in them.
For your hand-crafted PDF file Ghostscript says your xref table is wrong, some of the entries are not exactly 20 bytes. This is a requirement for an entry in the xref table, if you use \r or \n instead of \r\n for an xref entry (as you have done) you must pad out the entry with white space.
I get the same warning on your modified file too.
See page 94 of the 1.7 PDF Reference Manual where it says:
Each entry is exactly 20 bytes long, including the end-of-line marker.
and then later:
If the file’s end-of-line marker is a single character (either a
carriage return or a line feed), it is preceded by a single space; if
the marker is 2 characters (both a carriage return and a line feed),
it is not preceded by a space. Thus, the overall length of the entry
is always exactly 20 bytes.
You might want to think about using a different validation tool, this is a basic and very common error, if that validator can't find it it doesn't speak well for its quality.
Though I'm disappointed to see that Acrobat X Pro's own validation analyser can't find it either.....
I want to generate a Portable Document Format (PDF) by an original program of mine.
I am going to experiment an original typesetting program, and in the course of development I want to avoid external tools and fonts as far as possible.
So, it would be ideal to avoid using XeTeX, LuaTeX, among other engines.
And I want to store the glyph information internally in my program or my library.
But where should the character code be specified in the PDF so that the viewer program knows when they are copied or searched?
To generate glyphs, my naive approach is to save, in local library, raster images or Bézier curve parameters that correspond to the characters.
According to the PDF Reference, that seems well possible.
I do not care for kerning, ligature, or other aesthetics virtues for my present purpose, or at least that can be dealt later.
Initially, I think I may generate a Postscript, and use Ghostscript to convert that to PDF.
But it is pointed out here that Postscript does not support Unicode, which I will certainly use.
My option is then reduced to directly generating PDF from scratch.
My confusion is, though my brute-force approach may render correctly, I guess the resulting PDF would be such that the viewer is unable to copy, nor search, since I would have specified nowhere about the character codes.
In PDF Reference p.122, we see that there are several different objects.
What seems relevant are text objects, path objects, and image objects.
Is it possible to associate a image object to its character code?
As I recall, there are some scanned PDF, for example the freely-previewed parts of scanned Google-Books, in which you can copy strings correctly.
What is the method or field specifying that?
However, I think in various tables that follows the PDF Reference, there is no suitable slot for Unicode code.
Similarly, it is not clear how to associate a path object to its character code.
If this can be done, the envisioned project would be easiest, since I just extract out some open source fonts' Bézier curve parameters (I believe that can be done) and translate them myself to the PDF-allowed format.
If both image- and path-objects are impossible to hold character codes, I conclude that a text object is (obviously) more suitable for representing a glyph together with its character code.
Maybe a more correct way would be embedding a custom font, synthesized in runtime, in the PDF.
This is mentioned verbally and briefly in p.364, sec. 5.8, "Embedded Font Programs".
That does seem rather difficult and requires tremendous research.
I would like that you recommend some tutorials for embedding fonts, and they are not easy to find.
In fact, I find exemplary PDF files are itself already scarce, as most of them seems to come in LZ-compressed binary files (I guess).
Indeed, I try to compile a "Hello world" PDF in non-Computer-Modern font, and open with a text editor, and all I see is blanks, control characters, and Mojibake-like strings.
In summary, how do I (if possible) represent a glyph by a text object, image object, or a path object so that is character code can be known?
For concreteness, can you generate a PDF so that: there is shown a circle, but when you copy that, you copy the character "A"?
The association between the curves and the character code is the font. There are several tables involved that do the mappings. The font has an Encoding vector which is indexed by the character code and yields a Glyph name. For copying out of the document, there must also be a ToUnicode vector which maps to unicode code points.
If you study a simple example of a PostScript Type 3 font, that should be very beneficial in understanding a PDF font. I have a short one in this calendar program.
To answer the bold question, if you convert gridcal.ps to pdf, copying the moon glyph results in the character 1 because it is in the ascii position for 1 in the Encoding vector. Some other of the glyphs, notably sun, mars and venus are recognized by Ghostscript, which produces a mapping to the Unicode character. This is very clever, but probably not sufficiently extensive to rely upon (indeed, moon, mercury, jupiter and saturn are not recognized).
I tried to copy text from a PDF file but get some weird characters. Strangely, Okular can recoqnize the text, but not with Sumatra PDF or Adobe, all three applications are installed in Windows 10 64 bit. To better explain my issue, here is the video https://streamable.com/sw1hc. The "text layer workaround file" is one solution I got. Any help is greatly appreciated. Regards
In short: The (original) PDF does not contain the information required for regular text extraction as described in the PDF specification. Depending on the exact nature of your task, you might try to add the required information to the existing text objects and fonts or you might go for OCR.
Mapping character codes to Unicode as described in the PDF specification
The PDF specification ISO 32000-1 (and similarly ISO 32000-2, too) describes an algorithm for mapping character codes to Unicode values using information available directly inside the PDF.
It has been quoted very often in other stack overflow answers (see here, here, here, here, here, or here), so I won't quote it here again.
Essentially this is the algorithm used by Adobe Acrobat during copy&paste and also by many other text extractors.
In PDFs which don't contain the information required for text extraction, you eventually get to this point in the algorithm:
If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.
What happens if the algorithm above fails to produce a Unicode value
This is where the text extraction implementations differ, they try to determine the matching Unicode value by using heuristics or information from beyond the PDF or applying OCR to the glyph in question.
That the different programs you tried returned so different results shows that
your PDF does not contain the information required for the algorithm above from the PDF specification and
the heuristics used by those programs differ relevantly and Okular's heuristics work best for your document.
What to do in such a case
There are multiple options, more or less feasible depending on your concrete case:
Ask the source of the PDF for a version that contains proper information for text extraction.
Unless you have a contract with that source that requires them to supply the PDFs in a machine readable form or the source is otherwise obligated to do so, they usually will decline, though...
Apply OCR to the PDF in question.
Depending on the quality of the OCR software and the glyphs in the PDF, the results can be of a questionable quality; e.g. in your "PDF copy text issue-Text layer workaround.pdf" the header "Chapter 1: Derivative Securities" has been recognized as "Chapter1: Deratve Securites"...
You can try to interactively add manually created ToUnicode maps to the PDF, e.g. as described by Tilman Hausherr in his answer to "how to add unicode in truetype0font on pdfbox 2.0.0".
Depending on the number of different fonts you have to create the mappings for, this approach might easily require way too much time and effort...
We are developing a Pdf parser to be used along with our system.
The requirement is such that, we store all the information on any pdf documents and should be able to reproduce the document as such (with minimal changes from original document).
We did some googling and found iTextSharp be the best mate for our purpose.
We are developing our project using .net.
You might have guessed as i mentioned in my title requiring comparisons for specific versions of iTextSharp (4.1.6 vs 5.x). We know that 4.1.6 is the last version of iTextSharp with the LGPL/MPL license . The 5.x versions are AGPL.
We would like to have a good comparison between the versions before choosing the LGPL version or we buy the license for AGPL (we dont like to publish our code).
I did some browsing through the revision changes in the iTextSharp but i would like to know if any content exist, making a good comparison between the versions.
Thanks in advance!
I'm the CTO of iText Software, so just like Michaël who already answered in the comment section, I'm at the same time the most authoritative source as well as a biased source.
There's a very simple comparison chart on the iText web site.
This chart doesn't cover text extraction, so allow me to list the relevant improvements since iText 5.
You've probably also found this page.
In case you wonder about the bug fixes and the performance improvements regarding text parsing, this is a more exhaustive list:
5.0.0: Text extraction: major overhaul to perform calculations in user space. This allows the parser to correctly determine line breaks, even if the text or page is rotated.
5.0.1: Refactored callback so method signature won't need to change as render callback API evolves.
5.0.1: Refactoring to make it easier for outside users to interact with the content stream processor. Also refactored render listener so text and image event listening occurs in the same interface (reduces a lot of non-value-add complexity)
5.0.1: New filtering functionality for text renderers.
5.0.1: Additional utility method for previewing PDF content.
5.0.1: Added a much more advanced text renderer listener that can reconstruct page content based on physical location of text on the page
5.0.1: Added support for XObject Form processing (text added via PdfTemplate can now be parsed)
5.0.1: Added rudimentary support for XObject Image callbacks
5.0.1: Bug fix - text extraction wasn't correct for certain page orientations
5.0.1: Bug fix - matrices were being concatenated in the wrong order.
5.0.1: PdfTextExtractor: changed the default render listener (new location aware strategy)
5.0.1: Getters for GraphicsState
5.0.2: Major refactoring of interface to text extraction functionality: for instance introduction of class PdfReaderContentParser
5.0.2: CMapAwareDocumentFont: Tweaks to make processing quasi-invalid PDF files more robust
5.0.2: PdfContentReaderTool: null pointer handling, plus a few well placed flush calls
5.0.2: PdfContentReaderTool: Show details on resource entries
5.0.2: PdfContentStreamProcessor: Adjustment so embedded images don't cause parsing problems and improvements to EI detection
5.0.2: LocationTextExtractionStrategy: Fixed anti-parallel algorithm, plus accounting for negative inter-character offsets. Change to text extraction strategy that builds out the text model first, then computes concatenation requirements.
5.0.2: Adjustments to linesegment implementation; optimalization of changes made by Bruno to text extraction; for example: introduction of the class MarkedContentInfo.
5.0.2: Major refactoring of interface to text extraction functionality: for instance introduction of class PdfReaderContentParser
5.0.3: added method to get area of image in user units
5.0.3: better parsing of inline images
5.0.3: Adding an extra check for begin/end sequences when parsing a ToUnicode stream.
5.0.4: Content streams in arrays should be parsed as if they were separated by whitespace
5.0.4: Expose CTM
5.0.4: Refactor to pull inline image processing into it's own class. Added parsing of image data if there is no filter applied (there are some PDFs where there is no white space between the end of the image data and the EI operator). Ultimately, it will be best to actually parse the image data, but this will require a pretty big refactoring of the iText decoders (to work from streams instead of byte[] of known lengths).
5.0.4: Handle multi-stage filters; Correct bug that pulled whitespace as first byte of inline image stream.
5.0.4: Applying stream filters to inline images.
5.0.4: PdfReader: Expose filter decoder for arbitrary byte arrays (instead of only streams)
5.0.6: CMapParser: Fix to read broken ToUnicode cmaps.
5.0.6: handle slightly malformed embedded images
5.0.6: CMapAwareDocumentFont: Some PDFs have a diff map bigger than 256 characters.
5.0.6: performance: Cache the fonts used in text extraction
5.1.2: PRTokeniser: Made the algorithm to find startxref more memory efficient.
5.1.2: RandomAccessFileOrArray: Improved handling for huge files that can't be mapped
5.1.2: CMapAwareDocumentFont: fix NPE if mapping doesn't get initialized (I'd rather wind up with junk characters than throw an unexpected exception down the road)
5.1.3: refactoring of how filters are applied to streams, adjust parser so it can handle multi-stage filters
5.1.3: images: allow correct decoding of 1bpc bitmask images
5.1.3: images: add jbig2 streams to pass through
5.1.3: images: handle null and indirect references in decode parameters, throw exception if unable to decode an image
5.2.0: Better error messages and better handling zero sized files and attempts to read past the end of the file.
5.2.0: Removed restriction that using memory mapping requires the file be smaller than ~2GB.
5.2.0: Avoid NullPointerException in RandomAccessFileOrArray
5.2.0: Made a utility method in pdfContentStreamProcessor private and clarified the stateful nature of the class
5.2.0: LocationTextExtractionStrategy: bounds checking on string lengths and refactoring to make code easier to read.
5.2.0: Better handling of color space dictionaries in images.
5.2.0: improve handling of quasi improper inline image content.
5.2.0: don't decode inline image streams until we absolutely need them.
5.2.0: avoid NullPointerException of resource dictionary isn't provided.
5.3.0: LocationTextExtractionStrategy: old comparison approach caused runtime exceptions in Java 7
5.3.3: incorporate the text-rise parameter
5.3.3: expose glyph-by-glyph information
5.3.3: Bugfix: text to user space transformation was being applied multiple times for sub-textrenderinfo objects
5.3.3: Bugfix: Correct baseline calculation so it doesn't include final character spacing
5.3.4: Added low-level filtering hook to LocationTextExtractionStrategy.
5.3.5: Fixed bug in PRTokeniser: handle case where number is at end of stream.
5.3.5: Replaced StringBuffer with StringBuilder in PRTokeniser for performance reasons.
5.4.2: Added an isChunkAtWordBoundary() method to LocationTextExtractionStrategy to check if a space character should be inserted between a previous chunk and the current one.
5.4.2: Added a getCharSpaceWidth() method to LocationTextExtractionStrategy to get the width of a space character.
5.4.2: Added a getText() method to LocationTextExtractionStrategy to get the text of the current Chunk.
5.4.2: Added an appendTextChunk(() method to SimpleTextExtractionStrategy to expose the append process so that subclasses can add text from outside the text parse operation.
5.4.5: Added MultiFilteredRenderListener class for PDF parser.
5.4.5: Added GlyphRenderListener and GlyphTextRenderListener classes for processing each glyph rather than processing chunks of text.
5.4.5: Added method getMcid() in TextRenderInfo.
5.4.5: fixed resource leak when many inline images were in content stream
5.5.0: CMapAwareDocumentFont: if font space width isn't defined, use the default width for the font.
5.5.0: PdfContentReader: avoid exception when displaying an empty dictionary.
There are some things that you won't be able to do if you don't upgrade. For instance, you won't be able to do the things described in these slides.
If you look at the roadmap for iText, you'll see that we'll invest even more time on text extraction in the future.
In all honesty: using the 5 year old version wouldn't only be like reinventing the wheel, it would also be like falling in every pitfall we've fallen in in the last 5 years. I can assure you that buying a license will be less expensive.