We are developing a Pdf parser to be used along with our system.
The requirement is such that, we store all the information on any pdf documents and should be able to reproduce the document as such (with minimal changes from original document).
We did some googling and found iTextSharp be the best mate for our purpose.
We are developing our project using .net.
You might have guessed as i mentioned in my title requiring comparisons for specific versions of iTextSharp (4.1.6 vs 5.x). We know that 4.1.6 is the last version of iTextSharp with the LGPL/MPL license . The 5.x versions are AGPL.
We would like to have a good comparison between the versions before choosing the LGPL version or we buy the license for AGPL (we dont like to publish our code).
I did some browsing through the revision changes in the iTextSharp but i would like to know if any content exist, making a good comparison between the versions.
Thanks in advance!
I'm the CTO of iText Software, so just like Michaël who already answered in the comment section, I'm at the same time the most authoritative source as well as a biased source.
There's a very simple comparison chart on the iText web site.
This chart doesn't cover text extraction, so allow me to list the relevant improvements since iText 5.
You've probably also found this page.
In case you wonder about the bug fixes and the performance improvements regarding text parsing, this is a more exhaustive list:
5.0.0: Text extraction: major overhaul to perform calculations in user space. This allows the parser to correctly determine line breaks, even if the text or page is rotated.
5.0.1: Refactored callback so method signature won't need to change as render callback API evolves.
5.0.1: Refactoring to make it easier for outside users to interact with the content stream processor. Also refactored render listener so text and image event listening occurs in the same interface (reduces a lot of non-value-add complexity)
5.0.1: New filtering functionality for text renderers.
5.0.1: Additional utility method for previewing PDF content.
5.0.1: Added a much more advanced text renderer listener that can reconstruct page content based on physical location of text on the page
5.0.1: Added support for XObject Form processing (text added via PdfTemplate can now be parsed)
5.0.1: Added rudimentary support for XObject Image callbacks
5.0.1: Bug fix - text extraction wasn't correct for certain page orientations
5.0.1: Bug fix - matrices were being concatenated in the wrong order.
5.0.1: PdfTextExtractor: changed the default render listener (new location aware strategy)
5.0.1: Getters for GraphicsState
5.0.2: Major refactoring of interface to text extraction functionality: for instance introduction of class PdfReaderContentParser
5.0.2: CMapAwareDocumentFont: Tweaks to make processing quasi-invalid PDF files more robust
5.0.2: PdfContentReaderTool: null pointer handling, plus a few well placed flush calls
5.0.2: PdfContentReaderTool: Show details on resource entries
5.0.2: PdfContentStreamProcessor: Adjustment so embedded images don't cause parsing problems and improvements to EI detection
5.0.2: LocationTextExtractionStrategy: Fixed anti-parallel algorithm, plus accounting for negative inter-character offsets. Change to text extraction strategy that builds out the text model first, then computes concatenation requirements.
5.0.2: Adjustments to linesegment implementation; optimalization of changes made by Bruno to text extraction; for example: introduction of the class MarkedContentInfo.
5.0.2: Major refactoring of interface to text extraction functionality: for instance introduction of class PdfReaderContentParser
5.0.3: added method to get area of image in user units
5.0.3: better parsing of inline images
5.0.3: Adding an extra check for begin/end sequences when parsing a ToUnicode stream.
5.0.4: Content streams in arrays should be parsed as if they were separated by whitespace
5.0.4: Expose CTM
5.0.4: Refactor to pull inline image processing into it's own class. Added parsing of image data if there is no filter applied (there are some PDFs where there is no white space between the end of the image data and the EI operator). Ultimately, it will be best to actually parse the image data, but this will require a pretty big refactoring of the iText decoders (to work from streams instead of byte[] of known lengths).
5.0.4: Handle multi-stage filters; Correct bug that pulled whitespace as first byte of inline image stream.
5.0.4: Applying stream filters to inline images.
5.0.4: PdfReader: Expose filter decoder for arbitrary byte arrays (instead of only streams)
5.0.6: CMapParser: Fix to read broken ToUnicode cmaps.
5.0.6: handle slightly malformed embedded images
5.0.6: CMapAwareDocumentFont: Some PDFs have a diff map bigger than 256 characters.
5.0.6: performance: Cache the fonts used in text extraction
5.1.2: PRTokeniser: Made the algorithm to find startxref more memory efficient.
5.1.2: RandomAccessFileOrArray: Improved handling for huge files that can't be mapped
5.1.2: CMapAwareDocumentFont: fix NPE if mapping doesn't get initialized (I'd rather wind up with junk characters than throw an unexpected exception down the road)
5.1.3: refactoring of how filters are applied to streams, adjust parser so it can handle multi-stage filters
5.1.3: images: allow correct decoding of 1bpc bitmask images
5.1.3: images: add jbig2 streams to pass through
5.1.3: images: handle null and indirect references in decode parameters, throw exception if unable to decode an image
5.2.0: Better error messages and better handling zero sized files and attempts to read past the end of the file.
5.2.0: Removed restriction that using memory mapping requires the file be smaller than ~2GB.
5.2.0: Avoid NullPointerException in RandomAccessFileOrArray
5.2.0: Made a utility method in pdfContentStreamProcessor private and clarified the stateful nature of the class
5.2.0: LocationTextExtractionStrategy: bounds checking on string lengths and refactoring to make code easier to read.
5.2.0: Better handling of color space dictionaries in images.
5.2.0: improve handling of quasi improper inline image content.
5.2.0: don't decode inline image streams until we absolutely need them.
5.2.0: avoid NullPointerException of resource dictionary isn't provided.
5.3.0: LocationTextExtractionStrategy: old comparison approach caused runtime exceptions in Java 7
5.3.3: incorporate the text-rise parameter
5.3.3: expose glyph-by-glyph information
5.3.3: Bugfix: text to user space transformation was being applied multiple times for sub-textrenderinfo objects
5.3.3: Bugfix: Correct baseline calculation so it doesn't include final character spacing
5.3.4: Added low-level filtering hook to LocationTextExtractionStrategy.
5.3.5: Fixed bug in PRTokeniser: handle case where number is at end of stream.
5.3.5: Replaced StringBuffer with StringBuilder in PRTokeniser for performance reasons.
5.4.2: Added an isChunkAtWordBoundary() method to LocationTextExtractionStrategy to check if a space character should be inserted between a previous chunk and the current one.
5.4.2: Added a getCharSpaceWidth() method to LocationTextExtractionStrategy to get the width of a space character.
5.4.2: Added a getText() method to LocationTextExtractionStrategy to get the text of the current Chunk.
5.4.2: Added an appendTextChunk(() method to SimpleTextExtractionStrategy to expose the append process so that subclasses can add text from outside the text parse operation.
5.4.5: Added MultiFilteredRenderListener class for PDF parser.
5.4.5: Added GlyphRenderListener and GlyphTextRenderListener classes for processing each glyph rather than processing chunks of text.
5.4.5: Added method getMcid() in TextRenderInfo.
5.4.5: fixed resource leak when many inline images were in content stream
5.5.0: CMapAwareDocumentFont: if font space width isn't defined, use the default width for the font.
5.5.0: PdfContentReader: avoid exception when displaying an empty dictionary.
There are some things that you won't be able to do if you don't upgrade. For instance, you won't be able to do the things described in these slides.
If you look at the roadmap for iText, you'll see that we'll invest even more time on text extraction in the future.
In all honesty: using the 5 year old version wouldn't only be like reinventing the wheel, it would also be like falling in every pitfall we've fallen in in the last 5 years. I can assure you that buying a license will be less expensive.
Related
I want to generate a Portable Document Format (PDF) by an original program of mine.
I am going to experiment an original typesetting program, and in the course of development I want to avoid external tools and fonts as far as possible.
So, it would be ideal to avoid using XeTeX, LuaTeX, among other engines.
And I want to store the glyph information internally in my program or my library.
But where should the character code be specified in the PDF so that the viewer program knows when they are copied or searched?
To generate glyphs, my naive approach is to save, in local library, raster images or Bézier curve parameters that correspond to the characters.
According to the PDF Reference, that seems well possible.
I do not care for kerning, ligature, or other aesthetics virtues for my present purpose, or at least that can be dealt later.
Initially, I think I may generate a Postscript, and use Ghostscript to convert that to PDF.
But it is pointed out here that Postscript does not support Unicode, which I will certainly use.
My option is then reduced to directly generating PDF from scratch.
My confusion is, though my brute-force approach may render correctly, I guess the resulting PDF would be such that the viewer is unable to copy, nor search, since I would have specified nowhere about the character codes.
In PDF Reference p.122, we see that there are several different objects.
What seems relevant are text objects, path objects, and image objects.
Is it possible to associate a image object to its character code?
As I recall, there are some scanned PDF, for example the freely-previewed parts of scanned Google-Books, in which you can copy strings correctly.
What is the method or field specifying that?
However, I think in various tables that follows the PDF Reference, there is no suitable slot for Unicode code.
Similarly, it is not clear how to associate a path object to its character code.
If this can be done, the envisioned project would be easiest, since I just extract out some open source fonts' Bézier curve parameters (I believe that can be done) and translate them myself to the PDF-allowed format.
If both image- and path-objects are impossible to hold character codes, I conclude that a text object is (obviously) more suitable for representing a glyph together with its character code.
Maybe a more correct way would be embedding a custom font, synthesized in runtime, in the PDF.
This is mentioned verbally and briefly in p.364, sec. 5.8, "Embedded Font Programs".
That does seem rather difficult and requires tremendous research.
I would like that you recommend some tutorials for embedding fonts, and they are not easy to find.
In fact, I find exemplary PDF files are itself already scarce, as most of them seems to come in LZ-compressed binary files (I guess).
Indeed, I try to compile a "Hello world" PDF in non-Computer-Modern font, and open with a text editor, and all I see is blanks, control characters, and Mojibake-like strings.
In summary, how do I (if possible) represent a glyph by a text object, image object, or a path object so that is character code can be known?
For concreteness, can you generate a PDF so that: there is shown a circle, but when you copy that, you copy the character "A"?
The association between the curves and the character code is the font. There are several tables involved that do the mappings. The font has an Encoding vector which is indexed by the character code and yields a Glyph name. For copying out of the document, there must also be a ToUnicode vector which maps to unicode code points.
If you study a simple example of a PostScript Type 3 font, that should be very beneficial in understanding a PDF font. I have a short one in this calendar program.
To answer the bold question, if you convert gridcal.ps to pdf, copying the moon glyph results in the character 1 because it is in the ascii position for 1 in the Encoding vector. Some other of the glyphs, notably sun, mars and venus are recognized by Ghostscript, which produces a mapping to the Unicode character. This is very clever, but probably not sufficiently extensive to rely upon (indeed, moon, mercury, jupiter and saturn are not recognized).
I have a pdf of which the content stream of the pdf doc looks like image1.
But once I open the pdf in adobe dc and tried to change the reading order. The entire content stream is changed. (Please see image2)
And here is the link to source pdf https://drive.google.com/file/d/1V2K3-2GdWG5DuTUv1fyfIIT54en70kI2/view
Is there a way to do the same programmatically(convert content stream of graphical text to proper stream)
Thanks in advance !
Is there a way to do the same programmatically(convert content stream of graphical text to proper stream)
First of all, both streams are proper, there merely are different (and in the case at hand considerably different) ways to create the same text on screen, each of them as valid as each other, and different PDF processors use different ways.
The processor that created your original PDF appears to have approached the task by dividing the text in small pieces (less than a text line) and draw these pieces as independently as possible, i.e. as separate text objects (BT..ET) with text properties set in each (Tm, Tf, Tc), positioned and rescaled by transformation matrix changes (cm), enveloped in save/restore graphics state instructions (q..Q).
Adobe Acrobat, on the other hand, appears to prefer the page main text to be contained in a single text object with text properties only set when they change and no text object or graphics state switches in-between.
Neither of these is more "proper" or more "graphical" than the other. If anything, these structures mirror how these instructions are stored or processed internally by the respective PDF processor.
That being said, you do want to convert from the former style into the latter.
The main problem is that the latter style is not standardized (at least there is no published document normatively describing it). So, while you can surely attempt to follow the lead of the example you have, you can never be sure that you understood the style exactly. Thus, you always have to expect differences emerging in special, not yet encountered situations. Furthermore, there is no guarantee Adobe will meticulously adhere to that style across software versions.
Nonetheless, you can of course attempt to follow the style (as you perceive it) as well as possible.
An implementation will have to walk through the respective content stream, keeping track of the current graphics state, and transform the text drawing (and related) instructions into a single text object for as long as possible.
You have tagged your question both itext and pdfbox. Thus, you appear to be undecided with which PDF library to implement this. Here some ideas for both choices:
For processing content streams and keeping track of the current graphics state, iText offers its com.itextpdf.text.pdf.parser API, in particular the PdfContentStreamProcessor (iText 5.x) / its com.itextpdf.kernel.pdf.canvas.parser API, in particular the PdfCanvasProcessor (iText 7.x).
You can extend them to in addition to analyzing the current contents also replace the content stream in question with an updated version, e.g. like I did in this answer for iText 5 or in this answer for iText 7.
PDFBox for the same task offers the class hierarchy based on the PDFStreamEngine. Based on these classes it should similarly be possible to create a graphics state aware content stream editor.
Both libraries also offer simpler classes for parsing the content streams into sequences of instructions, but those classes don't keep track of the graphics state, leaving that for you to implement.
I tried to copy text from a PDF file but get some weird characters. Strangely, Okular can recoqnize the text, but not with Sumatra PDF or Adobe, all three applications are installed in Windows 10 64 bit. To better explain my issue, here is the video https://streamable.com/sw1hc. The "text layer workaround file" is one solution I got. Any help is greatly appreciated. Regards
In short: The (original) PDF does not contain the information required for regular text extraction as described in the PDF specification. Depending on the exact nature of your task, you might try to add the required information to the existing text objects and fonts or you might go for OCR.
Mapping character codes to Unicode as described in the PDF specification
The PDF specification ISO 32000-1 (and similarly ISO 32000-2, too) describes an algorithm for mapping character codes to Unicode values using information available directly inside the PDF.
It has been quoted very often in other stack overflow answers (see here, here, here, here, here, or here), so I won't quote it here again.
Essentially this is the algorithm used by Adobe Acrobat during copy&paste and also by many other text extractors.
In PDFs which don't contain the information required for text extraction, you eventually get to this point in the algorithm:
If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.
What happens if the algorithm above fails to produce a Unicode value
This is where the text extraction implementations differ, they try to determine the matching Unicode value by using heuristics or information from beyond the PDF or applying OCR to the glyph in question.
That the different programs you tried returned so different results shows that
your PDF does not contain the information required for the algorithm above from the PDF specification and
the heuristics used by those programs differ relevantly and Okular's heuristics work best for your document.
What to do in such a case
There are multiple options, more or less feasible depending on your concrete case:
Ask the source of the PDF for a version that contains proper information for text extraction.
Unless you have a contract with that source that requires them to supply the PDFs in a machine readable form or the source is otherwise obligated to do so, they usually will decline, though...
Apply OCR to the PDF in question.
Depending on the quality of the OCR software and the glyphs in the PDF, the results can be of a questionable quality; e.g. in your "PDF copy text issue-Text layer workaround.pdf" the header "Chapter 1: Derivative Securities" has been recognized as "Chapter1: Deratve Securites"...
You can try to interactively add manually created ToUnicode maps to the PDF, e.g. as described by Tilman Hausherr in his answer to "how to add unicode in truetype0font on pdfbox 2.0.0".
Depending on the number of different fonts you have to create the mappings for, this approach might easily require way too much time and effort...
I have a MRC compressed PDF (images are JPX encoded) which I can not get redacted with iText 7 pdfSweep as the ImageReadException is thrown.
Caused by: org.apache.commons.imaging.ImageReadException: Can't parse this format.
at org.apache.commons.imaging.Imaging.getImageParser(Imaging.java:731)
at org.apache.commons.imaging.Imaging.getImageInfo(Imaging.java:703)
at org.apache.commons.imaging.Imaging.getImageInfo(Imaging.java:637)
at com.itextpdf.pdfcleanup.PdfCleanUpFilter.processImage(PdfCleanUpFilter.java:343)
... 13 more
Do you know any workaround or solution for this issue? An obvious workaround would be to replace the jp2 (jpx) in the PDF with some other image format and perform the redaction on this modified PDF, however, in this case the benefits of MRC compression are lost, not to mention the overall speed of such conversion and then redaction.
(iText developer here)
As you can see, iText uses org.apache.commons to handle the images.
In the past we have had some problems with known bugs in this external library.
A possible solution is to fork the org.apache.commons project, implement a fix, and submit your pull request.
This way, everyone benefits, and the change would automatically be included in iText as well.
Of course, should you be a paying customer, then reporting this problem through the iText support board might trigger us to do the pull request instead.
As for a workaround, I think you've already suggest the appropriate idea.
process all images
convert them to a different format (if needed)
feed the modified document into pdfSweep
More detailed (step 1 and 2)
using IEventListener you can obtain the underlying BufferedImage of a given resource, and you can then use a ByteArrayOutputStream and ImageIO to re-encode your image into standard jpg or png. You can then use iText to change the dictionary entry for this particular resource.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I'm looking for a way to remove all path objects from PDF file.
I suspect that this can probably be done with javascript in Adobe Acrobat, but would really appreciate a tip to do it with ghostscript or mupdf tools.
Anyhow any working solution is acceptable as correct answer
To do this with Ghostscript you would have to modify the pdfwrite device. In fact you would probably have to do something similar for any PDF interpreter.
What do you consider a 'path' object ? A shfill for example ? How about text ? How about text using a type 3 font (which constructs paths) ?
What about clip paths ?
If you really want to pursue this I can tell you where to modify pdfwrite, provided you don't mind recompiling Ghostscript.
Its probably a dumb question, but why do you want to do this ? Is it possible there might be another solution to your problem ? If all you want to do is remove filled paths (or indeed stroked paths. One solution would be to run the file through ps2write to get PostScript, prepend code to redefine 'fill' and 'stroke' as no-ops, and then run the file back through pdfwrite to get a PDF.
[Added after reading comments]
PDF doesn't have a 'path' object, unlike XObject which is a type of object. Paths are created by a series of operations such as 'newpath', 'moveto', 'curveto' and 'lineto'. Once you have built a path you then operate on it with 'fill' or 'stroke. Note that PDF also doesn't have a 'text' object type either.
This is why your approach doesn't work, you can't remove 'path objects' because there aren't any, the paths are created in the content stream. You can use a Form XObject to do something similar, but then the paths construction is in the Form content stream, it still isn't a separate object.
The same is true of PostScript, these are NOT any kind of object oriented languages. You cannot ' detect vector object of type path' in either language because there are no objects. In practice anything which isn't an image is a vector object, and is constructed from a path (and with clipping, even some images might be considered as paths)
The piece of PostScript you have highlighted adds a rectangle to a path (paths need not be contiguous in either PDF or PostScript) and then fills it. Note that, as is usual practice in PostScript, these are not directly using the PostScript operators, but are executing procedures which use the operators. The procedures are defined in the program prologue.
By the way, it looks like you used the pswrite device here (can't be sure with such a small sample). If this is the case you really want to start with ps2write instead. Otherwise you are going to end up with an awful lot of things degenerating to tiny filled rectangles (pswrite does this with many image types)
I didn't suggest that you try to 'decrypt' the ps2write output (it isn't encrypted, its compressed).
What I suggested was to create a PostScript file, redefine the 'show' and/or 'fill' operators so that they do nothing, and then run the resulting PostScript program back through Ghostscript using the pdfwrite device. This will produce a PDF file where all stroked and/or filled objects are ignored.
[final addition]
I picked up your sample file and examined it.
I presume the bug you are seeing is that the PDF file uses a /Separation colour (surely it cannot fail to fill a rectangle) with an ICCBased alternate and no device space tint transfrom. In that case the current version of ps2write may solve your problem. It (currently, this is due to change) does not preserve /Separation colours and instead emits them as a device colour, by default RGB. So simply converting the files to PostScript and back to PDF may completely resolve your problem.
If you knew what the problem was, it would have been quicker if you had told us, I could have given you that information and work-around in the first place.
Using ps2write I then created a PostScript version of the file (notice that the Separation colours are now RGB) and prefixed the PostScript program with two lines:
/fill {newpath} bind def
/stroke {newpath} bind def
Note that you must use an editor which preserves binary. Then running that PostScript program back through Ghostscript using the pdfwrite device I obtain a PDF file where the green 'decoration' which I think you are having a problem with is gone.
So, there's a solution to your question, and a possibly better way to solve your problem as well.