I have several PDF files that have been OCR-processed (not by me). They contain both the scanned image and the OCR text. They seem to work fine in some viewers (iPhone/iPad), but not in others (Preview.app on macOS) which makes them somewhat awkward to read.
From googling around, it seems that the text & image may be layered incorrectly or there is a problem with the fonts used? I'm not even sure I'm using the correct vocabulary, as most hits I get are worthless.
Is it possible to use ghostscript or something to batch-fix these files?
Example of "bad" rendering:
Its impossible to say what's wrong with the PDF file (or viewer) without seeing the PDF file, which alse makes it hard to propose solutions!
You could certainly run the file through Ghostscript to the pdfwrite device, and use the -dFILTERTEXT switch to not process the text. The resulting document would therefore not contain the offending text, but would still contain the image.
Of course, this would then not be possible to search or highlight.
You could instead use -dFILTERIMAGE which would remove the original image leaving the text behind. But then anything in the original document which was not text would now be missing.
The usual 'best practice' is to have the text drawn in rendering mode 3, which makes no marks. This allows you to see the original image without the OCR'ed text interfering. Its possible that the viewer you are using is not honouring the text rendering mode, which would be a (fairly serious) bug in the viewer. The most recent versions of MacOS seems to have some nasty bugs in the Quartz PDF rendering engine.
The other way to do this is to draw the text first, then put the original image on top of it, but that's hard to get wrong, I suspect its more likely the text rendering mode.
EDIT
The PDF file first draws the text, then draws the image on top of the text. The underlying text should not appear. mkl is quite correct in his comment.
The correct way to fix this is to fix the consumer which is rendering it incorrectly. As I mentioned above the latest version of Quartz seems to have some fairly serious bugs, you might choose to raise this as a bug with Apple.
The only other solution would be to run this through something which will remove the text. Ghostscript can do this but there are implications; firstly it will no longer be possible to search/copy/paste text from the document. Secondly you would need to run quite a complex command line in order to prevent the decompressed JPX images being recompressed as JPEG, which would probably result in compromised quality. Finally the resulting file size would be larger.
Related
Morning, everyone,
Quick question about PS2PDF. I use it to convert graphics that I produce directly in postscript to PDF. While there is no visual problem on PS files, I see a grid on my PDF viewer. At first I thought the problem was in the viewer, but it remains present when I compile my TeX files containing the figures with PDFLaTeX. Do you have any ideas for settings that can "fix" this display? Thanks in advance :)
Evince is independent of Ghostscript as far as PDF files are concerned, but I don't know how it can be viewing PostScript files.
I believe what you are seeing is an artefact of the PDF rendering engine in use, and the way the PDF file is constructed (which is itself dependent on the way the PostScript is constructed).
Much of the content is drawn by creating little rectangles which are intended to butt up against each other (and basically do). However, depending on the resolution, the precise numerical accuracy of the calculations and the accuracy of the co-ordinates, it can be the case that these rectangles do not quite touch ideally. There is a theoretical gap between them.
You can see this occur with Adobe Acrobat, and zooming in and out changes where the lines appear (it changes the effective resolution, thereby changing the calculations from user space to device space, ie to the actual pixels on screen).
I cannot say for sure that the same problem exists with Evince, but I expect it does. Withh Acrobat I can turn off anti-aliasing, which is where the problem really arises. Acrobat is attempting to insert an anti-aliased pixel between the two rectangles, which leads to these faint lines. Turning it off (In Acrobat X Edit->Preferences->PageDisplay->Smooth Line art) makes the lines disappear.
Ghostscript doesn't apply anti-aliasing by default, so these lines don't appear when rendering either the PostScript or the PDF files, but if I turn on anti-aliasing (-dGraphicsAlphaBits=4) then Ghostscript renders the lines in both the PostScript and the PDF file.
Essentially I think the problem is that your PDF viewer is using anti-aliasing and your PostScript viewer isn't, so they don't look the same.
I've got the following problem:
I want to print a PDF file as a booklet, using Adobe Acrobat Reader (in a copy shop, they got no better printing software). Unfortunately, Adobe shrinks my file down to the printable area. Instead I want to have it printet 50%(cause it'a a booklet, every page shrinked down by half) the original size, without shrinking any further, the margins simply cut off (just the egde of some pics etc, not important, the size matters)
My idea was, to use a software to create a white margin around every page, covering the stuff in the not-printable-area. Then adobe would not shrink anything down.
Does anyone know a tool for my problem? I couldnt find one. (running on either Windows or Ubuntu)
I would prefer a command line tool, cause I got a bunch of files to print.
Or is there a way to tell adobe Reader to not shrink anything (I know it works with normal printing, just couldnt figure it out with booklet printing)
Or are there any other ideas out there?
thanks in advance
Nevermind, i found a solution:
I created a PDF template with a white margin, transparent in the middle.
Using 'pdftk' I can easily set my original file als background of my template.
Done.
// EDIT 26.03.2018 - Who wants to continue my work can have a look on my source-files https://github.com/n0l0cale/ocr-sampledata
I'm actually looking for some details about PDF Files. It's most important for me that the files will be usable for a very long time and if possible the OCR should be automatically applied for new files (which seems to be not really possible with Adobe Acrobat...).
For that I've been looking for different solutions how to OCR my PDF Files. I found three candidates which seems to be doing what they should do... (more or less). But all three variants have their pro&cons... But there seem to be different approaches how to store data in PDF Files.... for all three Variants... Let me explain:
a File OCRed with Adobe Acrobat:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ACROBAT.pdf
results in a file that Acrobat is able to open in one step (no preloading of any background layer) and after a preflight-script I'm able to see the text which is stored hidden:
a File OCRed with Abby Finereader:
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_ABBY.pdf
does not seem suitable for the default adobe preflight-script as it does not display any additional layers:
But far as I was able to reproduce these Files seems to have a Background-Text-Layer, which contains the OCRed Text, which is the underlying layer for the Image that is shown to the user at the end. Unfortunately this seems to be loaded separately and this is confusing while opening the file with Adobe Acrobat...
a File OCRed with Tesseract 4 (Alpha):
https://github.com/n0l0cale/ocr-sampledata/blob/master/A4%20sample_TESSERACT_oem2.pdf
is also doing some weird magic with the hidden text part:
But in all three cases I'm able to search for words in the files and see the text using "Remove hidden information" and selecting "hidden text":
I'm seriously confused.... Does anyone know how these programs are storing their hidden text information really?
S.
P.S.: For those wondering what this ominous preflight script is: https://theblog.adobe.com/hidden-gems-in-acrobat-dc-how-to-optimize-hidden-ocr-text/
Does anyone know how these programs are storing their hidden text information really?
You correctly have found out that the approach of Abby Finereader is different from that of Adobe Acrobat and of Tesseract:
Abby creates a page content stream in which first the text is drawn normally on the page and eventually covered by the scanned image.
Acrobat and Tesseract create content streams in which first the image is drawn and then the text is drawn invisibly (using text rendering mode 3 which draws nothing).
The difference between the latter two results is the choice of font used:
Acrobat uses regular standard 14 fonts for which a PDF viewer has a font program to render them as normal glyphs.
Tesseract uses a font GlyphLessFont it embeds a font program for into the result file. When rendered the glyphs in this font do not show as our normal Latin glyphs but merely as empty space.
Considering the visual effect you observed for the Abby result, the approach used by Acrobat or Tesseract might be preferable.
Whether one prefers fonts with visually recognizable glyphs (as used by Acrobat) or without (as used by Tesseract), is mostly a mere matter of taste. They are used only in the invisible rendering mode anyways.
I'm trying to use pdf content (mathematics) in my webpage. I basically want to convert the pdf to some vector image. Converting the pdf to swf does the job very well, but as flash isn't supported on every platform, I'm trying to find another solution.
I read about svg, but as those pdf's contain a lot of mathematics, the result of the converters I found is really ugly and incorrect.
I've also thought about retyping the latex, and displaying it using mathjax, in some way this is the best solution, but also very time consuming.
The only thing I want is to convert it to a nice vector image, I don't want to change the content, or anything else. Besides converting to swf or retyping it, is there any other solution ?
Edit:
this is svg output
and here original pdf
The only solution I could find is illustrator.
Just open the pdf, save as svg, and choose to embed all used glyphs.
Result is perfect:
https://dl.dropboxusercontent.com/u/58922976/Sol-10.1.svg
what about using flash + raster image in case of platform without flash, if flash mostly works for you?
Your PDF is a little difficult for reasons that are probably not apparent to you.
The core problem with it is that some of the graphics in the document are actually drawn using custom glyphs. You can see this if you copy and paste the text out of Acrobat. There are a variety of unusual characters in there that don't seem to serve any useful purpose. That's those squares at the bottom of your SVG with EEs and FFs in them.
However these characters are actually custom glyphs for things like the braces around the matrices at the bottom of the page. So they are both fairly important and also very specific to this document.
I tried ABCpdf .NET to convert your PDF to SVG. It worked fine apart from these custom glyphs at the bottom. The output was about 90KB. It looked very similar to your inkscape SVG output but just a bit smaller (the inkscape one is 160KB).
The only way to get rid of these non-Unicode glyphs is to vectorize the text. I did this using ABCpdf and the output looked fine in SVG. But... vectorized text is big and SVG isn't a particularly efficient medium. The output was about 1MB! Zipped it goes down to half that but it's still no-where near as efficient as the original PDF.
The problems I am seeing here are going to be universal whatever format you use. These custom characters are always going to be problematic whether you output to SVG, SWF, HTML canvas, VML or indeed any vector format.
So what would I suggest? Well the obvious vector format that is widely used on the web is... PDF!
I know it's not quite what you're looking for but I think this is the realistic solution given the constraints above. :-)
I would like to know if its possible to convert a PDF to and image without fonts. My goal is to have only the image without text ?
And if yes, can I do it with ImageMagick/GhostScript ?
Here an example
The image final http://crocodoc_public.s3.amazonaws.com/8b8aa154-45e3-41f9-a465-628e1b2e955d/images/page-001.png
and the original PDF http://crocodoc.com/demo/efwpa (page 2) We can see that the text are on overlay over the image, what I want is to do the same.
So if I got you right, what you want is to remove some text from your PDF (not fonts), and you want to do it programmatically. I suspect you know already that this will only possible if the text is placed on some kind of separate layer in your PDF files. You can try to utilize iText for that. Beware, this will mean you will have to invest some days of learning how to use that library.
I too am the lookout for something like that.
While playing with imagemagick I tried this a command and got some unexpected results.
convert -input.pdf -blur 0x0 output.jpg
this removes the text layers from the pdfs I tried.
I cannot guarantee that this will work for you and if this the right way to achieve, but you may try.
You can do that with Adobe Acrobat. Select the text with the touch up tool and delete it. I don't think you can do that with Ghostscript. You could consider editing the PDF by hand (qpdf helps).