Tesseract : Line detection too sensitive - pdf

I am trying to detect the .pdf file text.
They are first converted to an image, then given to Tesseract.
The detection is good but they make too many line breaks.
For example if the file is a bit panched on the right, the sentence:
"I like Tesseract for reading text"
become:
"text read for Tesseract like I"
And that's already after a treatment because the raw text is :
"textreadforTesseractlikeI"
The bug occurs since the source .pdf are in 300DPI, I understand that the problem comes from the resolution but I cannot find how to solve it.
Here is my Tesseract cmd Tesseract.exe dummy.pdf dumy-ocr.pdf --psm 12 --dpi 300 -l bvr+fra+eng+deu hocr pdf
First, I would like to solve the problem of too many lines,
Then I would find out how to make the image perfectly straight
Thank you in advance for your help
https://i.stack.imgur.com/crmdO.jpg

You seem to be working backwards.
The "many" lines and thus word reversal are due to the anti-clockwise rotation.
text"
reading
for
Tesseract
like
"I
Fix that first and then the words will naturally all be placed on the same lines.
If using Leptonica in conjunction with Tesseract it is supposed to help with the pre-processing including deskew.
However there is a very small but powerful open source GUI and Command Line tool for Windows, Linux, and macOS that you could use from a shell see https://galfar.vevb.net/wp/projects/deskew/ it is also available on GitHub as an appveyor CI artifact so for the most up to date version (currently 5 days ago) follow the green tick at https://github.com/galfar/deskew

Related

PDF Table Lines Missing from GhostScript

I am trying to convert a PDF file to an image format (ideally PNG), but some of the table lines do not render in the output, which is an issue since the purpose of my conversion is to use computer vision on it.
I unfortunately do not have access to the file used to generate the PDF.
Thank you in advance for your help!
Attached is the ghostscript rendering vs the actual pdf:
Original
GhostScript
EDIT: Thanks for the answers. Here is what I had already tried:- ---
Changing the scaling & Changing the Antialiasing (I doubt that any combination of this will work in Ghostscript at this point)
Converting to PostScript and then to PNG/PDF
Saving from a Browser
Saving from various virtual printers to PDF
Using Poppler to do the rendering
All to no avail. Digging deeper, I found some interesting things which may be helpfull. Ghostscript does recognize the lines when using -sDevice=X11 and -sDevice=PS2Write (apologies for coding typos). That is, using Ghostscript to visualize the PDF does work, but not to process them into anything else than Postscript.
Also, printing into a PDF from Adobe Acrobat does fix my problem, however this is something that I need to be able to do from the command line on thousands of files.
Hope this helps!
EDIT2:
Link to a concerned file
https://transfer.sh/PuIF90/e176ad9824ddc6cb5e6aead2d389c131-filer.pdf
I thought that I would share the fix that I found. Turns out that a bunch of the pdf we need to process were generated using a specific HTML5 to PDF conversion tool which turns each lines of the PDF into a rectangle with size 0. Solution for me has been to automate decompressing PDFs, and looking through the text file for "A A A A re", with all "A's" being numbers. Should the last or next to last A be a zero, I change it to size 1.
For instance (once again, after decompressing the PDF):
1000 2000 0 14 re
to
1000 2000 1 14 re
Hope this helps someone else out there and let me know if there is a more elegant way of doing this, I am still a beginner about all things PDF.

How to convert unusual unicode characters (UTF-8) to PDF?

I would like to convert a text file containing Unicode characters in UTF-8 to a PDF file. When I cat the file or look at it with vim, everything is great, but when I open the file with LibreOffice, the formatting is off. I have tried various fonts, none of which have worked. Is there a font file somewhere on my Ubuntu 16.04 system which is used for display in a terminal window? It seems that would be the font to tell LibreOffice to use.
I am not attached to LibreOffice. Any app that will convert the text file into a PDF file is fine. I have tried txt2pdf and pandoc without success.
This is what the file looks like
To be more specific about the problem, below is an example of what the above lines look like in LibreOffice using Liberation Mono font (no mono font does better):
I answered to you by mail, but here is the answer. You are using some very specific characters; the most difficult to find being in the Miscellaneous Symbols unicode block. For instance the SESQUIQUADRATE which sould is on your second line as ⚼.
A quick search lead me to the two following candidates (for monospace fonts):
Everson Mono
GNU Unifont
As you can see, the block is partially covered by PragmataPro which is a very good font; however, I tried with an old version and found all your own characters, but an issue occured because the Sun character (rendered as ☉) seems to be printed twice wider than the other characters, but my version of this font is rather old and perhaps buggy.
Once you have chosen the font suiting your needs, you may be able to render your documents as PDF with various tools. I made all my experiments with txt2pdf which I use daily for many documents.

How to troubleshoot badly rendered PDF file

I have a small PDF file, which is supposed to display just the string "Hello World!".
Unfortunately, it displays black boxes instead of the characters. I suppose there is some problem with the fonts, but I am not sure.
Is there a way to diagnose and troubleshoot this issue? All I see on the Internet is advices to do this and to do that, which helps to some and does not to others (nothing helped me). Looks like shooting in the dark to me.
Here is a concrete example. Why does this PDF display black squares instead of the string Hello World ?
EDIT
A bit of the context. I am trying to convert a trivial HTML to PDF using the wkhtmltopdf tool. It is an absolute frustration, because according to the Internet searches the tool is supposed to work and do it quite well. But the thing does not work for me and nothing I do changes this! Unfortunately, this tool seems the only free tool to convert HTML to PDF. This is a huge bummer.
If you want to find out whether a PDF is valid or what is wrong with it, there are a few general steps you can take:
1) Open it in Adobe Acrobat or Adobe Reader (on a desktop platform, not a tablet device). For a very long time the PDF format was owned by Acrobat and the way their software handles PDF is still close to the gold standard. However, there is a caveat with this; Acrobat is very, very smart in the way it handles PDF files and it will overlook or actively correct a number of mistakes other PDF engines might have a problem with...
2) Get yourself a preflight tool. These tools were invented for use in graphic arts, but have applications outside of it too. Popular examples are callas pdfToolbox (warning, I'm affiliated with this vendor!) or the "Preflight" plug-in you'll find in Adobe Acrobat Pro (which is actually also callas technology under the hood). Then preflight specifically against the PDF/A-1b or PDF/A-2b standard.
That last point deserves some more explanation. You should pick a PDF/A compliant preflight profile because the PDF/A (or PDF for Archival) standard is extremely picky. It's goal is to make sure that PDF files will still be readable in exactly the same way 50 years from now and to ensure that it tests a whole range of properties of the file itself and the different components in it. You might be able to ignore some of the errors you get (because some of them will be connected to the fact that the PDF/A identification isn't correct for example) but I wouldn't ignore any other errors unless you understand exactly what they mean and why they aren't relevant.
PS: Can you make your test file available some other way? The file you shared in your question is useless I think. When I do "Download" I get a PDF file that doesn't contain text and doesn't have fonts in it. Those rectangles you see are exactly that - rectangles. So this PDF renders fine - it's the PDF generation process (or the fact that you stored the file on Google docs - I really have no clue what that might do) that went berserk apparently.
In addition to David's hints (first using a known good viewer and then some preflight tool), there is a third level in the inspection process:
3) Inspect the PDF with your own eyes and with the PDF specification (made available by Adobe here) at hand in a text viewer (for a first impression) and (if the cause of the issue at hand is not immediately visible) then in a PDF browsing tool (for in-depth analysis).
This step is quite cumbersome at first but after some time you learn your way around in the PDFs.
A sample for such a PDF browser tool is RUPS but there are others around, too.
'Small PDF file supposed to display "Hello World!"'
Not correct. The file you linked to does not contain any code that could render pixels on screen or on paper that a human brain would read as "Hello World!". The file indeed does only contain vector drawing operations which result in 12 black boxes.
The command line tool pdffonts does not indicate any font being used in the file:
pdffonts so-file-#15858199.pdf
What could still cause the "rendering" of the words you are looking for: some vector or pixel drawing code contained in the PDF. To find out about this, you'll have to look into the low level source code of the PDF.
The original file is 1.570 Bytes. So this task looks not as being overly huge.
'Is there a way to diagnose and troubleshoot this issue?'
Using qpdf, a "command-line program that does structural, content-preserving transformations on PDF files", you can expand all contained streams (which are normally compressed):
qpdf --qdf --object-streams=disable so-file-#15858199.pdf qdf-#15858199.pdf
The resulting file, qdf-#15858199.pdf, is 3.875 Bytes. Now open it in a text editor. PDF object no. 6 (lines 66-219) contains the contents of the page. Lines 123-194 contain only the operators m (moveto), l (lineto) and h (closepath). These lines contain 12 different groups of drawing commands, where each one represents the path for one of the 12 black boxes you see rendered on screen or printed on paper:
102.400001 12.8000001 m
268.800004 12.8000001 l
268.800004 179.200002 l
102.400001 179.200002 l
102.400001 12.8000001 l
h
Line 196 contains
f
which is the fill operator to actually fill black color into so far constructed (closed) path. Nothing in the other lines (which I didn't analyze in detail) does any drawing that may resemble the shapes of any glyphs.
'Unfortunately, this tool seems the only free tool to convert HTML to PDF'
Not correct either.
1.
Assuming your "free" is meant as free as in liberty, then an alternative option is HTMLDOC.
HTMLDOC does not support specific fonts which may be assigned to your HTML input via CSS, but it does a good job in converting one or multiple HTML documents into a single PDF book containing chapters, page-numbering, page headers and footers and more. For all options available, see its full documentation.
2.
Assuming your "free" is meant as free as in beer, then an alternative option (for private usage only) could be PrinceXML.
PrinceXML does an extraordinarily good job when it comes to support almost all CSS features your HTML document may be using. See its documentation and also some of the sample PDF files produced by PrinceXML.

Saving the output from DiffPDF / ComparePDF command line. - Comparing folders of PDF's

We have to do a comparison of about 1500 PDF's in one folder with 1500 PDF's in another to check for visual differences.
We have found DiffPDF(and comparePDF command line version) for Windows which is a lot faster than our automated Acrobat Pro comparisons.
So far I have used:
comparepdf -v=2 =c=a old.pdf new.pdf
but the problem with this is that it just returns "these files are different". Does anyone know of any way to save the output from command line? You can do this from the GUI but that would mean using something like TestCOmplete to automate it :(
Or are there better ways of doing a comparison of 2 PDF's visually- with output/highlighting/
Bonus points for C# .net libraries.
You could have a look at these answers to similar questions:
PDF compare on linux command line
How to compare two pdf files through command line
How to unit test a Python function that draws PDF graphics?
However, I have no idea if any of these would be performing faster than what your automated Acrobat Pro comparison does... Let me know if you found out, will you?
Shortcut:
For simplicity, let's assume your input files to be compared are similar enough, and each being only 1 page. (For multi-page input expand the base idea of this answer...)
The two most essential commands any such comparison boils down to are these:
compare.exe ^
%input1% ^
%input2% ^
-compose src ^
%output%.tmp.pdf
and
pdftk.exe ^
%output%.tmp.pdf ^
background %input1% ^
output %output%.pdf
The first command generates a PDF with all differential pixels colored in red. (A default resolution is used here, 72 dpi. For a more fine-grained view on pixel differences add -density 200 (that will mean: 200 dpi) or higher -- but your processing time will increase accordingly as will the disk space needed by the output...)
The second command tries to merge the resulting PDF with a background taken from ${input1}.
Optionally, you may add -verbose -debug coder after the compare command for a better idea about what's going on.
compare.exe is a commandline tool from the great, great ImageMagick family of utilities (available for Linux, Windows, Unix and MacOSX). But it requires a Ghostscript installation to use as a 'delegate' in order to be able to process PDF input. pdftk.exe is also a commandline utility, available for the same platforms. Both a Free Software.
After the first command, you'll have an output file which has only red pixels where there are differences found on the page.
After the second command, you'll have an output with all red 'diff' pixels in the context of the first input PDF.
Example output:
Here are screenshots of two 1-page PDF files with differences in their content:
Here are screenshots of the output produced by the two commands above:
The left one shows the intermediate result (after first command), with only the difference pixels displaying as red (identical pixels being white).
The screenshot on the right shows the red difference pixels, but this time with the input PDF file number 1 as a (gray) background (after second command).
(PDF input files courtesy of Mark Summerfield, author of the beautiful DiffPDF tool.)
I had the same problem, diffpdf is quick and nice but GUI only.
[comparepdf] is console one but reports only exit code (no diff itself).
[diff-pdf] has both console mode and diff.pdf output but it is slow and output is not friendly.
I have tried to add the required code to diffpdf,
you can find it here: http://github.com/taurus-forever/diffpdf-console

Exported PDFs from Mathematica 8 won't print

UPDATE: I wrote to Wolfram support about this and will update the post if they can resolve the problem. Sorry for spamming SO with a technical support question, but here it remains in case anyone else is having the same issue.
Is anyone else having this problem with Mathematica 8? I recently upgraded and noticed that when I export Graphics to a PDF file, although the file appears fine on my computer, it prints as a blank page. For example, try
Rectangle[{1,1}]//
Graphics//
Export["~/test.pdf",#]&
which creates a PDF file containing a black square. This file opens fine, but if I send it to my department printer I just get a blank page. If I don't export the graphics but print the notebook from MM, no problem, the graphics print as expected. If I use MM 7 to do exactly the same thing, the PDF file prints as expected. Exporting to PNG in MM8 seems to work fine. And, using the context menu Save Graphics As ... or File > Save Selection As ... to create a PDF containing just the graphic also works. However, these graphics eventually get included in a TeX document, and it would be far better if I could continue using the script I've got that doesn't require any button clicking to generate them.
I'm running MM 8.0.0.0 on Mac OS 10.6.7. I have not been able to test this on another printer yet, but this printer has never given me problems before and prints other PDF documents fine. Any ideas why this is happening?
Wolfram Research responds:
...
This issue has been reported by other users as
well and our developers are currently looking into it. I have added your
details to the report so you can be notified when this is resolved.
In the meantime, the alternatives that you could try are:
Try a different printer.
Rasterize the image with the function 'Rasterize' before exporting. If
the rasterized image loses some resolution, you could use the option
'ImageResolution' to edit this.
Rasterize[image, ImageResolution -> xxx]
Surely this is a bug (please report it to support#wolfram.com), but you can work around the problem by selecting the graphic and choosing File > Save Selection As... from the menu (or Save Graphic As... from the contextual menu). This produces a slightly different file that doesn't appear to exhibit the undesirable behavior we observe from Export[].
These problematic files, and LaTeX PDFs that include them, can be properly printed by Adobe Reader 10.1.2. That's if you're okay with installing and using a 450MB PDF reader.
I reproduced the problem (leading me to this question) with Mathematica 8.0.4.0 on Mac OS X 10.7.2. Wolfram suggested lame workarounds like Rasterize and told me
This issue has been addressed by our developers, and a fix will be included in a future version of Mathematica.