OCRMYPDF: 'pages' parameter not working as expected even with optimization disabled - ocrmypdf

I'm using ocrmypdf and I just want the first page of the files to have their characters recognized. I'm trying to do this with
ocrmypdf -l por --force-ocr --pages 1 --optimize 0 input.pdf output.pdf
but even then it outputs
Start processing 10 pages concurrently
the files are in portuguese and some of them have text with fonts that I can't read in python because the string becomes a lot of "(cid:)" that's why I use --force-ocr.
Also I have a lot of files (the files are actually a parameter for an application I'm developing), so this is taking too much time.
My operating system is Windows if it helps somehow.

Related

Batch converting EPS/PDF to postscript

I'm on windows and am trying to convert 6000 PDF files in batch to postscript files. The reason is I'm trying to do pdf imposition as asked here, first wanted to do in R as asked here. I found a library grImport to handle vector graphics in R, but it needs .ps files.
I could already convert batch .pdf files to .eps using Inkscape using this script. However, I need .ps for the R package. I was unable to do it using Adobe Acrobat Pro Action (it simply doesn't work on the folder, and freezes when I try it on an individual file.)
I have also tried ghostscript but setting -sDEVICE=pswrite throws an error saying the device is unknown. Also, I really could not get my head around GS.
How can I do this? (If you happen to know a solution to the main problem, sharing it is very appreciated.) Thanks in advance.
The pswrite device was deprecated years ago (it only produced level 1, big, ugly, unscalable output). You want to use the ps2write device which produces level 2 PostScript.
A simple command line for Ghostscript woudl be:
gs -sDEVICE=ps2write -o out.ps input.pdf
There are tools for imposing PDF files, you can even (with some effort) do it with Ghostscript.

how to create mjpeg

I can't understand how to create a .mjpeg file. As far as I understand it is simply a series of jpeg files. I searched online for a way to combine them into a single file, but didn't find any information. Some people said that one just needs to create a miniserver that would show one image after another.
I'm trying to use the following application, git://git.ideasonboard.org/uvc-gadget.git, to test UVC, and one of the options that it has is a path to the mjpeg file. I'm not very clear if it is possible to create a mjpeg file at all.
Would appreciate any help on how to create an mjpeg file so I could use it with the above mentioned application.
I had a difficult time searching for the same. It's especially misleading to read through mencoder's manpage when it supports various movie containers but not the UVC payload format.
This seemed to work for me to record a bytestream from a webcam on Ubuntu 16.04:
gst-launch-1.0 v4l2src device=/dev/video0 ! 'image/jpeg,width=1280,height=720,framerate=30/1' ! \
filesink buffer-size=0 location=mystream.mjpeg
where 1280x720 at 30 fps is what guvcview says my webcam supports.
Source: link
Edit: Later I learned about v4l2-ctl:
v4l2-ctl -d /dev/video0 --list-formats-ext # identify a proper resolution/format
v4l2-ctl --set-fmt-video=width=1280,height=720,pixelformat=1
v4l2-ctl --stream-mmap=1 --stream-count=30 --stream-to=mystream.mjpeg
When the stream-count is set to 1, it makes a regular JPEG file that can be viewed with xdg-open. Otherwise, run file mystream.mjpeg to confirm the output has a proper resolution and frame count.
Getting this data to actually work with uvc-gadget -i could be much more involved. Given it possibly requires the appropriate patches, kernel configuration, and debugging, so far I have only gotten the uncompressed format to work in isochronous on my Raspberry Pi Zero. Hopefully you're further along.

Does GhostScript require a GPU?

I'm converting pdf to images using a nodejs package: https://www.npmjs.com/package/pdf2images-multiple
This works successfully in docker on two different local machines which both have Graphic Cards. However when I try to run this on a server in Google Cloud (which does not have a GPU). The following error occurs for particular pdf pages that have graphs:
error: message=Failed to convert page to image, killed=false, code=1, signal=null, cmd=gm convert -density 150 -quality 100 -sharpen 0x1.0 -trim '/usr/src/app/1161115-30-1kabyqq.2bteimgqfr.pdf[7]' '/usr/src/app/pdfimages1161115-30-10uod6h.siy6pyzaor/1161115-30-1kabyqq-7.png', stdout=, stderr=gm convert: "gs" "-q" "-dBATCH" "-dMaxBitmap=50000000" "-dNOPAUSE" "-sDEVICE=pnmraw" "-dTextAlphaBits=4" "-dGraphicsAlphaBits=4" "-r150x150" "-dFirstPage=8" "-dLastPage=8" "-sOutputFile=/usr/src/app/gmxHC5iw" "--" "/usr/src/app/gm0tibSq" "-c" "quit" (child process quit due to signal 11).
gm convert: Postscript delegate failed (/usr/src/app/1161115-30-1kabyqq.2bteimgqfr.pdf).
I've created an AWS instance with a GPU and this error does not occur. Looking to see if there's an Environment Variable that would be able to skip the GPU variant in GhostScript at least until Google Cloud gets GPUs or some alternative that I'm not seeing here.
The command in the error message called GraphicsMagick has documentation saying it doesn't use any GPU techniques.
http://www.graphicsmagick.org/FAQ.html#are-there-any-plans-to-use-opencl-or-cuda-to-use-a-gpu
Ghostscript does not need, and indeed is not capable of using (beyond using X to display the bitmap) a GPU. There is some SIMD code, but you can compile without that, obviously I have no idea how the Ghostscript you are using was compiled.
For Linux, its often impossible to move a binary from one box to another, because the ABI differs between the two systems in terms of things like the C runtime. Also, if the executable has been compiled with shared libraries (many distributions insist on doing this) then differing versions of the shared libraries might cause problems.
My guess is that, rather than the presence or absence of a GPU, there is some significant difference between the Google Cloud Linux and the AWS Linux.
The best way to deploy Ghostscript on Linux is to build it from source on the machine you intend to use, this is especially true if you intend to put it on multiple machines with different configurations.

Template rendering engine on Raspberry Pi

I have a project in which I am using a Raspberry Pi to print ticket to a thermal printer.
It is pretty much the same principle as in this video.
Tickets are generated from templates that may include text and images. Both text and images are dynamic, for example I may want to print the current time. I receive the template as a .psd from a designer and the thermal printer takes bitmap data. The Raspberry Pi communicates to the printer with a python library. Everything must be done locally as cloud access is not guaranteed. Performance is important.
I investigated several options:
Latex + ImageMagick
Webkit + Phantom.js
Pillow (Python Imaging Library), especially the module ImageDraw
The first option is not quite satisfying because Latex generates a pdf file and then ImageMagick is very slow to convert it to a .png.
The second option is seducing but if I am not mistaken, I would need to start a server locally.
The third option would be great because it will be pure python, but requires to build a basic typesetting system on top of PIL.
Has anyone been confronted to a similar problem ?

Converting PDF to JPG with Imagemagick is really slow after update

I have several servers set up using Centos 5 (64bit) and the default yum installed versions of Ghostscript (v8.70) and Imagemagick (v6.2.8) which work really well and are very quick at converting PDF files into JPG previews.
I have removed both IM and GS on one of my servers and installed the latest versions Ghostscript(v9.0.7) and Imagemagick(v6.8.5) from source and the conversion speed has gone from around 0.5 seconds to 7.5 seconds for exactly the same original PDF.
I need to be able to run the later version of both to be able to use the inkcov device for working out which pages are colour in multipage pdfs (up to 200 pages).
Now I am assuming this slowdown is due to the compile options, as I can't believe that the later versions are so much slower. I have searched around to try and find ways of optimising at the compilation stage (changing to Q8 rather than Q16 quality etc) but nothing seems to make much difference.
thanks