I am using Tesseract OCR for getting an exclusively numeric string in a PDF file.
The PDF contains : 66600O3377.pdf
but Tesseract recognizes : 66600Q3377.pdf
The input is a TIFF file, the quality is good enough (see the screenshot).
Is there a way to improve the Tesseract accuracy ? I could always change Q for a 0 but I'm afraid of further unexpected mistakes.
This is in Tesseract FAQ:
Run a tesseract command like this to only permit digits in input image:
tesseract imagename outputbase digits
Related
I need to convert a PDF or Postscript file to EPS, I tried using Ghostscript with the following command to convert Postscript to EPS:
gswin32.exe -o output.eps -sDEVICE=eps2write -dFitPage input.ps
Or PDF to EPS:
gswin32c.exe -q -dNOCACHE -dNOPAUSE -dBATCH -dSAFER -sDEVICE=eps2write -o output.eps -dFitPage input.pdf
They both complete successfully but they are not maintaining the page size. The input PDF or PS files are the same drawings and they both a page size of 300x300pts. You can download these files here and here. They look like this:
But after converting them to EPS the results are these, PS to EPS and PDF to EPS. They look like this, the first one is the result from PS to EPS and the second one is the result from PDF to EPS (they are opened using EPS Viewer that rasterizes the image that's the reason for the low quality):
As you can see, none of them have the original 300x300 pts size, I've tried many Ghostscript options but I can't manage to get an EPS with the right Bounding Box. I just need to convert a PDF OR PS to EPS, whatever is easier or gives better results.
What you are asking for is, more or less, the exact opposite of what is normally required.
In general people want the EPS Bounding Box to be as tight as possible to the actual marks made by the EPS, because the normal use for an EPS file is to 'embed' it in another document. If you want extra white space you would normally add it around the EPS when you embed it.
Indeed, the EPS specification says that the BoundingBox comment should not include the white space. On page 8 of the EPSF specification:
"For an EPS file, the bounding box is the smallest rectangle that encloses all the marks painted on the single page of the EPS file"
Messing with Ghostscript switches isn't going to do anything helpful for you here, the device explicitly records the marks that are made by the input, and sets the BoundiongBox from those.
Perhaps if you were to explain why you want to have an EPS file with incorrect BoundingBox comments it would be possible to make some suggestions, but Ghostscript is doing exactly what it should do here.
[addendum]
(see comment below, this is in reply)
I suspect you need to change your process in some way then. One solution is to have the PDF start by filling the entire page with white. Contrary to many people's expectations that counts as making a mark on the page so the entire page would then be considered as the BoundingBox.
As long as you are using the Ghostscript eps2write device you could also parse the document for %%BeginPageSetup, the eps2write device still writes the original document size out in this section, Eg:
%!PS-Adobe-3.0 EPSF-3.0
%%Invocation: path/gswin32c -dDisplayFormat=198788 -dDisplayResolution=96 -sDEVICE=eps2write -sOutputFile=? ?
%%BoundingBox: 101 132 191 256
%%HiResBoundingBox: 101.80 132.80 190.30 255.20
%%Creator: GPL Ghostscript GIT PRERELEASE 951 (eps2write)
....
....
%%EndProlog
%%Page: 1 1
%%BeginPageSetup
4 0 obj
<</Type/Page/MediaBox [0 0 300 300]
/Parent 3 0 R
/Resources<</ProcSet[/PDF]
>>
/Contents 5 0 R
>>
endobj
%%EndPageSetup
You can see here that the original media size was 300x300, even though the BoundingBox correctly reflects the marks made on the page. Note! This is characteristic of EPS files produced by the current version of eps2write, it won't work for EPS files from other sources and may not work with eps2write in the future.
Other than that you're stuck with finding the media size from the input and passing it separately to the program doing the insertion, presumably by putting the data in some other text file to accompany the EPS. Or, of course, manually or programmatically editing the urx,ury co-ordinates of the BoundingBox.
Ghostscript isn't going to do this for you I'm afraid.
Given a directory with several jpg files (photos), I would
like to create a single pdf file with one photo per page.
However, I would like the photos to be stored in the pdf file unchanged; i.e., I would like to avoid decoding and recoding.
So ideally I would like to be able to extract the original jpg files (maybe minus the metadata) from the pdf file, using, e.g., a linux command line too like pdfimages.
My ideas so far:
imagemagick convert. However, I am confused by the compression options: If I choose 100% quality, does it mean that the jpg is internally decoded, and then encoded lossless? (Which is obviously not what I want?)
pdflatex. Some people claim that the graphics package includes images lossless, while other dispute that. In any case, pdflatex would be slightly more cumbersome (I would first have to find out the dimensions of the photos, then set the page size accordingly, make sure that ther are no margins, headers etc etc).
img2pdf (PyPI page):
Losslessly convert raster images to PDF without re-encoding PNG, JPEG, and
JPEG2000 images. This leads to a lossless conversion of PNG, JPEG and JPEG2000
images with the only added file size coming from the PDF container itself.
Other raster graphics formats are losslessly stored using the same encoding
that PNG uses. Since PDF does not support images with transparency and since
img2pdf aims to never be lossy, input images with an alpha channel are not
supported.
(pdfimages -all does the exact opposite.)
You could use the following small script which relies on HexaPDF (note: I'm the author of HexaPDF) to do this.
Note: Make sure you have Ruby 2.4 installed, then run gem install hexapdf to install hexapdf.
Here is the script:
require 'hexapdf'
doc = HexaPDF::Document.new
ARGV.each do |image_file|
image = doc.images.add(image_file)
page = doc.pages.add
iw = image.info.width.to_f
ih = image.info.height.to_f
pw = page.box(:media).width.to_f
ph = page.box(:media).height.to_f
rw, rh = pw / iw, ph / ih
ratio = [rw, rh].min
iw, ih = iw * ratio, ih * ratio
x, y = (pw - iw) / 2, (ph - ih) / 2
page.canvas.image(image, at: [x, y], width: iw, height: ih)
end
doc.write('images.pdf')
Just supply the images as arguments on the command line, the output file will be named images.pdf. Most of the code deals with centering and scaling the images to nicely fit onto the pages.
Another possibility for storing jpg images into a pdf file in a "lossless" way is provided by PoDoFo:
podofoimg2pdf is able to perform lossless conversion from JPEG to PDF by embedding the jpg file into the pdf container.
podofoimg2pdf
Usage: podofoimg2pdf [output.pdf] [-useimgsize] [image1 image2 image3 ...]
Options:
-useimgsize Use the imagesize as page size, instead of A4
Depending on what you wish to do with the files, on windows, if the images are simpler jpeg/gif/tif/png you can store in a cbz, zip, folder or zipped folder and view with SumatraPDF which has the SaveAs PDF option thus all done with one exe.
It will fail with files that are viewable but not acceptable as PDF inputs such as webp or heic, so check in the viewer what the filename extension is before.
It should in practically all cases be lossless, however you should roundtrip with pdfimage -all to do a file compare between input and output to check there was no need to convert any bytes.
I am using GIMP to convert grayscale PNM-files (scanned documents) to PDF.
My goal is a small filesize. (Ideally: viewable on different devices without any problems and maybe suitable long-term preservation - PDF/A?)
So far, so good. Trying to reproduce that process with ImageMagick in a batch script doesn't give me that same small filesize as in GIMP.
GIMP (Ver. 2.8.14) workflow:
Open File
Change resolution (density) to 300x300 Pixel/in
Set threshold to 127 (=50%)
Export as OutGIMP.pdf
ImageMagick (Ver. 6.7.9-0 2012-09-16 Q16) workflow:
convert Scan.pnm -density 300x300 -threshold 50% -monochrome OutA.pdf
convert Scan.pnm -density 300x300 -threshold 50% -monochrome OutB.png
convert OutB.png OutC.pdf
Using an example File this results in:
OutGIMP.pdf: 141.195 Byte
OutA.pdf: 684.245 Byte
OutB.png: 137.246 Byte
OutC.pdf: 217.860 Byte
How can I get a PDF with ImageMagick that is at least as small as the GIMP-PDF?
Edit
Continuing the GIMP (Ver. 2.8.14) workflow from above with:
Scale to 100x100 Pixel/in while keeping the Imagesize
Export as OutGIMP_100ppi.pdf
strangely results in:
OutGIMP_100ppi.pdf: 179.123 Byte
This question is related to
Script (or some other means) to convert RGB to CMYK in PDF?
however way more specific. Consider that I am not an expert in print production ;)
Situation: For printing I am only allowed to use two colors, Cyan and Black. The printery requests the final PDF to be in DeviceCMYK with only the Channels C and K used.
pdflatex automatically does that (with the xcolor package) for all fonts and drawn objects, however I have more than 100 sketches/figures in PDF format which are embedded in the manuscript. Due to an admittedly badly designed workflow (late realization that Inkscape cannot export CMYK PDFs), all these figures were created in Inkscape, and thus are RGB PDFs.
However, the only used colors within Inkscape were RGB complements of CMY(K), e.g. 100% Cyan is (0,255,255) RGB and 50% K is (127,127,127) etc.
Problem: I need to convert all these PDF figures from RGB to DeviceCMYK (or alternatively the whole PDF of the final manuscript) with a specific conversion formula.
I did a lot of google research and tried the often suggested ways of using e.g. Ghostscript or various print production tools in Adobe Acrobat, however all of the conversion techniques I found so far wanted to use ICC color profiles or used some other conversion strategy which filled the channels MY and spared some C and K, for example.
I know the exact conversion formula for the raw color numbers from our Inkscape-RGBs to the channels C and K, however I do not know or find any program or tool that allows me to manually specify conversion formulas.
Question: Is there any workflow to convert my PDFs from RGB to C(MY)K manually with my own specific conversion formula for the raw numbers with the converted PDF being in DeviceCMYK using a tool, script or Adobe product?
Due to the large number of figures I would prefer a batched solution which doesn't require too much coding from my side, but if it should be the only solution, I'd also be open minded for a workflow like "load/convert/save" within a program for every single figure or writing a small program with an easy-to-handle C++ PDF API for example.
Limitations and additional info: A different file format (like TikZ figures) is not possible any more since it does not work perfectly and the necessary adaptions to the figures would create too much overhead. A maybe helpful information: Since the figures are created in Inkscape, there are no raster images within the PDFs. I also do not want all figures to be converted to raster images during the color conversion.
Edit:
I have created an example of a RGB PDF-figure created with inkscape.
I also did a manual object-by-object color conversion to a CMYK-PDF with Illustrator, to show how the result should look like. Illustrator stores the axial shading in a DeviceN colorspace with the colors cyan and black, which is close enough^^
Here is an idea, I think it will work if your PDF files are using exclusively the colorspaces DeviceGray, DeviceRGB and DeviceCMYK:
1- Convert all your PDF files to Postscript (with pdf2ps from ghostscript for example)
2- Write a Postscript program that redefines the operators setrgbcolor, setgray and setcolor with your own implementation in the Postscript language, your implementation will internally use setcmykcolor and it will compute the values using your custom formula.
Here is an example for redefining the setgray operator:
% The operator setcmykcolor expects 4 values in the stack
% When setgray is called, we can expect to have 1 value in the stack, we will
% use it for the black component of cmyk by adding 3 zeros and rolling the
% top 4 elements of the stack 3 times
/setgray { 0 0 0 4 3 roll setcmykcolor } bind def
3- Paste your Postcript program at the begining of each resulting ps file from step 1.
4- Convert all your files back to PDF (with ps2pdf for example)
See it in action by saving this piece of code as sample.ps:
/setgray { 0 0 0 4 3 roll setcmykcolor } bind def
0.5 setgray
0 0 moveto
600 600 lineto
stroke
showpage
Convert it to PDF with ghostscript using this command line (I used version 9.14):
gswin64c.exe -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=sample.pdf sample.ps
The resulting PDF will have the following page content:
q 0.1 0 0 0.1 0 0 cm
/R7 gs
10 w
% The K operator is the PDF equivalent of setcmykcolor in postscript
0 0 0 0.5 K
0 0 m
3000 3000 l
S
Q
As you can see, the ps-> pdf conversion will preserve the cmky colors specified in postscript with the setcmykcolor operator.
Maybe you can post your formula as a new question and someone could help you out translating it to postscript.
Since you have access to Illustrator, you might want to try importing the PDF into Illustrator and using Illustrator's scripting capabilities to iterate over the elements and replace fill/stroke RGB colors with their CMYK replacement colors.
The difficulty will be with the shading patterns (Gradients) used in the PDF; if they are imported as GradientColor, then in theory it's a matter of digging into the GradientColor to find the base RGB colors and substitute their CMYK replacement.
A very similar problem was solved using the ActivePDF.dll with C++ (or C#??).
i want to shrink png or jpg on OSX. i only want to shrinkg without affecting the image quality.
like tinypng.org
is there any recommended library? i just know imagemagick. is there a way to do that natively? or another library to shrink/compress images without affecting the image quality?
my aim is to shrink the file size, for example:
logo.png >> 476 k before shrink
logo.png >> 50k after shrink
Edit: to be clear, i want to compress the size of the file, not the image resolution.
TinyPNG.org works by using image quantisation - the similar colours in the image are converted into a HSV or RGB model and then merged depending on the distance.
How does it work?
...
When you upload a PNG (Portable Network Graphics) file, similar colours in your image are combined. This technique is called “quantisation”
...
src: http://tinypng.org
An answer here outlines a method of doing so: https://stackoverflow.com/a/492230/556479.
There are also some answers on this question with refer to how you can do so on Mac OS using objective-c: How do I reduce a bitmap to a known set of RGB colours
See Wikipedia for a more in depth guide: http://en.wikipedia.org/wiki/Color_quantization
Did you have a problem using ImageMagick? It has a rich set of quantize functions such as
bool MagickQuantizeImage( MagickWand mgck_wnd,
float number_colors,
int colorspace_type,
float treedepth,
bool dither,
bool measure_error )
Here is a very thorough guide to quantization using imageMagick
My suggestion is to use http://pngnq.sourceforge.net, it will give better results than ImageMagick and for the single example given in http://tinypng.org, it also produces a very similar output. It is a tiny C implementation of the method present in the paper "Kohonen Neural Networks for Optimal Colour Quantization". That alone is much better since you are no longer relying on closed unknown implementations.
Original (57 KB), tinypng.org (16 KB), pngnq (17 KB):
Using ImageMagick, the best quantization to 256 colors I can get uses the LAB colorspace and dithering by Floyd-Steinberg:
convert input.png -quantize LAB -dither FloydSteinberg -colors 256 output.png
This produces a 16 KB png, but it contains much more visual artifacts: