imagemagick - generated PDFs are blurry/hazy - pdf

I have an image of a website which I'm trying to convert to PDF. I have the image in several formats: PSD, PNG, JPG, TIFF, all saved losslessly.
I'm using the following command to convert the image to PDF:
convert -density 93 foo.jpg bar.pdf
Here is part of the original image:
And here is the same part, after converting to PDF:
As you can see, the second one is ever so slightly hazy. What's causing this, and how can I eliminate it? I've seen PDFs with crisp graphics, so I know it's possible.

If you are seeing the same results with multiple input types. The fuzziness is likely being caused by the anti-aliasing feature of your PDF Viewer. If using Acrobat, you can turn off image anti-aliasing by doing the following:
Go to Edit-->Preferences-->Page Display
Untick the option "Smooth Images" and hit "OK".
The crisp graphics you are seeing on other PDFs are likely due to the fact that they are vectorized graphics. Imagemagick is creating a PDF and embedding your image inside of it which may be subject to compression.
Also:
When using jpeg as input, add the "-quality 100" to your Imagemagick call to retain the highest quality possible.
Use a higher value for the "-density" parameter (I would recommend at least 150) to generate a higher resolution PDF.

Related

Pdf real cropping

I need to crop a pdf document using the linux shell and then extract the text just in that cropped pdf.
My idea was to crop a pdf using pdfcrop linux tool and then use a txt2pdf text extractor tool to extract the text just in the cropped area, but i've realized that i'm thinking on images, and when i try to do this the result is the same than doing it over the original, not cropped, pdf.
I guess it's a layer problem. As the pdf format works with layers, if i don't "crop" all the layers, the result is gonna include all the information from all the layers, which i don't want.
I would appreciate so much if someone has any idea of how i could do a real "all layers cropping" in a pdf. If its possible or if i should start thinking on another solution.
TY
Its not layers, its the fact that cropping a PDF usually involves simply setting the CropBox, which doesn't alter the actual contents of the PDF (other than the CropBox) at all. Most text extraction code will ignore the CropBox and extract all the text....
You could, with some effort, use Ghostscript to produce a genuinely cropped PDF (though note that partially cropped glyphs will still be included) and then extract the text from that. But that's pretty ugly.
Alternatively Ghostscript and MuPDF can both extract text with co-ordinate information, which may be enough for your needs.

Import vector graphics from PDF to GIMP

I need to extract vector graphics from a PDF image and import them into GIMP, either as paths or as high-resolution raster images. Specifically, I need to get contour lines from USGS topographical maps and overlay them on satellite images. Any suggestions?
So far I have tried:
--Using GIMP's native PDF importing function to import them as raster images. Problem: To do so at high resolution crashes my computer. Possible solution would be to import only a selected area of a PDF, but as far as I can tell this is not possible.
--Using ImageMagick to convert the PDF to a raster image. Problem: Used with the "-scale" parameter, "convert" appears to rasterize the PDF and then upscale it, leading to a choppy image.
--Using InkScape to extract the necessary vector elements from the PDF. Problem: InkScape freezes when I try to open a moderately large (25 Mb) PDF.
Any other ideas?
Many thanks,
treacl
The option you didn't mention above is to try to use the ghostscript program directly to render your output - ghostscript is used internally by GIMP to import PDF files, so you likely have it installed already.
There are tens of command switches to pass ghostscript for it to render a file into another format - the switches you need to pass are for determining the output size, resolution and which page to print. I didn't find any switch to select a portion of the page to be rendered - so, if your document is a single page, it is possible the generated file will still be to big for GIMP - but you will likely be able to crop it with ImageMagick, at least.
I guess the relevant command line for you would be something along:
gs -dNOPAUSE -dBATCH -sDEVICE=png16m -sOutputFile=page.png -dFirstPage=<pagenumber> -dLastPage=<pagenumber> -r<dpiresolution> -f<filename.pdf>
If the resulting image is still too large to be generated or operated upon, you can try changing the output format to use a smaller color depth (this one is 3 bytes per pixel: png16m) . It should be possible to pass postscript commands to transform the device, so that the area of interest is scaled up to your page size (and the remaining parts are cropped out of the rendering) - that would be the definitive fix for you - but of the top of my head, I don't know how to do that with ghostscript.
Alternatively, you can try passing ImageMagick the -density parameter as suggested in the comments.

Quality degradation of a text pdf after pdf>png>pdf

I have a very specific requirement where i must automatically stamp every page of a PDF file (for a faxing application), so here's the process i've made:
step 1: Convert PDF to PNG, one png file per page
cmd1: gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=png16m -dGraphicsAlphaBits=4 -dTextAlphaBits=4 -r400 -sOutputFile=image_raw.png input.pdf
cmd2: mogrify -resize 31.245% image_raw.png
input.pdf (input): https://www.dropbox.com/s/p2ajqxe99nc0h8m/input.pdf
image_raw.png (output): https://www.dropbox.com/s/4cni4w7mqnmr0t7/image_raw.png
step 2: Stamp every PNG file (using a third party tool ..)
image_stamped.png (output): https://www.dropbox.com/s/3ryiu1m9ndmqik6/image_stamped.png
step 3: Reconvert PNG files into one PDF file
cmd: convert -resize 1240x1753 -units PixelsPerInch -density 150x150 image_stamped.png output.pdf
output.pdf (output): https://www.dropbox.com/s/o9y0jp9b4pm08ci/output.pdf
The output file of the third step shal be "theoretically" the same as the input file in step 1 (plus the stamp on it) but it's not, the file is somehow blurry and it turns to be unreadeable for humans after faxing it since blurred pixels wouldnt pass through fax wires even if you may see no difference between input.pdf and output.pdf, try zooming in and you'll find that text characters are blurred on its edges.
What is the best parameters to play with at input (step 1) or output (step 3) ?
Thanks !
You are using anti-aliasing (TextAlphaBits=4). This 'smooths' the edges of text by introducing grey pixels between the black pixels of the text edges. At low resolutions (such as displays) this prevents the 'jaggies' in text and gives a more readable result. At higher resolutions its value is highly debatable.
Fax is a 1-bit monochrome medium, so the grayscale values have to be recreated by dithering. As you have discovered, this is not a good idea in a limited resolution device as it leads to a loss of sharpness.
I believe that if you remove the -dTextAlphaBits=4 you will see an immediate improvement. I would also suggest that you remove the GraphicsAlphaBits as well, since this will have the same effect on linework.
If you believe that you still want anti-aliasing you could try reducing the aggressiveness, you currnetly have it set to 4, try reducing it to 2.
Regarding the other comments;
Kurt is quite correct, as is fourat, and I'm afraid MarcB is mistaken, the -r400 sets the resolution for rendering, in dots per inch. If only one number is given it is used for both x and y resolution. It is possible to produce a fixed size raster using Ghostscript, but you use the -dFIXEDMEDIA with -sPAPERSIZE switches or the -g switch which also sets FIXEDMEDIA automatically.
While I do agree with yms and Kurt that converting the PDF to a bitmap format (PNG) and then back to PDF will result in a loss of quality, if the final PDF is only used for transmission via fax, it doesn't matter. The PDF must be rendered to a fax-resolution bitmap at some point in the process, its not a big problem if its done before the stamp is applied.
I don't agree with BitBank here, converting a vector representation to bitmap means rasterising it at a particular resolution. Once this is done, the resulting image cannot be rescaled without loss of quality, whereas the original vector representation can be as it is simply rendered again at a different resolution. Image in PDF refers to a bitmap, you can't have a vector bitmap. The image posted by yms clearly shows the effect of rendering a vector representation into an image.
One last caveat. I'm not familiar with the other tools being used here, but two of the command lines at least imply 'resize'. If you 'resize' a bitmap then the chances are that the tool will introduce the same kinds of artefacts (anti-aliasing) that you are having a problem with. Onceyou have created the bitmap you should not alter it at all. Its important that you create the PNG at the correct size in the first place.
And finally.....
I just checked your original PDF file and I see that the content of the page is already an image. Not only that its a DCT (JPEG) image. JPEG is a really poor choice of format for a monochrome image. Its a lossy compression format and always introduces artefacts into the image. If you open your original PDF file in Acrobat (or similar viewer) and zoom in, you can see that there are faint 'halos' around the text, you will also see that the text is already blurry.
You then render the image, quite probably at a different resolution to the original image resolution, and at the same time introduce more blurring by setting -dGraphicsAlphaBits. You then make further changes to the image data which I can't comment on. In the end you render the image again, to a monochrome bitmap. The dithering required to represent the grey pixels leads to your text being unreadable.
Here are some ways to improve this:
1) Don't convert text into images like this, it instantly leads to a quality loss.
2) Don't compress monochrome images using JPEG
3) If you are going to work with images, don't keep converting them back and forth, work with the original until you are done, then make a PDF file from that, if you really must.
4) If you really insist on doing all this, don't compound the problem by using more anti-aliasing. Remove the -dGraphicsAlphaBits from the command line. You might as well remove -dTextAlphaBits as well since your files contain no text. Please read the documentation before using switches and understand what it is you are doing.
You should really think about your workflow here. Obviously we don't know what you are doing or why, so there may well be good reasons why some things are not possible, but you should try and avoid manipulating images like this. Because these are not vector, every time you make a change to the image data you are potentially losing information which cannot be recovered at a later stage. By making many such transformations (and your workflow as depicted seems to perform as many as 5 transformations from the 'original' image data) you will unavoidably lose quality.
If possible retain everything as vector data. When it is unavoidable to move to image data, create the image data as you need it to be finally used, do not transform it further.
I've had a closer look at the files you provided, see here:
So, already the first image (image_raw), the result of the mogrify resize command, is fairly blurry at 1062x1375. While the blurriness does not get worse in the second image (image_stamped) which is the result of the third-party tool, the third image (extracted from your output.pdf), i.e. the result of that convert command, is even more blurred which is due to the graphic being resized (which is something you explicitly tell it to do).
I don't know at which resolution your fax program works, but there is more quality loss still, at least due to 24 bit colors to black-and-white transformation.
If you insist on the work flow (i.e. pdf->png->stamped png->pdf->fax) you should
in the initial rasterization already use the per-inch resolution your rastered image will have in all following steps (including fax transmission),
refrain from anti-aliasing and use of alpha bits (cf. KenS' answer), and
restrict the rasterized image to the colorspace available to the fax transmission, i.e. most likely black-and-white.
PS As KenS pointed out, already the original PDF is merely a container for an image (with some blur to start with). Therefore, an alternative way to improve your workflow is to extract that image instead of rendering it, to stamp that original image and only resize it (again without anti-aliasing) when faxing.

Prevent Ghostscript from rasterizing text?

I'm trying to convert PDFs to PCL (using ghostscript, but I'd love to hear alternative suggestions), and every driver (ghostscript device), including all of the built-ins and gutenprint generate PCL files many times larger than the input PDF. (This is the problem - I need my PCL to be about as small as the input).
Given that the text doesn't show up in the PCL file, I guess that Ghostscript is rasterizing the text. Is there a way to prevent GS generally, or just gutenprint, from doing that? I'd rather either have it embed the fonts, or not even embed the fonts (leave it to the printer to render the fonts)?
Unfortunately, there doesn't seem to be any documentation on this point.
There are 3 (I think) types of font in PCL. There are rendered bitmaps, TrueType fonts (in later versions) and the HPGL stick font.
PDF and PostScript Have type 1, 2 (CFF), 3 and 42 (TrueType, but not the same as PCL) and CIDFonts based on any of the preceding types.
The only font type the two have in common is TrueType, so in order to retain text, any font which was not TrueType would have top be converted into TrueType. This is not a simple task. So Ghostscript simply renders the text, which is guaranteed to work.
PDF is, in general, a much richer format than PCL< there are many PDF constructs (fonts, shading, stroke/fill in a single operation, transparency) which cannot be represented in PCL. So its entirely possible that the increase in size is nothing to do with text and fonts.
In fact, I believe that the PXL drivers in Ghostscript simply render the entire page to a bitmap at the required resolution, and then wrap that up with enough PCL to be successfully sent to a printer. (I could be mistaken on this point though)
Basically, you are not going to get PCL of a similar size to your PDF out of Ghostscript.
Here is a way to 'prevent Ghostscript from rasterizing text'. But its output will be PostScript. You may however succeed convert this PostScript to a PCL5e in an additional step.
The method will convert all glyphs into outline shapes for its PostScript output, and it does not work for its PDF or PCL output. The key here is the -dNOCACHE parameter:
gs -o somepdf.ps -dNOCACHE -sDEVICE=pswrite somepdf.pdf
Of course, converting font glyphs to outlines will take more space than keeping the original fonts embedded, because "fonts" are a space-optimized concept to store, retrieve and render glyph shapes.
Once you have this PostScript, you may be able to convert it to PCL5e with the help of either of the methods you tried before for PDF input (including {Apache?} FOP).
However, I have no idea if the output will be much smaller than versions with rasterized fonts (or even wholesome rasterized pages). But it may be worth a test.
Now vote down this answer too...
Update
Apparently, from version 9.15 (to be released during September/October 2014), Ghostscript will support a new command line parameter:
-dNoOutputFonts
which will cause the output devices pdfwrite, ps2write and eps2write to "to 'flatten' glyphs into 'basic' marking operations (rather than writing fonts to the output)".
That means that the above command should be replaced by this:
gs -o somepdf.ps -dNoOutputFonts -sDEVICE=ps2write somepdf.pdf
Caveats: I've tested this with a few input files using a self-compiled Ghostscript based on current Git sources. It worked flawlessly in each case.

ImageMagick PDF to JPEG conversion results in green square where image should be

I'm attempting to convert a PDF to a JPEG using ImageMagick.
The PDF:
baby_aRCWTU.pdf
The command:
convert -density 260 -profile 'SWOP.icc' -profile 'sRGB.icm' 'baby_aRCWTU.pdf' 'baby_aRCWTU.jpg'
The resulting JPEG:
baby_aRCWTU.jpg
As you can see, the text is rendered nicely, but the embedded image shows up as a green square. Any ideas? This occurs with and without the colour profiles.
edit: reposted due to broken links
On a site we convert hundreds of PDF's on a daily basis where we need to create JPGs and we found it only reliable to convert the PDF's to postscript first.
We use the "pdftops" command, try
pdftops baby_aRCWTU.pdf baby_aRCWTU.ps
then your convert command above, but on the ps. Works for me, the image is then included.