Saving "heavy" figure to PDF in MATLAB - rendering problem - pdf

I generate a figure in MATLAB with large amount of elements (100000+) and want to save it into a PDF file. With zbuffer or painters renderer I've got very large and slowly opened file (over 4 Mb) - all points are in vector format. Using OpenGL renderer rasterize the figure in PDF, ok for the plot, but not good for text labels. The file size is about 150 Kb.
Try this simplified code, for example:
x=linspace(1,10,100000);
y=sin(x)+randn(size(x));
plot(x,y,'.')
set(gcf,'Renderer','zbuffer')
print -dpdf -r300 testpdf_zb
set(gcf,'Renderer','painters')
print -dpdf -r300 testpdf_pa
set(gcf,'Renderer','opengl')
print -dpdf -r300 testpdf_op
The actual figure is much more complex with several axes and different types of plots.
Is there a way to rasterize the figure, but keep text labels as vectors?
Another problem with OpenGL is that is does not work in terminal mode (-nosplash -nodesktop -nodisplay) under Mac OSX. Looks like OpenGL is not supported. I have to use terminal mode for automation. The MATLAB version I run is 2007b. Mac OSX server 10.4.

This is a funny one. Your problem is not Matlab, it's Ghostscript (Matlab creates PDFs by calling Ghostscript, at least on Windows). When I run
x=linspace(1,10,100000);
y=sin(x)+randn(size(x));
plot(x,y,'.')
print -dpsc2 test.ps
I've got a 2Mb PS file (all vector, of course), which when compressed became a 164Kb ZIP. One would expect to get more-or-less the same result when converting PS to PDF, but ps2pdf test.ps produced your 4Mb file!
Since you are on a Mac, you probably have Distiller. I'd give it a try — generate PS files as above, and then run them through Distiller; you should get a 150K vector PDF.
If you insist on rasterizing, I can suggest printing the figure without any axes or labels to a tiff, opening the tiff, and recreating axes and labels on top of it.

If you don't want to go with a 2D histogram (i.e. an image where pixel brightness corresponds to density of points) as BlessedKey suggests, it looks like the only good way is to do the rasterizing yourself, as mentioned by AB.
getframe followed by frame2im seems to be the way to go for that. Unfortunately, getframe returns empty if you run with -nodisplay. Therefore, you'd have to save the figure as .fig, and on another computer run a script that
opens the figure, gets the content of the axes with getframe, displays the image from getframe and then saves to pdf.
As an alternative to simple plotting or a 2D histogram, you may want to look into scattercloud, which combines plotting the points with density information, by the way.

If at all possible you should try to subsample your problem before building the illustration. If you are plotting points on a curve then 10,000 is probably more than you need. A modern printer is only about 600 DPI afterall.
If the points are illustrating a cloud with some density properties, a better solution may be to build a two dimensional histogram first, and illustrate that with imshow or imagesc.
If multiple clouds are being illustrated with different colors you may be interested in building one such image for each cloud and the combining them with transparency.

Related

White gradient artifacts left over after converting an SVG file to PDF

I have an SVG file of a bar plot that I need to convert to a PDF. The bar plot was made in matplotlib, saved as a PDF and imported into Inkscape. I used Inkscape to add annotations to the figure and then export it back to a PDF to be used in a final document.
This is what the PDF file looks like going into Inkscape
After adding text elsewhere on the figure and saving as a PDF I get the same plot with these white lines:
These are not your typical PDF render artifacts, rather a closer inspection shows that they have a gradient to them.
I think this is somehow a product of the SVG file. I have used an online SVG-to-PDF converter and the lines are still present. Additionally, I use this method to make all my figure, Matplotlib to Inkscape to PDF and I have not had this issue with any other figures.
I've found that Inkscape does this when you import a bar graph which has a shading type that is not the same as any of the preset Inkscape patterns. I've seen this exact issue when I've imported graphs from R programing language and excel so I don't think it's specific to Matplotlib. I don't know the root cause, however, since I experience this problem a lot I'll share the workaround options I typically employ when I get this issue. One is not necessarily better than another and it depends on the situation which I use.
Option 1) Convert the PDF to a .png bitmap image in some other program, (Gimp, Photoshop, Powerpoint....) then embed the image in Inkscape. Make your changes then export from Inkscape as a PDF. This has the disadvantage that the graph will no longer be a vector map. Use option 2 or 3 to keep it a vector map.
Option 2) Import the pdf into Inkscape, ungroup the pdf object, delete the stripped filling in the bar graph, then recreate the filling using an Inkscape made fill. In the worst cases I've actually made custom bar graph patterns in Inkscape to exactly match the pattern that I had before. This process is a pain.
Option 3) Create shapes that cover over the artifacts, remove border lines from the shapes and use the eye dropper to make them exactly the same color as the good parts.
Like I said these are not an academic understanding of the problem to avoid the problem but I hope it can help you accomplish your task.

Blurry results on EPS/PDF import [duplicate]

I want to create a visualization of a matrix for some academic work. I decided to go about this by having the pixels in the image correspond to the values in the matrix. I created the nice small png that follows:
When properly scaled up, you get a very reasonable image:
This is a screenshot from within inkscape. However, when export this as a pdf, both evince and chrome do a terrible job at upscaling what should be very trivial, and instead I get something that looks like:
The pdf itself seems to scale appropriately well for printing, but unfortunately I do a lot of my editing without printing, and this looks unacceptable. I did find this incredibly old thread about people seeming to have a similar issue with chrome's pdf viewer, and the "solution" was to just upscale the raster graphics. This is a solution, but is terribly inefficient.
Is anyone aware of a way to change the pdf so that it gets upscaled appropriately? Maybe a config change in evince or chrome that will render these properly? Even a nice way to go from a raster image to a vector image might be suitable?
The comments aggregated into an answer...
An image dictionary in a PDF has an (optional) boolean entry Interpolate. It is specified as a flag indicating whether image interpolation shall be performed by a conforming reader.
The program used by the OP to create the PDF, Inkscape, seems to have explicitly set this flag to true. Editing the PDF to unset this flag creates a file which looks as desired by the OP.
(This also is a solution proposed in this Inkscape forum thread eventually found by the OP, which is to save the PDF with high-resolution bitmaps embedded. File -> Inkscape Preferences -> Bitmaps -> Resolution for Create Bitmap Copy, and set it to 6000 dpi)
The fact that interpolation looks different in different viewers and different output media, is by design. The PDF specification states on interpolation:
A conforming Reader may choose to not implement this feature of PDF, or may use any specific implementation of interpolation that it wishes.
A different way to get around this problem (especially as some PDF viewers have the tendency to not really live up to the specification and e.g. interpolate ignoring that flag) would be to use vector graphics here, drawing the bitmap pixels as rectangles. The result should be optimal.

Import vector graphics from PDF to GIMP

I need to extract vector graphics from a PDF image and import them into GIMP, either as paths or as high-resolution raster images. Specifically, I need to get contour lines from USGS topographical maps and overlay them on satellite images. Any suggestions?
So far I have tried:
--Using GIMP's native PDF importing function to import them as raster images. Problem: To do so at high resolution crashes my computer. Possible solution would be to import only a selected area of a PDF, but as far as I can tell this is not possible.
--Using ImageMagick to convert the PDF to a raster image. Problem: Used with the "-scale" parameter, "convert" appears to rasterize the PDF and then upscale it, leading to a choppy image.
--Using InkScape to extract the necessary vector elements from the PDF. Problem: InkScape freezes when I try to open a moderately large (25 Mb) PDF.
Any other ideas?
Many thanks,
treacl
The option you didn't mention above is to try to use the ghostscript program directly to render your output - ghostscript is used internally by GIMP to import PDF files, so you likely have it installed already.
There are tens of command switches to pass ghostscript for it to render a file into another format - the switches you need to pass are for determining the output size, resolution and which page to print. I didn't find any switch to select a portion of the page to be rendered - so, if your document is a single page, it is possible the generated file will still be to big for GIMP - but you will likely be able to crop it with ImageMagick, at least.
I guess the relevant command line for you would be something along:
gs -dNOPAUSE -dBATCH -sDEVICE=png16m -sOutputFile=page.png -dFirstPage=<pagenumber> -dLastPage=<pagenumber> -r<dpiresolution> -f<filename.pdf>
If the resulting image is still too large to be generated or operated upon, you can try changing the output format to use a smaller color depth (this one is 3 bytes per pixel: png16m) . It should be possible to pass postscript commands to transform the device, so that the area of interest is scaled up to your page size (and the remaining parts are cropped out of the rendering) - that would be the definitive fix for you - but of the top of my head, I don't know how to do that with ghostscript.
Alternatively, you can try passing ImageMagick the -density parameter as suggested in the comments.

Quality degradation of a text pdf after pdf>png>pdf

I have a very specific requirement where i must automatically stamp every page of a PDF file (for a faxing application), so here's the process i've made:
step 1: Convert PDF to PNG, one png file per page
cmd1: gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=png16m -dGraphicsAlphaBits=4 -dTextAlphaBits=4 -r400 -sOutputFile=image_raw.png input.pdf
cmd2: mogrify -resize 31.245% image_raw.png
input.pdf (input): https://www.dropbox.com/s/p2ajqxe99nc0h8m/input.pdf
image_raw.png (output): https://www.dropbox.com/s/4cni4w7mqnmr0t7/image_raw.png
step 2: Stamp every PNG file (using a third party tool ..)
image_stamped.png (output): https://www.dropbox.com/s/3ryiu1m9ndmqik6/image_stamped.png
step 3: Reconvert PNG files into one PDF file
cmd: convert -resize 1240x1753 -units PixelsPerInch -density 150x150 image_stamped.png output.pdf
output.pdf (output): https://www.dropbox.com/s/o9y0jp9b4pm08ci/output.pdf
The output file of the third step shal be "theoretically" the same as the input file in step 1 (plus the stamp on it) but it's not, the file is somehow blurry and it turns to be unreadeable for humans after faxing it since blurred pixels wouldnt pass through fax wires even if you may see no difference between input.pdf and output.pdf, try zooming in and you'll find that text characters are blurred on its edges.
What is the best parameters to play with at input (step 1) or output (step 3) ?
Thanks !
You are using anti-aliasing (TextAlphaBits=4). This 'smooths' the edges of text by introducing grey pixels between the black pixels of the text edges. At low resolutions (such as displays) this prevents the 'jaggies' in text and gives a more readable result. At higher resolutions its value is highly debatable.
Fax is a 1-bit monochrome medium, so the grayscale values have to be recreated by dithering. As you have discovered, this is not a good idea in a limited resolution device as it leads to a loss of sharpness.
I believe that if you remove the -dTextAlphaBits=4 you will see an immediate improvement. I would also suggest that you remove the GraphicsAlphaBits as well, since this will have the same effect on linework.
If you believe that you still want anti-aliasing you could try reducing the aggressiveness, you currnetly have it set to 4, try reducing it to 2.
Regarding the other comments;
Kurt is quite correct, as is fourat, and I'm afraid MarcB is mistaken, the -r400 sets the resolution for rendering, in dots per inch. If only one number is given it is used for both x and y resolution. It is possible to produce a fixed size raster using Ghostscript, but you use the -dFIXEDMEDIA with -sPAPERSIZE switches or the -g switch which also sets FIXEDMEDIA automatically.
While I do agree with yms and Kurt that converting the PDF to a bitmap format (PNG) and then back to PDF will result in a loss of quality, if the final PDF is only used for transmission via fax, it doesn't matter. The PDF must be rendered to a fax-resolution bitmap at some point in the process, its not a big problem if its done before the stamp is applied.
I don't agree with BitBank here, converting a vector representation to bitmap means rasterising it at a particular resolution. Once this is done, the resulting image cannot be rescaled without loss of quality, whereas the original vector representation can be as it is simply rendered again at a different resolution. Image in PDF refers to a bitmap, you can't have a vector bitmap. The image posted by yms clearly shows the effect of rendering a vector representation into an image.
One last caveat. I'm not familiar with the other tools being used here, but two of the command lines at least imply 'resize'. If you 'resize' a bitmap then the chances are that the tool will introduce the same kinds of artefacts (anti-aliasing) that you are having a problem with. Onceyou have created the bitmap you should not alter it at all. Its important that you create the PNG at the correct size in the first place.
And finally.....
I just checked your original PDF file and I see that the content of the page is already an image. Not only that its a DCT (JPEG) image. JPEG is a really poor choice of format for a monochrome image. Its a lossy compression format and always introduces artefacts into the image. If you open your original PDF file in Acrobat (or similar viewer) and zoom in, you can see that there are faint 'halos' around the text, you will also see that the text is already blurry.
You then render the image, quite probably at a different resolution to the original image resolution, and at the same time introduce more blurring by setting -dGraphicsAlphaBits. You then make further changes to the image data which I can't comment on. In the end you render the image again, to a monochrome bitmap. The dithering required to represent the grey pixels leads to your text being unreadable.
Here are some ways to improve this:
1) Don't convert text into images like this, it instantly leads to a quality loss.
2) Don't compress monochrome images using JPEG
3) If you are going to work with images, don't keep converting them back and forth, work with the original until you are done, then make a PDF file from that, if you really must.
4) If you really insist on doing all this, don't compound the problem by using more anti-aliasing. Remove the -dGraphicsAlphaBits from the command line. You might as well remove -dTextAlphaBits as well since your files contain no text. Please read the documentation before using switches and understand what it is you are doing.
You should really think about your workflow here. Obviously we don't know what you are doing or why, so there may well be good reasons why some things are not possible, but you should try and avoid manipulating images like this. Because these are not vector, every time you make a change to the image data you are potentially losing information which cannot be recovered at a later stage. By making many such transformations (and your workflow as depicted seems to perform as many as 5 transformations from the 'original' image data) you will unavoidably lose quality.
If possible retain everything as vector data. When it is unavoidable to move to image data, create the image data as you need it to be finally used, do not transform it further.
I've had a closer look at the files you provided, see here:
So, already the first image (image_raw), the result of the mogrify resize command, is fairly blurry at 1062x1375. While the blurriness does not get worse in the second image (image_stamped) which is the result of the third-party tool, the third image (extracted from your output.pdf), i.e. the result of that convert command, is even more blurred which is due to the graphic being resized (which is something you explicitly tell it to do).
I don't know at which resolution your fax program works, but there is more quality loss still, at least due to 24 bit colors to black-and-white transformation.
If you insist on the work flow (i.e. pdf->png->stamped png->pdf->fax) you should
in the initial rasterization already use the per-inch resolution your rastered image will have in all following steps (including fax transmission),
refrain from anti-aliasing and use of alpha bits (cf. KenS' answer), and
restrict the rasterized image to the colorspace available to the fax transmission, i.e. most likely black-and-white.
PS As KenS pointed out, already the original PDF is merely a container for an image (with some blur to start with). Therefore, an alternative way to improve your workflow is to extract that image instead of rendering it, to stamp that original image and only resize it (again without anti-aliasing) when faxing.

PDF "Canonicalization"

I am writing a library to generate PDF reports using prawn reports.
One of the features I wish to my gem is the ability to provide means of testing the generation of reports.
The problem is that two visually equal PDFs can have different files.
Is there a way to make sure that 2 visually equal PDF have the same bits in the file? Something like XML canonicalization.
'Visual equality' (or visual similarity': where only a small percentage of pixels is different for each page) of 2 different PDFs can occur even if the internal structure of PDF objects is very different. (Think of a page of 'text', which may use real fonts or which may use 'outline' vector graphics for each glyph's shape...)
That means this equality can only be determined by rendering the two files at the same resolution to page images and then comparing both image sets pixel by pixel. The result of the comparison could be another pixel image that shows all differing pixels as red, or, at your preference, just the number of pixels which do not agree.
A scriptable way to do this with the help of ghostscript, pdftk and ImageMagick I've described in this answer:
How to unit test a Python function that draws PDF graphics?
Alternatively, you may have a look at
diffpdf
(which is available for Linux, Unix, Mac OS X and Windows): it also can compare two PDF files visually.
[ Your literal question was this: "Is there a way to make sure that 2 visually equal PDF have the same bits in the file?" -- However, I'm not sure if you really meant it that way -- hence my above answer. Otherwise I'd have to say: If two PDF files are visually equal, just generate their respective MD5sum to determine if they have the same bits in each file... ]