I'm trying to use the command line program convert to take a PDF into an image (JPEG or PNG). Here is one of the PDFs that I'm trying to convert.
I want the program to trim off the excess white-space and return a high enough quality image that the superscripts can be read with ease.
This is my current best attempt. As you can see, the trimming works fine, I just need to sharpen up the resolution quite a bit. This is the command I'm using:
convert -trim 24.pdf -resize 500% -quality 100 -sharpen 0x1.0 24-11.jpg
I've tried to make the following conscious decisions:
resize it larger (has no effect on the resolution)
make the quality as high as possible
use the -sharpen (I've tried a range of values)
Any suggestions please on getting the resolution of the image in the final PNG/JPEG higher would be greatly appreciated!
It appears that the following works:
convert \
-verbose \
-density 150 \
-trim \
test.pdf \
-quality 100 \
-flatten \
-sharpen 0x1.0 \
24-18.jpg
It results in the left image. Compare this to the result of my original command (the image on the right):
(To really see and appreciate the differences between the two, right-click on each and select "Open Image in New Tab...".)
Also keep the following facts in mind:
The worse, blurry image on the right has a file size of 1.941.702 Bytes (1.85 MByte).
Its resolution is 3060x3960 pixels, using 16-bit RGB color space.
The better, sharp image on the left has a file size of 337.879 Bytes (330 kByte).
Its resolution is 758x996 pixels, using 8-bit Gray color space.
So, no need to resize; add the -density flag. The density value 150 is weird -- trying a range of values results in a worse looking image in both directions!
Personally I like this.
convert -density 300 -trim test.pdf -quality 100 test.jpg
It's a little over twice the file size, but it looks better to me.
-density 300 sets the dpi that the PDF is rendered at.
-trim removes any edge pixels that are the same color as the corner pixels.
-quality 100 sets the JPEG compression quality to the highest quality.
Things like -sharpen don't work well with text because they undo things your font rendering system did to make it more legible.
If you actually want it blown up use resize here and possibly a larger dpi value of something like targetDPI * scalingFactor That will render the PDF at the resolution/size you intend.
Descriptions of the parameters on imagemagick.org are here
I use pdftoppm on the command line to get the initial image, typically with a resolution of 300dpi, so pdftoppm -r 300, then use convert to do the trimming and PNG conversion.
I really haven't had good success with convert [update May 2020: actually: it pretty much never works for me], but I've had EXCELLENT success with pdftoppm. Here's a couple examples of producing high-quality images from a PDF:
[Produces ~25 MB-sized files per pg] Output uncompressed .tif file format at 300 DPI into a folder called "images", with files being named pg-1.tif, pg-2.tif, pg-3.tif, etc:
mkdir -p images && pdftoppm -tiff -r 300 mypdf.pdf images/pg
[Produces ~1MB-sized files per pg] Output in .jpg format at 300 DPI:
mkdir -p images && pdftoppm -jpeg -r 300 mypdf.pdf images/pg
[Produces ~2MB-sized files per pg] Output in .jpg format at highest quality (least compression) and still at 300 DPI:
mkdir -p images && pdftoppm -jpeg -jpegopt quality=100 -r 300 mypdf.pdf images/pg
For more explanations, options, and examples, see my full answer here:
https://askubuntu.com/questions/150100/extracting-embedded-images-from-a-pdf/1187844#1187844.
Related:
[How to turn a PDF into a searchable PDF w/pdf2searchablepdf] https://askubuntu.com/questions/473843/how-to-turn-a-pdf-into-a-text-searchable-pdf/1187881#1187881
Cross-linked:
How to convert a PDF into JPG with command line in Linux?
https://unix.stackexchange.com/questions/11835/pdf-to-jpg-without-quality-loss-gscan2pdf/585574#585574
normally I extract the embedded image with 'pdfimages' at the native resolution, then use ImageMagick's convert to the needed format:
$ pdfimages -list fileName.pdf
$ pdfimages fileName.pdf fileName # save in .ppm format
$ convert fileName-000.ppm fileName-000.png
this generate the best and smallest result file.
Note: For lossy JPG embedded images, you had to use -j:
$ pdfimages -j fileName.pdf fileName # save in .jpg format
With recent "poppler-util" (0.50+, 2016) you can use -all that save lossy as jpg and lossless as png, so a simple:
$ pdfimages -all fileName.pdf fileName
extract always the best possible quality content from PDF.
On little provided Win platform you had to download a recent (0.68, 2018) 'poppler-util' binary from:
http://blog.alivate.com.au/poppler-windows/
In ImageMagick, you can do "supersampling". You specify a large density and then resize down as much as desired for the final output size. For example with your image:
convert -density 600 test.pdf -background white -flatten -resize 25% test.png
Download the image to view at full resolution for comparison..
I do not recommend saving to JPG if you are expecting to do further processing.
If you want the output to be the same size as the input, then resize to the inverse of the ratio of your density to 72. For example, -density 288 and -resize 25%. 288=4*72 and 25%=1/4
The larger the density the better the resulting quality, but it will take longer to process.
I have found it both faster and more stable when batch-processing large PDFs into PNGs and JPGs to use the underlying gs (aka Ghostscript) command that convert uses.
You can see the command in the output of convert -verbose and there are a few more tweaks possible there (YMMV) that are difficult / impossible to access directly via convert.
However, it would be harder to do your trimming and sharpening using gs, so, as I said, YMMV!
It also gives you good results:
exec("convert -geometry 1600x1600 -density 200x200 -quality 100 test.pdf test_image.jpg");
Linux user here: I tried the convert command-line utility (for PDF to PNG) and I was not happy with the results. I found this to be easier, with a better result:
extract the pdf page(s) with pdftk
e.g.: pdftk file.pdf cat 3 output page3.pdf
open (import) that pdf with GIMP
important: change the import Resolution from 100 to 300 or 600 pixel/in
in GIMP export as PNG (change file extension to .png)
Edit:
Added picture, as requested in the Comments. Convert command used:
convert -density 300 -trim struct2vec.pdf -quality 100 struct2vec.png
GIMP : imported at 300 dpi (px/in); exported as PNG compression level 3.
I have not used GIMP on the command line (re: my comment, below).
For Windows (tested on W11):
magick.exe -verbose -density 150 "input.pdf" -quality 100 -sharpen 0x1.0 output.jpg
You need install:
ImageMagick https://imagemagick.org/index.php
ghostscript
https://www.ghostscript.com/releases/gsdnld.html
Additional info:
Watch for using -flatten parameter since it can produce only first page as image
Use -scene 1 parameter to start at index 1 with images names
convert command mentioned in question has been deprecated in favor to magick
One more suggestion is that you can use GIMP.
Just load the PDF file in GIMP->save as .xcf and then you can do whatever you want to the image.
I have used pdf2image. A simple python library that works like charm.
First install poppler on non linux machine. You can just download the zip. Unzip in Program Files and add bin to Machine Path.
After that you can use pdf2image in python class like this:
from pdf2image import convert_from_path, convert_from_bytes
images_from_path = convert_from_path(
inputfile,
output_folder=outputpath,
grayscale=True, fmt='jpeg')
I am not good with python but was able to make exe of it.
Later you may use the exe with file input and output parameter. I have used it in C# and things are working fine.
Image quality is good. OCR works fine.
Edited:
Here is my another finding, You don't need to install Poppler for conversion.
Just make your converter.exe from Python and place it in binary bin folder of Poppler window.
I suppose it will work on azure aswell.
PNG file you attached looks really blurred. In case if you need to use additional post-processing for each image you generated as PDF preview, you will decrease performance of your solution.
2JPEG can convert PDF file you attached to a nice sharpen JPG and crop empty margins in one call:
2jpeg.exe -src "C:\In\*.*" -dst "C:\Out" -oper Crop method:autocrop
I use icepdf an open source java pdf engine. Check the office demo.
package image2pdf;
import org.icepdf.core.exceptions.PDFException;
import org.icepdf.core.exceptions.PDFSecurityException;
import org.icepdf.core.pobjects.Document;
import org.icepdf.core.pobjects.Page;
import org.icepdf.core.util.GraphicsRenderingHints;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.awt.image.RenderedImage;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
public class pdf2image {
public static void main(String[] args) {
Document document = new Document();
try {
document.setFile("C:\\Users\\Dell\\Desktop\\test.pdf");
} catch (PDFException ex) {
System.out.println("Error parsing PDF document " + ex);
} catch (PDFSecurityException ex) {
System.out.println("Error encryption not supported " + ex);
} catch (FileNotFoundException ex) {
System.out.println("Error file not found " + ex);
} catch (IOException ex) {
System.out.println("Error IOException " + ex);
}
// save page captures to file.
float scale = 1.0f;
float rotation = 0f;
// Paint each pages content to an image and
// write the image to file
for (int i = 0; i < document.getNumberOfPages(); i++) {
try {
BufferedImage image = (BufferedImage) document.getPageImage(
i, GraphicsRenderingHints.PRINT, Page.BOUNDARY_CROPBOX, rotation, scale);
RenderedImage rendImage = image;
try {
System.out.println(" capturing page " + i);
File file = new File("C:\\Users\\Dell\\Desktop\\test_imageCapture1_" + i + ".png");
ImageIO.write(rendImage, "png", file);
} catch (IOException e) {
e.printStackTrace();
}
image.flush();
}catch(Exception e){
e.printStackTrace();
}
}
// clean up resources
document.dispose();
}
}
I've also tried imagemagick and pdftoppm, both pdftoppm and icepdf has a high resolution than imagemagick.
Please take note before down voting, this solution is for Gimp using a graphical interface, and not for ImageMagick using a command line, but it worked perfectly fine for me as an alternative, and that is why I found it needful to share here.
Follow these simple steps to extract images in any format from PDF documents
Download GIMP Image Manipulation Program
Open the Program after installation
Open the PDF document that you want to extract Images
Select only the pages of the PDF document that you would want to extract images from.
N/B: If you need only the cover images, select only the first page.
Click open after selecting the pages that you want to extract images from
Click on File menu when GIMP when the pages open
Select Export as in the File menu
Select your preferred file type by extension (say png) below the dialog box that pops up.
Click on Export to export your image to your desired location.
You can then check your file explorer for the exported image.
That's all.
I hope this helps
get Image from Pdf in iOS Swift Best solution
func imageFromPdf(pdfUrl : URL,atIndex index : Int, closure:#escaping((UIImage)->Void)){
autoreleasepool {
// Instantiate a `CGPDFDocument` from the PDF file's URL.
guard let document = PDFDocument(url: pdfUrl) else { return }
// Get the first page of the PDF document.
guard let page = document.page(at: index) else { return }
// Fetch the page rect for the page we want to render.
let pageRect = page.bounds(for: .mediaBox)
let renderer = UIGraphicsImageRenderer(size: pageRect.size)
let img = renderer.image { ctx in
// Set and fill the background color.
UIColor.white.set()
ctx.fill(CGRect(x: 0, y: 0, width: pageRect.width, height: pageRect.height))
// Translate the context so that we only draw the `cropRect`.
ctx.cgContext.translateBy(x: -pageRect.origin.x, y: pageRect.size.height - pageRect.origin.y)
// Flip the context vertically because the Core Graphics coordinate system starts from the bottom.
ctx.cgContext.scaleBy(x: 1.0, y: -1.0)
// Draw the PDF page.
page.draw(with: .mediaBox, to: ctx.cgContext)
}
closure(img)
}
}
//Usage
let pdfUrl = URL(fileURLWithPath: "PDF URL")
self.imageFromPdf2(pdfUrl: pdfUrl, atIndex: 0) { imageIS in
}
Many answers here concentrate on using magick (or its dependency GhostScript) as set by the OP question, with a few suggesting Gimp as an alternative, without describing why some settings may work best for different cases.
Taking the OP "sample" the requirement is a crisp trimmed image as small as possible yet retaining good readability. and here the result is 96 dpi in 58 KB (a very small increase on the vector source 54 KB) yet retains a good image even zoomed in. compare that with 72 dpi (226 KB) in the accepted answer image above.
The key point is any image processor can be scripted to batch run from the command line using a profile as input, so here IrfanView (with or without GS) is set to auto crop the pdf page(s) and output at a default 96 dpi to PNG using only 4 BitPerPixel colour for 16 shades of greys.
The size could be further reduced by dropping resolution to 72 but 96 is an optimal setting for PNG screen display.
Use this commandline:
convert -geometry 3600x3600 -density 300x300 -quality 100 TEAM\ 4.pdf team4.png
This should correctly convert the file as you've asked for.
The following python script will work on any Mac (Snow Leopard and upward). It can be used on the command line with successive PDF files as arguments, or you can put in into a Run Shell Script action in Automator, and make a Service (Quick Action in Mojave).
You can set the resolution of the output image in the script.
The script and a Quick Action can be downloaded from github.
#!/usr/bin/python
# coding: utf-8
import os, sys
import Quartz as Quartz
from LaunchServices import (kUTTypeJPEG, kUTTypeTIFF, kUTTypePNG, kCFAllocatorDefault)
resolution = 300.0 #dpi
scale = resolution/72.0
cs = Quartz.CGColorSpaceCreateWithName(Quartz.kCGColorSpaceSRGB)
whiteColor = Quartz.CGColorCreate(cs, (1, 1, 1, 1))
# Options: kCGImageAlphaNoneSkipLast (no trans), kCGImageAlphaPremultipliedLast
transparency = Quartz.kCGImageAlphaNoneSkipLast
#Save image to file
def writeImage (image, url, type, options):
destination = Quartz.CGImageDestinationCreateWithURL(url, type, 1, None)
Quartz.CGImageDestinationAddImage(destination, image, options)
Quartz.CGImageDestinationFinalize(destination)
return
def getFilename(filepath):
i=0
newName = filepath
while os.path.exists(newName):
i += 1
newName = filepath + " %02d"%i
return newName
if __name__ == '__main__':
for filename in sys.argv[1:]:
pdf = Quartz.CGPDFDocumentCreateWithProvider(Quartz.CGDataProviderCreateWithFilename(filename))
numPages = Quartz.CGPDFDocumentGetNumberOfPages(pdf)
shortName = os.path.splitext(filename)[0]
prefix = os.path.splitext(os.path.basename(filename))[0]
folderName = getFilename(shortName)
try:
os.mkdir(folderName)
except:
print "Can't create directory '%s'"%(folderName)
sys.exit()
# For each page, create a file
for i in range (1, numPages+1):
page = Quartz.CGPDFDocumentGetPage(pdf, i)
if page:
#Get mediabox
mediaBox = Quartz.CGPDFPageGetBoxRect(page, Quartz.kCGPDFMediaBox)
x = Quartz.CGRectGetWidth(mediaBox)
y = Quartz.CGRectGetHeight(mediaBox)
x *= scale
y *= scale
r = Quartz.CGRectMake(0,0,x, y)
# Create a Bitmap Context, draw a white background and add the PDF
writeContext = Quartz.CGBitmapContextCreate(None, int(x), int(y), 8, 0, cs, transparency)
Quartz.CGContextSaveGState (writeContext)
Quartz.CGContextScaleCTM(writeContext, scale,scale)
Quartz.CGContextSetFillColorWithColor(writeContext, whiteColor)
Quartz.CGContextFillRect(writeContext, r)
Quartz.CGContextDrawPDFPage(writeContext, page)
Quartz.CGContextRestoreGState(writeContext)
# Convert to an "Image"
image = Quartz.CGBitmapContextCreateImage(writeContext)
# Create unique filename per page
outFile = folderName +"/" + prefix + " %03d.png"%i
url = Quartz.CFURLCreateFromFileSystemRepresentation(kCFAllocatorDefault, outFile, len(outFile), False)
# kUTTypeJPEG, kUTTypeTIFF, kUTTypePNG
type = kUTTypePNG
# See the full range of image properties on Apple's developer pages.
options = {
Quartz.kCGImagePropertyDPIHeight: resolution,
Quartz.kCGImagePropertyDPIWidth: resolution
}
writeImage (image, url, type, options)
del page
You can do it in LibreOffice Draw (which is usually preinstalled in Ubuntu):
Open PDF file in LibreOffice Draw.
Scroll to the page you need.
Make sure text/image elements are placed correctly. If not, you can adjust/edit them on the page.
Top menu: File > Export...
Select the image format you need in the bottom-right menu. I recommend PNG.
Name your file and click Save.
Options window will appear, so you can adjust resolution and size.
Click OK, and you are done.
convert -density 300 * airbnb.pdf
Looked perfect to me
It's actually pretty easy to do with Preview on a mac. All you have to do is open the file in Preview and save-as (or export) a png or jpeg but make sure that you use at least 300 dpi at the bottom of the window to get a high quality image.
this works for creating a single file from multiple PDF's and images files:
php exec('convert -density 300 -trim "/path/to/input_filename_1.png" "/path/to/input_filename_2.pdf" "/path/to/input_filename_3.png" -quality 100 "/path/to/output_filename_0.pdf"');
WHERE:
-density 300 = dpi
-trim = something about transparancy - makes edges look smooth, it seems
-quality 100 = quality vs compression (100 % quality)
-flatten ... for multi page, do not use "flatten"
Related
I am using PDFBox 2.0.8 to replace image in my application. I am able to extract the image and replace the same with another image of same dimension. However, there is no decrease in the size of PDF if there is decrease in the size of image. For example refer the documents/images in the below links. Original size of PDF is 93 KB. Extracted image is 91 KB. Replaced image is 54 KB. PDF size after image replacement is still 92 KB....
Original Document = http://35.200.192.44/download?fileName=/outbox/pdf/10_cert.pdf
Extracted Image = http://35.200.192.44/download?fileName=/outbox/pdf/image0.jpg
Replacement image = http://35.200.192.44/download?fileName=/outbox/pdf/image1.jpg
PDF after replacement = http://35.200.192.44/download?fileName=/outbox/pdf/10_cert1.pdf.
The change in size of PDF after replacement is not in the same proportion... Code snippet used for image replacement is
BufferedImage buffered_replacement_image_file = ImageIO.read(new File(replacement_image_file));
PDImageXObject replacement_img = JPEGFactory.createFromImage(doc, buffered_replacement_image_file);
resources.put(xObjectName, replacement_img);
The images in your two example PDFs are identical. This most likely is due to the way you load the image data, first creating a BufferedImage from the file and then creating a PDImageXObject from that BufferedImage. This causes the input image data to be expanded to a plain bitmap and then re-compressed to JPEG identically by JPEGFactory.createFromImage.
To use the JPEG data as they initially are, try this approach instead:
PDImageXObject replacement_img = JPEGFactory.createFromStream(doc, new FileInputStream(replacement_image_file));
resources.put(xObjectName, replacement_img);
or, if the replacement_image_file is not necessarily a JPEG file, like this
PDImageXObject replacement_img = PDImageXObject.createFromFileByExtension(new File(replacement_image_file), doc);
resources.put(xObjectName, replacement_img);
If this doesn't help, you most likely have other issues in your code, too, and need to show more of it.
Given a directory with several jpg files (photos), I would
like to create a single pdf file with one photo per page.
However, I would like the photos to be stored in the pdf file unchanged; i.e., I would like to avoid decoding and recoding.
So ideally I would like to be able to extract the original jpg files (maybe minus the metadata) from the pdf file, using, e.g., a linux command line too like pdfimages.
My ideas so far:
imagemagick convert. However, I am confused by the compression options: If I choose 100% quality, does it mean that the jpg is internally decoded, and then encoded lossless? (Which is obviously not what I want?)
pdflatex. Some people claim that the graphics package includes images lossless, while other dispute that. In any case, pdflatex would be slightly more cumbersome (I would first have to find out the dimensions of the photos, then set the page size accordingly, make sure that ther are no margins, headers etc etc).
img2pdf (PyPI page):
Losslessly convert raster images to PDF without re-encoding PNG, JPEG, and
JPEG2000 images. This leads to a lossless conversion of PNG, JPEG and JPEG2000
images with the only added file size coming from the PDF container itself.
Other raster graphics formats are losslessly stored using the same encoding
that PNG uses. Since PDF does not support images with transparency and since
img2pdf aims to never be lossy, input images with an alpha channel are not
supported.
(pdfimages -all does the exact opposite.)
You could use the following small script which relies on HexaPDF (note: I'm the author of HexaPDF) to do this.
Note: Make sure you have Ruby 2.4 installed, then run gem install hexapdf to install hexapdf.
Here is the script:
require 'hexapdf'
doc = HexaPDF::Document.new
ARGV.each do |image_file|
image = doc.images.add(image_file)
page = doc.pages.add
iw = image.info.width.to_f
ih = image.info.height.to_f
pw = page.box(:media).width.to_f
ph = page.box(:media).height.to_f
rw, rh = pw / iw, ph / ih
ratio = [rw, rh].min
iw, ih = iw * ratio, ih * ratio
x, y = (pw - iw) / 2, (ph - ih) / 2
page.canvas.image(image, at: [x, y], width: iw, height: ih)
end
doc.write('images.pdf')
Just supply the images as arguments on the command line, the output file will be named images.pdf. Most of the code deals with centering and scaling the images to nicely fit onto the pages.
Another possibility for storing jpg images into a pdf file in a "lossless" way is provided by PoDoFo:
podofoimg2pdf is able to perform lossless conversion from JPEG to PDF by embedding the jpg file into the pdf container.
podofoimg2pdf
Usage: podofoimg2pdf [output.pdf] [-useimgsize] [image1 image2 image3 ...]
Options:
-useimgsize Use the imagesize as page size, instead of A4
Depending on what you wish to do with the files, on windows, if the images are simpler jpeg/gif/tif/png you can store in a cbz, zip, folder or zipped folder and view with SumatraPDF which has the SaveAs PDF option thus all done with one exe.
It will fail with files that are viewable but not acceptable as PDF inputs such as webp or heic, so check in the viewer what the filename extension is before.
It should in practically all cases be lossless, however you should roundtrip with pdfimage -all to do a file compare between input and output to check there was no need to convert any bytes.
I have a program that generates multiple SVG files in batch, which I then need to be able to combine (tiled) into one file, with a set whitespace and set width in cm (or mm).
I need either an existing script or a pointer to which libraries and languages I can use to accomplish this. Any suggestions where to start?
Here are some tools which can help you to create a SVG sprite sheet from your svg files:
SVG STACK
SVG UTILS
And then you can clean up your svg when all done with a tool like
SCOUR
Yes as #victor-henriquez noted you can use montage but it’s a bit tricky, I got into it by activating the -verbose output and see that it created an inkscape command and analysing that solved this issue for me.
montage -version
# Version: ImageMagick 7.0.7-31 Q16 x86_64 20180506
I wanted …
… to label desktop icons: use -label and -pointsize (tricky to get font size correct via pointsize but depending on density)
… to increase -density (it’s tricky to find a suitable number for the output)
… to stack and tile them orderly: use -tile 15x30 (here 15 columns x 30 rows)
… to add a margin on each sub-image: use -geometry '+40+0' (adds 40px horizontally but 0px vertically)
The resulting command was (add -verbose to get detailed processing information):
montage -label '%f' -pointsize 2 -density 300 *.svg \
-tile 15x30 \
-geometry '+40+0' \
./papirus-icons-mimetypes.png
If you specify additionally the desired output pixel size geometry, eg. 96 pixels by 96 pixels -geometry '96x96+40+0', it becomes even more complex to understand what -density plays a role at. I failed to figure it out deeply ;-)
I used Victor gem https://github.com/DannyBen/victor
first_svg = File.open("first.svg").read
second_svg = File.open("second.svg").read
first_content = first_svg.split("\n")[1..-2].join(", ")
second_content = second_svg.split("\n")[1..-2].join(", ")
svg = Victor::SVG.new width: "100%", height: "100%"
svg << first_content
svg << second_content
svg.save 'final.svg'
You can have a look to montage, from ImageMagick: http://www.imagemagick.org/Usage/montage/
You can build your script around it.
I am using GS to do conversion from PDF to JPEG and following is the command that I use:
gs -sDEVICE=jpeg -dNOPAUSE -dBATCH -g500x300 -dPDFFitPage -sOutputFile=image.jpg image.pdf
In this command as u can see -g500x300 is to set the converted image size (Width x Height).
Is there a way to just set the Width without having to input the Height so it will base on the width to scale the height using its original aspect ratio? I know it can be achieved by using ImageMagick convert where you simply put 0 on the height parameter i.e. -resize 500x0. I tried with GhostScript but I don't think that is the correct way to do it.
I decided not to use ImageMagick convert reason why because it is very slow when it comes to converting a big sized multiple page PDF.
Thanks for the help!
This post explains why ghostscript is faster - https://serverfault.com/questions/167573/fast-pdf-to-jpg-conversion-on-linux-wanted, and the only workaround to fix it would involve modifying the imagemagick code.
Unfortunately, autodetermined output size is not supported by ghostscript. This is primarily because the -g option used is actually determining the device size that will hold the rendered output, and not the rendered output itself. That output size is changing because of the -dPDFFitPage switch which then tries to match the device size. And although you can define just the height of the jpeg 'device' using -dDEVICEHEIGHT=n, that will leave the device width at the unchanged default.
Although a somewhat tedious workaround, you can use ghostscript or imagemagick to get the width and height of the pdf page(s). To do this using ghostscript, see the answer to Using GhostScript to get page size. You can then calculate the proper width to set the -g flag to hold the aspect ratio. Bonus points if you can figure out a single set of commands to do all this :)
You could write a PostScript program to do this readily enough. Here is a start:
%!
% usage: gs -sFile=____.pdf scale.ps
/File where not {
(\n *** Missing source file. \(use -sFile=____.pdf\)\n) =
Usage
} {
pop
}ifelse
% Get the width and height of a PDF page
%
/GetSize {
pdfgetpage currentpagedevice
1 index get_any_box
exch pop dup 2 get exch 3 get
/PDFHeight exch def
/PDFWidth exch def
} def
%
% The main loop
% For every page in the original PDF file
%
1 1 PDFPageCount
{
/PDFPage exch def
PDFPage GetSize
% In here, knowing the desired destination size
% calculate a scale factor for the PDF to fit
% either the width or height, or whatever
% The width and height are stored in PDFWidht and PDFHeight
PDFPage pdfgetpage
pdfshowpage
} for
pdfgetpage and pdfshowpage are internal Ghostscript extensions to the PostScript language for handling PDF files.
To resize image with Ghostscript, use -dDownScaleFactor
e.g.
gs -dBATCH -dNOPAUSE -r300 -dDownScaleFactor=3 -sDEVICE=png16m -sOutputFile=/tmp/26a0e9f7-3f26-437d-9a97-1653074e819a_%d.png,/tmp/temp.pdf
-r300 here will produce a huge image
I can drop the size by scaling down by 3, aspect ratio maintained.
You can use this if it is not important to set an exact width dimension. Which works for most use cases.
Is there any easy (scriptable) way to convert a PDF with vector images into a PDF with raster images? In other words, I want to generate a PDF with the exact same (un-rasterized) text but with each vector image replaced with a rasterized version.
I occasionally read PDFs of technical articles on my Kindle, and have found that reading a PDF directly is frustrating. Thankfully, Amazon's automatic conversion of PDFs to the Kindle format does a good job of reflowing the text portions of most of PDFs I have tried. However, while raster images seem to make it through the conversion process fine, vector images get horribly mangled. It would be great if I could easily convert a PDF so that all of its vector images were rasterized.
I am interested in any possible solutions, but a Linux- or Windows-based one would be preferable.
I had a similar issue, and solved it using ImageMagics convert tool (http://www.imagemagick.org/script/index.php). That comes with linux and runs fine on Windows/Cygwin or OS X
convert -density 300 largeVectorFileFromR.pdf out.pdf
With -density 300 you control resolution (as DPI).
Downside: Text is rasterized as well, I understand that Michael does not want this.
After some days searching for some solution, based on "Remove all text from PDF file" and "How to add a picture onto an existing pdf file?" I found a (ugly) scriptable solution:
gs -o /tmp/onlytxt.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE $INPUT_FILE && \
gs -o /tmp/graphics.pdf -sDEVICE=pdfwrite -dFILTERTEXT $INPUT_FILE && \
convert -density $DPI -quality 100 /tmp/graphics.pdf /tmp/graphics.png && \
convert -density $DPI -quality 100 /tmp/graphics.png /tmp/graphics.pdf && \
pdftk /tmp/graphics.pdf stamp /tmp/onlytxt.pdf output $OUTPUT_FILE && \
rm /tmp/onlytxt.pdf /tmp/graphics.pdf /tmp/graphics.png
were we have three variables INPUT_FILE, OUTPUT_FILE, and DPI. We split the textual and graphical contents via Ghostscript, convert the graphical image to a raster image (PNG) and join the two using pdftk.
I've been using this successfully to convert huge vector images for use in scientific papers.
Pitstop Pro v2 update 3 from Enfocus can do exactly that. It has an action called "Rasterize page content, keeping text" which works pretty well. It is a plugin to Adobe Acrobat so it requires a little more but is also available as a server solution.
It's a little complicated, but you asked for any possible solution. Furthermore this solution is not automatable.
1) Open the pdf with the vector images in Inkscape. Then select the whole image with the select tool (F1)
2) If the vector image is consistant of more than one svg graphic press Ctrl + G (Object --> Group)
3) cut the grouped svg image Ctrl + x
4) open a new InkScape Window Ctrl + n and paste the image Ctrl + v
5) choose File --> export Bitmap (Shift + Ctrl + e), maybe you want to increase the dpi
6) go back to the first InkScape window, File --> import (Ctrl + i) and choose the previously exported bitmap
7) place the bitmap to the location where the svg image was
Save the pdf and the vector image is replaced by a bitmap image.
Here's one way to solve your problem:
Step 1: Use an online PDF-to-HTML converter, like the one here:
http://www.idrsolutions.com/online-pdf-to-html5-converter/
This tool converts the PDF into a set of images and a text overlay. The vector images should be converted to raster at this point.
Step 2: Convert the HTML+images back into PDF:
http://pdfcrowd.com/#convert_by_upload+with_options
The resulting PDF will have all the vector images rasterized, and all text will remain text, so you can select, copy, etc.
Convert the pdf to djvu with https://jwilk.net/software/pdf2djvu converter. Uncheck "antialias fonts,vectors..". It will reduce file size significantly and improve document load times.
I used the following:
gswin32c -o "%2" -dFirstPage=1 -dLastPage=1 -sDEVICE=pngalpha -r72x72 -dUseCropBox -dFitPage "%1" -dBATCH -dNOPAUSE
where %1 is the input file and %2 is the output. This can be used with LaTeX, the generated PNG has the same ratio and page size as the original PDF so the relative position of the image will not change.
Note that in Linux, you may need to use gs rather than gswin32c.
You can also set the page range and then print the pages back to PDF. The downside is that the text gets rasterized as well.
inkscape is the best solution, I quickly made this rather unoptimized batch file that does exactly that and you can play with it and change options. ImageMacick convert, gs, or pdftoimages don't work as good as inkscape they either don't export the layers or export but with bad quality :
#!/bin/bash
#set -xev
ORIGINAL_FOLDER=`pwd`
JPEGS=`mktemp -d`
unzip "$1" -d "$JPEGS"
cd "$JPEGS"
# expang the pdf in pdf pages
pdftk combined_to_do.pdf burst output pg_%04d.pdf
#1) print the pdf's to pngs as they are seen with alpha, layers, transparency etc, this cannot be done by ImageMacick convert or pdftoimages
ls ./pg*.pdf | xargs -L1 -I {} inkscape {} -z --export-dpi=300 --export-area-drawing --export-png={}.png
#2) Second change to jpgs
rm *.pdf
ls ./p*.png | xargs -L1 -I {} convert {} -quality 100 -density 300 {}.jpg
#3) This to make a pdf file out of every jpg image without loss of either resolution or quality:
ls -1 ./*jpg | xargs -L1 -I {} img2pdf {} -o {}.pdf
#4) This to concatenate the pdfpages into one:
pdftk *.jpg.pdf cat output combined.pdf
#5) And last I add an OCRed text layer that doesn't change the quality of the scan in the pdfs so they can be searchable:
pypdfocr combined.pdf
cp "$JPEGS/combined_ocr.pdf" "$ORIGINAL_FOLDER/$1_ocr.pdf"
cp "$JPEGS/combined.pdf" "$ORIGINAL_FOLDER/$1.pdf"
Based on Civ Lins solution, I came up with this:
#!/usr/bin/env sh
gs -o /tmp/onlytxt.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE $1 && \
gs -o /tmp/graphics.pdf -sDEVICE=pdfimage24 -dFILTERTEXT -r600 -dDownScaleFactor=6 $1 && \
pdftk /tmp/graphics.pdf multistamp /tmp/onlytxt.pdf output $2 && \
rm /tmp/onlytxt.pdf /tmp/graphics.pdf
(In contrast to the previous solution, it handles multipage PDFs and uses gs to directly render the rasterized image without the detour of convert.)