ImageMagick generate pdf with special page numbering - pdf

I generate a PDF from a set of PNG files using this command:
convert -- $(ls -v -- src/*.png) out/book.pdf
where there're some files with names like -03.png, which I need to have smaller page numbers than others. But I get a PDF which has -01 having page number 1, -02 number 2, etc., and 01 starts from page number 6.
The PDF is a scanned book, which has some elements like table of contents etc. which aren't included in page numbering. I remember to have seen some PDFs which have special page numbers like vii before normal Arabic numbers start.
I've tried using -scene -5 to add an offset to page numbers, but this didn't change the result.
So what should I instead do to make page "01.png" have page number 1, etc., and previous ones have some other numbers (negative or Latin, anything) and appear at the beginning of the document?

First, you want to sort files numerically counting optional minus sign, which you won't do with command you show.
Second, you talk about PageLabels for PDF pages, which you can add using Ghostscript and pdfmark operator.
Try this command:
ls src/*.png | \
sort -n | \
convert #- pdf:- | \
gs \
-sDEVICE=pdfwrite \
-o out/book.pdf \
-c '[{Catalog}<</PageLabels<</Nums[0<</P(-3)>>1<</P(-2)>>2<</P(-1)>>3<</S/D>>]>>>>/PUT pdfmark' \
-f -
It's for 3 pages -3, -2 and -1, followed by any number of pages labelled 1, 2, 3 etc. Modify according to your needs.

Related

Reordering large PDF files using command line tools

I'm working with PDF files that have hundreds of forms within them. Each form is 2 pages long, so in most files pages 1-2 is the first form, pages 3-4 is the second form, and so on.
However, there are several PDF files where the page order of the forms are reversed. In these cases, page 1 is the second page of the first form and page 2 is the first page of the first form, page 3 is the second page of the second form and page 4 is the first page of the second form, and so on.
I want to reorder the page order in these files so that the pages are in this sequence: (2(1), 2(1)-2, 2(2), 2(2)-1, 2(3), 2(3)-1,...,2n,2n-1), where n= total number of pages/2.
I've been looking for a way to do this using command line tools such as cpdf, pdftk, etc., but to no avail. The files are quite large so I would like to do it by using command line tools.
Any advice will be greatly appreciated!
CIB pdf toolbox of CIB (https://www.cib.de) has a (non free) command line tool version, which supports all possibilities of PDF merging in one run.
Did you try poppler-utils?
i think with the command line tools pdfseparate and pdfunite utilities you can achieve all you want.
http://www.manpagez.com/man/1/pdfseparate/
http://www.manpagez.com/man/1/pdfunite/
Would it matter to you if the order of the forms inside the PDF is changed? For example if instead of
form1-reversed,
form2-reversed,
form3-reversed
your resulting file would look like
form3,
form2,
form1
?
In this case you could just run PDFtk so that it completely reverses all the pages of the original file:
pdftk in.pdf cat r1-1 output reversed.pdf
(Prefixing a page number with the letter r references pages in reverse order. That means r1 is the last page...)
In case you are on an operating system which supports shell scripting (like Bash on Linux or macOS), you can achieve the output of your requested page numbers by something like this (assuming that your n==10):
for i in {1..10}; do
echo -n "$(( 2 * ${i} )) ";
echo -n "$(( 2 * ${i} -1 )) ";
done
This will output 2 1 4 3 6 5 8 7 10 9. Now you could use this PDFtk command for re-ordering the pages as you want:
pdftk in.pdf cat $(for i in {1..10};do echo -n "$((2 * ${i})) ";echo -n "$((2*${i}-1 )) ";done) output out.pdf

Ghostscript, Duplex print same side twice when NumCopies is bigger than 1

I using the command :
-q -dBATCH -dNOPAUSE -dNODISPLAY -dPDFFitPage \
-c "mark /BitsPerPixel 1 \
/NoCancel true \
/NumCopies 2 \
/Duplex true \
/OutputFile (%printer%Ricoh c2051) \
/UserSettings << /DocumentName (Arquivo Teste) \
/MaxResolution 500 >> \
(mswinpr2)finddevice putdeviceprops setdevice" -f "C:\Test123.pdf"
The PDF file have 3 pages, when I set the NumCopies 2 for example, the result is 3 pages :
page 1 = the text of page one, in both sides;
page 2 = the text of page two, in both sides;
page 3 = the text of page three, in both sides;
But when I set just one copy the result is 2 pages:
page 1 : the text of page one and in the other side the text of page 2
page 2 : the text of page three and the other side blank .
like Duplex are supposed to be.
Anyone knows how this happened ?
Its a consequence of the way mswinpr2 works, it doesn't care about you setting /Duplex true because the device is not a duplexing device (your printer obviously is, but that's not the same thing). In fact the majority of the command line will have no effect on teh output.
When you set NumCopies, it prints each page 'NumCopies' times to the printer so if your printer is set to do duplexing then it prints the first copy of page 1 on the first side, then the second copy of page 1 on the second side (ie the back of page 1) then the first copy of page 2 on the third side (the front of page 2) and so on.
You cannot achieve multiple collated copies using the mswinpr2 device.
The command line you have set suggests that you have the PostScript option for your printer, you could instead use the ps2write device to convert the PDF to PostScript and send the PostScript to the printer, the latest version of Ghostscript allows for the injection of device-specific options into the output PostScript so you could easily add NumCopies and Duplex there, assuming your printer has sufficient memory to do duplexing and NumCopies at the same time.

Convert subset of postscript file to pdf documents

I have a system that generates large quantities of PostScript files that each contain multiple, multi-page documents. I want to write a script that takes these large PostScript documents and outputs multiple PDF documents from each.
For example one postscript file contains 200 letters to customers, each of which is 10 pages long. This postscript file contains 2000 pages. I want to output from this 1 ps document, 200x 10 page PDFs, one for each customer.
I'm thinking GhostScript is the way to go for this level of document manipulation but I'm not sure the best way to go - Is there a function in GhostScript to take 'pages 1-10' of the input ps file? Do I have to output the entire ps file as 2000 separate ps files (1 per page) then combine them back together again?
Or is there a much simpler way of acheiving my goal with something other than GhostScript?
Many Thanks,
Ben
Technically this will be possible in the next release of Ghostscript, or using the HEAD code in the Git repository. It is now possible to switch devices when using pdfwrite which will cause the device to close and complete the current PDF file. Switching back again will start a new one.
Combine this with a BeginPage and/or EndPage procedure in the page device dictionary, and you should be able to do something like what you want.
Caveat; I haven't tried any of this, and it will take some PostScript programming to get it to work.
Because of the nature of PostScript, there is no way to extract the 'N'th page from a file, so there is no way to specify a range of pages.
As lsemi suggests you could first convert to one large PDF file and then extract the ranges you want. Ghostscript is able to use the FirstPage and LastPage switches to do this (unlike PostScript, it is possible to extract a specific page from a PDF file).
Well, you might first make the PS into a PDF object collection (or directly generate a PDF from GhostScript by printing to the PDFWriter device), and then "cut" from the big PDF using pdftk, which would be quite fast.
Create the complete PDF file first with the help of Ghostscript:
gs \
-o 2000p.pdf \
-sDEVICE=pdfwrite \
-dPDFSETTINGS=/prepress \
2000p.ps
Use pdftk to extract PDF files with 10 pages each:
for i in $(seq 0 10 199); do \
export start=$(( ${i} * 1 + 1 )); \
export end=$(( ${start} + 9 )); \
pdftk \
2000p.pdf \
cat ${start}-${end} \
output pages---${start}..${end}.pdf; \
done
You can have Ghostscript generate a 2000page sample+test PDF for you by first creating a sample PostScript file named '2000p.ps' with these contents:
%!PS
/H1 {/Helvetica findfont 48 scalefont setfont .2 .2 1 setrgbcolor} def
/pageframe {1 0 0 setrgbcolor 2 setlinewidth 10 10 575 822 rectstroke} def
/gopageno {H1 300 700 moveto } def
1 1 2000 {pageframe gopageno
4 string cvs
dup stringwidth pop
-1 mul 0 rmoveto
show
showpage} for
and then run this command:
gs -o 2000p.pdf -sDEVICE=pdfwrite -g5950x8420 2000p.ps

High-res images from PDFS

I'm working on a project in which I need to extract a TIFF per page from multi-page PDFs. The PDFs contain images only and there is one image per page (I believe they were made on some kind of photocopier/scanner, but haven't confirmed this). The TIFFs are then used to create several other derivative versions of the document so the higher the resolution the better.
I've found two recipes, both with helpful aspects, but neither is ideal. Hoping someone can help me tune one of them, or offer a third option.
Recipe 1, pdfimages and ImageMagick:
First do:
$ pdfimages $MY_PDF.pdf foo"
Which results in several .pbm files (named foo-000.pbm, foo-001.pbm), etc.
Then for each *.pbm do:
$ convert $each -resize 3200x3200\> -quality 100 $new_name.tif
Pro: The resultant TIFFs are a healthy 3300+ pixels on the long dimension, (-resize just serves to normalize everything)
Con: The orientation of the pages is lost, and they come out rotated different directions (they follow logical patterns, so probably they are the orientation in which they were fed to the scanner??).
Recipe 2 Imagemagick solo:
convert +adjoin $MY_PDF.pdf pages.tif
This gives me a TIFF per page (pages-0.tif, pages-1.tif, etc.).
Pro: Orientation stays!
Con: The long dimension of the resultant file is < 800 px, which is too small to be useful, and it looks as though there is some compression applied.
How can I ditch the scaling of the image stream in the PDF, but retain the orientation? Is there some more magick in ImageMagick that I'm missing? Something else entirely?
Sorry for the noise on this old topic, but google took me here as one of the top results and it might take others, so I thought I'd post the solution for the TO's question that I found here: http://robfelty.com/2008/03/11/convert-pdf-to-png-with-imagemagick
In Short: You have to tell ImageMagick at which density it should scan the PDF.
so convert -density 600x600 foo.pdf foo.png will tell ImageMagick to treat the PDF as if it had a 600dpi resolution and thus output much larger PNGs. In my case, the resulting foo.png was sized 5000x6600px. You can optionally add -resize 3000x3000 or whatever size you require and it will be scaled down.
Note that as long as you only have vector images or text in your PDF-files, density might be set as high as needed. If the PDF contains rasterized images, it won't look good if you set it higher than those images' dpi, surprise! :)
Chris
I wanted to share my solution...it may not work for everyone, but since nothing else has come around maybe it will help someone else. I wound up going with the first option in my question, which was to use pdfimages to get large images that were rotated every which way. I then found a way to use OCR and word counts to guess at the orientation, which got me from (estimated) 25% rotated accurately to above 90%.
The flow is as follows:
Use pdfimages (apt-get install poppler-utils) to get a set of pbm
files (not shown below).
For each file:
Make four versions, rotated 0, 90, 180, and 270 degrees (I refer to them as "north", "east", "south", and "west" in my code).
OCR each. The two with the lowest word count are likely the right-side up and upside down versions. This was over 99% accurate in my set of images processed to date.
From the two with the lowest word count, run the OCR output through a spell check. The file with the least spelling errors (i.e. most recognizable words) is likely to be correct. For my set this was about 93% (up from 25%) accurate based on a sample of 500.
YMMV. My files are bitonal and highly textual. The source images are an average of 3300 px on the long side. I can't speak to greyscale or color, or files with a lot of images. Most of my source PDFs are bad scans of old photocopies, so the accuracy might be even better with cleaner files. Using -despeckle during the rotation made no difference and slowed things down considerably (~5×). I chose ocrad for speed and not accuracy since I only need rough numbers and am throwing away the OCR. Re: performance, my nothing-special Linux desktop machine can run the whole script over about 2-3 files/per second.
Here's the implementation in a simple bash script:
#!/bin/bash
# Rotates a pbm file in place.
# Pass a .pbm as the only arg.
file=$1
TMP="/tmp/rotation-calc"
mkdir $TMP
# Dependencies:
# convert: apt-get install imagemagick
# ocrad: sudo apt-get install ocrad
ASPELL="/usr/bin/aspell"
AWK="/usr/bin/awk"
BASENAME="/usr/bin/basename"
CONVERT="/usr/bin/convert"
DIRNAME="/usr/bin/dirname"
HEAD="/usr/bin/head"
OCRAD="/usr/bin/ocrad"
SORT="/usr/bin/sort"
WC="/usr/bin/wc"
# Make copies in all four orientations (the src file is north; copy it to make
# things less confusing)
file_name=$(basename $file)
north_file="$TMP/$file_name-north"
east_file="$TMP/$file_name-east"
south_file="$TMP/$file_name-south"
west_file="$TMP/$file_name-west"
cp $file $north_file
$CONVERT -rotate 90 $file $east_file
$CONVERT -rotate 180 $file $south_file
$CONVERT -rotate 270 $file $west_file
# OCR each (just append ".txt" to the path/name of the image)
north_text="$north_file.txt"
east_text="$east_file.txt"
south_text="$south_file.txt"
west_text="$west_file.txt"
$OCRAD -f -F utf8 $north_file -o $north_text
$OCRAD -f -F utf8 $east_file -o $east_text
$OCRAD -f -F utf8 $south_file -o $south_text
$OCRAD -f -F utf8 $west_file -o $west_text
# Get the word count for each txt file (least 'words' == least whitespace junk
# resulting from vertical lines of text that should be horizontal.)
wc_table="$TMP/wc_table"
echo "$($WC -w $north_text) $north_file" > $wc_table
echo "$($WC -w $east_text) $east_file" >> $wc_table
echo "$($WC -w $south_text) $south_file" >> $wc_table
echo "$($WC -w $west_text) $west_file" >> $wc_table
# Take the bottom two; these are likely right side up and upside down, but
# generally too close to call beyond that.
bottom_two_wc_table="$TMP/bottom_two_wc_table"
$SORT -n $wc_table | $HEAD -2 > $bottom_two_wc_table
# Spellcheck. The lowest number of misspelled words is most likely the
# correct orientation.
misspelled_words_table="$TMP/misspelled_words_table"
while read record; do
txt=$(echo $record | $AWK '{ print $2 }')
misspelled_word_count=$(cat $txt | $ASPELL -l en list | wc -w)
echo "$misspelled_word_count $record" >> $misspelled_words_table
done < $bottom_two_wc_table
# Do the sort, overwrite the input file, save out the text
winner=$($SORT -n $misspelled_words_table | $HEAD -1)
rotated_file=$(echo $winner | $AWK '{ print $4 }')
mv $rotated_file $file
# Clean up.
if [ -d $TMP ]; then
rm -r $TMP
fi

combining pdf files with ghostscript, how to include original file names?

I have about 250 single-page pdf files that have names like:
file_1_100.pdf,
file_1_200.pdf,
file_1_300.pdf,
file_2_100.pdf,
file_2_200.pdf,
file_2_300.pdf,
file_3_100.pdf,
file_3_200.pdf,
file_3_300.pdf
...etc
I am using the following command to combine them to a single pdf file:
gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=finished.pdf file*pdf
It works perfectly, combining them in the correct order. However, when I am looking at finished.pdf, I want to have a reference that tells me the orignal filename for each page.
Does anyone have any suggestions? Can I add page names referencing the files or something?
It is fairly easy to put the file names into a list of Bookmarks which many PDF viewers can display.
This is done with PostScript using the 'pdfmark' distiller operator. For example, use the following
gs -sDEVICE=pdfwrite -o finished.pdf control.ps
where control.ps contains PS commands to print the pages and output the bookmark (/OUT) pdfmarks:
(examples/tiger.eps) run [ /Page 1 /Title (tiger.eps) /OUT pdfmark
(examples/colorcir.ps) run [ /Page 2 /Title (colorcir.ps) /OUT pdfmark
Note that you can also perform the enumeration using PS to automate the entire process:
/PN 1 def
(file*.pdf) {
/FN exch def
FN run
[ /Page PN /Title FN /OUT pdfmark % do the file and bookmark it by filename
/PN PN 1 add def % bump the page number
} 1000 string filenameforall
NB that the order of filenameforall enumeration is not specified, so you may want to sort the list
to control the order, using the Ghostscript extension .sort ( array lt .sort lt ).
Also after thinking about this, I also realized that if an imput file has more than one page, there is a better way to set the bookmark to the correct page number using the 'PageCount' device property.
[
(file*.pdf) { dup length string copy } 1000 string filenameforall
] % create array of filenames
{ lt } .sort % sort in increasing alphabetic order
/PN 1 def
{ /FN exch def
/PN currentpagedevice /PageCount get 1 add def % get current page count done (next is one greater)
FN run [ /Page PN /Title FN /OUT pdfmark % do the file and bookmark it by filename
} forall
The above creates an array of strings (copying them to unique string objects since filenameforall
just overwrites the string it is given), then sorts it, and finally processes the array of strings
using the forall operator. By using the PageCount device property to get the count of pages already produced, the page number (PN) for the bookmark will be correct. I have tested this snippet as 'control.ps'.
To stamp the filename on each page you can use a combination of ghostscript and pdftk. Taken from https://superuser.com/questions/171790/print-pdf-file-with-file-path-in-footer
gs \
-o outdir\footer.pdf \
-sDEVICE=pdfwrite \
-c "5 5 moveto /Helvetica findfont 9 scalefont setfont (foobar-filename.pdf) show"
pdftk \
foobar-filename.pdf \
stamp outdir\footer.pdf \
output outdir\merged_foobar-filename.pdf