Is there a way discard previous pdfmark metadata? - pdf

I was trying to automate adding title, bookmarks and such to some PDFs I need. The way I came up with was to create a simple pdfmark script like this:
% pdfmark.ps
[ /Title (My document)
/Author(Me)
/DOCINFO pdfmark
[ /Title (First chapter)
/Page 1
/OUT pdfmark
Then generate a new PDF with ghostscript using:
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=out.pdf in.pdf pdfmark.ps
If in.pdf doesn't have any pdfmark data it works fine, however if it does things don't work out nicely: for example title/author aren't modified and bookmarks are appended instead of replaced.
Since I don't want to mess around modifying the PDF's corresponding postscript, I was trying to find if there is some command to add to pdfmark.ps that can delete (or overwrite) previous metadata.

I'll leave PostScript to others and show how to remove a PDF outline using the qpdf package (for qpdf and fix-qdf) and GNU sed.
From the qpdf manual:
In QDF mode, qpdf creates PDF files in what we call QDF form.
A PDF file in QDF form, sometimes called a QDF file, is a completely
valid PDF file that has %QDF-1.0 as its third line (after the pdf
header and binary characters) and has certain other characteristics.
The purpose of QDF form is to make it possible to edit PDF files,
with some restrictions, in an ordinary text editor.
(For a non-GNU/Linux system adapt the commands below.)
qpdf --qdf --compress-streams=n --decode-level=generalized \
--object-streams=disable -- in.pdf - |
sed --binary \
-e '/^[ ][ ]*\/Outlines [0-9][0-9]* [0-9] R/ s/[1-9]/0/g' |
fix-qdf > tmp.qdf
qpdf --coalesce-contents --compression-level=9 \
--object-streams=generate -- tmp.qdf out.pdf
where:
1st qpdf command converts the PDF file to QDF form for editing
sed orphans outlines in the QDF file by rooting them at non-existing obj 0
fix-qdf repairs the QDF after editing
2nd qpdf converts and compresses QDF to PDF
qpdf input cannot be pipelined, it needs to seek
The sed command changes digits to zeros in the line containing
the indented text /Outlines.
Note that GNU sed is used for the non-standard --binary option
to avoid mishaps on an OS distinguishing between text and binary files.
Similarly, to strip annotations replace /Outlines with /Annots in
the -e above, or insert it in a second -e option to do both.
Another patch utility than sed will do; often just one byte has
to be changed.
To quickly strip all non-page data (docinfo, outlines a.o. but not
annotations) qpdf's --empty option may be useful:
qpdf --coalesce-contents --compression-level=9 \
--object-streams=generate \
--empty --pages in.pdf 1-z -- out.pdf

Related

Get mutool to output "structured text (as xml)"

Following mutool's instructions for the draw command
https://mupdf.com/docs/manual-mutool-draw.html
How do I output "structured text (as xml)" when one of the output "vector formats" is "debug trace (as xml)" and the "output format is inferred from the output filename" ?
If I run
mutool draw -o "testfile.xml" "testfile.pdf"
it appears that I get the "debug trace (as xml)" file format.
What file extension should I use to ensure that the "structured text (as xml)" format is output?
The usage message if you run "mutool draw" with no arguments tells you which formats are supported, and what their file extensions are.
In your case, you want "stext" output.
mutool draw -o out.stext input.pdf
mutool draw -F stext -o out.xml input.pdf
Or if you prefer the "mutool convert" command, which supports advanced output options using the -O argument.
mutool convert -o out.stext input.pdf

splitting PDF files in 50 pages interval

I have a Ghostscript to split PDF books in 50 pages interval. The problem is the GS is removing the transparency (I think this is called alpha channel in technical terms: http://www.peteryu.ca/tutorials/publishing/pdf_manipulation_tips) of the annotations.
Look at the following paragraph from a book. The highlight was fully readable before the splitting.
Now, it is blacked out.
So, I am looking for a way to do the splitting using other tools like PDFtk or any other tool which will not flatten my annotations.
Ultimately, I want to run the script on a folder of files using Hazel in Mac.
Here is the Ghostscript if it helps: ($1 is Hazel's way of importing the file, I think).
echo "Page count: "
ournum=`gs -q -dNODISPLAY -c "("$1") (r) file runpdfbegin pdfpagecount = quit" 2>/dev/null`
declare -i counter;
declare -i counterplus;
counter=1;
while [ $counter -le $ournum ] ; do
echo $counter
newname=`echo $1 | sed -e s/\.pdf//g`
reallynewname=$newname-$counter.pdf
counterplus=$counter+50;
yes | gs -dBATCH -sOutputFile=$reallynewname -dFirstPage=$counter - dLastPage=$counterplus -sDEVICE=pdfwrite "$1" >& /dev/null
counter=$counterplus
done;
Can you guys help me with this?
Thanks

Ghostscript: Internal links annotations not-printing in PDF/A-1b

i'm trying to generate a PDF/A-1b document with Ghostscript 9.18 from a batch of scanned document pages. I want to cover the scanned table of content with a layer of document internal links at the first page. But Ghostscript returns an error:
GPL Ghostscript 9.18: Annotation set to non-printing,
not permitted in PDF/A, annotation will not be present in output file
In commandline, i use:
gs \
-sDEVICE=pdfwrite \
-dBATCH=true \
-dNOPAUSE=true \
-sPAPERSIZE=a4 \
-dSAFER=true \
-sColorConversionStrategy=UseDeviceIndependentColor \
-sOutputFile=out.pdf \
-dEmbedAllFonts=true \
-dPrinted=true \
-dPDFA=true \
-dPDFACompatibilityPolicy=1 \
-sPDFSETTINGS=screen \
-f raw.pdf \
-f meta.ps
Each link is defined like:
[ /Rect [ 10 10 100 100 ] /ScrPg 1 /Page 7 /Subtype /Link /ANN pdfmark
I've tried to force the printing with the /F 3 and /F 4 PDF flag for annotations and on gs level with -dPrinted=true without any success.
Is there an other way to generate internal links in an PDF/A file? Do i misunderstand the PDF/A standard?
There is no need to make your Link annotations non-printing. If you do not want them to have any visual appearance, just give them an appearance that does not draw anything (i.e. an empty appearance stream).
The PDF/A-1 standard mandates that all annotations that are visible (on screen) are also set to print (to ensure that the appearance of pages doesn't look different between display on a screen and printouts).
I unfortunately cannot help with how to use this information in or with GhostScript.

Ghostscript Merge pdf and create table of content page from merged files

I would like to generate a PDF file with a table of content based on the merged files.
Let's say that I have these files: 1.pdf, 2.pdf and 3.pdf.
I would love to create a fourth PDF file containing the list with internal links to the different merged files.
Let's name it: toc.pdf. It should contain the list of the previous files with a pdfmark to link on the document.
I have succeeded merging the first three documents with the Ghostscript command:
gs -dBATCH -sDEVICE=pdfwrite -sPAPERSIZE=letter -dEPSFitPage -o merged.pdf 1.pdf 2.pdf 3.pdf
But I have failed looking for options on how to build the file toc.pdf with the internal links.
OK first point; GS and the pdfwrite device aren't intended for this purpose.
I've explained this before, but it bears repetition because people don't understand how this system works and therefore aren't aware of the potential pitfalls. You aren't 'merging' files at all, when you process a PDF file with GS it is fully interpreted and broken down into a sequence of graphics primitive. These are then transferred to a 'device' which deals with them, often this renders the graphics to a bitmap but in the case of pdfwrite it reassembles them into a brand new PDF file.
So the final PDF file is not created by chopping up the bits of the original file and rearranging them, its a totally newly created file which has the same appearance.
Now as to your actual request. If you want to do this you are going to have to do it manually, I don't think there is any tool which is going to do this for you.
The good news is that GS does accept and process most pdfmarks, so you can create a pdfmark, or series of them, which will do what you want. Of course, you are going to have to craft these specifically for each case, as you will need to know the page number within the final file as part of the pdfmark which means knowing how many pages are in each of the component files.
By the way, the EPSFitPage switch has no effect on any input which is not a well-formed EPS file. If you want to fit PDF files, use PDFFitPage.
Step 1:
gs -o 1_toc.pdf -sDEVICE=pdfwrite -c "[/Title (1.pdf) /OUT pdfmark" -f 1.pdf
gs -o 2_toc.pdf -sDEVICE=pdfwrite -c "[/Title (2.pdf) /OUT pdfmark" -f 2.pdf
gs -o 3_toc.pdf -sDEVICE=pdfwrite -c "[/Title (3.pdf) /OUT pdfmark" -f 3.pdf
Step 2:
gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress -sOutputFile=toc.pdf 1_toc.pdf 2_toc.pdf 3_toc.pdf
expanding KenS answer, from this post: https://groups.google.com/d/msg/comp.text.pdf/TslRCZH6x70/X_veyNNMyTcJ
assuming the 1.pdf, 2.pdf and 3.pdf all have 3 pages, you can try:
gs -o out.pdf -sDEVICE=pdfwrite \
-c "[/Page 1 /View [/XYZ null null null] /Title (file 1.pdf) /OUT pdfmark" \
-c "[/Page 4 /View [/XYZ null null null] /Title (file 2.pdf) /OUT pdfmark" \
-c "[/Page 7 /View [/XYZ null null null] /Title (file 3.pdf) /OUT pdfmark" \
-f merged.pdf
Of course, you shoud chage the number after /Page and the string in the brackets after /Title.

How to add page numbers to Postscript/PDF

If you've got a large document (500 pages+) in Postscript and want to add page numbers, does anyone know how to do this?
Based on rcs's proposed solution, I did the following:
Converted the document to example.pdf and ran pdflatex addpages, where addpages.tex reads:
\documentclass[8pt]{article}
\usepackage[final]{pdfpages}
\usepackage{fancyhdr}
\topmargin 70pt
\oddsidemargin 70pt
\pagestyle{fancy}
\rfoot{\Large\thepage}
\cfoot{}
\renewcommand {\headrulewidth}{0pt}
\renewcommand {\footrulewidth}{0pt}
\begin{document}
\includepdfset{pagecommand=\thispagestyle{fancy}}
\includepdf[fitpaper=true,scale=0.98,pages=-]{example.pdf}
% fitpaper & scale aren't always necessary - depends on the paper being submitted.
\end{document}
or alternatively, for two-sided pages (i.e. with the page number consistently on the outside):
\documentclass[8pt]{book}
\usepackage[final]{pdfpages}
\usepackage{fancyhdr}
\topmargin 70pt
\oddsidemargin 150pt
\evensidemargin -40pt
\pagestyle{fancy}
\fancyhead{}
\fancyfoot{}
\fancyfoot[LE,RO]{\Large\thepage}
\renewcommand{\headrulewidth}{0pt}
\renewcommand{\footrulewidth}{0pt}
\begin{document}
\includepdfset{pages=-,pagecommand=\thispagestyle{fancy}}
\includepdf{target.pdf}
\end{document}
Easy way to change header margins:
% set margins for headers, won't shrink included pdfs
% you can remove the topmargin/oddsidemargin/evensidemargin lines
\usepackage[margin=1in,includehead,includefoot]{geometry}
you can simply use
pspdftool
http://sourceforge.net/projects/pspdftool
in this way:
pspdftool 'number(x=-1pt,y=-1pt,start=1,size=10)' input.pdf output.pdf
see these two examples (unnumbered and numbered pdf with pspdftool)
unnumbered pdf
http://ge.tt/7ctUFfj2
numbered pdf
http://ge.tt/7ctUFfj2
with this as the first command-line argument:
number(start=1, size=40, x=297.5 pt, y=10 pt)
I used to add page numbers to my pdf using latex like in the accepted answer.
Now I found an easier way:
Use enscript to create empty pages with a header containing the page number, and then use pdftk with the multistamp option to put the header on your file.
This bash script expects the pdf file as it's only parameter:
#!/bin/bash
input="$1"
output="${1%.pdf}-header.pdf"
pagenum=$(pdftk "$input" dump_data | grep "NumberOfPages" | cut -d":" -f2)
enscript -L1 --header='||Page $% of $=' --output - < <(for i in $(seq "$pagenum"); do echo; done) | ps2pdf - | pdftk "$input" multistamp - output $output
I was looking for a postscript-only solution, using ghostscript. I needed this to merge multiple PDFs and put a counter on every page. Only solution I found was an old gs-devel posting, which I heavily simplified:
%!PS
% add page numbers document bottom right (20 units spacing , harcoded below)
% Note: Page dimensions are expressed in units of the default user space (72nds of an inch).
% inspired by https://www.ghostscript.com/pipermail/gs-devel/2005-May/006956.html
globaldict /MyPageCount 1 put % initialize page counter
% executed at the end of each page. Before calling the procedure, the interpreter
% pushes two integers on the operand stack:
% 1. a count of previous showpage executions for this device
% 2. a reason code indicating the circumstances under which this call is being made:
% 0: During showpage or (LanguageLevel 3) copypage
% 1: During copypage (LanguageLevel 2 only)
% 2: At device deactivation
% The procedure must return a boolean value specifying whether to transmit the page image to the
% physical output device.
<< /EndPage {
exch pop % remove showpage counter (unused)
0 eq dup { % only run and return true for showpage
/Helvetica 12 selectfont % select font and size for following operations
MyPageCount =string cvs % get page counter as string
dup % need it twice (width determination and actual show)
stringwidth pop % get width of page counter string ...
currentpagedevice /PageSize get 0 get % get width from PageSize on stack
exch sub 20 sub % pagewidth - stringwidth - some extra space
20 moveto % move to calculated x and y=20 (0/0 is the bottom left corner)
show % finally show the page counter
globaldict /MyPageCount MyPageCount 1 add put % increment page counter
} if
} bind >> setpagedevice
If you save this to a file called pagecount.ps you can use it on command line like this:
gs \
-dBATCH -dNOPAUSE \
-sDEVICE=pdfwrite -dPDFSETTINGS=/prepress \
-sOutputFile=/path/to/merged.pdf \
-f pagecount.ps -f input1.pdf -f input2.pdf
Note that pagecount.ps must be given first (technically, right before the the input file which the page counting should start with).
If you don't want to use an extra .ps file, you can also use a minimized form like this:
gs \
-dBATCH -dNOPAUSE \
-sDEVICE=pdfwrite -dPDFSETTINGS=/prepress \
-sOutputFile=/path/to/merged.pdf \
-c 'globaldict /MyPageCount 1 put << /EndPage {exch pop 0 eq dup {/Helvetica 12 selectfont MyPageCount =string cvs dup stringwidth pop currentpagedevice /PageSize get 0 get exch sub 20 sub 20 moveto show globaldict /MyPageCount MyPageCount 1 add put } if } bind >> setpagedevice' \
-f input1.pdf -f input2.pdf
Depending on your input, you may have to use gsave/grestore at the beginning/end of the if block.
This might be a solution:
convert postscript to pdf using ps2pdf
create a LaTeX file and insert the pages using the pdfpages package (\includepdf)
use pagecommand={\thispagestyle{plain}} or something from the fancyhdr package in the arguments of \includepdf
if postscript output is required, convert the pdflatex output back to postscript via pdf2ps
Further to captaincomic's solution, I've extended it to support the starting of page numbering at any page.
Requires enscript, pdftk 1.43 or greater and pdfjam (for pdfjoin utility)
#!/bin/bash
input="$1"
count=$2
blank=$((count - 1))
output="${1%.pdf}-header.pdf"
pagenum=$(pdftk "$input" dump_data | grep "NumberOfPages" | cut -d":" -f2)
(for i in $(seq "$blank"); do echo; done) | enscript -L1 -B --output - | ps2pdf - > /tmp/pa$$.pdf
(for i in $(seq "$pagenum"); do echo; done) | enscript -a ${count}- -L1 -F Helvetica#10 --header='||Page $% of $=' --output - | ps2pdf - > /tmp/pb$$.pdf
pdfjoin --paper letter --outfile /tmp/join$$.pdf /tmp/pa$$.pdf /tmp/pb$$.pdf &>/dev/null
cat /tmp/join$$.pdf | pdftk "$input" multistamp - output "$output"
rm /tmp/pa$$.pdf
rm /tmp/pb$$.pdf
rm /tmp/join$$.pdf
For example.. place this in /usr/local/bin/pagestamp.sh and execute like:
pagestamp.sh doc.pdf 3
This will start the page number at page 3.. useful when you have coversheets, title pages and table of contents, etc.
The unfortunate thing is that enscript's --footer option is broken, so you cannot get the page numbering at the bottom using this method.
I liked the idea of using pspdftool (man page) but what I was after was page x out of y format and the font style to match the rest of the page.
To find out about the font names used in the document:
$ strings input.pdf | grep Font
To get the number of pages:
$ pdfinfo input.pdf | grep "Pages:" | tr -s ' ' | cut -d" " -f2
Glue it together with a few pspdftool commands:
$ in=input.pdf; \
out=output.pdf; \
indent=30; \
pageNumberIndent=49; \
pageCountIndent=56; \
font=LiberationSerif-Italic; \
fontSize=9; \
bottomMargin=40; \
pageCount=`pdfinfo $in | grep "Pages:" | tr -s ' ' | cut -d" " -f2`; \
pspdftool "number(x=$pageNumberIndent pt, y=$bottomMargin pt, start=1, size=$fontSize, font=\"$font\")" $in tmp.pdf; \
pspdftool "text(x=$indent pt, y=$bottomMargin pt, size=$fontSize, font=\"$font\", text=\"page \")" tmp.pdf tmp.pdf; \
pspdftool "text(x=$pageCountIndent pt, y=$bottomMargin pt, size=$fontSize, font=\"$font\", text=\"out of $pageCount\")" tmp.pdf $out; \
rm tmp.pdf;
Here is the result:
Oh, it's a long time since I used postscript, but a quick dip into the blue book will tell you :) www-cdf.fnal.gov/offline/PostScript/BLUEBOOK.PDF
On the other hand, Adobe Acrobat and a bit of javascript would also do wonders ;)
Alternatively, I did find this: http://www.ghostscript.com/pipermail/gs-devel/2005-May/006956.html, which seems to fit the bill (I didn't try it)
You can use the free and open source pdftools to add page numbers to a PDF file with a single command line.
The command line you could use is (on GNU/Linux you have to escape the $ sign in the shell, on Windows it is not necessary):
pdftools.py --input-file ./input/wikipedia_algorithm.pdf --output ./output/addtext.pdf --text "\$page/\$pages" br 1 1 --overwrite
Regarding the --text option:
The first parameter is the text to add. Some placeholders are available. $page stands for the current page number, while $pages stands for the total number of pages in the PDF file. Thus the option so formulated would add something like "1/10" for the first page of a 10-page PDF document, and so on for the following pages
The second parameter is the anchor point of the text box. "br" will position the bottom right corner of the text box
The third parameter is the horizontal position of the anchor point of the text box as a percentage of the page width. Must be a number between 0 and 1, with the dot . separating decimals
The fourth parameter option is the vertical position of the anchor point on the text box as a percentage of the page height. Must be a number between 0 and 1, with the dot . separating decimals
Disclaimer: I'm the author of pdftools
I am assuming you are looking for a PS-based solution. There is no page-level operator in PS that will allow you to do this. You need to add a footer-sort of thingy in the PageSetup section for each page. Any scripting language should be able to help you along.
I tried pspdftool (http://sourceforge.net/projects/pspdftool).
I eventually got it to work, but at first I got this error:
pspdftool: xreftable read error
The source file was created with pdfjoin from pdfjam, and contained a bunch of scans from my Epson Workforce as well as generated tag pages. I couldn't figure out a way to fix the xref table, so I converted to ps with pdf2ps and back to pdf with pdf2ps. Then I could use this to get nice page numbers on the bottom right corner:
pspdftool 'number(start=1, size=20, x=550 pt, y=10 pt)' input.pdf output.pdf
Unfortunately, it means that any text-searchable pages are no longer searchable because the text was rasterized in the ps conversion. Fortunately, in my case it doesn't matter.
Is there any way to fix or empty the xref table of a pdf file without losing what pages are searchable?
I took captaincomic's solution and added support for filenames containing spaces, plus giving some more informations about the progress
#!/bin/bash
clear
echo
echo This skript adds pagenumbers to a given .pdf file.
echo
echo This skript needs the packages pdftk and enscript
echo if not installed the script will fail.
echo use the command sudo apt-get install pdftk enscript
echo to install.
echo
input="$1"
output="${1%.pdf}-header.pdf"
echo input file is $input
echo output file will be $output
echo
pagenum=$(pdftk "$input" dump_data | grep "NumberOfPages" | cut -d":" -f2)
enscript -L1 --header='||Page $% of $=' --output - < <(for i in $(seq "$pagenum"); do echo; done) | ps2pdf - | pdftk "$input" multistamp - output "$output"
echo done.
I wrote the following shell script to solve this for LaTeX beamer style slides produced with inkscape (I pdftk cat the slides together into the final presentation PDF & then add slide numbers using the script below):
#!/bin/sh
# create working directory
tmpdir=$(mktemp --directory)
# read un-numbered beamer slides PDF from STDIN & create temporary copy
cat > $tmpdir/input.pdf
# get total number of pages
pagenum=$(pdftk $tmpdir/input.pdf dump_data | awk '/NumberOfPages/{print $NF}')
# generate latex beamer document with the desired number of empty but numbered slides
printf '%s' '
\documentclass{beamer}
\usenavigationsymbolstemplate{}
\setbeamertemplate{footline}[frame number]
\usepackage{forloop}
\begin{document}
\newcounter{thepage}
\forloop{thepage}{0}{\value{thepage} < '$pagenum'}{
\begin{frame}
\end{frame}
}
\end{document}
' > $tmpdir/numbers.tex
# compile latex file into PDF (2nd run needed for total number of pages) & redirect output to STDERR
pdflatex -output-directory=$tmpdir numbers.tex >&2 && pdflatex -output-directory=$tmpdir numbers.tex >&2
# add empty numbered PDF slides as background to (transparent background) input slides (page by
# page) & write results to STDOUT
pdftk $tmpdir/input.pdf multibackground $tmpdir/numbers.pdf output -
# remove temporary working directory with all intermediate files
rm -r $tmpdir >&2
The script reads STDIN & writes STDOUT printing diagnostic pdflatex output to STDERR.
So just copy-paste the above code in a text file, say enumerate_slides.sh, make it executable (chmod +x enumerate_slides.sh) & call it like this:
./enumerate_slides.sh < input.pdf > output.pdf [2>/dev/null]
It should be easy to adjust this to any other kind of document by adjusting the LaTeX template to use the proper documentclass, paper size & style options.
edit:
I replaced echo by $(which echo) since in ubuntu symlinks /bin/sh to dash which overrides the echo command by a shell internal interpreting escape sequences by default & not providing the -E option to override this behaviour. Note that alternatively you could escape all \ in the LaTeX template as \\.
edit:
I replaced $(which echo) by printf '%s' since in zsh, which echo returns echo: shell built-in command instead of /bin/echo.
See this question for details why I decided to use printf in the end.
Maybe pstops (part of psutils) can be used for this?
I have used LibreOffice Calc for this. Adding a page number field is easy using Insert->Field->Page Number. And then you can copy-and-paste this field to other pages; fortunately the position is not changed and the copy-and-paste can be done quickly with down arrow key and Ctrl+V. Worked for me for a 30 page article. Maybe prone to errors for a 500+ one!