pdf how to batch list pdf having annotations? qpdf? pdfinfo?

pdf how to batch list pdf having annotations? qpdf? pdfinfo? - pdf

I was surprised when I printed a pdf which I annotated with Okular that print was without the annotations eventhough it does show on the screen.
I have to save the annoted file as printed pdf, then print it.
Question:
how can I list all pdfs having at least one annotation on at least one page?
Apparently, pdfinfo returns Acroform when there is an annotation
find -type f -iname "*.pdf" -exec pdfinfo {} \;
but does not displays the filename.
I'm not familiar with qpdf, but it does not seem to provide this info
Thanks

Using pdfinfo from poppler-utils you can say,
find . -type f -iname '*.pdf' | while read -r pn
do pdfinfo "$pn" |
grep -q '^Form: *AcroForm' && printf '%s\n' "$pn"
done
to list the names of PDF files for which pdfinfo reports:
Form: AcroForm
However, in my tests it misses several PDFs with text annotations
and lists several without so I'd avoid it for this job. Below are 2
alternatives: qpdf supports all annotation subtypes,
python3-poppler-qt5 only a subset but can be much faster.
(For a non-POSIX shell adapt the commands in this posting.)
EDIT: find constructs edited to avoid unsafe and GNU-reliant {}s.
qpdf versions since 8.3.0 support a json representation
of non-content PDF data, and if you're on a system with the jq
JSON processor you can list unique PDF annotation types as
tab-separated values (in this case discarding the output and using
the exit code only):
find . -type f -iname '*.pdf' | while read -r pn
do qpdf --json --no-warn -- "$pn" |
jq -e -r --arg typls '*' -f annots.jq > /dev/null &&
printf '%s\n' "$pn"
done
where
--arg typls '*' specifies desired annotation subtypes, e.g. *
for all (the default), or Text,FreeText,Link for a selection
-e sets exit code 4 if no output was made (no annotations found)
-r produces raw (non-JSON) output
the jq script file annots.jq contains the following
#! /usr/bin/env jq-1.6
def annots:
( if ($typls | length) > 0 and $typls != "*"
then $typls
else
# annotation types, per Adobe`s PDF Reference 1.7 (table 8.20)
"Text,Link,FreeText,Line,Square,Circle,Polygon"
+ ",PolyLine,Highlight,Underline,Squiggly,StrikeOut"
+ ",Stamp,Caret,Ink,Popup,FileAttachment,Sound,Movie"
+ ",Widget,Screen,PrinterMark,TrapNet,Watermark,3D"
end | split(",")
) as $whitelist
| .objects
| .[]
| objects
| select( ."/Type" == "/Annot" )
| select( ."/Subtype" | .[1:] | IN($whitelist[]) )
| ."/Subtype" | .[1:]
;
[ annots ] | unique as $out
| if ($out | length) > 0 then ($out | #tsv) else empty end
For many purposes it's tempting to use python-3.x with
python3-poppler-qt5
to handle the entire file list in one go,
find . -type f -iname '*.pdf' -exec python3 path/to/script -t 1,7 {} '+'
where the -t option lists the desired annotation subtypes, per
poppler documentation;
1 is AText and 7 is ALink. Without -t all subtypes known to
poppler (0 through 14) are selected, i.e. not all existing subtypes
are supported.
#! /usr/bin/env python3.8
import popplerqt5
def gotAnnot(pdfPathname, subtypls):
pdoc = popplerqt5.Poppler.Document.load(pdfPathname)
for pgindex in range(pdoc.numPages()):
annls = pdoc.page(pgindex).annotations()
if annls is not None and len(annls) > 0:
for a in annls:
if a.subType() in subtypls:
return True
return False
if __name__ == "__main__":
import sys, getopt
typls = range(14+1) ## default: all subtypes
opts, args = getopt.getopt(sys.argv[1:], "t:")
for o, a in opts:
if o == "-t" and a != "*":
typls = [int(c) for c in a.split(",")]
for pathnm in args:
if gotAnnot(pathnm, typls):
print(pathnm)

Related

What is the format in jq for calling a custom module?

I need to url decode a string in a json structure using jq. I have a custom module defined under ~/.jq/urldecode.jq but when calling it:
jq '.http.referrer | url_decode::url_decode' file.json
I get the error message:
jq: 1 compile error
The module source is:
def url_decode:
# The helper function converts the input string written in the given
# "base" to an integer
def to_i(base):
explode
| reverse
| map(if 65 <= . and . <= 90 then . + 32 else . end) # downcase
| map(if . > 96 then . - 87 else . - 48 end) # "a" ~ 97 => 10 ~ 87
| reduce .[] as $c
# base: [power, ans]
([1,0]; (.[0] * base) as $b | [$b, .[1] + (.[0] * $c)]) | .[1];
. as $in
| length as $length
| [0, ""] # i, answer
| until ( .[0] >= $length;
.[0] as $i
| if $in[$i:$i+1] == "%"
then [ $i + 3, .[1] + ([$in[$i+1:$i+3] | to_i(16)] | implode) ]
else [ $i + 1, .[1] + $in[$i:$i+1] ]
end)
| .[1]; # answer
What is the proper syntax?

In theory, with your setup, you should be able to invoke jq along the lines of
jq 'import "urldecode" as url_decode;
.http.referrer | url_decode::url_decode' file.json
or more simply:
jq 'include "urldecode";
.http.referrer | url_decode' file.json
However, there are some circumstances in which theory does not quite apply. In such cases, the following workarounds may be used with jq 1.5 and 1.6:
(1) specify -L $HOME as a command-line parameter, and give the relative path name in the module specification. Thus, in your case, the command line would look like:
jq -L $HOME 'import ".jq/urldecode" as url_decode; ...
or:
jq -L $HOME 'include ".jq/urldecode"; ...
(2) Use the {search: _} feature, e.g.
jq 'include "urldecode" {search: "~/.jq"} ; ...' ...
jq 'import "urldecode" as url_decode {search: "~/.jq"} ; ...' ...

jq reads by default from a hidden folder in your root directory files marked with .jq file extension: ~/.jq (["~/.jq", "$ORIGIN/../lib/jq", "$ORIGIN/../lib"])
To reference the module you can use the import function, then follow your normal jq command after a semicolon. The "as lib" below allows you to change the name of the namespace as well:
jq 'import "urldecode" as lib; .http.referrer | lib::url_decode' file.json
You can override the location where the .jq file are stored w/ the -L option.

I'm not sure how to do custom modules in JQ, but If you are using bash I would suggest piping to PERL for this. So far this is the easiest way I have found to quickly url-encode/decode HTML entities and I typically use this in conjunction with JQ
echo 'http://domain.tld/?fields={fieldname_of_type_Tab&#125' |
perl -MHTML::Entities -pe 'decode_entities($_)'
Decode URL Unix/Bash Command Line (without sed)

Load Data Transfert files v2 into Big Query

I am currently trying to insert all our DT files v2 into BQ.
I already did it with the click file, I spotted any trouble.
But it's not the same game with the activity and impression.
I wrote a quick script to help me in making the schema for the insertion :
import csv,json
import glob
data = []
for i in glob.glob('*.csv'):
print i
b = i.split("_")
print b[2]
with open(i, 'rb') as f:
reader = csv.reader(f)
row1 = next(reader)
title = [w.replace(' ', '_').replace('/', '_').replace(':', '_').replace('(', '_').replace(')', '').replace("-", "_") for w in row1]
print title
for a in title:
j={"name":"{0}".format(a),"type":"string","mode":"nullable"}
print j
if j not in data:
data.append(j)
with open('schema_' + b[2] + '.json', 'w') as outfile:
json.dump(data, outfile)
After that, I use the small bash script to insert all our data from our GCS .
#!/bin/bash
prep_files() {
date=$(echo "$f" | cut -d'_' -f4 | cut -c1-8)
echo "$n"
table_name=$(echo "$f" | cut -d'_' -f1-3)
bq --nosync load --field_delimiter=',' DCM_V2."$table_name""_""$date" "$var" ./schema/v2/schema_"$n".json
}
num=1
for var in $(gsutil ls gs://import-log/01_v2/*.csv.gz)
do
if test $num -lt 10000
then
echo "$var"
f=$(echo "$var" | cut -d'/' -f5)
n=$(echo "$f" | cut -d'_' -f3)
echo "$n"
prep_files
num=$(($num+1))
else
echo -e "Wait the next day"
echo "$num"
sleep $(( $(date -d 'tomorrow 0100' +%s) - $(date +%s) ))
num=0
fi
done
echo 'Import done'
But I have this kind of error :
Errors:
Too many errors encountered. (error code: invalid)
/gzip/subrange//bigstore/import-log/01_v2/dcm_accountXXX_impression_2016101220_20161013_073847_299112066.csv.gz: CSV table references column position 101, but line starting at position:0 contains only 101 columns. (error code: invalid)
So I check the number of columns in my schema with :
$awk -F',' '{print NF}'
But I have the good number of column...
So I thought that was because we had comma in value (some publishers are using a .NET framework, that allows comma in url). But theses values are enclosed with double quote.
So I made a test with a small file :
id,url
1,http://www.google.com
2,"http://www.google.com/test1,test2,test3"
And this loading works...
If someone have a clue to help me, that could be realy great. :)
EDIT : I did another test by make the load with an already decompressed file.
Too many errors encountered. (error code: invalid)
file-00000000: CSV table references column position 104, but line starting at position:2006877004 contains only 104 columns. (error code: invalid)
I used this command to find the line : $tail -c 2006877004 dcm_accountXXXX_activity_20161012_20161013_040343_299059260.csv | head -n 1
I get :
3079,10435077,311776195,75045433,1,2626849,139520233,IT,,28,0.0,22,,4003208,,dc_pre=CLHihcPW1M8CFTEC0woddTEPSQ;~oref=http://imasdk.googleapis.com/js/core/bridge3.146.2_en.html,1979747802,1476255005253094,2,,4233079,CONVERSION,POSTVIEW,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,0.000000000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
After that : $head -n1 dcm_account8897_activity_20161012_20161013_040343_299059260.csv | awk -F',' '{print NF}'
Response : 102
So, I have 104 columns in the first row and 102 on this one...
Anyone else have trouble with the DT files v2 ?

I had this similar issue and found the problem was due to a few records being separated by carriage returns into 2 lines. Removing \r solved the problem
The line affected is usually not the line reflected in the error log.

I would open the csv file from google sheets, and compare the columns with the schema you generated.
Most probably you will found a bug in the schema.

How to extract table data from PDF as CSV from the command line?

I want to extract all rows from here while ignoring the column headers as well as all page headers, i.e. Supported Devices.
pdftotext -layout DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - \
| sed '$d' \
| sed -r 's/ +/,/g; s/ //g' \
> output.csv
The resulting file should be in CSV spreadsheet format (comma separated value fields).
In other words, I want to improve the above command so that the output doesn't brake at all. Any ideas?

I'll offer you another solution as well.
While in this case the pdftotext method works with reasonable effort, there may be cases where not each page has the same column widths (as your rather benign PDF shows).
Here the not-so-well-known, but pretty cool Free and OpenSource Software Tabula-Extractor is the best choice.
I myself am using the direct GitHub checkout:
$ cd $HOME ; mkdir svn-stuff ; cd svn-stuff
$ git clone https://github.com/tabulapdf/tabula-extractor.git git.tabula-extractor
I wrote myself a pretty simple wrapper script like this:
$ cat ~/bin/tabulaextr
#!/bin/bash
cd ${HOME}/svn-stuff/git.tabula-extractor/bin
./tabula $#
Since ~/bin/ is in my $PATH, I just run
$ tabulaextr --pages all \
$(pwd)/DAC06E7D1302B790429AF6E84696FCFAB20B.pdf \
| tee my.csv
to extract all the tables from all pages and convert them to a single CSV file.
The first ten (out of a total of 8727) lines of the CVS look like this:
$ head DAC06E7D1302B790429AF6E84696FCFAB20B.csv
Retail Branding,Marketing Name,Device,Model
"","",AD681H,Smartfren Andromax AD681H
"","",FJL21,FJL21
"","",Luno,Luno
"","",T31,Panasonic T31
"","",hws7721g,MediaPad 7 Youth 2
3Q,OC1020A,OC1020A,OC1020A
7Eleven,IN265,IN265,IN265
A.O.I. ELECTRONICS FACTORY,A.O.I.,TR10CS1_11,TR10CS1
AG Mobile,Status,Status,Status
which in the original PDF look like this:
It even got these lines on the last page, 293, right:
nabi,"nabi Big Tab HD\xe2\x84\xa2 20""",DMTAB-NV20A,DMTAB-NV20A
nabi,"nabi Big Tab HD\xe2\x84\xa2 24""",DMTAB-NV24A,DMTAB-NV24A
which look on the PDF page like this:
TabulaPDF and Tabula-Extractor are really, really cool for jobs like this!
Update
Here is an ASCiinema screencast (which you also can download and re-play locally in your Linux/MacOSX/Unix terminal with the help of the asciinema command line tool), starring tabula-extractor:

As Martin R commented, tabula-java is the new version of tabula-extractor and active. 1.0.0 was released on July 21st, 2017.
Download the jar file and with the latest java:
java -jar ./tabula-1.0.0-jar-with-dependencies.jar \
--pages=all \
./DAC06E7D1302B790429AF6E84696FCFAB20B.pdf
> support_devices.csv

What you want is rather easy, but you're having a different problem also (I'm not sure you are aware of it...).
First, you should add -nopgbrk for ("No pagebreaks, please!") to your command. Because these pesky ^L characters which otherwise appear in the output then need not be filtered out later.
Adding a grep -vE '(Supported Devices|^$)' will then filter out all the lines you do not want, including empty lines, or lines with only spaces:
pdftotext -layout -nopgbrk \
DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - \
| grep -vE '(Supported Devices|^$|Marketing Name)' \
| gsed '$d' \
| gsed -r 's# +#,#g' \
| gsed '# ##g' \
> output2.csv
However, your other problem is this:
Some of the table fields are empty.
Empty fields appear with the -layout option as a series of space characters, sometimes even two in the same row.
However, the text columns are not spaced identically from page to page.
Therefor you will not know from line to line how many spaces you need to regard as a an "empty CSV field" (where you'd need an extra , separator).
As a consequence, your current code will show only one, two or three (instead of four) fields for some lines, and these fields end up in the wrong columns!
There is a workaround for this:
Add the -x ... -y ... -W ... -H ... parameters to pdftotext to crop the PDF column-wise.
Then append the columns with a combination of utilities like paste and column.
The following command extracts the first columns:
pdftotext -layout -x 38 -y 77 -W 176 -H 500 \
DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - > 1st-columns.txt
These are for second, third and fourth columns:
pdftotext -layout -x 214 -y 77 -W 176 -H 500 \
DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - > 2nd-columns.txt
pdftotext -layout -x 390 -y 77 -W 176 -H 500 \
DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - > 3rd-columns.txt
pdftotext -layout -x 567 -y 77 -W 176 -H 500 \
DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - > 4th-columns.txt
BTW, I cheated a bit: in order to get a clue about what values to use for -x, -y, -W and -H I did first run this command in order to find the exact coordinates of the column header words:
pdftotext -f 1 -l 1 -layout -bbox \
DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - | head -n 10
It's always good if you know how to read and make use of pdftotext -h. :-)
Anyway, how to append the four text files as columns side by side, with the proper CVS separator in between, you should find out yourself. Or ask a new question :-)

This can be done easily with an IntelliGet (http://akribiatech.com/intelliget) script as below
userVariables = brand, name, device, model;
{ start = Not(Or(Or(IsSubstring("Supported Devices",Line(0)),
IsSubstring("Retail Branding",Line(0))),
IsEqual(Length(Trim(Line(0))),0)));
brand = Trim(Substring(Line(0),10,44));
name = Trim(Substring(Line(0),45,79));
device = Trim(Substring(Line(0),80,114));
model = Trim(Substring(Line(0),115,200));
output = Concat(brand, ",", name, ",", device, ",", model);
}

For the case where you want to extract that tabular data from PDF over which you have control at creation time (for timesheets contracts your employees have to sign), the following solution will be cleaner:
Create a PDF form with field IDs.
Let people fill and save the PDF forms.
Use a Apache PDFBox, an open source tool that allows to extract form data from a PDF. It includes a command-line example tool PrintFields that you would call as follows to print the desired field information:
org.apache.pdfbox.examples.interactive.form.PrintFields file.pdf
For other options, see this question.
As an alternative to the above workflow, maybe you could also use a digital signature web service that allows PDF form filling and export of the data to tables. Such as SignRequest, which allows to create templates and later export the data of signed documents. (Not affiliated, just found this myself.)

How to number the ls output in unix?

I am trying to write a file with format - "id file_absolute_path" which basically lists down all the files recursively in a folder and give an identifier to each file listed like 1,2,3,4.
I can get the absolute path of the files recursively using the following command:
ls -d -1 $PWD/**/*/*
However, I am unable to give an identifier from the output of the ls command. I am sure this can be done using awk, but can't seem to solve it.

Pipe the output through cat -n.

Assuming x is your command:
x | awk '{print NR, $0}'
will number the output lines

Two posible commands:
ls -d -1 $PWD/**/*/* | cat -n
ls -d -1 $PWD/**/*/* | nl
nl puts numbers to file lines.
I hope this clarifies too.

There is a tool named nl for that.
ls -la | nl

If you do ls -i, you'll get the inode number which is great as an id.
The only potential issue with using inodes is if you folder spans multiple file systems as an inode is only guaranteed to be unique within a file system.

ls -d -1 $PWD/**// | awk ' {x = x + 1} {print x " " $0} '

How to add page numbers to Postscript/PDF

If you've got a large document (500 pages+) in Postscript and want to add page numbers, does anyone know how to do this?

Based on rcs's proposed solution, I did the following:
Converted the document to example.pdf and ran pdflatex addpages, where addpages.tex reads:
\documentclass[8pt]{article}
\usepackage[final]{pdfpages}
\usepackage{fancyhdr}
\topmargin 70pt
\oddsidemargin 70pt
\pagestyle{fancy}
\rfoot{\Large\thepage}
\cfoot{}
\renewcommand {\headrulewidth}{0pt}
\renewcommand {\footrulewidth}{0pt}
\begin{document}
\includepdfset{pagecommand=\thispagestyle{fancy}}
\includepdf[fitpaper=true,scale=0.98,pages=-]{example.pdf}
% fitpaper & scale aren't always necessary - depends on the paper being submitted.
\end{document}
or alternatively, for two-sided pages (i.e. with the page number consistently on the outside):
\documentclass[8pt]{book}
\usepackage[final]{pdfpages}
\usepackage{fancyhdr}
\topmargin 70pt
\oddsidemargin 150pt
\evensidemargin -40pt
\pagestyle{fancy}
\fancyhead{}
\fancyfoot{}
\fancyfoot[LE,RO]{\Large\thepage}
\renewcommand{\headrulewidth}{0pt}
\renewcommand{\footrulewidth}{0pt}
\begin{document}
\includepdfset{pages=-,pagecommand=\thispagestyle{fancy}}
\includepdf{target.pdf}
\end{document}
Easy way to change header margins:
% set margins for headers, won't shrink included pdfs
% you can remove the topmargin/oddsidemargin/evensidemargin lines
\usepackage[margin=1in,includehead,includefoot]{geometry}

you can simply use
pspdftool
http://sourceforge.net/projects/pspdftool
in this way:
pspdftool 'number(x=-1pt,y=-1pt,start=1,size=10)' input.pdf output.pdf
see these two examples (unnumbered and numbered pdf with pspdftool)
unnumbered pdf
http://ge.tt/7ctUFfj2
numbered pdf
http://ge.tt/7ctUFfj2
with this as the first command-line argument:
number(start=1, size=40, x=297.5 pt, y=10 pt)

I used to add page numbers to my pdf using latex like in the accepted answer.
Now I found an easier way:
Use enscript to create empty pages with a header containing the page number, and then use pdftk with the multistamp option to put the header on your file.
This bash script expects the pdf file as it's only parameter:
#!/bin/bash
input="$1"
output="${1%.pdf}-header.pdf"
pagenum=$(pdftk "$input" dump_data | grep "NumberOfPages" | cut -d":" -f2)
enscript -L1 --header='||Page $% of $=' --output - < <(for i in $(seq "$pagenum"); do echo; done) | ps2pdf - | pdftk "$input" multistamp - output $output

I was looking for a postscript-only solution, using ghostscript. I needed this to merge multiple PDFs and put a counter on every page. Only solution I found was an old gs-devel posting, which I heavily simplified:
%!PS
% add page numbers document bottom right (20 units spacing , harcoded below)
% Note: Page dimensions are expressed in units of the default user space (72nds of an inch).
% inspired by https://www.ghostscript.com/pipermail/gs-devel/2005-May/006956.html
globaldict /MyPageCount 1 put % initialize page counter
% executed at the end of each page. Before calling the procedure, the interpreter
% pushes two integers on the operand stack:
% 1. a count of previous showpage executions for this device
% 2. a reason code indicating the circumstances under which this call is being made:
% 0: During showpage or (LanguageLevel 3) copypage
% 1: During copypage (LanguageLevel 2 only)
% 2: At device deactivation
% The procedure must return a boolean value specifying whether to transmit the page image to the
% physical output device.
<< /EndPage {
exch pop % remove showpage counter (unused)
0 eq dup { % only run and return true for showpage
/Helvetica 12 selectfont % select font and size for following operations
MyPageCount =string cvs % get page counter as string
dup % need it twice (width determination and actual show)
stringwidth pop % get width of page counter string ...
currentpagedevice /PageSize get 0 get % get width from PageSize on stack
exch sub 20 sub % pagewidth - stringwidth - some extra space
20 moveto % move to calculated x and y=20 (0/0 is the bottom left corner)
show % finally show the page counter
globaldict /MyPageCount MyPageCount 1 add put % increment page counter
} if
} bind >> setpagedevice
If you save this to a file called pagecount.ps you can use it on command line like this:
gs \
-dBATCH -dNOPAUSE \
-sDEVICE=pdfwrite -dPDFSETTINGS=/prepress \
-sOutputFile=/path/to/merged.pdf \
-f pagecount.ps -f input1.pdf -f input2.pdf
Note that pagecount.ps must be given first (technically, right before the the input file which the page counting should start with).
If you don't want to use an extra .ps file, you can also use a minimized form like this:
gs \
-dBATCH -dNOPAUSE \
-sDEVICE=pdfwrite -dPDFSETTINGS=/prepress \
-sOutputFile=/path/to/merged.pdf \
-c 'globaldict /MyPageCount 1 put << /EndPage {exch pop 0 eq dup {/Helvetica 12 selectfont MyPageCount =string cvs dup stringwidth pop currentpagedevice /PageSize get 0 get exch sub 20 sub 20 moveto show globaldict /MyPageCount MyPageCount 1 add put } if } bind >> setpagedevice' \
-f input1.pdf -f input2.pdf
Depending on your input, you may have to use gsave/grestore at the beginning/end of the if block.

This might be a solution:
convert postscript to pdf using ps2pdf
create a LaTeX file and insert the pages using the pdfpages package (\includepdf)
use pagecommand={\thispagestyle{plain}} or something from the fancyhdr package in the arguments of \includepdf
if postscript output is required, convert the pdflatex output back to postscript via pdf2ps

Further to captaincomic's solution, I've extended it to support the starting of page numbering at any page.
Requires enscript, pdftk 1.43 or greater and pdfjam (for pdfjoin utility)
#!/bin/bash
input="$1"
count=$2
blank=$((count - 1))
output="${1%.pdf}-header.pdf"
pagenum=$(pdftk "$input" dump_data | grep "NumberOfPages" | cut -d":" -f2)
(for i in $(seq "$blank"); do echo; done) | enscript -L1 -B --output - | ps2pdf - > /tmp/pa$$.pdf
(for i in $(seq "$pagenum"); do echo; done) | enscript -a ${count}- -L1 -F Helvetica#10 --header='||Page $% of $=' --output - | ps2pdf - > /tmp/pb$$.pdf
pdfjoin --paper letter --outfile /tmp/join$$.pdf /tmp/pa$$.pdf /tmp/pb$$.pdf &>/dev/null
cat /tmp/join$$.pdf | pdftk "$input" multistamp - output "$output"
rm /tmp/pa$$.pdf
rm /tmp/pb$$.pdf
rm /tmp/join$$.pdf
For example.. place this in /usr/local/bin/pagestamp.sh and execute like:
pagestamp.sh doc.pdf 3
This will start the page number at page 3.. useful when you have coversheets, title pages and table of contents, etc.
The unfortunate thing is that enscript's --footer option is broken, so you cannot get the page numbering at the bottom using this method.

I liked the idea of using pspdftool (man page) but what I was after was page x out of y format and the font style to match the rest of the page.
To find out about the font names used in the document:
$ strings input.pdf | grep Font
To get the number of pages:
$ pdfinfo input.pdf | grep "Pages:" | tr -s ' ' | cut -d" " -f2
Glue it together with a few pspdftool commands:
$ in=input.pdf; \
out=output.pdf; \
indent=30; \
pageNumberIndent=49; \
pageCountIndent=56; \
font=LiberationSerif-Italic; \
fontSize=9; \
bottomMargin=40; \
pageCount=`pdfinfo $in | grep "Pages:" | tr -s ' ' | cut -d" " -f2`; \
pspdftool "number(x=$pageNumberIndent pt, y=$bottomMargin pt, start=1, size=$fontSize, font=\"$font\")" $in tmp.pdf; \
pspdftool "text(x=$indent pt, y=$bottomMargin pt, size=$fontSize, font=\"$font\", text=\"page \")" tmp.pdf tmp.pdf; \
pspdftool "text(x=$pageCountIndent pt, y=$bottomMargin pt, size=$fontSize, font=\"$font\", text=\"out of $pageCount\")" tmp.pdf $out; \
rm tmp.pdf;
Here is the result:

Oh, it's a long time since I used postscript, but a quick dip into the blue book will tell you :) www-cdf.fnal.gov/offline/PostScript/BLUEBOOK.PDF
On the other hand, Adobe Acrobat and a bit of javascript would also do wonders ;)
Alternatively, I did find this: http://www.ghostscript.com/pipermail/gs-devel/2005-May/006956.html, which seems to fit the bill (I didn't try it)

You can use the free and open source pdftools to add page numbers to a PDF file with a single command line.
The command line you could use is (on GNU/Linux you have to escape the $ sign in the shell, on Windows it is not necessary):
pdftools.py --input-file ./input/wikipedia_algorithm.pdf --output ./output/addtext.pdf --text "\$page/\$pages" br 1 1 --overwrite
Regarding the --text option:
The first parameter is the text to add. Some placeholders are available. $page stands for the current page number, while $pages stands for the total number of pages in the PDF file. Thus the option so formulated would add something like "1/10" for the first page of a 10-page PDF document, and so on for the following pages
The second parameter is the anchor point of the text box. "br" will position the bottom right corner of the text box
The third parameter is the horizontal position of the anchor point of the text box as a percentage of the page width. Must be a number between 0 and 1, with the dot . separating decimals
The fourth parameter option is the vertical position of the anchor point on the text box as a percentage of the page height. Must be a number between 0 and 1, with the dot . separating decimals
Disclaimer: I'm the author of pdftools

I am assuming you are looking for a PS-based solution. There is no page-level operator in PS that will allow you to do this. You need to add a footer-sort of thingy in the PageSetup section for each page. Any scripting language should be able to help you along.

I tried pspdftool (http://sourceforge.net/projects/pspdftool).
I eventually got it to work, but at first I got this error:
pspdftool: xreftable read error
The source file was created with pdfjoin from pdfjam, and contained a bunch of scans from my Epson Workforce as well as generated tag pages. I couldn't figure out a way to fix the xref table, so I converted to ps with pdf2ps and back to pdf with pdf2ps. Then I could use this to get nice page numbers on the bottom right corner:
pspdftool 'number(start=1, size=20, x=550 pt, y=10 pt)' input.pdf output.pdf
Unfortunately, it means that any text-searchable pages are no longer searchable because the text was rasterized in the ps conversion. Fortunately, in my case it doesn't matter.
Is there any way to fix or empty the xref table of a pdf file without losing what pages are searchable?

I took captaincomic's solution and added support for filenames containing spaces, plus giving some more informations about the progress
#!/bin/bash
clear
echo
echo This skript adds pagenumbers to a given .pdf file.
echo
echo This skript needs the packages pdftk and enscript
echo if not installed the script will fail.
echo use the command sudo apt-get install pdftk enscript
echo to install.
echo
input="$1"
output="${1%.pdf}-header.pdf"
echo input file is $input
echo output file will be $output
echo
pagenum=$(pdftk "$input" dump_data | grep "NumberOfPages" | cut -d":" -f2)
enscript -L1 --header='||Page $% of $=' --output - < <(for i in $(seq "$pagenum"); do echo; done) | ps2pdf - | pdftk "$input" multistamp - output "$output"
echo done.

I wrote the following shell script to solve this for LaTeX beamer style slides produced with inkscape (I pdftk cat the slides together into the final presentation PDF & then add slide numbers using the script below):
#!/bin/sh
# create working directory
tmpdir=$(mktemp --directory)
# read un-numbered beamer slides PDF from STDIN & create temporary copy
cat > $tmpdir/input.pdf
# get total number of pages
pagenum=$(pdftk $tmpdir/input.pdf dump_data | awk '/NumberOfPages/{print $NF}')
# generate latex beamer document with the desired number of empty but numbered slides
printf '%s' '
\documentclass{beamer}
\usenavigationsymbolstemplate{}
\setbeamertemplate{footline}[frame number]
\usepackage{forloop}
\begin{document}
\newcounter{thepage}
\forloop{thepage}{0}{\value{thepage} < '$pagenum'}{
\begin{frame}
\end{frame}
}
\end{document}
' > $tmpdir/numbers.tex
# compile latex file into PDF (2nd run needed for total number of pages) & redirect output to STDERR
pdflatex -output-directory=$tmpdir numbers.tex >&2 && pdflatex -output-directory=$tmpdir numbers.tex >&2
# add empty numbered PDF slides as background to (transparent background) input slides (page by
# page) & write results to STDOUT
pdftk $tmpdir/input.pdf multibackground $tmpdir/numbers.pdf output -
# remove temporary working directory with all intermediate files
rm -r $tmpdir >&2
The script reads STDIN & writes STDOUT printing diagnostic pdflatex output to STDERR.
So just copy-paste the above code in a text file, say enumerate_slides.sh, make it executable (chmod +x enumerate_slides.sh) & call it like this:
./enumerate_slides.sh < input.pdf > output.pdf [2>/dev/null]
It should be easy to adjust this to any other kind of document by adjusting the LaTeX template to use the proper documentclass, paper size & style options.
edit:
I replaced echo by $(which echo) since in ubuntu symlinks /bin/sh to dash which overrides the echo command by a shell internal interpreting escape sequences by default & not providing the -E option to override this behaviour. Note that alternatively you could escape all \ in the LaTeX template as \\.
edit:
I replaced $(which echo) by printf '%s' since in zsh, which echo returns echo: shell built-in command instead of /bin/echo.
See this question for details why I decided to use printf in the end.

Maybe pstops (part of psutils) can be used for this?

I have used LibreOffice Calc for this. Adding a page number field is easy using Insert->Field->Page Number. And then you can copy-and-paste this field to other pages; fortunately the position is not changed and the copy-and-paste can be done quickly with down arrow key and Ctrl+V. Worked for me for a 30 page article. Maybe prone to errors for a 500+ one!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

pdf how to batch list pdf having annotations? qpdf? pdfinfo? - pdf

Related

What is the format in jq for calling a custom module?

Load Data Transfert files v2 into Big Query

How to extract table data from PDF as CSV from the command line?

How to number the ls output in unix?

How to add page numbers to Postscript/PDF

Categories

Resources