Merge certain pdf pages in one pdf with pdftk

Merge certain pdf pages in one pdf with pdftk - pdf

I have some pdf files
Lettera_Contributi_201701-1.pdf
Lettera_Contributi_201701-2.pdf
Lettera_Contributi_201701-3.pdf
so on...
and I'd like to merge only their 2nd pages in one pdf file. I've tried the following pdftk command with a list of file example
pdftk *.pdf cat 2 output test.pdf
but the result I get in test.pdf is just the a.pdf's 2nd page..
Any ideas?
$ pdftk *.pdf cat 2 output test.pdf verbose
Command Line Data is valid.
Input PDF Filenames & Passwords in Order
( <filename>[, <password>] )
Lettera_Contributi_201701-1.pdf
Lettera_Contributi_201701-2.pdf
Lettera_Contributi_201701-3.pdf
Lettera_Contributi_201701-4.pdf
Lettera_Contributi_201701-5.pdf
Lettera_Contributi_201701-6.pdf
The operation to be performed:
cat - Catenate given page ranges into a new PDF.
The output file will be named:
test.pdf
Output PDF encryption settings:
Output PDF will not be encrypted.
No compression or uncompression being performed on output.
Creating Output ...
Adding page 2 X0X from Lettera_Contributi_201701-1.pdf

You may do it in two steps using 'find':
1) find all source PDFs in a current folder and execute 'pdftk' on everyone of them:
find . -name \*pdf -exec pdftk A={} cat A2 output {}_2 \;
( Above command finds all the files which have names ending with "pdf" and runs a command given after -exec. Brackets { } are substituted with a name of each file that was found. )
You'll get a set of new PDFs containing only a second page each. They will be named like: original_filename.pdf_2
e.g.
file1.pdf_2
file2.pdf_2
file3.pdf_2
2) now you can merge all the new PDFs:
pdftk \*pdf_2 output out.pdf
You will get out.pdf containing all the second pages of original PDFs.

Related

Print multiple PDF page ranges

I have a PDF with 200+ editable pages and need to hardcode print to PDF them into smaller PDF files (ie page 1-2, 3-8, 9, 10-11, 12-14, etc..).
Is there a way to automate this since I do this exercise each month? Right now I have to manually print each sub section one at a time.

You can use Ghostscript to copy a range of pages from a PDF file to another.
For example, to write pages 3-8 from input.pdf to output.pdf you could run the following from the command prompt, using the command line options to specify the first and last page to process.
gswin64c.exe -sDEVICE=pdfwrite -dFirstPage=3 -dLastPage=8 -o output.pdf input.pdf

Insert a blank page after each file/record while merging using pdftk

I have a job which creates N pdf files in a folder, I am merging these files using pdftk. Usually each pdf file is of two pages.but the problem is, sometimes the pdf may get to 3 pages, so i need to insert a blank page after each pdf, so that i can differentiate them.
is there any a way to achieve this.
currently i am using the folowing script to merge pdfs.
`
#echo off
pdftk *.pdf cat output Avinash.pdf
ren Avinash.pdf Avinash.doc
del *.pdf
ren Avinash.doc Avinash.pdf
`

How to get the hidden text layout that tesseract creates for pdf files?

I don't have much experience with ocr. Here's what I try:
tesseract -l eng -psm 1 image_str007_0001.jpg image_str007_tess pdf
The result is a perfectly structured hidden text layout - the words are on their exact places when searching the pdf.
My question is: can I get this layout as a file (hocr or html)?
(Config parameters preferred, not API.)
What I've tried:
tesseract -l eng -psm 1 image_str007_0001.jpg output hocr
and
hocr2pdf -i image_str007_001 -o output.pdf < output.hocr
In the file output.pdf the words are badly mislpaced when searching through the text. Is command 2. not correct for creating the tesseract hocr layout file, or the hocr2pdf app does not create the pdf correctly?

How can I drop metadata fields (e.g., PageLabel fields) from PDFs?

I have used pdftk to change the "Info" metadata associated with a PDF. I currently have several PDFs with extraneous page labels and I cannot figure how to drop them. This is what I am currently doing:
$ pdftk example_orig.pdf dump_data output page_labels.orig
$ grep -v PageLabel page_labels.orig > page_labels.new
$ pdftk example_orig.pdf update_info page_labels.new output example_new.pdf
This does not remove the PageLabel* metadata which can be verified with:
$ pdftk example_orig.pdf dump_data | grep PageLabel
How can I programmatically remove this metadata from the PDF? It would be nice to do with with pdftk but if there another tool or way to do this on GNU/Linux, that would also work for me.
I need this because I am using LaTeX Beamer to generate presentations with the \setbeameroption{show notes on second screen} option which generates a double-width PDF for showing notes on a second screen. Unfortunately, there seems to be a bug in pgfpages which results in incorrect and extraneous PageLabels in these files (example). If I generate a slides only PDF, it will generates the correct PageLabels (example). Since I can generate a correct set of PageLabels, one solution would be to replace the pagelabels in the first examples with those in the second. That said, since there are extra pagelabels in the first example, I would need to remove them first.

Using a text editor to remove PDF metadata
If it is the first time you edit a PDF, make a backup copy first.
Open your PDF with a text editor that can handle binary blobs. vim -b will be fine.
Locate the /Info dictionary. Overwrite all the entries you do not want any more completely with blanks (an entry consists of /Key names plus the (some values) following them).
Be careful to not use more spaces than there were characters initially. Otherwise your xref table (ToC of PDF objects will be invalidated, and some viewers will indicate the PDF as corrupted).
For additional measure, locate the /XML string in your PDF. It should show you where your XMP/XML metadata section is (not all PDFs have them). Locate all the key values (not the <something keys>!) in there which you want to remove. Again, just overwrite them with blanks and be careful not to change the total length (neither longer, nor shorter).
In case your PDF does not make the /Info dictionary accessible, transform it with the help of qpdf.
Use this command:
qpdf --qdf --object-streams=disable orig.pdf qdf---orig.pdf
Apply the procedure outlined above. (The qdf---orig.pdf now should be much better suited for
Re-compact your edited file:
qpdf qdf---orig.pdf edited---orig.pdf
Done! Enjoy your edited---orig.pdf. Check if it has all the data removed:
pdfinfo -meta edited---orig.pdf
Update
After looking at the sample PDF files provided, it became clear to me that the /PageLabel key is not part of the /Info dictionary (PDF's Document Information Dictionary), but of the /Root object.
That's probably one reason why pdftk was unable to update it with the method the OP described.
The other reason is the following: the PDF which the OP quoted as containing the correct page labels does in fact contain incorrect ones!
Logical Page No. | Page Label
-----------------+------------
1 | 1
2 | 2
3 | 2
4 | 2
5 | 2
6 | 4
The other PDF (which supposedly contains extraneous page labels) is incorrect in a different way:
Logical Page No. | Page Label
-----------------+------------
1 | 1
2 | 1
3 | 2
4 | 2
5 | 2
6 | 4
My original advice about how to manually edit the classical metadata of a PDF remains valid. For the case of editing page labels you can apply the same method with a slight variation.
In the case of the OP's example files, the complication comes into play: the /Root object is not directly accessible, because it is hidden inside a compressed object stream (PDF object type /ObjStm). That means one has to decompress it with the help of qpdf first:
Use qpdf:
qpdf --qdf --object-streams=disable example_presentation-NOTES.pdf q-notes.pdf
Open the resulting file in binary mode with vim:
vim -b q-notes.pdf
Locate the 1 0 obj marker for the beginning of the /Root object, containing a dictionary named /PageLabels.
(a) To disable page labels altogether, just replace the /PageLabels string by /Pagelabels, using a lowercase 'l' (PDF is case sensitive, and will no longer recognize the keyword; you yourself could at some other time restore the original version should you need it.)
(b) To edit the page labels, first see how the consecutive labels for pages 1--6 are being referred to as
<feff0031>
[....]
<feff0032>
[....]
<feff0032>
[....]
<feff0032>
[....]
<feff0033>
[....]
<feff0034>
(These values are in BOM-marked hex, meaning 1, 2, 2, 2, 3, 4...)
Edit these values to read:
<feff0031>
[....]
<feff0032>
[....]
<feff0033>
[....]
<feff0034>
[....]
<feff0035>
[....]
<feff0036>
Save the file and run qpdf again in order to re-compress the PDF:
qpdf q-notes.pdf notes.pdf
These now hopefully are the page labels the OP is looking for....
Since the OP seems to be familiar with editing pdftk's output of dump_data output, he can possibly edit the output and use update_data to apply the fix to the PDF without needing to resort to qpdf and vim.
Update 2:
User #Iserni posted a very good, short and working answer, which limits itself to one command, pdftk, which the OP seems to be familiar with already, plus sed -- not needing to use a text editor to open the PDF, and not introducing an additional utility qpdf like my answer did.
Unfortunately #Iserni deleted it again after a comment of mine. I think his answer deserves to get the bounty and I call you to vote to "undelete" his answer!
So temporarily, I'll include a copy of #Iserni's answer here, until his is undeleted again:
Not sure if I correctly understood the problem. You can try with a butcher's solution: brute force replace the /PageLabels block with a different one which will not be recognized.
# Get a readable/writable PDF
pdftk file1.pdf output temp.pdf uncompress
# Mangle the PDF. Keep same length
sed -e 's|^/PageLabels|/BageLapels|g' < temp.pdf > mangled.pdf
# Recompress
pdftk mangled.pdf output final.pdf compress
# Remove temp file
rm -f temp.pdf mangled.pdf

Not sure if I correctly understood the problem. You can try with a butcher's solution: brute force replace the /PageLabels block with a different one which will not be recognized.
# Get a readable/writable PDF
pdftk file1.pdf output temp.pdf uncompress
# Mangle the PDF. Keep same length
sed -e 's|^/PageLabels|/BageLapels|g' < temp.pdf > mangled.pdf
# Recompress
pdftk mangled.pdf output final.pdf compress
rm -f temp.pdf mangled.pdf

Comparison of two pdf files

I need to compare the contents of two almost similar files and highlight the dissimilar portions in the corresponding pdf file. Am using pdfbox. Please help me atleast with the logic.

If you prefer a tool with a GUI, you could try this one: diffpdf. It's by Mark Summerfield, and since it's written with Qt, it should be available (or should be buildable) on all platforms where Qt runs on.
Here's a screenshot:

You can do the same thing with a shell script on Linux. The script wraps 3 components:
ImageMagick's compare command
the pdftk utility
Ghostscript
It's rather easy to translate this into a .bat Batch file for DOS/Windows...
Here are the building blocks:
pdftk
Use this command to split multipage PDF files into multiple singlepage PDFs:
pdftk first.pdf burst output somewhere/firstpdf_page_%03d.pdf
pdftk 2nd.pdf burst output somewhere/2ndpdf_page_%03d.pdf
compare
Use this command to create a "diff" PDF page for each of the pages:
compare \
-verbose \
-debug coder -log "%u %m:%l %e" \
somewhere/firstpdf_page_001.pdf \
somewhere/2ndpdf_page_001.pdf \
-compose src \
somewhereelse/diff_page_001.pdf
Note, that compare is part of ImageMagick. But for PDF processing it needs Ghostscript as a 'delegate', because it cannot do so natively itself.
Once more, pdftk
Now you can again concatenate your "diff" PDF pages with pdftk:
pdftk \
somewhereelse/diff_page_*.pdf \
cat \
output somewhereelse/diff_allpages.pdf
Ghostscript
Ghostscript automatically inserts meta data (such as the current date+time) into its PDF output. Therefore this is not working well for MD5hash-based file comparisons.
If you want to automatically discover all cases which consist of purely white pages (that means: there are no visible differences in your input pages), you could also convert to a meta-data free bitmap format using the bmp256 output device. You can do that for the original PDFs (first.pdf and 2nd.pdf), or for the diff-PDF pages:
gs \
-o diff_page_001.bmp \
-r72 \
-g595x842 \
-sDEVICE=bmp256 \
diff_page_001.pdf
md5sum diff_page_001.bmp
Just create an all-white BMP page with its MD5sum (for reference) like this:
gs \
-o reference-white-page.bmp \
-r72 \
-g595x842 \
-sDEVICE=bmp256 \
-c "showpage quit"
md5sum reference-white-page.bmp

I had this very problem myself and the quickest way that I've found is to use PHP and its bindings for ImageMagick (Imagick).
<?php
$im1 = new \Imagick("file1.pdf");
$im2 = new \Imagick("file2.pdf");
$result = $im1->compareImages($im2, \Imagick::METRIC_MEANSQUAREERROR);
if($result[1] > 0.0){
// Files are DIFFERENT
}
else{
// Files are IDENTICAL
}
$im1->destroy();
$im2->destroy();
Of course, you need to install the ImageMagick bindings first:
sudo apt-get install php5-imagick # Ubuntu/Debian

I have come up with a jar using apache pdfbox to compare pdf files - this can compare pixel by pixel & highlight the differences.
Check my blog : http://www.testautomationguru.com/introducing-pdfutil-to-compare-pdf-files-extract-resources/ for example & download.
To get page count
import com.taguru.utility.PDFUtil;
PDFUtil pdfUtil = new PDFUtil();
pdfUtil.getPageCount("c:/sample.pdf"); //returns the page count
To get page content as plain text
//returns the pdf content - all pages
pdfUtil.getText("c:/sample.pdf");
// returns the pdf content from page number 2
pdfUtil.getText("c:/sample.pdf",2);
// returns the pdf content from page number 5 to 8
pdfUtil.getText("c:/sample.pdf", 5, 8);
To extract attached images from PDF
//set the path where we need to store the images
pdfUtil.setImageDestinationPath("c:/imgpath");
pdfUtil.extractImages("c:/sample.pdf");
// extracts & saves the pdf content from page number 3
pdfUtil.extractImages("c:/sample.pdf", 3);
// extracts & saves the pdf content from page 2
pdfUtil.extractImages("c:/sample.pdf", 2, 2);
To store PDF pages as images
//set the path where we need to store the images
pdfUtil.setImageDestinationPath("c:/imgpath");
pdfUtil.savePdfAsImage("c:/sample.pdf");
To compare PDF files in text mode (faster – But it does not compare the format, images etc in the PDF)
String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";
// compares the pdf documents & returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.comparePdfFilesTextMode(file1, file2);
// compare the 3rd page alone
pdfUtil.comparePdfFilesTextMode(file1, file2, 3, 3);
// compare the pages from 1 to 5
pdfUtil.comparePdfFilesTextMode(file1, file2, 1, 5);
To compare PDF files in Binary mode (slower – compares PDF documents pixel by pixel – highlights pdf difference & store the result as image)
String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";
// compares the pdf documents & returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.comparePdfFilesBinaryMode(file1, file2);
// compare the 3rd page alone
pdfUtil.comparePdfFilesBinaryMode(file1, file2, 3, 3);
// compare the pages from 1 to 5
pdfUtil.comparePdfFilesBinaryMode(file1, file2, 1, 5);
//if you need to store the result
pdfUtil.highlightPdfDifference(true);
pdfUtil.setImageDestinationPath("c:/imgpath");
pdfUtil.comparePdfFilesBinaryMode(file1, file2);

To compare PDFs on macOS Monterey (i.e. version 12), I was able to install diff-pdf using homebrew, and run it.
The --view option didn't work for me, but the --output-diff did.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Merge certain pdf pages in one pdf with pdftk - pdf

Related

Print multiple PDF page ranges

Insert a blank page after each file/record while merging using pdftk

How to get the hidden text layout that tesseract creates for pdf files?

How can I drop metadata fields (e.g., PageLabel fields) from PDFs?

Comparison of two pdf files

Categories

Resources