PDF extract specific pages & merge with new filename - batch-rename

I have 2 pdf files (templates) from which I need to extract 1 page each and save as a combined pdf. Each pdf has a filename with a different 3 letter location indicator (e.g. LOC) for each of the agreement pdfs. I'm looking for a way to batch process these and save with the location indicator in the new combined filename. There are approx 500 locations.
Example files:
Agreement1_LOC.pdf - extract pg 3
Agreement2_LOC.pdf - extract pg 1
Agreement1_AAA.pdf - extract pg 3
Agreement2_AAA.pdf - extract pg 1
Save as LOC_combined.pdf (in same or new dir)
I'm looking for a way to batch process or loop through a directory. If it's easier, I have a list of all the filenames in .csv. I'm sure it could be done in python, powershell, or even batch file but I'm not very familiar with these. Trying to learn with real life example.
Using PDFtk pro, I can do it one at a time.
pdftk A=Agreement1_LOC.pdf B=Agreement2_LOC.pdf cat A3 B1 output LOC_combined.pdf
I found batch files for merging but none that save with portion of original filenames.

Related

Why can't I convert certain TIF files that I received in a split archive?

I received a large number of document files, where each document has its own split archive for each page (i.e. file1.001,file1.002,file2.001,file3.001). These are meant to be TIF files that can easily be combined and converted into PDF documents.
However, some of these files will not convert through imagemagick. Some can simply be converted using a different program, which works fine. There are some files where this doesn't work. I tried converting them to .jpg, then to tif, but they won't convert to .jpg. Things got weird when I converted them to .png, as some of these files would have multiple output files associated with them.
This is hard to explain, but I'll try and give an example; file1.001 and file1.002 both have the same image present on them when converted to tif and opened. However, when either of the tif documents is converted to a .png, two .png files are created. One has the original page, but the other one has a second page of the document that I could not view previously.
What could be causing this weird behavior, and how can I convert these to pdf more reliably?
I also used BlueBeam Staple to convert the files, if that helps at all.
Edit:
I've verified I'm on the latest imagemagick release, and I've been using it through PHP to process files. I'm running Windows 10.
Also, here's some example files to play around with. The first TIF actually shows the second page, instead of the page I normally see when I open the file.
Edit 2: Sorry, I thought uploading the image would preserve the file type. Here's a link to some test samples
When I convert your tiff to png, I get two files using IM 7.1.0-10 Q16-HDRI or IM 6.9.12-25 Q16 both on Mac OSX Sierra.
magick -quiet 294944.tif x.png
Produces:
and
Is this not what you get or expect?
P.S.
What are the other two files: 327924.001 327924.002
If those are some kind of split tiff, then it does not look like libtiff, which Imagemagick uses to read TIFFs can handle them. I get errors when attempting to use identify on them.
You definitely have some issue with whatever attempted to write those tiffs.
instrument 294944 page 1 of 2 = G4 199 dpi sheet 2 of 2 294944.tif (25.17 x 17.53 inches)
instrument 294944 page 2 of 2 = G4 199 dpi sheet 1 of 2 294944.tif (24.12 x 17.63 inches)
instrument 327501 page 1 of 1 = UN 72 dpi sheet 1 of 1 327924.001 (124.78 x 93.86 inches)
instrument 327924 page 1 of 2 = G4 400 dpi sheet 1 of 2 327924.002 (23.80 x 17.53 inches)
instrument 327924 page 2 of 2 = G4 400 dpi sheet 2 of 2 327924.002 (23.84 x 17.41 inches)
Two are identified as CCITT Group 4 Fax Encoding which is common for TIFFs of this type.
Tiff is a multi image format so a multipage FAX can be viewed as one file or 4 different printing CMYK colour plates could be sent as one image file for either overlay as one check print or printed one at a time for quality inking.
The file name Tif (or tiff) is usually applied to files with one or more pages (even 400+ for a long novel)
The extension part001.tif part002.tif is usually applied to groups of multiple pages OR for single sequential pages part1.001.tif part1.002.tif
Unfortunately for you you have a mix following a convention that seems to indicate number of pages 002 = 2 pages, but in inconsistent order, so need to check which were used for each file, as there is uncertainty.
Also the internal number does NOT always reflect the filename? perhaps transfer of interest ?
IN ADDITION you have a mix of compression methods and resolution thus cannot be sure of correct scale to be applied.
The best way to resolve this issue is decide how you wish them to be regrouped/sequenced and use the correct scale for each page or group of pages then recombine as desired into PDF.
It would help for a large number to tabulate the pages by number scale size compression etc and then process in identical groups before reorder and merge.

PDF File Merge Based on Filename

I have large batches of pdf files that must be merged.
Folder1 FileName Explaination: invoice12-105767-1510781492.pdf - 105767 is the component that will match with a pdf filename in Folder2.
"invoice12-" First section of the filename. This can sometimes be "invoice11-" or "invoice6-" so merging based on character length became challenging. The "invoicexx-" are based on where in the system the file came from.
"105767" Second part of the filename. This is the key component for matching and merging. this will be the filename in Folder2 it belongs with.
"-1510781492.pdf" Third part of the filename is a system generated unique ID, which can contain more or less characters.
Folder1:
invoice12-105767-1510781492.pdf
invoice12-105768-1510781484.pdf
invoice12-105769-1510781469.pdf
Folder2:
105767.pdf
105768.pdf
105769.pdf
OutputFolder:
Example I don't want to merge all the files in both folders into 1 huge file. I need them merged based on the Folder2 filename. (105767.pdf + invoice12-105767-1510781492.pdf) in that order specifically, also.
The final output should be three pdf files merged in order as follows:
105767.pdf + invoice12-105767-1510781492.pdf to make 1 file named 105767.pdf
105768.pdf + invoice12-105768-1510781484.pdf to make 1 file named 105768.pdf
105769.pdf + invoice12-105769-1510781469.pdf to make 1 file named 105769.pdf
I would appreciate any assistance with a way to automate this process. I merge over 800 files per day. This small automation would shave hours off my day and my wrist from carpel tunnel.
I primarily use Mac OS 10.13.1. I have looked around in Mac's "Automater" program and cannot figure out how to get it to do what I need. (I did figure out a great way to split files into single pages)
I downloaded pdftk server (since that is Mac compatible) but cannot figure out if this type of match and merge is capable with this program.
I have Adobe Acrobat DC Professional and it does not seem to have this match and merge function.
I am even open to other paid programs. I just need a fairly future-proof way of getting this mundane task done through automation on my Mac.
You can take a look at the APDFL library examples that are provided with sample code. These libraries are supported on Mac, but are not free.
https://dev.datalogics.com/adobe-pdf-library/sample-program-descriptions/c1samples/#mergedocuments
Here is a snippet of the code you would need to use:
APDFLDoc doc1 ( csInputFileName1.c_str(), true);
APDFLDoc doc2 ( csInputFileName2.c_str(), true);
// Insert doc2's pages into doc1.
// Here, we've stated PDLastPage, which adds the pages just before the last page of the target.
// If we specify PDBeforeFirstPage instead, doc2's pages will be inserted at the head of doc1.
PDDocInsertPages ( doc1.getPDDoc(),
PDLastPage,
doc2.getPDDoc(),
0,
PDAllPages,
PDInsertAll,
NULL, NULL, NULL, NULL);
doc1.saveDoc ( csOutputFileName.c_str(), PDSaveFull | PDSaveLinearized);

Sejda merging PDFs from CSV filelist names

I recently installed sedja-console for merging pdf files from command line.
The names of the input pdf files are in a CSV file named filelist-inputs.csv like this:
./Temp/source/046032.pdf,./Temp/source/048155.pdf
./Temp/source/049278.pdf,./Temp/source/050818.pdf,./Temp/source/052962.pdf
./Temp/source/052962.pdf,./Temp/source/054117.pdf
I need one output pdf file for the first line of the CSV filelist names, other output pdf file for the second line of the second line, other output for the third line, and so...
I tried a command line like this:
~$ sejda-console merge -l filelist-inputs.csv -o ./Temp/target/merged[FILENUMBER####].pdf
But it only creates a unique file named literally merged[FILENUMBER####].pdf, when I want 3 files:
merged0001.pdf
merged0002.pdf
merged0003.pdf
I've simplified the problem, because I need to merge more than 3500 pdf files in 700 output files.
Sejda takes all the values in the CSV and generates a single merged PDF, there isn't any option or setting in Sejda to achieve what you asked, you will need some scripting to loop through the CSV lines, create a CSV per line and feed it to Sejda.
The output file name merged[FILENUMBER####].pdf is literally used because the PDF merge task generates one output file and it expects an explicit output file name. Prefixes like [CURRENTPAGE] or [FILENUMBER] are valid when used as -p argument in tasks generating multiple output PDF files (split tasks etc).

Can you split a PDF 'package' into separate files with CF8 or CF9?

The cfpdf tag has lots of options but I can't seem to find one for splitting apart a PDF package into separate files which can be saved to the file system.
Is this possible?
There's not a direct command, but you can achieve what you want to do in very few lines of code by using action="merge", with the "pages" attribute. So if you wanted to take a 20-page PDF and create 20 separate files, you could use getInfo to get the number of pages in the input document, then loop from 1 to that number, and in that loop, do a merge from your input document to a new output document for each iteration, with pages="#currentPage#" (or whatever your loop counter is)

How to do mail merge on top of a PDF?

I often get a PDF from our designer (built in Adobe InDesign) which is supposed to be sent out to thousands of people.
I've got the list with all the people, and it's easy doing a mail merge in OpenOffice.org. However, OpenOffice.org doesn't support the advanced PDF. I just want to output some text onto each page and print it out.
Here's how I do it now: print out 6.000 copies of the PDF, then put all of them into the printer again and just print out name, address and other information on top of it. But that's expensive.
Sadly, I can't make the PDF to an image and use that in OpenOffice.org because it grinds the computer to a halt. It also takes extremely long time to send this job to the printer.
So, is there an easy way to do this mail merge (preferably in Python) without paying for third party closed solutions?
Now I've made an account. I fixed it by using the ingenious pdftk.
In my quest I totally overlook the feature "background" and "overlay". My solution was this:
pdftk names.pdf background boat_background.pdf output out.pdf
Creating the names.pdf you can easily do with Python reportlab or similar PDF-creation scripts. It's best using code to do that, creating 6k pages took several hours in LibreOffice/OpenOffice, while it took just a few seconds using Python.
You could probably look at a PDF library like iText. If you have some programming knowledge and a bit of time you could write some code that adds the contact information to the PDFs
There are two much simpler and cheaper solutions.
First, you can do your mail merge directly in InDesign using DataMerge. This is a utility added to InDesign way back in CS. You export or save your names in CSV format. Import the data into an InDesign template and then drop in your name, address and such fields in the layout. Press Go. It will create a new document with all the finished letters or you can go right to the printer.
OR, you can export your data to an XML file and create a dynamic layout using XML placeholders in InDesign.
The book A Designer's Guide to Adobe InDesign and XML will teach you how to do this, or you can check out the Lynda.com videos for Dynamic workflows with InDesign and XML.
Very easy to do.
If you want to create separate PDFs files for the mail merge, you can run out one long PDF with all the names in one file then do an Extract to Separate PDF files in Acrobat Pro itself.
If you cannot get the template in another format than PDF a simple ad-hoc solution would be to
convert the PDF into an image
put the image in the backgroud of your (OpenOffice.org) document
position mail merge fields on top of the image
do the mail merge and print
Probably the best way would be to generate another PDF with the missing text, and overlay one PDF over the other. A quick Google found this link showing how to do it in Acrobat, and I'm sure there are other methods as well.
http://forums.macrumors.com/showthread.php?t=508226
For a no-mess, no-fuss solution, use iText to simply add the text to the pdf. For example, you can do the following to add text to a pdf document once loaded:
PdfContentByte cb= ...;
cb.BeginText();
cb.SetFontAndSize(font, fontSize);
float x = ...;
float y = ...;
cb.SetTextMatrix(x, y);
cb.ShowText(fieldValue);
cb.EndText();
From there on, save it as a different file, and print it.
However, I've found that form fields are the way to go with pdf document generation from templates.
If you have a template with form fields (added with Adobe Acrobat), you have one of two choices :
Create a FDF file, which is essentially a list of values for the fields on the form. A FDF is a simple text document which references the original document so that when you open up the PDF, the document loads with the field values supplied by the FDF.
Alternatively, load the template with with a library like iText / iTextSharp, fill the form fields manually, and save it as a seperate pdf.
A sample FDF file looks like this (stolen from Planet PDF) :
%FDF-1.2
%âãÏÓ
1 0 obj
<<<
/F(Example PDF Form.pdf)
/Fields[
<<
/T(myTextField)
/V(myTextField default value)
>>
]
>>
>> endobj trailer
<>
%%EOF
Because of the simple format and the small size of the FDF, this is the preferred approach, and the approach should work well in any language.
As for filling the fields programmatically, you can use iText in the following way :
PdfAcroForm acroForm = writer.AcroForm;
acroForm.Put(new PdfName(fieldInfo.Name), new PdfString(fieldInfo.Value));
What about using a variable data program such as - XMPie for Adobe Indesign. It's a plug-in that should reference to your list of people (think it might have to be a list in Excel though).
One easy way would be to create a fillable pdf form from the original document in Acrobat and do a mail merge with the form and a csv.
PDF mail merges are relatively easy to do in python and pdftk. Fdfgen (pip install fdfgen) is a python library that will create an fdf from a python array, so you can save the excel grid to a csv, make sure that the csv headers match the name of the pdf form field you want to fill with that column, and do something like
import csv
import subprocess
from fdfgen import forge_fdf
PDF_FORM = 'path/to/form.pdf'
CSV_DATA = 'path/to/data.csv'
infile = open(CSV_DATA, 'rb')
reader = csv.DictReader(infile)
rows = [row for row in reader]
infile.close()
for row in rows:
# Create fdf
filename = row['filename'] # Construct filename
fdf_data = [(k,v) for k, v in row.items()]
fdf = forge_fdf(fdf_data_strings=fdf_data)
fdf_file = open(filename+'.fdf', 'wb')
fdf_file.write(fdf)
fdf_file.close()
# Use PDFTK to create filled, flattened, pdf file
cmds = ['pdftk', PDF_FORM, 'fill_form', filename+'.fdf',
'output', filename+'.pdf', 'flatten', 'dont_ask']
process = subprocess.Popen(cmds, stdout=subprocess.PIPE)
stdout, stderr = process.communicate()
returncode = process.poll()
os.remove(filename+'.fdf')
I've encountered this problem enough to write my own free solution, PdfZero. PdfZero has a mail merge feature to merge spreadsheets with PDF forms. You will still need to create a PDF form, but you can upload the form and csv to pdfzero, select which form fields you want filled with which columns, create a naming convention for each filled pdf using the csv data if needed, and batch generate the filled PDfs.
DISCLAIMER: I wrote PdfZero
Someone asked for specifics. I didn't want to sully my top answer with it, because you can do it how you like (and just knowing pdftk is up to it should give people the idea).
But here's some scripts I used ages ago:
csv_to_pdf.py
#!/usr/bin/python
# This makes one PDF page per name in the CSV file
# csv_to_pdf.py <CSV_FILE>
import csv
import sys
from reportlab.pdfgen.canvas import Canvas
from reportlab.lib.units import cm, mm
in_db = csv.reader(open(sys.argv[1], "rb"));
outname = sys.argv[1].replace("csv", "pdf")
pdf = Canvas(outname)
in_db.next()
i = 0
for rad in in_db:
pdf.setFontSize(11)
adr = rad[1]
tekst = pdf.beginText(2*cm, 26*cm)
for a in adr.split('\n'):
if not a.strip():
continue
if a[-1] == ',':
a = a[:-1]
tekst.textLine(a)
pdf.drawText(tekst)
pdf.showPage()
i += 1
if i % 1000 == 0:
print i
pdf.save()
When you've ran this, you have a file with thousands of pages, only with a name on it. This is when you can background the fancy PDF under all of them:
pdftk <YOUR_NEW_PDF_FILE.pdf> background <DESIGNED_FILE.pdf> <MERGED.pdf>
You can use InDesign's data merge function, or you can do what you've been doing with printing a portion of the job, and then printing the mail merge atop that with Word or Open Office.
But also look into finding a company that can do variable data offset printing or dynamic publishing. Might be a little more expensive up front but can save a bundle when it comes to time, testing, even packaging and mailing.
Disclaimer: I'm the author of this tool.
I ran into this issue enough times that I built a free online tool for it: https://pdfbatchfill.com/
It assumes a PDF form as a template and uses that along with CSV form data to generate a single PDF or individual PDFs in a zip file.