PDF File Merge Based on Filename - pdf

I have large batches of pdf files that must be merged.
Folder1 FileName Explaination: invoice12-105767-1510781492.pdf - 105767 is the component that will match with a pdf filename in Folder2.
"invoice12-" First section of the filename. This can sometimes be "invoice11-" or "invoice6-" so merging based on character length became challenging. The "invoicexx-" are based on where in the system the file came from.
"105767" Second part of the filename. This is the key component for matching and merging. this will be the filename in Folder2 it belongs with.
"-1510781492.pdf" Third part of the filename is a system generated unique ID, which can contain more or less characters.
Folder1:
invoice12-105767-1510781492.pdf
invoice12-105768-1510781484.pdf
invoice12-105769-1510781469.pdf
Folder2:
105767.pdf
105768.pdf
105769.pdf
OutputFolder:
Example I don't want to merge all the files in both folders into 1 huge file. I need them merged based on the Folder2 filename. (105767.pdf + invoice12-105767-1510781492.pdf) in that order specifically, also.
The final output should be three pdf files merged in order as follows:
105767.pdf + invoice12-105767-1510781492.pdf to make 1 file named 105767.pdf
105768.pdf + invoice12-105768-1510781484.pdf to make 1 file named 105768.pdf
105769.pdf + invoice12-105769-1510781469.pdf to make 1 file named 105769.pdf
I would appreciate any assistance with a way to automate this process. I merge over 800 files per day. This small automation would shave hours off my day and my wrist from carpel tunnel.
I primarily use Mac OS 10.13.1. I have looked around in Mac's "Automater" program and cannot figure out how to get it to do what I need. (I did figure out a great way to split files into single pages)
I downloaded pdftk server (since that is Mac compatible) but cannot figure out if this type of match and merge is capable with this program.
I have Adobe Acrobat DC Professional and it does not seem to have this match and merge function.
I am even open to other paid programs. I just need a fairly future-proof way of getting this mundane task done through automation on my Mac.

You can take a look at the APDFL library examples that are provided with sample code. These libraries are supported on Mac, but are not free.
https://dev.datalogics.com/adobe-pdf-library/sample-program-descriptions/c1samples/#mergedocuments
Here is a snippet of the code you would need to use:
APDFLDoc doc1 ( csInputFileName1.c_str(), true);
APDFLDoc doc2 ( csInputFileName2.c_str(), true);
// Insert doc2's pages into doc1.
// Here, we've stated PDLastPage, which adds the pages just before the last page of the target.
// If we specify PDBeforeFirstPage instead, doc2's pages will be inserted at the head of doc1.
PDDocInsertPages ( doc1.getPDDoc(),
PDLastPage,
doc2.getPDDoc(),
0,
PDAllPages,
PDInsertAll,
NULL, NULL, NULL, NULL);
doc1.saveDoc ( csOutputFileName.c_str(), PDSaveFull | PDSaveLinearized);

Related

PDF extract specific pages & merge with new filename

I have 2 pdf files (templates) from which I need to extract 1 page each and save as a combined pdf. Each pdf has a filename with a different 3 letter location indicator (e.g. LOC) for each of the agreement pdfs. I'm looking for a way to batch process these and save with the location indicator in the new combined filename. There are approx 500 locations.
Example files:
Agreement1_LOC.pdf - extract pg 3
Agreement2_LOC.pdf - extract pg 1
Agreement1_AAA.pdf - extract pg 3
Agreement2_AAA.pdf - extract pg 1
Save as LOC_combined.pdf (in same or new dir)
I'm looking for a way to batch process or loop through a directory. If it's easier, I have a list of all the filenames in .csv. I'm sure it could be done in python, powershell, or even batch file but I'm not very familiar with these. Trying to learn with real life example.
Using PDFtk pro, I can do it one at a time.
pdftk A=Agreement1_LOC.pdf B=Agreement2_LOC.pdf cat A3 B1 output LOC_combined.pdf
I found batch files for merging but none that save with portion of original filenames.

finding a corrupted part from the parts of a split archive

I have 7 files with extensions like xyz.rar.001 - xyz.rar.007 clearly they are parts of a single file. I have all the 7 parts. I join them using a file joiner into a single file xyz.rar and try to unrar them with WINRAR , it says that archive is corrupted It is clear that 1 or 2 parts are corrupted. IS THERE ANY WAY TO FIND THEM ? Please help I don't want to re download all of them NOTE- winrar can detect a corrupt part if the parts were splitted using winrar (with extensions like part1.rar , part2.rar etc. ) but not if they are named as rar.001
Parts .001 - .006 should have the same size. Check if there is a file with a different byte size.
Are there multiple files in the RAR or just the one? With multiple you could run a Test and see which is the first file to fail.
I think it's strange that there is a second tool used to split the RAR archive up. (e.g. HJSplit) This lets me think that .002 could be a RAR archive too. Try opening xyz.rar.001 with WinRAR and test/exctract. It happens more that RAR archives have the extension .001 instead of .rar. An example.
Naming your archives in WinRAR like this can be accomplished by putting "xyz.rar.001" as Archive name on the General tab and checking "Old style volume names" on the Advanced tab.
If I then join the files with HJSplit, I get one .rar file (that is corrupt). When I Test it, it says "Next volume is required". In the diagnostic messages I can see "The required volume is absent" and "CRC failed in X. The file is corrupt"
If there is one file stored inside the RAR and the RAR is indeed just chopped up into 7 pieces, there is no way of telling without additional files such as .sfv or .par2. (unless the RAR does not use compression: you can parse the underlying file for errors and calculate the part where it goes wrong)

Can you split a PDF 'package' into separate files with CF8 or CF9?

The cfpdf tag has lots of options but I can't seem to find one for splitting apart a PDF package into separate files which can be saved to the file system.
Is this possible?
There's not a direct command, but you can achieve what you want to do in very few lines of code by using action="merge", with the "pages" attribute. So if you wanted to take a 20-page PDF and create 20 separate files, you could use getInfo to get the number of pages in the input document, then loop from 1 to that number, and in that loop, do a merge from your input document to a new output document for each iteration, with pages="#currentPage#" (or whatever your loop counter is)

Programmatically add comments to PDF header

Has anyone had any success with adding additional information to a PDF file?
We have an electronic medical record system which produces medical documents for our users. In the past, those documents have been Print-To-File (.prn) files which we have fed to a system that displayed them as part of an enterprise medical record.
Now the hospital's enterprise medical record vendor wants to receive the documents as PDF, but still wants all of the same information stored in the header.
Honestly, we can't figure out how to put information into a PDF file that doesn't break the PDF file.
Here is the start of one of our PDFs...
%PDF-1.4
%âãÏÓ
6 0 obj
<<
/Type /XObject
/Subtype /Image
/BitsPerComponent 8
/Width 854
/Height 130
/ColorSpace /DeviceRGB
/Filter /DCTDecode
/Length 17734>>
stream
In our PRN files, we would insert information like this:
%MRN% TEST000001
%ACCT% TEST0000000000001
%DATE% 01/01/2009^16:44
%DOC_TYPE% Clinical
%DOC_NUM% 192837475
%DOC_VER% 1
My question is, can I insert this information into a PDF in a manner which allows the document server to perform post-processing, yet is NOT visible to the doctor who views the PDF?
Thank you,
David Walker
Yes, you can. Any line in a PDF file that starts with a percent sign is a comment and as such ignored (the first two lines of the PDF actually are comments as well). So you can pretty much insert your information into the PDF as you did into the PRN.
However:
The PDF format works with byte position references, so if you insert data into a finished PDF file, this will push the rest of the data away from their original position and thus break the file. You can also not append it to the file, because a PDF file has to end with
startxref
123456
%%EOF
(the 123456 is an example). You could insert your data right before these three lines. The byte position of the "startxref" part is never referenced anywhere, so you won't break anything if you push this final part towards the end.
Edit: This of course assumes there is no checksumming, signing or encryption going on. That would make things more complicated.
Edit 2: As Javier pointed out correctly, you can also just add your data to the end and just add a copy of the three lines to the end of that. Boils down to the same thing, but it's a little easier.
PDFs are supposed to have multiple versions just appending at the end; but the very end must have the offset to the main reference table. Just read the last three lines, append your data and reattach the original ending.
You can either remove the original ending or let it there. PDF readers will just go to the end and use the second-to-last line to find the reference table.
Have you ever thought to embed your additional info inside the PDF as a separate file?
The generic PDF specification allows to "attach files" to PDFs. Attached files can be anything: *.txt, *.doc, *.xsl, *.html or even .pdf. Attached files are contained in the PDF "container" file without corrupting the container's own content. (Special-purpose PDF specifications such as PDF/A- and PDF/X-* may impose some restrictions about embedded/attached files.)
That allows you to tie additional info and/or data to PDF files and allow for common storage and processing. Attached files are supposed to not disturb any PDF viewer's rendering.
I've used that feature frequently, for various purposes:
store the parent document (like .doc) inside the .pdf from which the .pdf was created in the first place;
tag a job ticketing information to a printfile that is sent to the printshop;
etc.pp.
Of course, recently discovered and published flaws in PDF processing software (and in the PDF spec itself) suggest to stay away from embedding/attaching binary files to PDF files --
because more and more Readers will by default stop you from easily extracting/detaching the embedded/attached files.
However, there is no reason why you shouldn't be able to put your additional info into a medical-record-info.txt file of arbitrary lenght and internal format and attach it to the PDF:
MRN TEST000001
ACCT TEST0000000000001
DATE 2009-01-01
TIME 16:44:33.76
DOC_TYPE Clinical
DOC_NUM 192837475
DOC_VER 1
MORE_INFO blah blah
Hi, guys,
can you please process this file faster than usual? If you don't,
someone will be dying.
Seriously, David.
FWIW, the commandline tools pdftk.exe (Windows) and pdftk (Linux) are able to attach and detach embedded files from their container PDF. Acrobat Reader can also handle attachments.
You could setup/program/script your document server handling the PDF to automatically detach the embedded .txt file and trigger actions according to its content.
Of course, the doctor who views the PDF would be able to see there is a file attachment in the PDF. But it wouldn't appear in his "normal" viewing. He'd have to take specific additional actions in order to extract and view it. (And then there is the option to set a password on the PDF to protect it from un-authorized file detachments. And/or encode, obscure, rot13 the .txt. Not exactly rock-solid methods, but 99% of doctors wouldn't be able to accomplish it even if you teach them how to...)
You can still insert comments into a PDF file using the % character. But anyone would be able to access with a text editor.
Your vendor could remove these comments after post-processing, so it doesn't actually get to the doctors.
You can store the data as real PDF metadata. For example, with CAM::PDF you can write metadata like this:
use CAM::PDF;
my $pdf = CAM::PDF->new('temp.pdf') || die;
my $info = $pdf->getValue($pdf->{trailer}->{Info}) || die;
$info->{PRN} = CAM::PDF::Node->new('dictionary', {
DOC_TYPE => CAM::PDF::Node->new('string', 'Clinical'),
DOC_NUM => CAM::PDF::Node->new('number', 192837475),
DOC_VER => CAM::PDF::Node->new('number', 1),
});
$pdf->cleanoutput('out.pdf');
The Info node of the PDF then looks like this:
8 0 obj
<< /CreationDate (D:20080916083455-04'00')
/ModDate (D:20080916083729-04'00')
/PRN << /DOC_NUM 192837475 /DOC_TYPE (Clinical) /DOC_VER 1 >> >>
endobj
You can read the PRN data back out like so (simplistic code...)
my $pdf = CAM::PDF->new('out.pdf') || die;
my $info = $pdf->getValue($pdf->{trailer}->{Info}) || die;
my $prn = $info->{PRN};
if ($prn) {
my $prndict = $pdf->getValue($prn);
for my $key (sort keys %{$prndict}) {
print "$key = ", $pdf->getValue($prndict->{$key}), "\n";
}
}
Which makes output like this:
DOC_NUM = 192837475
DOC_TYPE = Clinical
DOC_VER = 1
PDF supports arbitrarily nested arrays, dictionaries and references so just about any data can be represented. For example, I built an entire filesystem embedded in a PDF just for fun!
At one point we were changing some Acrobat JS code by doing a text replace in a plain (unencrypted) PDF. The trick was that the lengths of each PDF block were hard coded in the document. So, we could not change the number of characters. We would just add extra spaces.
It worked great, the JS code executed an all.
Have you thought about using XMP?

How to do mail merge on top of a PDF?

I often get a PDF from our designer (built in Adobe InDesign) which is supposed to be sent out to thousands of people.
I've got the list with all the people, and it's easy doing a mail merge in OpenOffice.org. However, OpenOffice.org doesn't support the advanced PDF. I just want to output some text onto each page and print it out.
Here's how I do it now: print out 6.000 copies of the PDF, then put all of them into the printer again and just print out name, address and other information on top of it. But that's expensive.
Sadly, I can't make the PDF to an image and use that in OpenOffice.org because it grinds the computer to a halt. It also takes extremely long time to send this job to the printer.
So, is there an easy way to do this mail merge (preferably in Python) without paying for third party closed solutions?
Now I've made an account. I fixed it by using the ingenious pdftk.
In my quest I totally overlook the feature "background" and "overlay". My solution was this:
pdftk names.pdf background boat_background.pdf output out.pdf
Creating the names.pdf you can easily do with Python reportlab or similar PDF-creation scripts. It's best using code to do that, creating 6k pages took several hours in LibreOffice/OpenOffice, while it took just a few seconds using Python.
You could probably look at a PDF library like iText. If you have some programming knowledge and a bit of time you could write some code that adds the contact information to the PDFs
There are two much simpler and cheaper solutions.
First, you can do your mail merge directly in InDesign using DataMerge. This is a utility added to InDesign way back in CS. You export or save your names in CSV format. Import the data into an InDesign template and then drop in your name, address and such fields in the layout. Press Go. It will create a new document with all the finished letters or you can go right to the printer.
OR, you can export your data to an XML file and create a dynamic layout using XML placeholders in InDesign.
The book A Designer's Guide to Adobe InDesign and XML will teach you how to do this, or you can check out the Lynda.com videos for Dynamic workflows with InDesign and XML.
Very easy to do.
If you want to create separate PDFs files for the mail merge, you can run out one long PDF with all the names in one file then do an Extract to Separate PDF files in Acrobat Pro itself.
If you cannot get the template in another format than PDF a simple ad-hoc solution would be to
convert the PDF into an image
put the image in the backgroud of your (OpenOffice.org) document
position mail merge fields on top of the image
do the mail merge and print
Probably the best way would be to generate another PDF with the missing text, and overlay one PDF over the other. A quick Google found this link showing how to do it in Acrobat, and I'm sure there are other methods as well.
http://forums.macrumors.com/showthread.php?t=508226
For a no-mess, no-fuss solution, use iText to simply add the text to the pdf. For example, you can do the following to add text to a pdf document once loaded:
PdfContentByte cb= ...;
cb.BeginText();
cb.SetFontAndSize(font, fontSize);
float x = ...;
float y = ...;
cb.SetTextMatrix(x, y);
cb.ShowText(fieldValue);
cb.EndText();
From there on, save it as a different file, and print it.
However, I've found that form fields are the way to go with pdf document generation from templates.
If you have a template with form fields (added with Adobe Acrobat), you have one of two choices :
Create a FDF file, which is essentially a list of values for the fields on the form. A FDF is a simple text document which references the original document so that when you open up the PDF, the document loads with the field values supplied by the FDF.
Alternatively, load the template with with a library like iText / iTextSharp, fill the form fields manually, and save it as a seperate pdf.
A sample FDF file looks like this (stolen from Planet PDF) :
%FDF-1.2
%âãÏÓ
1 0 obj
<<<
/F(Example PDF Form.pdf)
/Fields[
<<
/T(myTextField)
/V(myTextField default value)
>>
]
>>
>> endobj trailer
<>
%%EOF
Because of the simple format and the small size of the FDF, this is the preferred approach, and the approach should work well in any language.
As for filling the fields programmatically, you can use iText in the following way :
PdfAcroForm acroForm = writer.AcroForm;
acroForm.Put(new PdfName(fieldInfo.Name), new PdfString(fieldInfo.Value));
What about using a variable data program such as - XMPie for Adobe Indesign. It's a plug-in that should reference to your list of people (think it might have to be a list in Excel though).
One easy way would be to create a fillable pdf form from the original document in Acrobat and do a mail merge with the form and a csv.
PDF mail merges are relatively easy to do in python and pdftk. Fdfgen (pip install fdfgen) is a python library that will create an fdf from a python array, so you can save the excel grid to a csv, make sure that the csv headers match the name of the pdf form field you want to fill with that column, and do something like
import csv
import subprocess
from fdfgen import forge_fdf
PDF_FORM = 'path/to/form.pdf'
CSV_DATA = 'path/to/data.csv'
infile = open(CSV_DATA, 'rb')
reader = csv.DictReader(infile)
rows = [row for row in reader]
infile.close()
for row in rows:
# Create fdf
filename = row['filename'] # Construct filename
fdf_data = [(k,v) for k, v in row.items()]
fdf = forge_fdf(fdf_data_strings=fdf_data)
fdf_file = open(filename+'.fdf', 'wb')
fdf_file.write(fdf)
fdf_file.close()
# Use PDFTK to create filled, flattened, pdf file
cmds = ['pdftk', PDF_FORM, 'fill_form', filename+'.fdf',
'output', filename+'.pdf', 'flatten', 'dont_ask']
process = subprocess.Popen(cmds, stdout=subprocess.PIPE)
stdout, stderr = process.communicate()
returncode = process.poll()
os.remove(filename+'.fdf')
I've encountered this problem enough to write my own free solution, PdfZero. PdfZero has a mail merge feature to merge spreadsheets with PDF forms. You will still need to create a PDF form, but you can upload the form and csv to pdfzero, select which form fields you want filled with which columns, create a naming convention for each filled pdf using the csv data if needed, and batch generate the filled PDfs.
DISCLAIMER: I wrote PdfZero
Someone asked for specifics. I didn't want to sully my top answer with it, because you can do it how you like (and just knowing pdftk is up to it should give people the idea).
But here's some scripts I used ages ago:
csv_to_pdf.py
#!/usr/bin/python
# This makes one PDF page per name in the CSV file
# csv_to_pdf.py <CSV_FILE>
import csv
import sys
from reportlab.pdfgen.canvas import Canvas
from reportlab.lib.units import cm, mm
in_db = csv.reader(open(sys.argv[1], "rb"));
outname = sys.argv[1].replace("csv", "pdf")
pdf = Canvas(outname)
in_db.next()
i = 0
for rad in in_db:
pdf.setFontSize(11)
adr = rad[1]
tekst = pdf.beginText(2*cm, 26*cm)
for a in adr.split('\n'):
if not a.strip():
continue
if a[-1] == ',':
a = a[:-1]
tekst.textLine(a)
pdf.drawText(tekst)
pdf.showPage()
i += 1
if i % 1000 == 0:
print i
pdf.save()
When you've ran this, you have a file with thousands of pages, only with a name on it. This is when you can background the fancy PDF under all of them:
pdftk <YOUR_NEW_PDF_FILE.pdf> background <DESIGNED_FILE.pdf> <MERGED.pdf>
You can use InDesign's data merge function, or you can do what you've been doing with printing a portion of the job, and then printing the mail merge atop that with Word or Open Office.
But also look into finding a company that can do variable data offset printing or dynamic publishing. Might be a little more expensive up front but can save a bundle when it comes to time, testing, even packaging and mailing.
Disclaimer: I'm the author of this tool.
I ran into this issue enough times that I built a free online tool for it: https://pdfbatchfill.com/
It assumes a PDF form as a template and uses that along with CSV form data to generate a single PDF or individual PDFs in a zip file.