ColdFusion CFDOCUMENT with links to other PDFs - pdf

I am creating a PDF using the cfdocument tag at the moment. The PDF is not much more than a bunch of links to other PDFs.
So I create this PDF index and the links are all HREFs
Another PDF
if I set the localURL attribute to "no" my URLs have the whole web path in them:
Another PDF
if I set the localURL attribute to "yes" then I get:
Another PDF
So this index PDF is going to go onto a CD and all of the linked PDFs are going to sit right next to it so I need a relative link ... more like:
Another PDF
cfdocument does not seem to do this. I can modify the file name of the document and make it "File:///Another_PDF.pdf" but this does not work either because I don't know the driveletter of the CD drive ... or if the files are going to end up inside a directory on the CD.
Is there a way (possibly using iText or something) of opening up the PDF once it is created and converting the URL links to actual PDF GoTo tags?
I know this is kind of a stretch but I am at my wits end with this.
So I've managed to get into the Objects but I'm still struggling with.
Converting from:
5 0 obj<</C[0 0 1]/Border[0 0 0]/A<</URI(File:///75110_002.PDF)/S/URI>>/Subtype/Link/Rect[145 502 184 513]>>endobj
To this:
19 0 obj<</SGoToR/D[0/XYZ null null 0]/F(75110_002.PDF)>>endobj
20 0 obj<</Subtype/Link/Rect[145 502 184 513]/Border[0 0 0]/A 19 0 R>>endobj
Wow this is really kicking my ass! :)
So I've managed to get the document open, loop through the Link Annotations, capture the Rect co-ordinates and the linked to document name (saved into an array of Structures) and then successfully deleted the Annotation which was a URI Link.
So now I thought I could now loop over that array of structures and put the Annotations back into the document using the createLink method or the setAction method. But all the examples I've seen of these methods are attached to a Chunk (of text). But my document already has the Text in place so I don't need to remake the text links I just need to put the Links back in in the same spot.
So I figured I could reopen the document and look for the actual text that was the link and then attache the setAction to th ealready existing chunk of text .... I can't find the text!!
I suck! :)

This thread has an example of updating the link actions, by modifying the pdf annotations. It is written in iTextSharp 5.x, but the java code is not much different.
The thread provides a solid explanation of how annotations work. But to summarize, you need to read in your source pdf and loop through the individual pages for annotations. Extract the links and use something like getFileFromPath() to replace them with a file name only.
I was curious, so I did a quick and ugly conversion of the iTextSharp code above. Disclaimer, it is not highly tested:
/**
Usage:
util = createObject("component", "path.to.ThisComponent");
util.fixLinks( "c:/path/to/sourceFile.pdf", "c:/path/to/newFile.pdf");
*/
component {
/**
Convert all absolute links, in the given pdf, to relative links (file name only)
#source - absolute path to the source pdf file
#destination - absolute path to save copy
*/
public function fixLinks( string source, string destination) {
// initialize objects
Local.reader = createObject("java", "com.lowagie.text.pdf.PdfReader").init( arguments.source );
Local.pdfName = createObject("java", "com.lowagie.text.pdf.PdfName");
// check each page for hyperlinks
for ( Local.i = 1; Local.i <= Local.reader.getNumberOfPages(); Local.i++) {
//Get all of the annotations for the current page
Local.page = Local.reader.getPageN( Local.i );
Local.annotations = Local.page.getAsArray( Local.PdfName.ANNOTS ).getArrayList();
// search annotations for links
for (Local.x = 1; !isNull( Local.annotations) && Local.x < arrayLen(Local.annotations); Local.x++) {
// get current properties
Local.current = Local.annotations[ Local.x ];
Local.dictionary = Local.reader.getPdfObject( Local.current );
Local.subType = Local.dictionary.get( Local.PdfName.SUBTYPE );
Local.action = Local.dictionary.get( Local.PdfName.A );
Local.hasLink = true;
//Skip this item if it does not have a link AND action
if (Local.subType != Local.PdfName.LINK || isNull(Local.action)) {
Local.hasLink = false;
}
//Skip this item if it does not have a URI
if ( Local.hasLink && Local.action.get( Local.PdfName.S ) != Local.PdfName.URI ) {
Local.hasLink = false;
}
//If it is a valid URI, update link
if (Local.hasLink) {
// extract file name from URL
Local.oldLink = Local.action.get( Local.pdfName.URI );
Local.newLink = getFileFromPath( Local.oldLink );
// replace link
// WriteDump("Changed link from ["& Local.oldLink &"] ==> ["& Local.newLink &"]");
Local.pdfString = createObject("java", "com.lowagie.text.pdf.PdfString");
Local.action.put( Local.pdfName.URI, Local.pdfString.init( Local.newLink ) );
}
}
}
// save all pages to new file
copyPDF( Local.reader , arguments.destination );
}
/**
Copy all pages in pdfReader to the given destination file
#pdfReader - pdf to copy
#destination - absolute path to save copy
*/
public function copyPDF( any pdfReader, string destination) {
try {
Local.doc = createObject("java", "com.lowagie.text.Document").init();
Local.out = createObject("java", "java.io.FileOutputStream").init( arguments.destination );
Local.writer = createObject("java", "com.lowagie.text.pdf.PdfCopy").init(Local.doc, Local.out);
// open document and save individual pages
Local.doc.open();
for (Local.i = 1; i <= arguments.pdfReader.getNumberOfPages(); Local.i++) {
Local.writer.addPage( Local.writer.getImportedPage( arguments.pdfReader, Local.i) );
}
Local.doc.close();
}
finally
{
// cleanup
if (structKeyExists(Local, "doc")) { Local.doc.close(); }
if (structKeyExists(Local, "writer")) { Local.writer.close(); }
if (structKeyExists(Local, "out")) { Local.out.close(); }
}
}
}

I finally got it:
public function resetLinks( string source, string destination) {
try {
// initialize objects
Local.reader = createObject("java", "com.lowagie.text.pdf.PdfReader").init( arguments.source );
Local.pdfName = createObject("java", "com.lowagie.text.pdf.PdfName");
Local.annot = createObject("java", "com.lowagie.text.pdf.PdfAnnotation");
Local.out = createObject("java", "java.io.FileOutputStream").init( arguments.destination );
Local.stamper = createObject("java", "com.lowagie.text.pdf.PdfStamper").init(Local.reader, Local.out);
Local.PdfAction = createObject("java", "com.lowagie.text.pdf.PdfAction");
Local.PdfRect = createObject("java", "com.lowagie.text.Rectangle");
Local.PdfBorderArray = createObject("java", "com.lowagie.text.pdf.PdfBorderArray").init(javacast("float", "0"), javacast("float", "0"), javacast("float", "0"));
Local.newAnnots = [];
// check each page for hyperlinks
// Save the data to a structure then write it to an array
// then delete the hyperlink Annotation
for ( Local.i = 1; Local.i <= Local.reader.getNumberOfPages(); Local.i = Local.i + 1) {
//Get all of the annotations for the current page
Local.page = Local.reader.getPageN( Local.i );
Local.annotations = Local.page.getAsArray( Local.PdfName.ANNOTS ).getArrayList();
// search annotations for links
for (Local.x = arrayLen(Local.annotations); !isNull( Local.annotations) && Local.x > 0; Local.x--) {
// get current properties
Local.current = Local.annotations[ Local.x ];
Local.dictionary = Local.reader.getPdfObject( Local.current );
Local.subType = Local.dictionary.get( Local.PdfName.SUBTYPE );
Local.action = Local.dictionary.get( Local.PdfName.A );
Local.hasLink = true;
//Skip this item if it does not have a link AND action
if (Local.subType != Local.PdfName.LINK || isNull(Local.action)) {
Local.hasLink = false;
}
//Skip this item if it does not have a URI
if ( Local.hasLink && Local.action.get( Local.PdfName.S ) != Local.PdfName.URI ) {
Local.hasLink = false;
}
//If it is a valid URI, update link
if (Local.hasLink) {
// extract file name from URL
Local.oldLink = Local.action.get( Local.pdfName.URI );
Local.newLink = getFileFromPath( Local.oldLink );
Local.Rect = Local.dictionary.Get(PdfName.Rect);
arrayStruct = StructNew();
arrayStruct.rectSTR = Local.Rect.toString();
arrayStruct.link = Local.newLink;
arrayStruct.page = Local.i;
ArrayAppend(Local.newAnnots, arrayStruct);
// Delete
Local.annotations.remove(Local.current);
}
}
}
// Now really remove them!
Local.reader.RemoveUnusedObjects();
// Now loop over the saved annotations and put them back!!
for ( Local.z = 1; Local.z <= ArrayLen(Local.newAnnots); Local.z++) {
// Parse the rect we got save into an Array
theRectArray = ListToArray(ReplaceNoCase(ReplaceNoCase(Local.newAnnots[z].rectSTR, "[", ""), "]", ""));
// Create the GoToR action
theAction = Local.PdfAction.gotoRemotePage(javacast("string", '#Local.newAnnots[z].link#'), javacast("string", '#Local.newAnnots[z].link#'), javacast("boolean", "false"), javacast("boolean", "false"));
// Create the Link Annotation with the above Action and the Rect
theAnnot = Local.annot.createLink(Local.stamper.getWriter(), Local.PdfRect.init(javacast("int", theRectArray[1]), javacast("int", theRectArray[2]), javacast("int", theRectArray[3]), javacast("int", theRectArray[4])), Local.annot.HIGHLIGHT_INVERT, theAction);
// Remove the border the underlying underlined text will flag item as a link
theAnnot.setBorder(Local.PdfBorderArray);
// Add the Annotation to the Page
Local.stamper.addAnnotation(theAnnot, Local.newAnnots[z].page);
}
}
finally {
// cleanup
if (structKeyExists(Local, "reader")) { Local.reader.close(); }
if (structKeyExists(Local, "stamper")) { Local.stamper.close(); }
if (structKeyExists(Local, "out")) { Local.out.close(); }
}
}
I couldn't have done this without the help of Leigh!!

Related

Printing to pdf from Google Apps Script HtmlOutput

For years, I have been using Google Cloud Print to print labels in our laboratories on campus (to standardize) using a Google Apps Script custom HtmlService form.
Now that GCP is becoming depreciated, I am in on a search for a solution. I have found a few options but am struggling to get the file to convert to a pdf as would be needed with these other vendors.
Currently, when you submit a text/html blob to the GCP servers in GAS, the backend converts the blob to application/pdf (as evidenced by looking at the job details in the GCP panel on Chrome under 'content type').
That said, because these other cloud print services require pdf printing, I have tried for some time now to have GAS change the file to pdf format before sending to GCP and I always get a strange result. Below, I'll show some of the strategies that I have used and include pictures of one of our simple labels generated with the different functions.
The following is the base code for the ticket and payload that has worked for years with GCP
//BUILD PRINT JOB FOR NARROW TAPES
var ticket = {
version: "1.0",
print: {
color: {
type: "STANDARD_COLOR",
vendor_id: "Color"
},
duplex: {
type: "NO_DUPLEX"
},
copies: {copies: parseFloat(quantity)},
media_size: {
width_microns: 27940,
height_microns:40960
},
page_orientation: {
type: "LANDSCAPE"
},
margins: {
top_microns:0,
bottom_microns:0,
left_microns:0,
right_microns:0
},
page_range: {
interval:
[{start:1,
end:1}]
},
}
};
var payload = {
"printerid" : QL710,
"title" : "Blank Template Label",
"content" : HtmlService.createHtmlOutput(html).getBlob(),
"contentType": 'text/html',
"ticket" : JSON.stringify(ticket)
};
This generates the expected following printout:
When trying to convert to pdf using the following code:
The following is the code used to transform to pdf:
var blob = HtmlService.createTemplate(html).evaluate().getContent();
var newBlob = Utilities.newBlob(html, "text/html", "text.html");
var pdf = newBlob.getAs("application/pdf").setName('tempfile');
var file = DriveApp.getFolderById("FOLDER ID").createFile(pdf);
var payload = {
"printerid" : QL710,
"title" : "Blank Template Label",
"content" : pdf,//HtmlService.createHtmlOutput(html).getBlob(),
"contentType": 'text/html',
"ticket" : JSON.stringify(ticket)
};
an unexpected result occurs:
This comes out the same way for direct coding in the 'content' field with and without .getBlob():
"content" : HtmlService.createHtmlOutput(html).getAs('application/pdf'),
note the createFile line in the code above used to test the pdf. This file is created as expected, of course with the wrong dimensions for label printing (not sure how to convert to pdf with the appropriate margins and page size?): see below
I have now tried to adopt Yuri's ideas; however, the conversion from html to document loses formatting.
var blob = HtmlService.createHtmlOutput(html).getBlob();
var docID = Drive.Files.insert({title: 'temp-label'}, blob, {convert: true}).id
var file = DocumentApp.openById(docID);
file.getBody().setMarginBottom(0).setMarginLeft(0).setMarginRight(0).setMarginTop(0).setPageHeight(79.2).setPageWidth(172.8);
This produces a document looks like this (picture also showing expected output in my hand).
Does anyone have insights into:
How to format the converted pdf to contain appropriate height, width
and margins.
How to convert to pdf in a way that would print correctly.
Here is a minimal code to get a better sense of context https://script.google.com/d/1yP3Jyr_r_FIlt6_aGj_zIf7HnVGEOPBKI0MpjEGHRFAWztGzcWKCJrD0/edit?usp=sharing
I've made the template (80 x 40 mm -- sorry, I don't know your size):
https://docs.google.com/document/d/1vA93FxGXcWLIEZBuQwec0n23cWGddyLoey-h0WR9weY/edit?usp=sharing
And there is the script:
function myFunction() {
// input data
var matName = '<b>testing this to <u>see</u></b> if it <i>actually</i> works <i>e.coli</i>'
var disposeWeek = 'end of semester'
var prepper = 'John Ruppert';
var className = 'Cell and <b>Molecular</b> Biology <u>Fall 2020</u> a few exercises a few exercises a few exercises a few exercises';
var hazards = 'Lots of hazards';
// make a temporary Doc from the template
var copyFile = DriveApp.getFileById('1vA93FxGXcWLIEZBuQwec0n23cWGddyLoey-h0WR9weY').makeCopy();
var doc = DocumentApp.openById(copyFile.getId());
var body = doc.getBody();
// replace placeholders with data
body.replaceText('{matName}', matName);
body.replaceText('{disposeWeek}', disposeWeek);
body.replaceText('{prepper}', prepper);
body.replaceText('{className}', className);
body.replaceText('{hazards}', hazards);
// make Italics, Bold and Underline
handle_tags(['<i>', '</i>'], body);
handle_tags(['<b>', '</b>'], body);
handle_tags(['<u>', '</u>'], body);
// save the temporary Doc
doc.saveAndClose();
// make a PDF
var docblob = doc.getBlob().setName('Label.pdf');
DriveApp.createFile(docblob);
// delete the temporary Doc
copyFile.setTrashed(true);
}
// this function applies formatting to text inside the tags
function handle_tags(tags, body) {
var start_tag = tags[0].toLowerCase();
var end_tag = tags[1].toLowerCase();
var found = body.findText(start_tag);
while (found) {
var elem = found.getElement();
var start = found.getEndOffsetInclusive();
var end = body.findText(end_tag, found).getStartOffset()-1;
switch (start_tag) {
case '<b>': elem.setBold(start, end, true); break;
case '<i>': elem.setItalic(start, end, true); break;
case '<u>': elem.setUnderline(start, end, true); break;
}
found = body.findText(start_tag, found);
}
body.replaceText(start_tag, ''); // remove tags
body.replaceText(end_tag, '');
}
The script just changes the {placeholders} with the data and saves the result as a PDF file (Label.pdf). The PDF looks like this:
There is one thing, I'm not sure if it's possible -- to change a size of the texts dynamically to fit them into the cells, like it's done in your 'autosize.html'. Roughly, you can take a length of the text in the cell and, in case it is bigger than some number, to make the font size a bit smaller. Probably you can use the jquery texfill function from the 'autosize.html' to get an optimal size and apply the size in the document.
I'm not sure if I got you right. Do you need make PDF and save it on Google Drive? You can do in Google Docs.
As example:
Make a new document with your table and text. Something like this
Add this script into your doc:
function myFunction() {
var copyFile = DriveApp.getFileById(ID).makeCopy();
var newFile = DriveApp.createFile(copyFile.getAs('application/pdf'));
newFile.setName('label');
copyFile.setTrashed(true);
}
Every time you run this script it makes the file 'label.pdf' on your Google Drive.
The size of this pdf will be the same as the page size of your Doc. You can make any size of page with add-on: Page Sizer https://webapps.stackexchange.com/questions/129617/how-to-change-the-size-of-paper-in-google-docs-to-custom-size
If you need to change the text in your label before generate pdf or/and you need change the name of generated file, you can do it via script as well.
Here is a variant of the script that changes a font size in one of the cells if the label doesn't fit into one page.
function main() {
// input texts
var text = {};
text.matName = '<b>testing this to <u>see</u></b> if it <i>actually</i> works <i>e.coli</i>';
text.disposeWeek = 'end of semester';
text.prepper = 'John Ruppert';
text.className = 'Cell and <b>Molecular</b> Biology <u>Fall 2020</u> a few exercises a few exercises a few exercises a few exercises';
text.hazards = 'Lots of hazards';
// initial max font size for the 'matName'
var size = 10;
var doc_blob = set_text(text, size);
// if we got more than 1 page, reduce the font size and repeat
while ((size > 4) && (getNumPages(doc_blob) > 1)) {
size = size-0.5;
doc_blob = set_text(text, size);
}
// save pdf
DriveApp.createFile(doc_blob);
}
// this function takes texts and a size and put the texts into fields
function set_text(text, size) {
// make a copy
var copyFile = DriveApp.getFileById('1vA93FxGXcWLIEZBuQwec0n23cWGddyLoey-h0WR9weY').makeCopy();
var doc = DocumentApp.openById(copyFile.getId());
var body = doc.getBody();
// replace placeholders with data
body.replaceText('{matName}', text.matName);
body.replaceText('{disposeWeek}', text.disposeWeek);
body.replaceText('{prepper}', text.prepper);
body.replaceText('{className}', text.className);
body.replaceText('{hazards}', text.hazards);
// set font size for 'matName'
body.findText(text.matName).getElement().asText().setFontSize(size);
// make Italics, Bold and Underline
handle_tags(['<i>', '</i>'], body);
handle_tags(['<b>', '</b>'], body);
handle_tags(['<u>', '</u>'], body);
// save the doc
doc.saveAndClose();
// delete the copy
copyFile.setTrashed(true);
// return blob
return docblob = doc.getBlob().setName('Label.pdf');
}
// this function formats the text beween html tags
function handle_tags(tags, body) {
var start_tag = tags[0].toLowerCase();
var end_tag = tags[1].toLowerCase();
var found = body.findText(start_tag);
while (found) {
var elem = found.getElement();
var start = found.getEndOffsetInclusive();
var end = body.findText(end_tag, found).getStartOffset()-1;
switch (start_tag) {
case '<b>': elem.setBold(start, end, true); break;
case '<i>': elem.setItalic(start, end, true); break;
case '<u>': elem.setUnderline(start, end, true); break;
}
found = body.findText(start_tag, found);
}
body.replaceText(start_tag, '');
body.replaceText(end_tag, '');
}
// this funcion takes saved doc and returns the number of its pages
function getNumPages(doc) {
var blob = doc.getAs('application/pdf');
var data = blob.getDataAsString();
var pages = parseInt(data.match(/ \/N (\d+) /)[1], 10);
Logger.log("pages = " + pages);
return pages;
}
It looks rather awful and hopeless. It turned out that Google Docs has no page number counter. You need to convert your document into a PDF and to count pages of the PDF file. Gross!
Next problem, even if you managed somehow to count the pages, you have no clue which of the cells was overflowed. This script takes just one cell, changes its font size, counts pages, changes the font size again, etc. But it doesn't granted a success, because there can be another cell with long text inside. You can reduce font size of all the texts, but it doesn't look like a great idea as well.

Get pdf-attachments from Gmail as text

I searched around the web & Stack Overflow but didn't find a solution. What I try to do is the following: I get certain attachments via mail that I would like to have as (Plain) text for further processing. My script looks like this:
function MyFunction() {
var threads = GmailApp.search ('label:templabel');
var messages = GmailApp.getMessagesForThreads(threads);
for (i = 0; i < messages.length; ++i)
{
j = messages[i].length;
var messageBody = messages[i][0].getBody();
var messageSubject = messages [i][0].getSubject();
var attach = messages [i][0].getAttachments();
var attachcontent = attach.getContentAsString();
GmailApp.sendEmail("mail", messageSubject, "", {htmlBody: attachcontent});
}
}
Unfortunately this doesn't work. Does anybody here have an idea how I can do this? Is it even possible?
Thank you very much in advance.
Best, Phil
Edit: Updated for DriveApp, as DocsList deprecated.
I suggest breaking this down into two problems. The first is how to get a pdf attachment from an email, the second is how to convert that pdf to text.
As you've found out, getContentAsString() does not magically change a pdf attachment to plain text or html. We need to do something a little more complicated.
First, we'll get the attachment as a Blob, a utility class used by several Services to exchange data.
var blob = attachments[0].getAs(MimeType.PDF);
So with the second problem separated out, and maintaining the assumption that we're interested in only the first attachment of the first message of each thread labeled templabel, here is how myFunction() looks:
/**
* Get messages labeled 'templabel', and send myself the text contents of
* pdf attachments in new emails.
*/
function myFunction() {
var threads = GmailApp.search('label:templabel');
var threadsMessages = GmailApp.getMessagesForThreads(threads);
for (var thread = 0; thread < threadsMessages.length; ++thread) {
var message = threadsMessages[thread][0];
var messageBody = message.getBody();
var messageSubject = message.getSubject();
var attachments = message.getAttachments();
var blob = attachments[0].getAs(MimeType.PDF);
var filetext = pdfToText( blob, {keepTextfile: false} );
GmailApp.sendEmail(Session.getActiveUser().getEmail(), messageSubject, filetext);
}
}
We're relying on a helper function, pdfToText(), to convert our pdf blob into text, which we'll then send to ourselves as a plain text email. This helper function has a variety of options; by setting keepTextfile: false, we've elected to just have it return the text content of the PDF file to us, and leave no residual files in our Drive.
pdfToText()
This utility is available as a gist. Several examples are provided there.
A previous answer indicated that it was possible to use the Drive API's insert method to perform OCR, but it didn't provide code details. With the introduction of Advanced Google Services, the Drive API is easily accessible from Google Apps Script. You do need to switch on and enable the Drive API from the editor, under Resources > Advanced Google Services.
pdfToText() uses the Drive service to generate a Google Doc from the content of the PDF file. Unfortunately, this contains the "pictures" of each page in the document - not much we can do about that. It then uses the regular DocumentService to extract the document body as plain text.
/**
* See gist: https://gist.github.com/mogsdad/e6795e438615d252584f
*
* Convert pdf file (blob) to a text file on Drive, using built-in OCR.
* By default, the text file will be placed in the root folder, with the same
* name as source pdf (but extension 'txt'). Options:
* keepPdf (boolean, default false) Keep a copy of the original PDF file.
* keepGdoc (boolean, default false) Keep a copy of the OCR Google Doc file.
* keepTextfile (boolean, default true) Keep a copy of the text file.
* path (string, default blank) Folder path to store file(s) in.
* ocrLanguage (ISO 639-1 code) Default 'en'.
* textResult (boolean, default false) If true and keepTextfile true, return
* string of text content. If keepTextfile
* is false, text content is returned without
* regard to this option. Otherwise, return
* id of textfile.
*
* #param {blob} pdfFile Blob containing pdf file
* #param {object} options (Optional) Object specifying handling details
*
* #returns {string} id of text file (default) or text content
*/
function pdfToText ( pdfFile, options ) {
// Ensure Advanced Drive Service is enabled
try {
Drive.Files.list();
}
catch (e) {
throw new Error( "To use pdfToText(), first enable 'Drive API' in Resources > Advanced Google Services." );
}
// Set default options
options = options || {};
options.keepTextfile = options.hasOwnProperty("keepTextfile") ? options.keepTextfile : true;
// Prepare resource object for file creation
var parents = [];
if (options.path) {
parents.push( getDriveFolderFromPath (options.path) );
}
var pdfName = pdfFile.getName();
var resource = {
title: pdfName,
mimeType: pdfFile.getContentType(),
parents: parents
};
// Save PDF to Drive, if requested
if (options.keepPdf) {
var file = Drive.Files.insert(resource, pdfFile);
}
// Save PDF as GDOC
resource.title = pdfName.replace(/pdf$/, 'gdoc');
var insertOpts = {
ocr: true,
ocrLanguage: options.ocrLanguage || 'en'
}
var gdocFile = Drive.Files.insert(resource, pdfFile, insertOpts);
// Get text from GDOC
var gdocDoc = DocumentApp.openById(gdocFile.id);
var text = gdocDoc.getBody().getText();
// We're done using the Gdoc. Unless requested to keepGdoc, delete it.
if (!options.keepGdoc) {
Drive.Files.remove(gdocFile.id);
}
// Save text file, if requested
if (options.keepTextfile) {
resource.title = pdfName.replace(/pdf$/, 'txt');
resource.mimeType = MimeType.PLAIN_TEXT;
var textBlob = Utilities.newBlob(text, MimeType.PLAIN_TEXT, resource.title);
var textFile = Drive.Files.insert(resource, textBlob);
}
// Return result of conversion
if (!options.keepTextfile || options.textResult) {
return text;
}
else {
return textFile.id
}
}
The conversion to DriveApp is helped with this utility from Bruce McPherson:
// From: http://ramblings.mcpher.com/Home/excelquirks/gooscript/driveapppathfolder
function getDriveFolderFromPath (path) {
return (path || "/").split("/").reduce ( function(prev,current) {
if (prev && current) {
var fldrs = prev.getFoldersByName(current);
return fldrs.hasNext() ? fldrs.next() : null;
}
else {
return current ? null : prev;
}
},DriveApp.getRootFolder());
}

Split PDF into separate files based on text

I have a large single pdf document which consists of multiple records. Each record usually takes one page however some use 2 pages. A record starts with a defined text, always the same.
My goal is to split this pdf into separate pdfs and the split should happen always before the "header text" is found.
Note: I am looking for a tool or library using java or python. Must be free and available on Win 7.
Any ideas? AFAIK imagemagick won't work for this. May itext do this? I never used and it's
pretty complex so would need some hints.
EDIT:
Marked Answer led me to solution. For completeness here my exact implementation:
public void splitByRegex(String filePath, String regex,
String destinationDirectory, boolean removeBlankPages) throws IOException,
DocumentException {
logger.entry(filePath, regex, destinationDirectory);
destinationDirectory = destinationDirectory == null ? "" : destinationDirectory;
PdfReader reader = null;
Document document = null;
PdfCopy copy = null;
Pattern pattern = Pattern.compile(regex);
try {
reader = new PdfReader(filePath);
final String RESULT = destinationDirectory + "/record%d.pdf";
// loop over all the pages in the original PDF
int n = reader.getNumberOfPages();
for (int i = 1; i < n; i++) {
final String text = PdfTextExtractor.getTextFromPage(reader, i);
if (pattern.matcher(text).find()) {
if (document != null && document.isOpen()) {
logger.debug("Match found. Closing previous Document..");
document.close();
}
String fileName = String.format(RESULT, i);
logger.debug("Match found. Creating new Document " + fileName + "...");
document = new Document();
copy = new PdfCopy(document,
new FileOutputStream(fileName));
document.open();
logger.debug("Adding page to Document...");
copy.addPage(copy.getImportedPage(reader, i));
} else if (document != null && document.isOpen()) {
logger.debug("Found Open Document. Adding additonal page to Document...");
if (removeBlankPages && !isBlankPage(reader, i)){
copy.addPage(copy.getImportedPage(reader, i));
}
}
}
logger.exit();
} finally {
if (document != null && document.isOpen()) {
document.close();
}
if (reader != null) {
reader.close();
}
}
}
private boolean isBlankPage(PdfReader reader, int pageNumber)
throws IOException {
// see http://itext-general.2136553.n4.nabble.com/Detecting-blank-pages-td2144877.html
PdfDictionary pageDict = reader.getPageN(pageNumber);
// We need to examine the resource dictionary for /Font or
// /XObject keys. If either are present, they're almost
// certainly actually used on the page -> not blank.
PdfDictionary resDict = (PdfDictionary) pageDict.get(PdfName.RESOURCES);
if (resDict != null) {
return resDict.get(PdfName.FONT) == null
&& resDict.get(PdfName.XOBJECT) == null;
} else {
return true;
}
}
You can create a tool for your requirements using iText.
Whenever you are looking for code samples concerning (current versions of) the iText library, you should consult iText in Action — 2nd Edition the code samples from which are online and searchable by keyword from here.
In your case the relevant samples are Burst.java and ExtractPageContentSorted2.java.
Burst.java shows how to split one PDF in multiple smaller PDFs. The central code:
PdfReader reader = new PdfReader("allrecords.pdf");
final String RESULT = "record%d.pdf";
// We'll create as many new PDFs as there are pages
Document document;
PdfCopy copy;
// loop over all the pages in the original PDF
int n = reader.getNumberOfPages();
for (int i = 0; i < n; ) {
// step 1
document = new Document();
// step 2
copy = new PdfCopy(document,
new FileOutputStream(String.format(RESULT, ++i)));
// step 3
document.open();
// step 4
copy.addPage(copy.getImportedPage(reader, i));
// step 5
document.close();
}
reader.close();
This sample splits a PDF in single-page PDFs. In your case you need to split by different criteria. But that only means that in the loop you sometimes have to add more than one imported page (and thus decouple loop index and page numbers to import).
To recognize on which pages a new dataset starts, be inspired by ExtractPageContentSorted2.java. This sample shows how to parse the text content of a page to a string. The central code:
PdfReader reader = new PdfReader("allrecords.pdf");
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
System.out.println("\nPage " + i);
System.out.println(PdfTextExtractor.getTextFromPage(reader, i));
}
reader.close();
Simply search for the record start text: If the text from page contains it, a new record starts there.
Apache PDFBox has a PDFSplit utility that you can run from the command-line.
If you like Python, there's a nice library: PyPDF2. The library is pure python2, BSD-like license.
Sample code:
from PyPDF2 import PdfFileWriter, PdfFileReader
input1 = PdfFileReader(open("C:\\Users\\Jarek\\Documents\\x.pdf", "rb"))
# analyze pdf data
print input1.getDocumentInfo()
print input1.getNumPages()
text = input1.getPage(0).extractText()
print text.encode("windows-1250", errors='backslashreplacee')
# create output document
output = PdfFileWriter()
output.addPage(input1.getPage(0))
fout = open("c:\\temp\\1\\y.pdf", "wb")
output.write(fout)
fout.close()
For non coders PDF Content Split is probably the easiest way without reinventing the wheel and has an easy to use interface: http://www.traction-software.co.uk/pdfcontentsplitsa/index.html
hope that helps.

Attachments not showing up in pdf document - created using pdfbox

I m trying to attach an swf file to a pdf document. Below is my code (excerpted from the pdfbox-examples). while i can see that the file is attached based on the size of the file - with & without the attachment, I can't see / locate it in the pdf document. I do see textual content correctly displayed. Can someone tell me what I m doing wrong & help me fix the issue?
doc = new PDDocument();
PDPage page = new PDPage();
doc.addPage( page );
PDFont font = PDType1Font.HELVETICA_BOLD;
String inputFileName = "sample.swf";
InputStream fileInputStream = new FileInputStream(new File(inputFileName));
PDEmbeddedFile ef = new PDEmbeddedFile(doc, fileInputStream );
PDPageContentStream contentStream = new PDPageContentStream(doc, page,true,true);
//embedded files are stored in a named tree
PDEmbeddedFilesNameTreeNode efTree = new PDEmbeddedFilesNameTreeNode();
//first create the file specification, which holds the embedded file
PDComplexFileSpecification fs = new PDComplexFileSpecification();
fs.setEmbeddedFile(ef);
//now lets some of the optional parameters
ef.setSubtype( "swf" );
ef.setCreationDate( new GregorianCalendar() );
//now add the entry to the embedded file tree and set in the document.
Map<String, COSObjectable> efMap = new HashMap<String, COSObjectable>();
efMap.put("My first attachment", fs );
efTree.setNames( efMap );
//attachments are stored as part of the "names" dictionary in the document catalog
PDDocumentNameDictionary names = new PDDocumentNameDictionary( doc.getDocumentCatalog() );
names.setEmbeddedFiles( efTree );
doc.getDocumentCatalog().setNames( names );
After struggling with the same thing, I've discovered this is a known issue. Attachments haven't worked for a while I guess.
Here's a link to the issue on the apache forum.
There is a hack suggested here that you can use.
I tried it and it worked!
the other work around i found is after you call setNames on your PDEmbeddedFilesNameTreeNode remove the limits: ((COSDictionary
)efTree.getCOSObject()).removeItem(COSName.LIMITS); ugly hack, but it
works, without having to recompile pdfbox
Attachment works fine with new version of PDFBox 2.0,
public static boolean addAtachement(final String fileName, final String... attachements) {
if (Objects.isNull(fileName)) {
throw new NullPointerException("fileName shouldn't be null");
}
if (Objects.isNull(attachements)) {
throw new NullPointerException("attachements shouldn't be null");
}
Map<String, PDComplexFileSpecification> efMap = new HashMap<String, PDComplexFileSpecification>();
/*
* Load PDF Document.
*/
try (PDDocument doc = PDDocument.load(new File(fileName))) {
/*
* Attachments are stored as part of the "names" dictionary in the
* document catalog
*/
PDDocumentNameDictionary names = new PDDocumentNameDictionary(doc.getDocumentCatalog());
/*
* First we need to get all the existed attachments, after that we
* can add new attachments
*/
PDEmbeddedFilesNameTreeNode efTree = names.getEmbeddedFiles();
if (Objects.isNull(efTree)) {
efTree = new PDEmbeddedFilesNameTreeNode();
}
Map<String, PDComplexFileSpecification> existedNames = efTree.getNames();
if (existedNames == null || existedNames.isEmpty()) {
existedNames = new HashMap<String, PDComplexFileSpecification>();
}
for (String attachement : attachements) {
/*
* Create the file specification, which holds the embedded file
*/
PDComplexFileSpecification fs = new PDComplexFileSpecification();
fs.setFile(attachement);
try (InputStream is = new FileInputStream(attachement)) {
/*
* This represents an embedded file in a file specification
*/
PDEmbeddedFile ef = new PDEmbeddedFile(doc, is);
/* Set some relevant properties of embedded file */
ef.setCreationDate(new GregorianCalendar());
fs.setEmbeddedFile(ef);
/*
* now add the entry to the embedded file tree and set in
* the document.
*/
efMap.put(attachement, fs);
}
}
efTree.setNames(efMap);
names.setEmbeddedFiles(efTree);
doc.getDocumentCatalog().setNames(names);
doc.save(fileName);
return true;
} catch (IOException e) {
System.out.println(e.getMessage());
return false;
}
}
To 'locate' or see an attached file in the PDF, you can't flip through its pages to find any trace of it there (like, an annotation).
In Acrobat Reader 9.x for example, you have to click on the "View Attachments" icon (looking like a paper-clip) on the left sidebar.

How to automate Photoshop?

I am trying to automate the process of scanning/cropping photos in Photoshop. I need to scan 3 photos at a time, then use Photoshop's Crop and Straighten Photos command, which creates 3 separate images. After that I'd like to save each of the newly created images as a PNG.
I looked at the JSX scripts and they seem to a lot of promise. Is what I described possible to automate in Photoshop using JavaScript or VBScript or whatever?
I just found this script just did the work for me! It automatically crop & straighten the photo and save each result to directory you specified.
http://www.tranberry.com/photoshop/photoshop_scripting/PS4GeeksOrlando/IntroScripts/cropAndStraightenBatch.jsx
Save it to local then run it in the PS=>File=>Command=>Browse
P.S I found in the comment it said the script can be executed directly by double clicking from Mac Finder or Windows Explorer.
Backup gist for the script here
I actually got the answer on the Photoshop forums over at adobe. It turns out that Photoshop CS4 is totally scriptable via JavaScript, VBScript and comes with a really kick-ass Developer IDE, that has everything you'd expect (debugger, watch window, color coding and more). I was totally impressed.
Following is an extract for reference:
you can run the following script that will create a new folder off the existing one and batch split all the files naming them existingFileName#001.png and put them in the new folder (edited)
#target Photoshop
app.bringToFront;
var inFolder = Folder.selectDialog("Please select folder to process");
if(inFolder != null){
var fileList = inFolder.getFiles(/\.(jpg|tif|psd|)$/i);
var outfolder = new Folder(decodeURI(inFolder) + "/Edited");
if (outfolder.exists == false) outfolder.create();
for(var a = 0 ;a < fileList.length; a++){
if(fileList[a] instanceof File){
var doc= open(fileList[a]);
doc.flatten();
var docname = fileList[a].name.slice(0,-4);
CropStraighten();
doc.close(SaveOptions.DONOTSAVECHANGES);
var count = 1;
while(app.documents.length){
var saveFile = new File(decodeURI(outfolder) + "/" + docname +"#"+ zeroPad(count,3) + ".png");
SavePNG(saveFile);
activeDocument.close(SaveOptions.DONOTSAVECHANGES) ;
count++;
}
}
}
};
function CropStraighten() {
function cTID(s) { return app.charIDToTypeID(s); };
function sTID(s) { return app.stringIDToTypeID(s); };
executeAction( sTID('CropPhotosAuto0001'), undefined, DialogModes.NO );
};
function SavePNG(saveFile){
pngSaveOptions = new PNGSaveOptions();
pngSaveOptions.embedColorProfile = true;
pngSaveOptions.formatOptions = FormatOptions.STANDARDBASELINE;
pngSaveOptions.matte = MatteType.NONE;
pngSaveOptions.quality = 1;
pngSaveOptions.PNG8 = false; //24 bit PNG
pngSaveOptions.transparency = true;
activeDocument.saveAs(saveFile, pngSaveOptions, true, Extension.LOWERCASE);
}
function zeroPad(n, s) {
n = n.toString();
while (n.length < s) n = '0' + n;
return n;
};
Visit here for complete post.
Have you tried using Photoshop Actions? I don't now about the scanning part, but the rest can all be done by actions quite easily.