Identify and extract or delete pages of a PDF based on a search string / text (action / javascript)

Identify and extract or delete pages of a PDF based on a search string / text (action / javascript) - pdf

Good Evening (UK)
I'm trying to filter down a 1500+ page PDF file to only the pages which include a certain text string (typically one or two words). My laptop is locked down with respect to installing more software BUT I have used action(script)s quite a bit
I get the error below when I try to install this action into Abobe Acrobat X Pro (Win 7):
screen dump of error
called "Extract Commented Pages"... supposed to be OK for X and XI this looks like what I want.....
I wondered if there was something simple causing the problem but the actionscript file is rather... busy to say the least.
I used to have an action that I think was based on a legal redaction script but it is filed somewhere!
If you have already got an action that does this or a version of the above that doesn't give the error I get (unable to import the Action.... The file is either invalid or corrupt) I will forever by indebted to your gratitude
Many thanks, have a good weekend!

I recently came across a script found at the following link: http://forums.adobe.com/thread/1077118
I'm having some issues getting the script to run in Acrobat, despite everything looking alright in the script itself. I'll update if I find any errors.
Here is a copy of the script:
// Set the word to search for here
var sWord = "forms";
// Source document = current document
var sd = this;
var nWords, currWord, fp, fpa = [], nd;
var fn = sd.documentFileName.replace(/\.pdf$/i, "");
// Loop through the pages
for (var i = 0; i < sd.numPages; i += 1) {
// Get the number of words on the page
nWords = sd.getPageNumWords(i);
// Loop through the words on the page
for (var j = 0; j < nWords; j += 1) {
// Get the current word
currWord = sd.getPageNthWord(i, j);
if (currWord === sWord) {
// Extract the current page to a new file
fp = fn + "_" + i + ".pdf";
fpa.push(fp);
sd.extractPages({nStart: i, nEnd: i, cPath: fp});
// Stop searching this page
break;
}
}
}
// Combine the individual pages into one PDF
if (fpa.length) {
// Open the document that's the first extracted page
nd = app.openDoc({cPath: fpa[0], oDoc: sd});
// Append any other pages that were extracted
if (fpa.length > 1) {
for (var i = 1; i < fpa.length; i += 1) {
nd.insertPages({nPage: i - 1, cPath: fpa[i], nStart: 0, nEnd: 0});
}
}
// Save to a new document and close this one
nd.saveAs({cPath: fn + "_searched.pdf"});
nd.closeDoc({bNoSave: true});
}

Related

How to exclude certain images from autosave in Gatan Digital Micrograph (GMS) in DM-script

I am trying to mimic the autosave function in GMS v3 so that I can use in version 1 and 2. I would like to first acknowledge that the main bulk of the script originates from Dr Bernhard Schaffer's "How to script... Digital Micrograph Scripting Handbook". I have modified it a bit, so that any new image recorded by the camera can be autosave into the file. However, I met some problems because if I decide to click on live-view image and move the image around, or using live-fft, the live view image or the FFT image will be saved as well. One of the ideas I have is to use the taggroup information such as the "Acquisition:Parameters:Parameter Set Name" because for live view or live-FFT, this would be either in search or focus mode. Another idea is to use the document ID e.g iDocID = idoc.ImageDocumentGETID() to locate the ID of the live image. However, i am clueless then how to use this information to exclude them from autosaving. Can anyone point to me how i can proceed with this script?
Below is the script
Class PeriodicAutoSave : Object
{
Number output
PeriodicAutoSave(Object self) Result("\n Object ID"+self.ScriptObjectGetID()+" created.")
~PeriodicAutoSave(Object self) Result("\n Object ID"+self.ScriptObjectGetID()+" destroyed")
Void Init2(Object self, Number op)
output=op
Void AutoSave_SaveAll(Object self)
{
String path, name, targettype, targettype1, area, mag, mode, search, result1
ImageDocument idoc
Number nr_idoc, count, index_i, index, iDocID, iDocID_search
path = "c:\\path\\"
name = "test"
targettype=="Gatan Format (*.dm4)"
targettype1 = "dm4"
If (output) Result("\n AutoSave...")
nr_idoc = CountImageDocuments()
For (count = 1; count<nr_idoc; count++)
{
idoc = GetImageDocument(count) //imagedocument
index = 1 // user decide the index to start with
index_i= nr_idoc - index
If (idoc.ImageDocumentIsDirty())
{
idoc = getfrontimagedocument()
iDocID = idoc.ImageDocumentGetID()
TagGroup tg = ImageGetTagGroup(idoc.ImageDocumentGetImage(0)) // cannot declare an 'img' for this line as it will prompt an error?
tg.TagGroupGetTagAsString("Microscope Info:Formatted Indicated Mag", mag)
Try{
{
idoc.ImageDocumentSavetoFile( "Gatan Format", path+index_i+"-"+name+"-"+mag+".dm4")
idoc.ImageDocumentSetName(index_i + "-"+name+"-"+mag+".dm4")
idoc.ImageDocumentClean()
}
If (Output) Result("\n\t saving: "+idoc.ImageDocumentGetCurrentFile())
}
Catch{
Result("\n image cannot be saved at the moment:" + GetExceptionString())
Break
}
Result("\ Continue autosave...")
}
}
}
}
Void LaunchAutoSave()
{
Object obj = Alloc(PeriodicAutoSave)
obj.Init2(2)
Number task_id = obj.AddMainThreadPeriodicTask("AutoSave_SaveALL",6)
//Sleep(10)
while(!shiftdown()) 1==2
RemoveMainThreadTask(task_id)
}
LaunchAutoSave()

thank you very much for your pointers! I have tried and it works very well with my script. as the 'TagGroupDoesTagExist' only refers to the taggroup, I modified further to include the tags I want to filter e.g "Search" or "Focus" and it seems to work well. The script that I modified to your existing ones is as below :
If (idoc.ImageDocumentIsDirty())
{
//now find out what is a filter condition and skip if it is true
skip = 0
TagGroup tg = idoc.ImageDocumentGetImage(0).ImageGetTagGroup()
tg.TagGroupGetTagAsString("Microscope Info:Formatted Indicated Mag", mag)
tg.TagGroupGetTagAsString("Acquisition:Parameters:Parameter Set Name", mode)
skip = tg.TagGroupDoesTagExist("Acquisition:Parameters:Parameter Set Name")
if(skip && (mode == "Search" || mode== "Focus")) continue

Your idea of filtering is a good one, but there is something strange with your for loop.
in
nr_idoc = CountImageDocuments()
For (count = 1; count<nr_idoc; count++)
{
idoc = GetImageDocument(count) //imagedocument
you iterate over all currently open imageDocuments (except the first one!?) and get them one by one, but then in
If (idoc.ImageDocumentIsDirty())
{
idoc = getfrontimagedocument()
you actually get the front-most (selected) document instead each time. Why are you doing this?
Why not go with:
number nr_idoc = CountImageDocuments()
for (number count = 0; count<nr_idoc; count++)
{
imagedocument idoc = GetImageDocument(count)
If (idoc.ImageDocumentIsDirty())
{
// now find out what is a filter condition and skip if it is true
number skip = 0
TagGroup tg = idoc.ImageDocumentGetImage(0).ImageGetTagGroup()
skip = tg.TagGroupDoesTagExist("Microscope Info:Formatted Indicated Mag")
if (skip) continue
// do saving
}
}

Automating Photoshop to edit en rename +600 files with the name of the folder

I have +600 product images on my mac already cut out and catalogued in their own folder. They are all PSD's and I need a script that will do the following.
Grab the name of the folder
Grab all the PSD's in said folder
Combine them in one big PSD in the right order (the filenames are saved sequentially as 1843, 1845, 1846 so they need to open in that order)
save that PSD
save the separate layers as PNG with the name from the folder + _1, _2, _3
I have previous experience in Bash (former Linux user) and tried for hours in Automator but to no success.

Welcome to Stack Overflow. The quick answer is yes this is possible to do via scripting. I might even suggest breaking down into two scripts, one to grab and save the PSDs and the second to save out the layers.
It's not very clear about "combining" the PSDs or about "separate layers, only I don't know if they are different canvas sizes, where you want each PSD to be positioned (x, y offsets & layering) Remember none of use have your files infront of us to refer from.
In short, if you write out pseudo code of what is it you expect your code to do it makes it easier to answer your question.
Here's a few code snippets to get you started:
This will open a folder and retrieve alls the PSDs as an array:
// get all the files to process
var folderIn = Folder.selectDialog("Please select folder to process");
if (folderIn != null)
{
var tempFileList = folderIn.getFiles();
}
var fileList = new Array(); // real list to hold images, not folders
for (var i = 0; i < tempFileList.length; i++)
{
// get the psd extension
var ext = tempFileList[i].toString();
ext = ext.substring(ext.lastIndexOf("."), ext.length);
if (tempFileList[i] instanceof File)
{
if (ext == ".psd") fileList.push (tempFileList[i]);
// else (alert("Ignoring " + tempFileList[i]))
}
}
alert("Files:\n" + fileList.length);
You can save a png with this
function save_png(afilePath)
{
// Save as a png
var pngFile = new File(afilePath);
pngSaveOptions = new PNGSaveOptions();
pngSaveOptions.embedColorProfile = true;
pngSaveOptions.formatOptions = FormatOptions.STANDARDBASELINE;
pngSaveOptions.matte = MatteType.NONE; pngSaveOptions.quality = 1;
activeDocument.saveAs(pngFile, pngSaveOptions, false, Extension.LOWERCASE);
}
To open a psd just use
app.open(fileRef);
To save it
function savePSD(afilePath)
{
// save out psd
var psdFile = new File(afilePath);
psdSaveOptions = new PhotoshopSaveOptions();
psdSaveOptions.embedColorProfile = true;
psdSaveOptions.alphaChannels = true;
activeDocument.saveAs(psdFile, psdSaveOptions, false, Extension.LOWERCASE);
}

How to name tabs in pdf export to xls

I have a PDF form that has many form fields. My requirement is to export the form data (some of it anyway) into an excel (xls) format on a network drive that is being picked up and used by another process (that I do not have access or code to change) to load into a database using Acrobat Javascript when a button on the PDF is clicked. As this is a distributed form and the network drive is common, it did not make sense to setup ODBC connections to the database nor did I have access to do so.
The issues I am having are:
I need to specifically name the Worksheet so that the process correctly processes the xls file.
I need to save the information to the network drive without prompting the user, which the code below does.
The fieldNames seem to be losing spaces when they export to xls.
So far nothing I tried has worked nor do any of the references I have gone though provide such information. I can push data into a .csv and .txt files and have tried creating a new Report something like the following:
var rep = new Report();
var values = "";
...
rep.writeText(values);
var docRep = rep.open("myreport.pdf");
docRep.saveAs("/c/temp/Upload.txt","com.adobe.acrobat.plain-text")
docRep.closeDoc(true);
app.alert("Upload File Saved", 3);
but it only allows the .txt extension not the xls or csv extension. I managed to export the csv in another way.
Below is a small snippet of my code:
var fieldNames = [];
var result ="";
fieldNames.push("Inn Code");
fieldValues.push('"' + this.getField("Hotel Info Inn Code").value + '"');
fieldNames.push("Business Unit");
fieldValues.push('"' + this.getField("Hotel Info Business Unit").value + '"');
for ( var i=0; i < fieldNames.length; i++ ) {
result += (fieldNames[i] + "\t");
}
for ( var i=0; i < fieldValues.length; i++ ) {
result += (fieldValues[i] + "\t");
}
this.createDataObject({cName: "Upload.xls", cValue: cMyC});
this.exportDataObject({cName: "Upload.xls", nLaunch: 0});
Any help or suggestions provided would be greatly appreciated!

Update (For anyone else who encounters this need).
I determined that the tab name using the method above is the same as the save name. The other issue I encountered was that the xls file was really a csv with an xls extension. To resolve this I decided to export an csv I am formatting or a text file and an external process to reformat the file as an xls.
var fieldNames = [];
var fieldValues = [];
var result = '';
// FIELD VALUES
fieldNames.push('"Column Name 1"');
fieldValues.push('\r\n' + '"' + this.getField("Field Name 1").value + '"');
fieldNames.push('"Column Name 2"');
fieldValues.push('"' + this.getField("Field Name 2").value + '"');
for ( var i=0; i < fieldNames.length; i++ ) {
if (i != fieldNames.length-1){
result += (fieldNames[i] + ",");
} else {
result += (fieldNames[i]);
}
}
for ( var i=0; i < fieldValues.length; i++ ) {
if (i != fieldValues.length-1){
result += (fieldValues[i] + ",");
} else {
result += (fieldValues[i]);
}
}
I used the following to output a csv file:
this.createDataObject('UploadFile.csv', result);
this.exportDataObject({ cName:'UploadFile.csv', nLaunch:'0'});
The issue is that this prompts the user where to save the file and I wanted a silent specific location so I wound up doing the following in place of the create and export object above to output a txt file silently without prompting the user:
var rep = new Report();
rep.writeText(result);
var docRep = rep.open("myreport.pdf");
docRep.saveAs("/c/temp/UploadFile.txt","com.adobe.acrobat.plain-text")
docRep.closeDoc(true);
You will have to go into Edit > Preferences > Security (Enhanced) then choose the output folder (C:\Temp) in my case to output silently if you chose to employ this method.

Split PDF into separate files based on text

I have a large single pdf document which consists of multiple records. Each record usually takes one page however some use 2 pages. A record starts with a defined text, always the same.
My goal is to split this pdf into separate pdfs and the split should happen always before the "header text" is found.
Note: I am looking for a tool or library using java or python. Must be free and available on Win 7.
Any ideas? AFAIK imagemagick won't work for this. May itext do this? I never used and it's
pretty complex so would need some hints.
EDIT:
Marked Answer led me to solution. For completeness here my exact implementation:
public void splitByRegex(String filePath, String regex,
String destinationDirectory, boolean removeBlankPages) throws IOException,
DocumentException {
logger.entry(filePath, regex, destinationDirectory);
destinationDirectory = destinationDirectory == null ? "" : destinationDirectory;
PdfReader reader = null;
Document document = null;
PdfCopy copy = null;
Pattern pattern = Pattern.compile(regex);
try {
reader = new PdfReader(filePath);
final String RESULT = destinationDirectory + "/record%d.pdf";
// loop over all the pages in the original PDF
int n = reader.getNumberOfPages();
for (int i = 1; i < n; i++) {
final String text = PdfTextExtractor.getTextFromPage(reader, i);
if (pattern.matcher(text).find()) {
if (document != null && document.isOpen()) {
logger.debug("Match found. Closing previous Document..");
document.close();
}
String fileName = String.format(RESULT, i);
logger.debug("Match found. Creating new Document " + fileName + "...");
document = new Document();
copy = new PdfCopy(document,
new FileOutputStream(fileName));
document.open();
logger.debug("Adding page to Document...");
copy.addPage(copy.getImportedPage(reader, i));
} else if (document != null && document.isOpen()) {
logger.debug("Found Open Document. Adding additonal page to Document...");
if (removeBlankPages && !isBlankPage(reader, i)){
copy.addPage(copy.getImportedPage(reader, i));
}
}
}
logger.exit();
} finally {
if (document != null && document.isOpen()) {
document.close();
}
if (reader != null) {
reader.close();
}
}
}
private boolean isBlankPage(PdfReader reader, int pageNumber)
throws IOException {
// see http://itext-general.2136553.n4.nabble.com/Detecting-blank-pages-td2144877.html
PdfDictionary pageDict = reader.getPageN(pageNumber);
// We need to examine the resource dictionary for /Font or
// /XObject keys. If either are present, they're almost
// certainly actually used on the page -> not blank.
PdfDictionary resDict = (PdfDictionary) pageDict.get(PdfName.RESOURCES);
if (resDict != null) {
return resDict.get(PdfName.FONT) == null
&& resDict.get(PdfName.XOBJECT) == null;
} else {
return true;
}
}

You can create a tool for your requirements using iText.
Whenever you are looking for code samples concerning (current versions of) the iText library, you should consult iText in Action — 2nd Edition the code samples from which are online and searchable by keyword from here.
In your case the relevant samples are Burst.java and ExtractPageContentSorted2.java.
Burst.java shows how to split one PDF in multiple smaller PDFs. The central code:
PdfReader reader = new PdfReader("allrecords.pdf");
final String RESULT = "record%d.pdf";
// We'll create as many new PDFs as there are pages
Document document;
PdfCopy copy;
// loop over all the pages in the original PDF
int n = reader.getNumberOfPages();
for (int i = 0; i < n; ) {
// step 1
document = new Document();
// step 2
copy = new PdfCopy(document,
new FileOutputStream(String.format(RESULT, ++i)));
// step 3
document.open();
// step 4
copy.addPage(copy.getImportedPage(reader, i));
// step 5
document.close();
}
reader.close();
This sample splits a PDF in single-page PDFs. In your case you need to split by different criteria. But that only means that in the loop you sometimes have to add more than one imported page (and thus decouple loop index and page numbers to import).
To recognize on which pages a new dataset starts, be inspired by ExtractPageContentSorted2.java. This sample shows how to parse the text content of a page to a string. The central code:
PdfReader reader = new PdfReader("allrecords.pdf");
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
System.out.println("\nPage " + i);
System.out.println(PdfTextExtractor.getTextFromPage(reader, i));
}
reader.close();
Simply search for the record start text: If the text from page contains it, a new record starts there.

Apache PDFBox has a PDFSplit utility that you can run from the command-line.

If you like Python, there's a nice library: PyPDF2. The library is pure python2, BSD-like license.
Sample code:
from PyPDF2 import PdfFileWriter, PdfFileReader
input1 = PdfFileReader(open("C:\\Users\\Jarek\\Documents\\x.pdf", "rb"))
# analyze pdf data
print input1.getDocumentInfo()
print input1.getNumPages()
text = input1.getPage(0).extractText()
print text.encode("windows-1250", errors='backslashreplacee')
# create output document
output = PdfFileWriter()
output.addPage(input1.getPage(0))
fout = open("c:\\temp\\1\\y.pdf", "wb")
output.write(fout)
fout.close()

For non coders PDF Content Split is probably the easiest way without reinventing the wheel and has an easy to use interface: http://www.traction-software.co.uk/pdfcontentsplitsa/index.html
hope that helps.

How to automate Photoshop?

I am trying to automate the process of scanning/cropping photos in Photoshop. I need to scan 3 photos at a time, then use Photoshop's Crop and Straighten Photos command, which creates 3 separate images. After that I'd like to save each of the newly created images as a PNG.
I looked at the JSX scripts and they seem to a lot of promise. Is what I described possible to automate in Photoshop using JavaScript or VBScript or whatever?

I just found this script just did the work for me! It automatically crop & straighten the photo and save each result to directory you specified.
http://www.tranberry.com/photoshop/photoshop_scripting/PS4GeeksOrlando/IntroScripts/cropAndStraightenBatch.jsx
Save it to local then run it in the PS=>File=>Command=>Browse
P.S I found in the comment it said the script can be executed directly by double clicking from Mac Finder or Windows Explorer.
Backup gist for the script here

I actually got the answer on the Photoshop forums over at adobe. It turns out that Photoshop CS4 is totally scriptable via JavaScript, VBScript and comes with a really kick-ass Developer IDE, that has everything you'd expect (debugger, watch window, color coding and more). I was totally impressed.
Following is an extract for reference:
you can run the following script that will create a new folder off the existing one and batch split all the files naming them existingFileName#001.png and put them in the new folder (edited)
#target Photoshop
app.bringToFront;
var inFolder = Folder.selectDialog("Please select folder to process");
if(inFolder != null){
var fileList = inFolder.getFiles(/\.(jpg|tif|psd|)$/i);
var outfolder = new Folder(decodeURI(inFolder) + "/Edited");
if (outfolder.exists == false) outfolder.create();
for(var a = 0 ;a < fileList.length; a++){
if(fileList[a] instanceof File){
var doc= open(fileList[a]);
doc.flatten();
var docname = fileList[a].name.slice(0,-4);
CropStraighten();
doc.close(SaveOptions.DONOTSAVECHANGES);
var count = 1;
while(app.documents.length){
var saveFile = new File(decodeURI(outfolder) + "/" + docname +"#"+ zeroPad(count,3) + ".png");
SavePNG(saveFile);
activeDocument.close(SaveOptions.DONOTSAVECHANGES) ;
count++;
}
}
}
};
function CropStraighten() {
function cTID(s) { return app.charIDToTypeID(s); };
function sTID(s) { return app.stringIDToTypeID(s); };
executeAction( sTID('CropPhotosAuto0001'), undefined, DialogModes.NO );
};
function SavePNG(saveFile){
pngSaveOptions = new PNGSaveOptions();
pngSaveOptions.embedColorProfile = true;
pngSaveOptions.formatOptions = FormatOptions.STANDARDBASELINE;
pngSaveOptions.matte = MatteType.NONE;
pngSaveOptions.quality = 1;
pngSaveOptions.PNG8 = false; //24 bit PNG
pngSaveOptions.transparency = true;
activeDocument.saveAs(saveFile, pngSaveOptions, true, Extension.LOWERCASE);
}
function zeroPad(n, s) {
n = n.toString();
while (n.length < s) n = '0' + n;
return n;
};
Visit here for complete post.

Have you tried using Photoshop Actions? I don't now about the scanning part, but the rest can all be done by actions quite easily.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Identify and extract or delete pages of a PDF based on a search string / text (action / javascript) - pdf

Related

How to exclude certain images from autosave in Gatan Digital Micrograph (GMS) in DM-script

Automating Photoshop to edit en rename +600 files with the name of the folder

How to name tabs in pdf export to xls

Split PDF into separate files based on text

How to automate Photoshop?

Categories

Resources