PDFBox_Facing issue while extracting a certain image from the top of each page - pdf

Recently I had asked THIS QUESTION to be able to save all the images present in a PDF file on the File System and I was able to save the images successfully.
I tested my code on a lot of pdf files and it ran just fine. But, today I came accross THIS pdf file from where it is not able to extract some images(attached below).
Can anyone please tell me what else I can do to extract these images? Is it even possible to extract them? Are they really images or something else? I would really appreciate the help.
My code(Please ignore the hardcoding as I am still testing this out):
function fn_getAllImages()
{
var strPdf = "C:\\Users\\a614923\\Desktop\\haka\\Work\\2017\\10. October\\31\\test.PDF";
var strout = "C:\\Users\\a614923\\Desktop\\haka\\Work\\2017\\10. October\\31\\Newfolder\\img"
intPage = 2; //for the 2nd page(the image is present in the 2nd page)
var objPdf = JavaClasses.org_apache_pdfbox_pdmodel.PDDocument.load_3(strPdf);
var objPage = objPdf.getDocumentCatalog().getAllPages().get(intPage-1);
var objImages = objPage.getResources().getXObjects().values().toArray();
var objImage, objImgBuffer, objImageFile;
for(var i=0; i<objImages.length; i++)
{
objImage = objImages.items(i);
Log.Message(objImage.toString());
if(aqString.Find(objImage.toString(),"PDXObjectForm",0,false)>0)
{
continue;
}
else
{
objImage.write2file_2(strout+i);
//objImgBuffer = objImage.getRGBImage();
//objImageFile = JavaClasses.java_io.File.newInstance(strout+i+".png");
//JavaClasses.javax_imageio.ImageIO.write(objImgBuffer,"png",objImageFile);
}
}
}
The image in the PDF file which I want to save(the one inside the red box below):

Related

Extract Text from Multipage Attachment PDF Using Google Apps Script

I have a Gmail attachment PDF with multiple scanned pages. When I use Google Apps Script to save the blob from the attachment to a Drive file, open the PDF manually from Google Drive, then select Open With Google Docs, all of the text from the PDF is displayed as a Google Doc. However, when I save the blob as a Google Doc with OCR, only the text from the image on the first page is saved to a Doc, accessed either manually or by code.
The code to get the blob and process it is:
function getAttachments(desiredLabel, processedLabel, emailQuery){
// Find emails
var threads = GmailApp.search(emailQuery);
if(threads.length > 0){
// Iterate through the emails
for(var i in threads){
var mesgs = threads[i].getMessages();
for(var j in mesgs){
var processingMesg = mesgs[j];
var attachments = processingMesg.getAttachments();
var processedAttachments = 0;
// Iterate through attachments
for(var k in attachments){
var attachment = attachments[k];
var attachmentName = attachment.getName();
var attachmentType = attachment.getContentType();
// Process PDFs
if (attachmentType.includes('pdf')) {
processedAttachments += 1;
var pdfBlob = attachment.copyBlob();
var filename = attachmentName + " " + processedAttachments;
processPDF(pdfBlob, filename);
}
}
}
}
}
}
function processPDF(pdfBlob, filename){
// Saves the blob as a PDF.
// All pages are displayed if I click on it from Google Drive after running this script.
let pdfFile = DriveApp.createFile(pdfBlob);
pdfFile.setName(filename);
// Saves the blob as an OCRed Doc.
let resources = {
title: filename,
mimeType: "application/pdf"
};
let options = {
ocr: true,
ocrLanguage: "en"
};
let file = Drive.Files.insert(resources, pdfBlob, options);
let fileID = file.getId();
// Open the file to get the text.
// Only the text of the image on the first page is available in the Doc.
let doc = DocumentApp.openById(fileID);
let docText = doc.getBody().getText();
}
If I try to use Google Docs to read the PDF without OCR directly, I get Exception: Invalid argument, for example:
DocumentApp.openById(pdfFile.getId());
How do I get the text from all of the pages of the PDF?
DocumentApp.openById is a method that can only be used for Google Docs documents
pdfFile can only be "opened" with the DriveApp - DriveApp.getFileById(pdfFile.getId());
Opening a file with DriveApp allows you to use the following methods on the file
When it comes to OCR conversion, your code works for me correctly to convert all pages of a PDF document to Google Docs, so you error source is likely come from the attachment itself / the way you retrieve the blob
Mind that OCR conversion is not good at preserving formatting, so a two page PDF might be collapsed into a one-page Docs - depneding on the formatting of the PDF

creating a PDF via google apps script but creates the PDF with an additional blank page

I have a apps script bound to a spreadsheet that creates a pdf file from the sheet. this creates one single page pdf and saves it in a folder in drive. Up until recently, this worked perfectly. Now every time I run the code it does what it is supposed to but the file has a second page that is blank. When I create the pdf manually via file/download as/pdf doc, it creates the pdf as it should, with only one page. I have tried this with both the original and copy that the script temporarily creates. Both work when done manually. I am looking for some suggestions on what could have gone wrong and what to change. Here is an example of the code:
function makePDF() {
var sheet1 = SpreadsheetApp.getActive().getSheetByName('eTimesheet');
var sheet2 = SpreadsheetApp.getActive().getSheetByName('Time Sheet');
var sheet3 = SpreadsheetApp.getActive().getSheetByName('Data');
var triggercell3 = sheet1.getRange('M33').getValue();
if (triggercell3 == 'GO'){
var techNumber = sheet3.getRange('B5').getValue();
var date = sheet3.getRange('B3').getValue();
var fileID = sheet3.getRange('B7').getValue();
var pdfName = "TimeSheet- "+ techNumber + " " + date
var ss = SpreadsheetApp.getActive();
var folder = DriveApp.getFolderById(fileID);
sheet2.showSheet();
sheet1.hideSheet();
//Copy whole spreadsheet
var destSpreadsheet = SpreadsheetApp.open(DriveApp.getFileById(ss.getId()).makeCopy("tmp_convert_to_pdf", folder))
//save to pdf
var theBlob = destSpreadsheet.getBlob().getAs('application/pdf').setName(pdfName);
var newFile = folder.createFile(theBlob);
DriveApp.getFileById(destSpreadsheet.getId()).setTrashed(true);
sheet1.showSheet();
sheet2.hideSheet();
sheet1.getRange('M33').clearContent();
}
}
thanks for any assistance...
Without an example, first question I have is... if you delete a few rows from the sheet and run the script, are you back to 1 page?
I am asking in case the issue is just margin settings. If this is the issue, maybe you can adjust rows or use the UrlFetchApp.fetch approach as PDF page formatting can be specified (eg: margin size).

Mass export all images as individual JPEGs in InDesign?

I am new to Indesign. I have a file that contains images with Photoshop clipping paths. I want to export all the clipped images in a folder. I have tried doing the "Copy Links To" and it successfully exported the original images. However, I do not want the original images but the clipped images instead. Is there a way for me to export all the clipped images as JPEG and not the original linked image? In short, I want to export the images without their background. I hope I'm making sense. I have about 800-1000 images so a batch processing method would be highly appreciated.
I found this script from one of the posts here and modified it a bit to suit my needs. It appears to work in most of my INDD documents, but it fails in others. I wonder why. I sometimes get the error message that
Error string: null is not an object
Source: fileName = File ( rect.graphics[0].itemLink.filePath ).name;
I also noticed that it skips some objects and won't download all of the images. I guess it skips those that are not in rectangles.
test();
function test()
{
var myDoc = app.activeDocument,
apis = myDoc.allPageItems, rect, fileName;
while ( rect = apis.pop() )
{
if ( !(rect instanceof Rectangle) || !rect.graphics[0].isValid ){ continue;}
fileName = File ( rect.graphics[0].itemLink.filePath ).name;
fileName = fileName.replace( /\.[a-z]{2,4}$/i, '.jpg' );
app.jpegExportPreferences.exportResolution = 2400;
app.jpegExportPreferences.jpegQuality = JPEGOptionsQuality.MAXIMUM;
//give it a unique name
var myFile = new File ("C:/Users/RANFacistol-Mata/Desktop/Image Trial/"+ fileName);
rect.exportFile(ExportFormat.JPG, myFile);
}
}
Is there a way for me to modify this script such that instead of iterating through all the rectangles, I would iterate through all of the objects instead, much like clicking this next button
And then check if that object contains an image (jpg, tiff, psd, ai, eps). If it does, then I will export it as scripted above.
Thank you for your help!
You can traverse the links present inside the document with the following snippet, this will take less time as taken by the snippet above...
You can also get the type of link ('eps' or ' pdf' etc) with linkType attribute and filePath with 'filePath' attribute of each link object..
var theDoc = app.documents.item(0);
var theLinkLen = theDoc.links.length;
for(var i = 0; i < theLinkLen; ++i)
{
var link = theDoc.links.item(i);
alert("link name \"" + link.name + "\"" + " has type \"" + link.linkType + "\""+ " with filePath \"" + link.filePath + "\"");
}

How to Download PDF Links in Column and Save to Common Folder

We have a column that contains links to PDFs that starts on line 4 (e.g B4:B). I am trying to find a way to automatically download the PDF files that are accessed via the links to a folder on Drive. This is what I have so far:
function savePDFs() {
var sheet = SpreadsheetApp.getActiveSheet();
var data = sheet.getDataRange().getValues();
for (var i = 3; i < data.length; i++) {
Logger.log(data[i][1]);
}
}
Presumably the above code would write the links starting in column B (index value of [1]) on row 4 (i value of 3) (ie., B4) until the bottom of the data set (eg., data.length()).
I'm now confused about how to access and save the PDF link that are written in the logger to a folder.
Would someone be willing to help me out? I'm currently having to go to each link, click Save Link As... and then navigate to the folder that I'd like to save the linked PDF to. My hope is to modify the above process using code.
Update: I found this bit of code here that may help me out. Note, I changed the PDF link to a currently valid PDF link.
var urlOfThePdf = 'http://download.p4c.philips.com/l4b/9/929000277411_eu/929000277411_eu_pss_aenaa.pdf';// an example of online pdf file
var folderName = 'GAS';// an example of folder name
function saveInDriveFolder(){
var folder = DocsList.getFolder(folderName);// get the folder
var file = UrlFetchApp.fetch(urlOfThePdf); // get the file content as blob
folder.createFile(file);//create the file directly in the folder
}
Okay, I'm going to go and noodle with the data that is in the logger to confirm that the data is in properly formatted PDF links, then I'm going to test this new bit of code out. I feel like I'm getting close.
You can't force a download of a file from an apps script, you must try that from an HTMLService and not sure it will work.
For your need I would recommend to create a dedicated folder and you add all the pdf in it and you use the download function of the drive interface to download all files in one clic.
In drive, a file can be put in several folders so the pdf files stay in the original folder but you create a new folder 'PDF for download" for example and you put them in it. To do that from drive interface you have to click on "shift"+Z when file(s) is/are selected.
For you current list of file you just have to add in your loop the add to folder function. You can use this function.
function addFileToFolder(id){
var folderPDF = DriveApp.getFolderById("Id OFFolder to put pdf");
var file = DriveApp.getFileById(id);
folderPDF.addFile(file);
}
EDIT : Function will browse list of url, get the file and make a copy in a dedicated folder on the user drive.
function downloadInDriveFolder(){
var folderID = 'Id of the folder';// put id of the folder
var folder = DriveApp.getFolderById(folderID)// get the folder
var sheet = SpreadsheetApp.getActiveSheet();
var data = sheet.getDataRange().getValues();
for (var i = 3; i < data.length; i++) {
var blob = UrlFetchApp.fetch(data[i][1]).getContent();
var pdf = DriveApp.createFile(blob);
pdf.setName(data[i][0]);//Put as name of the file the value in col A
folder.addFile(pdf);
}
}
Well I figured it out. I was expecting more code, but this does it for me:
function listPDFs() {
var out = new Array();
var row = 3; //row index of 0 = row 1
var column = 4; // column index of 0 = column A
var sheet = SpreadsheetApp.getActiveSheet();
var data = sheet.getDataRange().getValues();
var folder = DriveApp.getFolderById("this is where you paste your folder id"); // destination folder (this is the 0978SDFSDFKJHSDF078Y98hkyo looking value when you right click your folder and select "Get Link")
for (var i=row ; i<data.length ; i++) {
if(data[i][column] !== "") {
var file = UrlFetchApp.fetch(data[i][column]);
folder.createFile(file);
}
}
return
}
As you can see, I included a row and column variable so that I could easily change these.
I haven't figured out how to assemble them into a merged PDF, but I did figure out that I could sort them by date (which places the top most item first) and then right click and select "Open With...PDF Mergy", which then moves the PDFs into PDF Mergy and merges them up in the correct order. You can find PDF Mergy in the Chrome App Store. If I figure out how to automatically call PDF Mergy from GAS, I'll post that up--but for the time being the above code has saved us a ton of time...so I'm calling it good enough for the time being.

Hyperlink Detection from PDF

I have some PDFs containing Hyperlinks both in form of URL and mailto. Now Is there any way or tool(may be 3rd party) to extract the Hyperlink meta information form the PDF like coordinates, link type and destination address. Any help is highly appreciated.
I have already tried with iText and PDFBox but with no major success, even some third party software are not providing me the desired output.
I have tried the following code in Java using iText
PdfReader myReader = new PdfReader("pdf File Path");
PdfDictionary pageDict = myReader.getPageN(1);
PdfArray annots = pageDict.getAsArray(PdfName.ANNOTS);
System.out.println(annots);
ArrayList<String> dests = new ArrayList<String>();
if(annots != null)
{
for(int i=0; i<annots.size(); ++i)
{
PdfDictionary annotDict = annots.getAsDict(i);
PdfName subType = annotDict.getAsName(PdfName.SUBTYPE);
if (subType != null && PdfName.LINK.equals(subType))
{
PdfDictionary action = annotDict.getAsDict(PdfName.A);
if(action != null && PdfName.URI.equals(action.getAsName(PdfName.S)))
{
dests.add(action.getAsString(PdfName.URI).toString());
} // else { its an internal link }
}
}
}
System.out.println(dests);
You can use Docotic.Pdf library for links extraction (disclaimer: I work for the company).
Below is the code that opens specified file, finds all hyperlinks, collects information about position of each link and draws rectangle around each links.
After that the code creates new PDF (with links in rectangles) and a text file with collected information. In the end, both created files are opened in default viewers.
public static void ListAndHighlightLinks(string inputFile, string outputFile, string outputTxt)
{
using (PdfDocument doc = new PdfDocument(inputFile))
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < doc.Pages.Count; i++)
{
PdfPage page = doc.Pages[i];
foreach (PdfWidget widget in page.Widgets)
{
PdfActionArea actionArea = widget as PdfActionArea;
if (actionArea == null)
continue;
PdfUriAction linkAction = actionArea.Action as PdfUriAction;
if (linkAction == null)
continue;
Uri url = linkAction.Uri;
PdfRectangle rect = actionArea.BoundingBox;
// add information about found link into string buffer
sb.Append("Page ");
sb.Append(i.ToString());
sb.Append(" : ");
sb.Append(rect.ToString());
sb.Append(" ");
sb.AppendLine(url.ToString());
// draw rectangle around found link
page.Canvas.DrawRectangle(rect);
}
}
// save document with highlighted links and text information about links to files
doc.Save(outputFile);
System.IO.File.WriteAllText(outputTxt, sb.ToString());
// open created PDF and text file in default viewers
System.Diagnostics.Process.Start(outputTxt);
System.Diagnostics.Process.Start(outputFile);
}
}
You can use the sample code with a call like this:
ListAndHighlightLinks("input.pdf", "output.pdf", "links.txt");
if your pdfs are copy protected, you need to start with step 1, if they're free to copy, you can start with step 2
step 1: convert your pdfs into word .doc: use Adobe Acrobat Pro or an online pdf to word converter:
http://www.pdfonline.com/pdf2word/index.asp
step 2: copy-paste the whole document into the input window here, you can also download the lightweight html tool:
http://www.surf7.net/services/value-added-services/free-web-tools/email-extractor-lite/
select 'url' as 'Type of address to extract', select your separator, hit extract and that's it.
Hope it works cheers.
One possibility would be using a custom JavaScript in Acrobat, which would enumerate the "words" on the page and then read out their Quads. From that you get the coordinates to create a link (or to compare with the links on the page), as well as the actual text (that's the "word(s)".
If it is "only" to set the border of the existing links, you also do another Acrobat JavaScript which enumerates the links of the document, and set their border color property (and you may need to set the width as well).
(if you prefer "buy" over "make" feel free to contact me in private; such things are part of my standard "repertoire").