Extract Text from Multipage Attachment PDF Using Google Apps Script - pdf

I have a Gmail attachment PDF with multiple scanned pages. When I use Google Apps Script to save the blob from the attachment to a Drive file, open the PDF manually from Google Drive, then select Open With Google Docs, all of the text from the PDF is displayed as a Google Doc. However, when I save the blob as a Google Doc with OCR, only the text from the image on the first page is saved to a Doc, accessed either manually or by code.
The code to get the blob and process it is:
function getAttachments(desiredLabel, processedLabel, emailQuery){
// Find emails
var threads = GmailApp.search(emailQuery);
if(threads.length > 0){
// Iterate through the emails
for(var i in threads){
var mesgs = threads[i].getMessages();
for(var j in mesgs){
var processingMesg = mesgs[j];
var attachments = processingMesg.getAttachments();
var processedAttachments = 0;
// Iterate through attachments
for(var k in attachments){
var attachment = attachments[k];
var attachmentName = attachment.getName();
var attachmentType = attachment.getContentType();
// Process PDFs
if (attachmentType.includes('pdf')) {
processedAttachments += 1;
var pdfBlob = attachment.copyBlob();
var filename = attachmentName + " " + processedAttachments;
processPDF(pdfBlob, filename);
}
}
}
}
}
}
function processPDF(pdfBlob, filename){
// Saves the blob as a PDF.
// All pages are displayed if I click on it from Google Drive after running this script.
let pdfFile = DriveApp.createFile(pdfBlob);
pdfFile.setName(filename);
// Saves the blob as an OCRed Doc.
let resources = {
title: filename,
mimeType: "application/pdf"
};
let options = {
ocr: true,
ocrLanguage: "en"
};
let file = Drive.Files.insert(resources, pdfBlob, options);
let fileID = file.getId();
// Open the file to get the text.
// Only the text of the image on the first page is available in the Doc.
let doc = DocumentApp.openById(fileID);
let docText = doc.getBody().getText();
}
If I try to use Google Docs to read the PDF without OCR directly, I get Exception: Invalid argument, for example:
DocumentApp.openById(pdfFile.getId());
How do I get the text from all of the pages of the PDF?

DocumentApp.openById is a method that can only be used for Google Docs documents
pdfFile can only be "opened" with the DriveApp - DriveApp.getFileById(pdfFile.getId());
Opening a file with DriveApp allows you to use the following methods on the file
When it comes to OCR conversion, your code works for me correctly to convert all pages of a PDF document to Google Docs, so you error source is likely come from the attachment itself / the way you retrieve the blob
Mind that OCR conversion is not good at preserving formatting, so a two page PDF might be collapsed into a one-page Docs - depneding on the formatting of the PDF

Related

Only part of a PDF is converting

I have a PDF I am trying to extract the text from.
To do this, I have tried to get the contents into a Google Doc.
The PDF has 1180 pages (3MB) but only the first 77 pages are being converted to text.
I have tried Drive.Files.insert and Drive.Files.copy, but get the same result.
I also tried to convert the PDF using MS Word and referencing that file (2.5MB) - with the same result.
I cannot see anything in either the PDF or Word that would indicate an "end of file" that would stop the rest of the document converting. There are no error messages - just 6.5% of what I need. I can only assume it was originally smaller PDF's that were merged.
Is there something else I should be looking at? Has anyone encountered this before?
I can manipulate the PDFtext string to get the data I need, but can't convert more than the first 77 pages.
This is what I am using to get the text string I require.
function txtPDF() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sht = ss.getSheetByName('Sheet1');
var mycell = sht.getRange('B1');
var myPdfID = mycell.getValue().toString();
var PDFblob = DriveApp.getFileById(myPdfID).getBlob();
var resource = {
title: PDFblob.getName(),
// mimeType: PDFblob.getContentType()
};
// var tmpfile = Drive.Files.insert(resource, PDFblob, {ocr: true, ocrLanguage: "en"});
var tmpfile = Drive.Files.copy(resource, myPdfID, {convert: true, ocr: true, ocrLanguage: "en"});
var doc = DocumentApp.openById(tmpfile.id);
// var doc = Drive.Files.copy({}, 'WordFileID', {'convert': true});
// var doc = DocumentApp.openById('WordFileID');
var PDFtext = doc.getBody().getText();
// Drive.Files.remove(doc.getId());
};
It looks like I fell foul of the Google Drive limits.
Documents can be up to 50MB - which was not an issue.
There is however a limit of 1.02 million characters. The 1180 pages exceeded this, so I guess I was lucky to get anything returned at all.
Maximum file sizes on Google Drive
According to the documentation, the pdf to docs conversion is limited to files of 2MB or less. To convert larger files properly look for other alternatives outside of Google.

How to convert PDF in google drive to Text by google script

I am working on the code to convert PDF in google drive to Text. And, I can convert PDF by below codes. However, when I tried to specify the PDF in google drive by URL (e.g. var url = "https://drive.google.com/open?id=1gs-WvPPPPPPP0-iaawwadafa--";). The code doesn't work.
var url = "https://www.test.com/sites/g/files/xyz123.pdf";
// It works fine. And, if changes to google drive URL, it doesn't work.
var blob = UrlFetchApp.fetch(url).getBlob();
// No error showed up at getBlob.
Logger.log(blob) // shows "Blob" after changes to google drive URL
Logger.log(blob.getName()) // shows "open.html" after changes to google drive URL
Logger.log(blob.getContentType()) // shows "text/html" after changes to google drive URL
var resource = {
title: blob.getName(),
mimeType: blob.getContentType()
};
var file = Drive.Files.insert(resource, blob, {ocr: true, ocrLanguage: "en"});
// At above line, Error message "OCR is not supported for files of type text/html (line 64, file "Code")" showed.
var doc = DocumentApp.openById(file.id);
var text = doc.getBody().getText();
return text;
I would like to convert PDF files in folder of google drive to TEXT one by one.

Apps Script save as pdf doesn't include drawings and images

I want to save a Google Doc file as a pdf in the same Google Drive folder as my current file. I know I can download the file as a pdf, but then I have to upload it into the same Google Drive folder. I am trying to skip the upload step.
I have created a script to accomplish all of this, but I cannot get the images and drawings to be included in the resulting pdf.
Here is my code:
function onOpen() {
// Add a custom menu to the spreadsheet.
var ui = DocumentApp.getUi();
var menu = ui.createAddonMenu();
menu.addItem('Save As PDF','saveToPDF')
.addToUi();
}
function saveToPDF(){
var currentDocument = DocumentApp.getActiveDocument();
var parentFolder = DriveApp.getFileById(currentDocument.getId()).getParents();
var folderId = parentFolder.next().getId();
var currentFolder = DriveApp.getFolderById(folderId);
var pdf = currentDocument.getAs('application/PDF');
pdf.setName(currentDocument.getName() + ".pdf");
// Check if the file already exists and add a datecode if it does
var hasFile = DriveApp.getFilesByName(pdf.getName());
if(hasFile.hasNext()){
var d = new Date();
var dateCode = d.getYear()+ "" + ("0" + (d.getMonth() + 1)).slice(-2) + "" + ("0" + (d.getDate())).slice(-2);
pdf.setName(currentDocument.getName() + "_" + dateCode +".pdf");
}
// Create the file (puts it in the root folder)
var file = DriveApp.createFile(pdf);
// Add to source document original folder
currentFolder.addFile(file);
// Remove the new file from the root folder
DriveApp.getRootFolder().removeFile(file);
}
Is there another way to create the pdf, save to the current Google Drive folder, and not lose the images?
UPDATE
I just tested and realized that even if I export as a pdf, the images and drawings aren't included. There has to be a way to do this.
UPDATE 2
I have been testing some more and have learned a few things:
Images in the header/footer are included if they are In line, but if I use Wrap text or Break text they are not.
Images in the body can be any of the three
However, if I use the "Project Proposal" template, they include an image in the footer with Break text and it exports to pdf. I can't tell why their image is any different.
I don't want to use In line because I want the image to touch both sides of the page and In line will always leave at least 1 pixel to the left of the image.

Get text from PDF in Google

I have a PDF document that is saved in Google Drive. I can use the Google Drive Web UI search to find text in the document.
How can I programmatically extract a portion of the text in the document using Google Apps Script?
See pdfToText() in this gist.
To invoke the OCR built in to Google Drive on a PDF file, e.g. myPDF.pdf, here is what you do:
function myFunction() {
var pdfFile = DriveApp.getFilesByName("myPDF.pdf").next();
var blob = pdfFile.getBlob();
// Get the text from pdf
var filetext = pdfToText( blob, {keepTextfile: false} );
// Now do whatever you want with filetext...
}

How to Download PDF Links in Column and Save to Common Folder

We have a column that contains links to PDFs that starts on line 4 (e.g B4:B). I am trying to find a way to automatically download the PDF files that are accessed via the links to a folder on Drive. This is what I have so far:
function savePDFs() {
var sheet = SpreadsheetApp.getActiveSheet();
var data = sheet.getDataRange().getValues();
for (var i = 3; i < data.length; i++) {
Logger.log(data[i][1]);
}
}
Presumably the above code would write the links starting in column B (index value of [1]) on row 4 (i value of 3) (ie., B4) until the bottom of the data set (eg., data.length()).
I'm now confused about how to access and save the PDF link that are written in the logger to a folder.
Would someone be willing to help me out? I'm currently having to go to each link, click Save Link As... and then navigate to the folder that I'd like to save the linked PDF to. My hope is to modify the above process using code.
Update: I found this bit of code here that may help me out. Note, I changed the PDF link to a currently valid PDF link.
var urlOfThePdf = 'http://download.p4c.philips.com/l4b/9/929000277411_eu/929000277411_eu_pss_aenaa.pdf';// an example of online pdf file
var folderName = 'GAS';// an example of folder name
function saveInDriveFolder(){
var folder = DocsList.getFolder(folderName);// get the folder
var file = UrlFetchApp.fetch(urlOfThePdf); // get the file content as blob
folder.createFile(file);//create the file directly in the folder
}
Okay, I'm going to go and noodle with the data that is in the logger to confirm that the data is in properly formatted PDF links, then I'm going to test this new bit of code out. I feel like I'm getting close.
You can't force a download of a file from an apps script, you must try that from an HTMLService and not sure it will work.
For your need I would recommend to create a dedicated folder and you add all the pdf in it and you use the download function of the drive interface to download all files in one clic.
In drive, a file can be put in several folders so the pdf files stay in the original folder but you create a new folder 'PDF for download" for example and you put them in it. To do that from drive interface you have to click on "shift"+Z when file(s) is/are selected.
For you current list of file you just have to add in your loop the add to folder function. You can use this function.
function addFileToFolder(id){
var folderPDF = DriveApp.getFolderById("Id OFFolder to put pdf");
var file = DriveApp.getFileById(id);
folderPDF.addFile(file);
}
EDIT : Function will browse list of url, get the file and make a copy in a dedicated folder on the user drive.
function downloadInDriveFolder(){
var folderID = 'Id of the folder';// put id of the folder
var folder = DriveApp.getFolderById(folderID)// get the folder
var sheet = SpreadsheetApp.getActiveSheet();
var data = sheet.getDataRange().getValues();
for (var i = 3; i < data.length; i++) {
var blob = UrlFetchApp.fetch(data[i][1]).getContent();
var pdf = DriveApp.createFile(blob);
pdf.setName(data[i][0]);//Put as name of the file the value in col A
folder.addFile(pdf);
}
}
Well I figured it out. I was expecting more code, but this does it for me:
function listPDFs() {
var out = new Array();
var row = 3; //row index of 0 = row 1
var column = 4; // column index of 0 = column A
var sheet = SpreadsheetApp.getActiveSheet();
var data = sheet.getDataRange().getValues();
var folder = DriveApp.getFolderById("this is where you paste your folder id"); // destination folder (this is the 0978SDFSDFKJHSDF078Y98hkyo looking value when you right click your folder and select "Get Link")
for (var i=row ; i<data.length ; i++) {
if(data[i][column] !== "") {
var file = UrlFetchApp.fetch(data[i][column]);
folder.createFile(file);
}
}
return
}
As you can see, I included a row and column variable so that I could easily change these.
I haven't figured out how to assemble them into a merged PDF, but I did figure out that I could sort them by date (which places the top most item first) and then right click and select "Open With...PDF Mergy", which then moves the PDFs into PDF Mergy and merges them up in the correct order. You can find PDF Mergy in the Chrome App Store. If I figure out how to automatically call PDF Mergy from GAS, I'll post that up--but for the time being the above code has saved us a ton of time...so I'm calling it good enough for the time being.