How to convert PDF in google drive to Text by google script - pdf

I am working on the code to convert PDF in google drive to Text. And, I can convert PDF by below codes. However, when I tried to specify the PDF in google drive by URL (e.g. var url = "https://drive.google.com/open?id=1gs-WvPPPPPPP0-iaawwadafa--";). The code doesn't work.
var url = "https://www.test.com/sites/g/files/xyz123.pdf";
// It works fine. And, if changes to google drive URL, it doesn't work.
var blob = UrlFetchApp.fetch(url).getBlob();
// No error showed up at getBlob.
Logger.log(blob) // shows "Blob" after changes to google drive URL
Logger.log(blob.getName()) // shows "open.html" after changes to google drive URL
Logger.log(blob.getContentType()) // shows "text/html" after changes to google drive URL
var resource = {
title: blob.getName(),
mimeType: blob.getContentType()
};
var file = Drive.Files.insert(resource, blob, {ocr: true, ocrLanguage: "en"});
// At above line, Error message "OCR is not supported for files of type text/html (line 64, file "Code")" showed.
var doc = DocumentApp.openById(file.id);
var text = doc.getBody().getText();
return text;
I would like to convert PDF files in folder of google drive to TEXT one by one.

Related

Only part of a PDF is converting

I have a PDF I am trying to extract the text from.
To do this, I have tried to get the contents into a Google Doc.
The PDF has 1180 pages (3MB) but only the first 77 pages are being converted to text.
I have tried Drive.Files.insert and Drive.Files.copy, but get the same result.
I also tried to convert the PDF using MS Word and referencing that file (2.5MB) - with the same result.
I cannot see anything in either the PDF or Word that would indicate an "end of file" that would stop the rest of the document converting. There are no error messages - just 6.5% of what I need. I can only assume it was originally smaller PDF's that were merged.
Is there something else I should be looking at? Has anyone encountered this before?
I can manipulate the PDFtext string to get the data I need, but can't convert more than the first 77 pages.
This is what I am using to get the text string I require.
function txtPDF() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sht = ss.getSheetByName('Sheet1');
var mycell = sht.getRange('B1');
var myPdfID = mycell.getValue().toString();
var PDFblob = DriveApp.getFileById(myPdfID).getBlob();
var resource = {
title: PDFblob.getName(),
// mimeType: PDFblob.getContentType()
};
// var tmpfile = Drive.Files.insert(resource, PDFblob, {ocr: true, ocrLanguage: "en"});
var tmpfile = Drive.Files.copy(resource, myPdfID, {convert: true, ocr: true, ocrLanguage: "en"});
var doc = DocumentApp.openById(tmpfile.id);
// var doc = Drive.Files.copy({}, 'WordFileID', {'convert': true});
// var doc = DocumentApp.openById('WordFileID');
var PDFtext = doc.getBody().getText();
// Drive.Files.remove(doc.getId());
};
It looks like I fell foul of the Google Drive limits.
Documents can be up to 50MB - which was not an issue.
There is however a limit of 1.02 million characters. The 1180 pages exceeded this, so I guess I was lucky to get anything returned at all.
Maximum file sizes on Google Drive
According to the documentation, the pdf to docs conversion is limited to files of 2MB or less. To convert larger files properly look for other alternatives outside of Google.

Extract Text from Multipage Attachment PDF Using Google Apps Script

I have a Gmail attachment PDF with multiple scanned pages. When I use Google Apps Script to save the blob from the attachment to a Drive file, open the PDF manually from Google Drive, then select Open With Google Docs, all of the text from the PDF is displayed as a Google Doc. However, when I save the blob as a Google Doc with OCR, only the text from the image on the first page is saved to a Doc, accessed either manually or by code.
The code to get the blob and process it is:
function getAttachments(desiredLabel, processedLabel, emailQuery){
// Find emails
var threads = GmailApp.search(emailQuery);
if(threads.length > 0){
// Iterate through the emails
for(var i in threads){
var mesgs = threads[i].getMessages();
for(var j in mesgs){
var processingMesg = mesgs[j];
var attachments = processingMesg.getAttachments();
var processedAttachments = 0;
// Iterate through attachments
for(var k in attachments){
var attachment = attachments[k];
var attachmentName = attachment.getName();
var attachmentType = attachment.getContentType();
// Process PDFs
if (attachmentType.includes('pdf')) {
processedAttachments += 1;
var pdfBlob = attachment.copyBlob();
var filename = attachmentName + " " + processedAttachments;
processPDF(pdfBlob, filename);
}
}
}
}
}
}
function processPDF(pdfBlob, filename){
// Saves the blob as a PDF.
// All pages are displayed if I click on it from Google Drive after running this script.
let pdfFile = DriveApp.createFile(pdfBlob);
pdfFile.setName(filename);
// Saves the blob as an OCRed Doc.
let resources = {
title: filename,
mimeType: "application/pdf"
};
let options = {
ocr: true,
ocrLanguage: "en"
};
let file = Drive.Files.insert(resources, pdfBlob, options);
let fileID = file.getId();
// Open the file to get the text.
// Only the text of the image on the first page is available in the Doc.
let doc = DocumentApp.openById(fileID);
let docText = doc.getBody().getText();
}
If I try to use Google Docs to read the PDF without OCR directly, I get Exception: Invalid argument, for example:
DocumentApp.openById(pdfFile.getId());
How do I get the text from all of the pages of the PDF?
DocumentApp.openById is a method that can only be used for Google Docs documents
pdfFile can only be "opened" with the DriveApp - DriveApp.getFileById(pdfFile.getId());
Opening a file with DriveApp allows you to use the following methods on the file
When it comes to OCR conversion, your code works for me correctly to convert all pages of a PDF document to Google Docs, so you error source is likely come from the attachment itself / the way you retrieve the blob
Mind that OCR conversion is not good at preserving formatting, so a two page PDF might be collapsed into a one-page Docs - depneding on the formatting of the PDF

How to upload a png file to S3 Bucket?

Trying to upload a png file using S3-for-Google-Apps-Script library to S3 bucket:
// get the image blob
const imgBlob = UrlFetchApp.fetch('imageUrl').getBlob();
// init S3 instance
const s3 = S3.getInstance(awsAccessKeyId, awsSecretKey);
// upload the image to S3 bucket
s3.putObject(bucketName, 'test.png', imgBlob, { logRequests:true });
The file is uploading to S3 but not in a perfect way! It looks like this:
If I download the image and open getting the error:
"It may be damaged or use a file format that Preview doesn’t recognize."
So, how can I upload a .png file to amazon S3 bucket?
I can correctly upload the image when 'base64' is used to s3.putObject():
const base64 = Utilities.base64Encode(imgBlob.getBytes());
s3.putObject(bucketName, 'test.png', base64, { logRequests:true });
// go to S3 and clicking on the link I can see the base64 string
But this is uploading as String e.g. when I go S3 & click on test.png I see something like this: "iVBORw0KGgoAAAANSUhEUgAAAgAAAAI ... II=", but I want to see the actual image, not a String.
I believe your situation and goal as follows.
In your situation, the base64 data of the image can be uploaded. But, the uploaded data is not the image. It's the string data.
In your goal, you want to upload the image file of the publicly shared image using the image URL.
For this, how about this answer?
Issue and workaround:
When I saw the script of "S3-for-Google-Apps-Script", it seems that the URL cannot be directly used for s3.putObject(). And, the inputted blob is converted to the string type using getDataAsString(). I think that this is the reason of your issue.
In this answer, I would like to propose to modify the GAS library of "S3-for-Google-Apps-Script" for using the byte array to payload.
Usage:
At first, please copy the GAS project of S3-for-Google-Apps-Script, and please modify this as follows.
Modified script:
About S3.prototype.putObject in the file of S3.gs, please modify as follows.
From:
request.setContent(object.getDataAsString());
To:
request.setContent(object.getBytes());
And, about S3Request.prototype.setContent in the file of S3Request.gs, please modify as follows.
From:
if (typeof content != 'string') throw 'content must be passed as a string'
To:
// if (typeof content != 'string') throw 'content must be passed as a string'
And, about S3Request.prototype.getContentMd5_ in the file of S3Request.gs, please modify as follows.
From:
return Utilities.base64Encode(Utilities.computeDigest(Utilities.DigestAlgorithm.MD5, this.content, Utilities.Charset.UTF_8));
To:
return Utilities.base64Encode(Utilities.computeDigest(Utilities.DigestAlgorithm.MD5, this.content));
Sample script:
And, for above modified script, please test the following script.
const imageUrl = "###"; // Please set the image URL.
const s3 = S3.getInstance(awsAccessKeyId, awsSecretKey); // Please set them.
const imageBlob = UrlFetchApp.fetch(imageUrl).getBlob();
s3.putObject(bucketName, 'test.png', imageBlob, { logRequests:true });
By this, your token can be created by the modified library and use it.
When I checked this official document, I thought that the byte array might be able to be used.
References:
PutObject
S3-for-Google-Apps-Script

Get text from PDF in Google

I have a PDF document that is saved in Google Drive. I can use the Google Drive Web UI search to find text in the document.
How can I programmatically extract a portion of the text in the document using Google Apps Script?
See pdfToText() in this gist.
To invoke the OCR built in to Google Drive on a PDF file, e.g. myPDF.pdf, here is what you do:
function myFunction() {
var pdfFile = DriveApp.getFilesByName("myPDF.pdf").next();
var blob = pdfFile.getBlob();
// Get the text from pdf
var filetext = pdfToText( blob, {keepTextfile: false} );
// Now do whatever you want with filetext...
}

how to upload a excel file in google doc

use C#,want to upload excel file on google doc. bellow syntax use to upload a xls file
//use Content-Type: text/csv
entry.MediaSource = new MediaFileSource("E:\\Emailcontent.xls", "text/csv");
but it's not working ,after upload file convert to csv .But i don't want this conversion.I just want to upload my excel file in my google doc.Help me to upload excel file with out conversion.Thanks in advanced
string USERNAME = "xxx#gmail.com";
string PASSWORD = "xxxxx";
// Start the service and set credentials
DocumentsService service = new DocumentsService("MyDocumentsListIntegration-v1");
service.setUserCredentials(USERNAME, PASSWORD);
Authenticator authenticator = new ClientLoginAuthenticator("TestApi", Google.GData.Client.ServiceNames.Documents, service.Credentials);
DocumentEntry entry = new DocumentEntry();
// Set the document title
entry.Title.Text = "Legal Contract";
entry.IsSpreadsheet = true;
// Set the media source
//entry.MediaSource = new MediaFileSource("E:\\New Microsoft Office Word Document.doc", "application/msword");
entry.MediaSource = new MediaFileSource("E:\\Emailcontent.xls", "text/csv");
// Define the resumable upload link
Uri createUploadUrl = new Uri("https://docs.google.com/feeds/upload/create-session/default/private/full");
AtomLink link = new AtomLink(createUploadUrl.AbsoluteUri);
link.Rel = ResumableUploader.CreateMediaRelation;
entry.Links.Add(link);
// Set the service to be used to parse the returned entry
entry.Service = service;
// Instantiate the ResumableUploader component.
ResumableUploader uploader = new ResumableUploader();
// Set the handlers for the completion and progress events
uploader.AsyncOperationCompleted += new AsyncOperationCompletedEventHandler(OnDone);
uploader.AsyncOperationProgress += new AsyncOperationProgressEventHandler(OnProgress);
// Start the upload process
uploader.InsertAsync(authenticator, entry, new object());
You are passing an xls (Excel) file as a text/csv. If you want to upload as xls, use
entry.MediaSource = new MediaFileSource("E:\\Emailcontent.xls", "text/csv");
If you want to upload as xls, use
entry.MediaSource = new MediaFileSource("E:\\Emailcontent.xls", "application/vnd.ms-excel");
Here is the wikipedia/google search that I used:
http://en.wikipedia.org/wiki/Internet_media_type
To ensure documents aren't converted when you upload them, you should also append ?convert=false to the upload uri.