I have a PDF I am trying to extract the text from.
To do this, I have tried to get the contents into a Google Doc.
The PDF has 1180 pages (3MB) but only the first 77 pages are being converted to text.
I have tried Drive.Files.insert and Drive.Files.copy, but get the same result.
I also tried to convert the PDF using MS Word and referencing that file (2.5MB) - with the same result.
I cannot see anything in either the PDF or Word that would indicate an "end of file" that would stop the rest of the document converting. There are no error messages - just 6.5% of what I need. I can only assume it was originally smaller PDF's that were merged.
Is there something else I should be looking at? Has anyone encountered this before?
I can manipulate the PDFtext string to get the data I need, but can't convert more than the first 77 pages.
This is what I am using to get the text string I require.
function txtPDF() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sht = ss.getSheetByName('Sheet1');
var mycell = sht.getRange('B1');
var myPdfID = mycell.getValue().toString();
var PDFblob = DriveApp.getFileById(myPdfID).getBlob();
var resource = {
title: PDFblob.getName(),
// mimeType: PDFblob.getContentType()
};
// var tmpfile = Drive.Files.insert(resource, PDFblob, {ocr: true, ocrLanguage: "en"});
var tmpfile = Drive.Files.copy(resource, myPdfID, {convert: true, ocr: true, ocrLanguage: "en"});
var doc = DocumentApp.openById(tmpfile.id);
// var doc = Drive.Files.copy({}, 'WordFileID', {'convert': true});
// var doc = DocumentApp.openById('WordFileID');
var PDFtext = doc.getBody().getText();
// Drive.Files.remove(doc.getId());
};
It looks like I fell foul of the Google Drive limits.
Documents can be up to 50MB - which was not an issue.
There is however a limit of 1.02 million characters. The 1180 pages exceeded this, so I guess I was lucky to get anything returned at all.
Maximum file sizes on Google Drive
According to the documentation, the pdf to docs conversion is limited to files of 2MB or less. To convert larger files properly look for other alternatives outside of Google.
Related
I have a Gmail attachment PDF with multiple scanned pages. When I use Google Apps Script to save the blob from the attachment to a Drive file, open the PDF manually from Google Drive, then select Open With Google Docs, all of the text from the PDF is displayed as a Google Doc. However, when I save the blob as a Google Doc with OCR, only the text from the image on the first page is saved to a Doc, accessed either manually or by code.
The code to get the blob and process it is:
function getAttachments(desiredLabel, processedLabel, emailQuery){
// Find emails
var threads = GmailApp.search(emailQuery);
if(threads.length > 0){
// Iterate through the emails
for(var i in threads){
var mesgs = threads[i].getMessages();
for(var j in mesgs){
var processingMesg = mesgs[j];
var attachments = processingMesg.getAttachments();
var processedAttachments = 0;
// Iterate through attachments
for(var k in attachments){
var attachment = attachments[k];
var attachmentName = attachment.getName();
var attachmentType = attachment.getContentType();
// Process PDFs
if (attachmentType.includes('pdf')) {
processedAttachments += 1;
var pdfBlob = attachment.copyBlob();
var filename = attachmentName + " " + processedAttachments;
processPDF(pdfBlob, filename);
}
}
}
}
}
}
function processPDF(pdfBlob, filename){
// Saves the blob as a PDF.
// All pages are displayed if I click on it from Google Drive after running this script.
let pdfFile = DriveApp.createFile(pdfBlob);
pdfFile.setName(filename);
// Saves the blob as an OCRed Doc.
let resources = {
title: filename,
mimeType: "application/pdf"
};
let options = {
ocr: true,
ocrLanguage: "en"
};
let file = Drive.Files.insert(resources, pdfBlob, options);
let fileID = file.getId();
// Open the file to get the text.
// Only the text of the image on the first page is available in the Doc.
let doc = DocumentApp.openById(fileID);
let docText = doc.getBody().getText();
}
If I try to use Google Docs to read the PDF without OCR directly, I get Exception: Invalid argument, for example:
DocumentApp.openById(pdfFile.getId());
How do I get the text from all of the pages of the PDF?
DocumentApp.openById is a method that can only be used for Google Docs documents
pdfFile can only be "opened" with the DriveApp - DriveApp.getFileById(pdfFile.getId());
Opening a file with DriveApp allows you to use the following methods on the file
When it comes to OCR conversion, your code works for me correctly to convert all pages of a PDF document to Google Docs, so you error source is likely come from the attachment itself / the way you retrieve the blob
Mind that OCR conversion is not good at preserving formatting, so a two page PDF might be collapsed into a one-page Docs - depneding on the formatting of the PDF
I am working on the code to convert PDF in google drive to Text. And, I can convert PDF by below codes. However, when I tried to specify the PDF in google drive by URL (e.g. var url = "https://drive.google.com/open?id=1gs-WvPPPPPPP0-iaawwadafa--";). The code doesn't work.
var url = "https://www.test.com/sites/g/files/xyz123.pdf";
// It works fine. And, if changes to google drive URL, it doesn't work.
var blob = UrlFetchApp.fetch(url).getBlob();
// No error showed up at getBlob.
Logger.log(blob) // shows "Blob" after changes to google drive URL
Logger.log(blob.getName()) // shows "open.html" after changes to google drive URL
Logger.log(blob.getContentType()) // shows "text/html" after changes to google drive URL
var resource = {
title: blob.getName(),
mimeType: blob.getContentType()
};
var file = Drive.Files.insert(resource, blob, {ocr: true, ocrLanguage: "en"});
// At above line, Error message "OCR is not supported for files of type text/html (line 64, file "Code")" showed.
var doc = DocumentApp.openById(file.id);
var text = doc.getBody().getText();
return text;
I would like to convert PDF files in folder of google drive to TEXT one by one.
I am using Aspose.Words for .NET to replace some merge fields in my document and then save the file as a PDF, however, my formatting is getting messed up (even for non-merge fields) by the conversion to PDF (refer to the images). The code is quite simple so I don't see what I'm missing.
The word document, pre-processing:
The generated pdf:
As you can see some of the fields are indented a bit more instead of being nicely aligned.
My code for generating the PDF and replacing the merge fields is:
public async Task<Stream> GenerateContractAsync(string requestRegistrationId)
{
var requestRegistration = await _requestRegistrationRepository
.FindRequestRegistration(requestRegistrationId)
.Include(rr => rr.Request.QualityType)
.Include(rr => rr.User)
.SingleOrDefaultAsync();
var file = await _fileService
.LoadFileAsync("Concept contract.docx");
var user = requestRegistration.User;
var document = new Aspose.Words.Document(file);
document.MailMerge.Execute(
new[]
{
"EmployeeName", "EmployeeDateOfBirth", "EmployeePlaceOfBirth", "EmployeeSSN", "EmployeeCity",
"EmployeeAddress", "ContractStartDate", "EmployeeFunction", "HourlyWage", "WageDeductionApplied"
},
new object[]
{
user.FullName, $"{user.Birthday:dd-MM-yyyy}", "Oss", user.Bsn, user.City,
$"{user.PostalCode}, {user.City}", $"{requestRegistration.Request.StartDate:dd-MM-yyyy}",
requestRegistration.Request.QualityType.Name, $"{requestRegistration.Request.HourlyRate:C}",
user.PayrollTaxDiscountEnabled ? "Ja" : "Nee"
}
);
var mergedDocumentStream = new MemoryStream();
document.Save(mergedDocumentStream, SaveFormat.Pdf);
#if DEBUG
mergedDocumentStream.Seek(0, SeekOrigin.Begin);
await _fileService.SaveFileToDiskAsync($"{user.Id}-{DateTimeOffset.Now:g}.pdf", "", mergedDocumentStream);
#endif
mergedDocumentStream.Seek(0, SeekOrigin.Begin);
return mergedDocumentStream;
}
Any help would be greatly appreciated.
The problem occurs because of missing fonts. Please refer to the following article for details.
How Aspose.Words Uses True Type Fonts
In your case, you need to install 'Verdana', 'Arial' and 'Cambria' fonts on the machine where you are executing this Aspose.Words' code. Simply copying these font files from Windows machine to other MAC machine should work.
I work with Aspose as Developer Evangelist.
I have a apps script bound to a spreadsheet that creates a pdf file from the sheet. this creates one single page pdf and saves it in a folder in drive. Up until recently, this worked perfectly. Now every time I run the code it does what it is supposed to but the file has a second page that is blank. When I create the pdf manually via file/download as/pdf doc, it creates the pdf as it should, with only one page. I have tried this with both the original and copy that the script temporarily creates. Both work when done manually. I am looking for some suggestions on what could have gone wrong and what to change. Here is an example of the code:
function makePDF() {
var sheet1 = SpreadsheetApp.getActive().getSheetByName('eTimesheet');
var sheet2 = SpreadsheetApp.getActive().getSheetByName('Time Sheet');
var sheet3 = SpreadsheetApp.getActive().getSheetByName('Data');
var triggercell3 = sheet1.getRange('M33').getValue();
if (triggercell3 == 'GO'){
var techNumber = sheet3.getRange('B5').getValue();
var date = sheet3.getRange('B3').getValue();
var fileID = sheet3.getRange('B7').getValue();
var pdfName = "TimeSheet- "+ techNumber + " " + date
var ss = SpreadsheetApp.getActive();
var folder = DriveApp.getFolderById(fileID);
sheet2.showSheet();
sheet1.hideSheet();
//Copy whole spreadsheet
var destSpreadsheet = SpreadsheetApp.open(DriveApp.getFileById(ss.getId()).makeCopy("tmp_convert_to_pdf", folder))
//save to pdf
var theBlob = destSpreadsheet.getBlob().getAs('application/pdf').setName(pdfName);
var newFile = folder.createFile(theBlob);
DriveApp.getFileById(destSpreadsheet.getId()).setTrashed(true);
sheet1.showSheet();
sheet2.hideSheet();
sheet1.getRange('M33').clearContent();
}
}
thanks for any assistance...
Without an example, first question I have is... if you delete a few rows from the sheet and run the script, are you back to 1 page?
I am asking in case the issue is just margin settings. If this is the issue, maybe you can adjust rows or use the UrlFetchApp.fetch approach as PDF page formatting can be specified (eg: margin size).
Recently I had asked THIS QUESTION to be able to save all the images present in a PDF file on the File System and I was able to save the images successfully.
I tested my code on a lot of pdf files and it ran just fine. But, today I came accross THIS pdf file from where it is not able to extract some images(attached below).
Can anyone please tell me what else I can do to extract these images? Is it even possible to extract them? Are they really images or something else? I would really appreciate the help.
My code(Please ignore the hardcoding as I am still testing this out):
function fn_getAllImages()
{
var strPdf = "C:\\Users\\a614923\\Desktop\\haka\\Work\\2017\\10. October\\31\\test.PDF";
var strout = "C:\\Users\\a614923\\Desktop\\haka\\Work\\2017\\10. October\\31\\Newfolder\\img"
intPage = 2; //for the 2nd page(the image is present in the 2nd page)
var objPdf = JavaClasses.org_apache_pdfbox_pdmodel.PDDocument.load_3(strPdf);
var objPage = objPdf.getDocumentCatalog().getAllPages().get(intPage-1);
var objImages = objPage.getResources().getXObjects().values().toArray();
var objImage, objImgBuffer, objImageFile;
for(var i=0; i<objImages.length; i++)
{
objImage = objImages.items(i);
Log.Message(objImage.toString());
if(aqString.Find(objImage.toString(),"PDXObjectForm",0,false)>0)
{
continue;
}
else
{
objImage.write2file_2(strout+i);
//objImgBuffer = objImage.getRGBImage();
//objImageFile = JavaClasses.java_io.File.newInstance(strout+i+".png");
//JavaClasses.javax_imageio.ImageIO.write(objImgBuffer,"png",objImageFile);
}
}
}
The image in the PDF file which I want to save(the one inside the red box below):