Get text from PDF in Google - pdf

I have a PDF document that is saved in Google Drive. I can use the Google Drive Web UI search to find text in the document.
How can I programmatically extract a portion of the text in the document using Google Apps Script?

See pdfToText() in this gist.
To invoke the OCR built in to Google Drive on a PDF file, e.g. myPDF.pdf, here is what you do:
function myFunction() {
var pdfFile = DriveApp.getFilesByName("myPDF.pdf").next();
var blob = pdfFile.getBlob();
// Get the text from pdf
var filetext = pdfToText( blob, {keepTextfile: false} );
// Now do whatever you want with filetext...
}

Related

how can I show the pdf file content inside from razor file

I created a document file from word and has exported as pdf . i want to show the pdf content inside the Div element in razor page. How can I show the pdf content from razor page. Please can you provide an example code how to show in blazor server side
If you stored your pdf file directly in your documents for example in the folder wwwroot/pdf.
wwwroot/pdf/test.pdf
You can display this PDF with this line of html bellow :
< embed src="pdf/test.pdf" style="width=100%; height=2100px;" />
It will provide you a pdf displayer with printing options !
If you want to go further, upload your file and then display it, I will recommend you to go check this explanation :
https://www.learmoreseekmore.com/2020/10/blazor-webassembly-fileupload.html
The upload for PDF files works the same as Img file, you need to go check IBrowserFile documentation.
You will see that it has a Size obj and a OpenReadStream() function that will help you get the display Url for your file (image or pdf)
If the site abow closes, this is the upload code that is shown on it :
#code{
List<string> imgUrls = new List<string>();
private async Task OnFileSelection(InputFileChangeEventArgs e)
{
foreach (IBrowserFile imgFile in e.GetMultipleFiles(5))
{
var buffers = new byte[imgFile.Size];
await imgFile.OpenReadStream().ReadAsync(buffers);
string imageType = imgFile.ContentType;
string imgUrl = $"data:{imageType};base64,{Convert.ToBase64String(buffers)}";
imgUrls.Add(imgUrl);
}
}
}
This code was written by Naveen Bommidi, author of the blog where I found this usefull code
If you want, as I said, upload a PDF and then display it.
You can use the same html line :
< embed src="#imgUrl" style="width=100%; height=2100px;" />
And your uploaded files will be displaying.
Example

Extract Text from Multipage Attachment PDF Using Google Apps Script

I have a Gmail attachment PDF with multiple scanned pages. When I use Google Apps Script to save the blob from the attachment to a Drive file, open the PDF manually from Google Drive, then select Open With Google Docs, all of the text from the PDF is displayed as a Google Doc. However, when I save the blob as a Google Doc with OCR, only the text from the image on the first page is saved to a Doc, accessed either manually or by code.
The code to get the blob and process it is:
function getAttachments(desiredLabel, processedLabel, emailQuery){
// Find emails
var threads = GmailApp.search(emailQuery);
if(threads.length > 0){
// Iterate through the emails
for(var i in threads){
var mesgs = threads[i].getMessages();
for(var j in mesgs){
var processingMesg = mesgs[j];
var attachments = processingMesg.getAttachments();
var processedAttachments = 0;
// Iterate through attachments
for(var k in attachments){
var attachment = attachments[k];
var attachmentName = attachment.getName();
var attachmentType = attachment.getContentType();
// Process PDFs
if (attachmentType.includes('pdf')) {
processedAttachments += 1;
var pdfBlob = attachment.copyBlob();
var filename = attachmentName + " " + processedAttachments;
processPDF(pdfBlob, filename);
}
}
}
}
}
}
function processPDF(pdfBlob, filename){
// Saves the blob as a PDF.
// All pages are displayed if I click on it from Google Drive after running this script.
let pdfFile = DriveApp.createFile(pdfBlob);
pdfFile.setName(filename);
// Saves the blob as an OCRed Doc.
let resources = {
title: filename,
mimeType: "application/pdf"
};
let options = {
ocr: true,
ocrLanguage: "en"
};
let file = Drive.Files.insert(resources, pdfBlob, options);
let fileID = file.getId();
// Open the file to get the text.
// Only the text of the image on the first page is available in the Doc.
let doc = DocumentApp.openById(fileID);
let docText = doc.getBody().getText();
}
If I try to use Google Docs to read the PDF without OCR directly, I get Exception: Invalid argument, for example:
DocumentApp.openById(pdfFile.getId());
How do I get the text from all of the pages of the PDF?
DocumentApp.openById is a method that can only be used for Google Docs documents
pdfFile can only be "opened" with the DriveApp - DriveApp.getFileById(pdfFile.getId());
Opening a file with DriveApp allows you to use the following methods on the file
When it comes to OCR conversion, your code works for me correctly to convert all pages of a PDF document to Google Docs, so you error source is likely come from the attachment itself / the way you retrieve the blob
Mind that OCR conversion is not good at preserving formatting, so a two page PDF might be collapsed into a one-page Docs - depneding on the formatting of the PDF

jspdf Some data is being cut

The jspdf library is being used to generate PDF files in html.
That's a really good thing.
But I have a problem with pdf.
The data is about three pages long, but if check the downloaded pdf file, I see only one page and the rest will be truncated.
Here's my code:
let pdfName = this.contractlist_detail.title
var doc = new jsPDF();
var NotoSansCJKjp;
doc.addFileToVFS('NotoSansCJKjp-Regular.ttf', VFS);
doc.addFont('NotoSansCJKjp-Regular.ttf', 'NotoSansCJKjp', 'Bold');
doc.setFont('NotoSansCJKjp', 'Bold');
doc.setFontSize(12);
var paragraph = data;
var lines = doc.splitTextToSize(paragraph, 150);
doc.text(15, 15, lines)
doc.save(pdfName + '.pdf');
How do I make all of my data visible to downloaded pdf without being truncated?
jspdf library doesn't handle multi-pages by its own. You need to add pages manually when content is cropped (you need also to calculate manually if text is cropped).
Here is the method to add a new page :
addPage method
a demo is available in section "two page hello world" to know how to use this method
enter link description here

How to convert PDF in google drive to Text by google script

I am working on the code to convert PDF in google drive to Text. And, I can convert PDF by below codes. However, when I tried to specify the PDF in google drive by URL (e.g. var url = "https://drive.google.com/open?id=1gs-WvPPPPPPP0-iaawwadafa--";). The code doesn't work.
var url = "https://www.test.com/sites/g/files/xyz123.pdf";
// It works fine. And, if changes to google drive URL, it doesn't work.
var blob = UrlFetchApp.fetch(url).getBlob();
// No error showed up at getBlob.
Logger.log(blob) // shows "Blob" after changes to google drive URL
Logger.log(blob.getName()) // shows "open.html" after changes to google drive URL
Logger.log(blob.getContentType()) // shows "text/html" after changes to google drive URL
var resource = {
title: blob.getName(),
mimeType: blob.getContentType()
};
var file = Drive.Files.insert(resource, blob, {ocr: true, ocrLanguage: "en"});
// At above line, Error message "OCR is not supported for files of type text/html (line 64, file "Code")" showed.
var doc = DocumentApp.openById(file.id);
var text = doc.getBody().getText();
return text;
I would like to convert PDF files in folder of google drive to TEXT one by one.

Print PDF in Website

I have been searching for days for a solution to this problem.
Description : I have a website which loads a PDF dynamically via an iFrame. The PDF is saved on the server and the user of the website can view the pdf on the website.
Problem : Introduce a Print button on website which prints the PDF which was created dynamically and saved on the server.
Is this even possible ? I am looking at a cross-browser implementation as well to make things worse. I have tried n number of JS options from the web but none of them seem to work. I can not seem to get the PDF printed in the same way as it looks. To put it short, I am trying to emulate the print button which appears on the PDF when it is loaded. Is there an option to pass the pdf document from the server to the print dialog box ?
Description : I have a website which loads a PDF dynamically via an iFrame. The PDF is saved on the server and the user of the website can view the pdf on the website.
Problem : Introduce a Print button on website which prints the PDF which was created dynamically and saved on the server.
Solution : I could not find an exact solution to this problem, but here is how I solved the problem -
Create the 'Print' as per req and redirect that to another page which has only the PDF.
Copy the previous PDF & Create new PDF with JS - this.print() such that when it opens up, the print dialog pops up directly to the user.
In the new page -
if ("Location of PDF " != null)
{
sPdf = "Location of PDF ";
PdfReader pReader = new PdfReader(sPdf);
Document document = new Document
(pReader.GetPageSizeWithRotation(ApplicationConstants.INDEX_ONE));
int n = pReader.NumberOfPages;
FileStream fs = new FileStream
("New PDF location",
FileMode.Create, FileAccess.Write);
PdfCopy copy = new PdfCopy(document, fs);
// Write to pdf
document.Open();
for (int i = ApplicationConstants.INDEX_ONE; i <= n; i++)
{
PdfImportedPage page = copy.GetImportedPage(pReader, i);
copy.AddPage(page);
}
copy.AddJavaScript("this.print(true);", true);
document.Close();
pReader.Close();
inStr = File.OpenRead("New PDF location");
while ((bytecnt = inStr.Read
(buffer, ApplicationConstants.INDEX_ZERO, buffer.Length))
> ApplicationConstants.INDEX_ZERO)
{
if (Context.Response.IsClientConnected)
{
Context.Response.ContentType = "application/PDF";
Context.Response.OutputStream.Write(buffer,
ApplicationConstants.INDEX_ZERO, buffer.Length);
Context.Response.Flush();
}
}
}
Please note that I am using itextsharp to inject the JS script into the new PDF. Hope this helps someone else. I am trying to find another solution without the usage of itextsharp or any other dll but this will have to do for now.
I am not sure if this will work, but you could try launching a popup window with a special version of your PDF file that opens the print dialog when opened. Then close the popup afterwards. This last part might be tricky since I think there is no clean way to know if the print dialog has been closed.