Convert multiple URL to individual PDFs

Convert multiple URL to individual PDFs - pdf

Question - I have a Google Sheet with about 200 Google Doc URLS. Is it possible to have a script that will convert the URLS to individual PDF files and save it to my desktop?
I have searched the internet high and low and I cannot find a solution. If anyone has any insight or can point me in the right direction, that would be really helpful.

One solution will be, create a folder inside of Drive, convert the documents to PDFs, and download the folder as a .zip.
function convertDocuments() {
/* Select the Spreadsheet */
const SS_ID = "SPREADSHEET_ID"
const SS = SpreadsheetApp.openById(SS_ID)
const PDF_MIME = "application/pdf"
const newFolder = DriveApp.createFolder('PDFs')
/* Get the links */
const getLinks = SS.getRange('A2:A').getValues()
getLinks.forEach((cells)=>{
const link = cells[0]
if(link==="") return
/* Getting the ID from the URL */
const parseID = link.toString().split("/")[5]
/* CREATE THE PDF */
const document = DriveApp.getFileById(parseID).getAs(PDF_MIME).copyBlob()
/* Inserting the PDF into the file */
newFolder.createFile(document)
})
Logger.log(newFolder.getUrl())
/* downloadFolder(newFolder.getId()) */
}
The steps are as follows:
Retrieve all links inside the A column
Use DriveApp in order to create a PDF for every link (link needs to be parsed for retrieving the ID)
Place the PDF inside the Drive Folder
From here, you have two possibilities:
Use the UI to download the folder
Use this function (provided by #Tanaike) to get directly the download link. Inside my script is referenced as downloadFolder
function downloadFolder(folderId) {
const folder = DriveApp.getFolderById(folderId);
const files = folder.getFiles();
let blobs = [];
while (files.hasNext()) {
blobs.push(files.next().getBlob());
}
const zipBlob = Utilities.zip(blobs, folder.getName() + ".zip");
const fileId = DriveApp.createFile(zipBlob).getId();
const url = "https://drive.google.com/uc?export=download&id=" + fileId;
Logger.log(url);
}
Documentation
getAs(contentType)

Related

Need t2.gstatic URL parameters for Web Scraping

I am checking to see if I can use gstatic to scrape favicon from websites. Below will fetch the websites Favicon:
https://t2.gstatic.com/faviconV2?client=SOCIAL&type=FAVICON&fallback_opts=TYPE,SIZE,URL&url=https://yahoo.com&size=64
I understand that the URL parameters might not be for general use, but just checking if anyone knows where this might be documented?
UPDATE: I have just started building an app on Google App Script. I need to list website names along with their favicons and metadata like site description, etc. Currently the only approach is to read the webpage and use beautifulSoup to parse the page and then locate the favicon. I came across the above link that will directly give me the favicon! But I want to understand it better and trying to locate more information on the URL parameters for gstatic.
I am also open to alternative ways to scrape a web site from Google App Script...
Thanks

I believe your goal is as follows.
You want to retrieve the favicon from the websites.
You want to use the following sample URL.
https://t2.gstatic.com/faviconV2?client=SOCIAL&type=FAVICON&fallback_opts=TYPE,SIZE,URL&url=https://yahoo.com&size=64
From I need to list website names along with their favicons and metadata like site description, etc., you want to retrieve the favicon, title, and description of the site using Google Apps Script.
Sample script 1:
When your URL of https://t2.gstatic.com/faviconV2?client=SOCIAL&type=FAVICON&fallback_opts=TYPE,SIZE,URL&url=https://yahoo.com&size=64 is used, how about the following sample script? Please copy and paste the following script to the script editor of Google Apps Script. And, run samoke1 at the script editor.
function sample1() {
const uri = 'https://t2.gstatic.com/faviconV2?client=SOCIAL&type=FAVICON&fallback_opts=TYPE,SIZE,URL&url=https://yahoo.com&size=64';
const blob = UrlFetchApp.fetch(encodeURI(uri)).getBlob();
DriveApp.createFile(blob);
}
When this script is run, the favicon is retrieved and that is saved as a file to the root folder of Google Drive.
When I saw the URL, it seems that the favicon is retrieved as the image data.
Sample script 2:
When the favicon, title, and description of the site are retrieved, how about the following sample script?
function sample2() {
const uri = 'https://yahoo.com'; // Please set the URL.
const obj = { title: "", description: "", faviconUrl: "" };
const res = UrlFetchApp.fetch(encodeURI(uri));
const html = res.getContentText();
const title = html.match(/<title>(.+?)<\/title>/i);
if (title || title.length > 1) {
obj.title = title[1];
}
const description = html.match(/<meta.+name\="description".+>/i);
if (description) {
const d = description[0].match(/content\="(.+)"/i);
if (d && d.length > 1) {
obj.description = d[1];
}
}
const faviconUrl = html.match(/rel="icon".+?href\="(.+?)"/i);
if (faviconUrl && faviconUrl.length > 1) {
obj.faviconUrl = faviconUrl[1];
}
console.log(obj);
}
When this script is run, you can see the following value in the log.
{
"title":"Yahoo | Mail, Weather, Search, Politics, News, Finance, Sports & Videos",
"description":"Latest news coverage, email, free stock quotes, live scores and video are just the beginning. Discover more every day at Yahoo!",
"faviconUrl":"https://s.yimg.com/cv/apiv2/default/icons/favicon_y19_32x32_custom.svg"
}
Reference:
fetch(url)

App script conversion of DOC to PDF ruins formatting

I created a simple script to convert all DOC files in a directory to PDF files. The script assumes the folder in driver does not have any other files. It also recursively iterates over the sub-directories and convert DOC to PDF as expected. Here's the script:
function convertDocToPdf(root) {
if(!root) {
root = DriveApp.getFoldersByName('conversion-test');
}
if(root.hasNext()) {
var rootFolder = root.next();
var files = rootFolder.getFiles();
var folders = rootFolder.getFolders();
while(files.hasNext()) {
var file = files.next();
if(!file) continue ;
convert(file, rootFolder);
}
while(folders.hasNext()) {
convertDocToPdf(folders);
}
}
}
function convert(file, rootFolder) {
var blob = file.getBlob();
var tmp = Drive.Files.insert({}, blob, {convert:true});
var id = tmp["id"];
var doc = DocumentApp.openById(id);
var text = doc.getBody().getText();
var filename = file.getName();
var name = filename.split('.')[0];
rootFolder.createFile(name + '.pdf', text);
Drive.Files.remove(id);
}
I tested this with simple files that only contains one line of text and it works. However, when I tried to convert a DOC file with images and other formatting (columns, tables) it removes all formatting and after download, the the file looks empty.
Are there any ways of preserving the format? What am I missing in my code?

I believe your goal and your current situation as follows.
You want to convert Google Document files to the PDF files.
In your script, you can retrieve the Google Document files from the folder.
Modification points:
In the function of convert(file, rootFolder), when file of convert(file, rootFolder) is Google Document, blob of var blob = file.getBlob(); has already been the converted PDF format. But your script converts the PDF format to Google Document again and retrieve only the text data, and then, the text data is created as a PDF file. By this, the PDF file with only text data is created. I think that this is the reason of your issue.
In order to remove this issue and convert the Google Document to the PDF file, I would like to modify as follows.
Modified script:
In this modification, I modified convert.
function convert(file, rootFolder) {
if (file.getMimeType() != MimeType.GOOGLE_DOCS) return;
var blob = file.getBlob();
var filename = file.getName();
var name = filename.split('.')[0];
rootFolder.createFile(blob.setName(name + '.pdf'));
}
Note:
In this case, the Google Document is converted to the PDF format with file.getBlob(). But when you want to use the Drive API for this, you can also use the following script. Ref
From
var blob = file.getBlob();
To
var url = `https://www.googleapis.com/drive/v3/files/${file.getId()}/export?mimeType=${MimeType.PDF}`;
var blob = UrlFetchApp.fetch(url, {headers: {authorization: `Bearer ${ScriptApp.getOAuthToken()}`}}).getBlob();
Reference:
getBlob()

Count total number of pages in pdf file

Every week, I'll be receiving a set of pdf files from my clients.
They will paste the pdf files in the specific google drive folder. I need a total number of pages of the pdf file. I was trying to create a code in Apps script which will helps to update the pdf file name and the total number of pages in the particular Google sheet.
I found the code which was created for the google docs here and here.
But that doesn't work. I am looking for a Apps script which helps to check the particular drive folder and update the pdf file name and the total number of pages in the specific google sheet.
I have tried to below script.
function getNumberofPages() {
var myFolder = DriveApp.getFoldersByName("Test").next();
var files = myFolder.searchFiles('title contains ".PDF"');
while (files.hasNext()) {
var file = files.next();
Logger.log(file.getName());
Logger.log(file.length);
}
}
But the length option is not working of pdf file....
Thanks in advance.

Unfortunately, there are no methods for directly retrieving the total pages from a PDF file using Google APIs yet. So how about these workarounds? Please choose it for your situation.
Workaround 1:
In this workaround, it retrieves the number of content streams in the PDF file. The content streams is shown as the attribute of /Contents.
When this is reflected to your script, it becomes as follows.
Modified script:
function getNumberofPages() {
var myFolder = DriveApp.getFoldersByName("Test").next();
var files = myFolder.searchFiles('title contains ".PDF"');
while (files.hasNext()) {
var file = files.next();
var n = file.getBlob().getDataAsString().split("/Contents").length - 1;
Logger.log("fileName: %s, totalPages: %s", file.getName(), n)
}
}
Although this workaround is simple, it might be able to not use for all PDF files as #mkl says. If this workaround cannot be used for your PDF files, how about the following workaround 2?
Workaround 2:
In this workaround, an API is used for retrieving the total pages of PDF file. I used Split PDF API. The total pages are retrieved from the number of splitted files. When you use this API, please check ConvertAPI and retrieve your secret key.
Modified script:
function getNumberofPages() {
var myFolder = DriveApp.getFoldersByName("Test").next();
var files = myFolder.searchFiles('title contains ".PDF"');
while (files.hasNext()) {
var file = files.next();
var url = "https://v2.convertapi.com/convert/pdf/to/split?Secret=#####"; // Please set your secret key.
var options = {
method: "post",
payload: {File: DriveApp.getFileById(file.getId()).getBlob()},
}
var res = UrlFetchApp.fetch(url, options);
res = JSON.parse(res.getContentText());
Logger.log("fileName: %s, totalPages: %s", file.getName(), res.Files.length)
}
}
I'm not sure about the number of PDF files and file size. So I didn't use fetchAll method for this. This is a sample script. So please modify this for your situation.
Note:
I can use these workarounds in my applications. But I have not been able to confirm for all PDF files. So if these workarounds didn't work for your PDF files, I'm sorry.
Reference:
PDF REFERENCE AND ADOBE EXTENSIONS TO THE PDF SPECIFICATION
ConvertAPI
Workaround 3:
As another approach, when this method is used, the sample script for retrieving the number of pages of PDF data is as follows.
async function myFunction() {
const cdnjs = "https://cdn.jsdelivr.net/npm/pdf-lib/dist/pdf-lib.min.js";
eval(UrlFetchApp.fetch(cdnjs).getContentText()); // Load pdf-lib
const setTimeout = function (f, t) {
// Overwrite setTimeout with Google Apps Script.
Utilities.sleep(t);
return f();
};
const myFolder = DriveApp.getFoldersByName("Test").next();
const files = myFolder.searchFiles('title contains ".PDF"');
const ar = [];
while (files.hasNext()) {
ar.push(files.next())
}
for (let i = 0; i < ar.length; i++) {
const file = ar[i];
const pdfData = await PDFLib.PDFDocument.load(new Uint8Array(file.getBlob().getBytes()));
const n = pdfData.getPageCount();
console.log("fileName: %s, totalPages: %s", file.getName(), n);
}
}
Note:
I think that the above script works. But, in this case, when you directly copy and paste the Javascript retrieved from https://cdn.jsdelivr.net/npm/pdf-lib/dist/pdf-lib.min.js to your Google Apps Script project, the process cost for loading it can be reduced.

function menuItem() {
var folder =
DriveApp.getFoldersByName('Test').next();
var contents = folder.searchFiles('title contains ".PDF"');
var file;
var name;
var sheet = SpreadsheetApp.getActiveSheet();
var count;
sheet.clear();
sheet.appendRow(["Name", "Number of pages"]);
while(contents.hasNext()) {
file = contents.next();
name = file.getName();
count =
file.getBlob().getDataAsString().split("/Contents").length - 1;
data = [name, count]
sheet.appendRow(data);
}
};
function onOpen() {
var ui = SpreadsheetApp.getUi();
ui.createMenu('PDF Page Calculator')
.addItem("PDF Page Calculator",
'menuItem')
.addToUi();
};

Converting PDF to Google Docs

I managed to get a script running where the script automatically converts PDFs to a Google Doc format. The issue that we seem to be running into is that the PDFs have images in them as well. When we convert the PDF to Google Doc, the Google Doc does not have the images and only has the text. I believe the reason why this is happening is due to OCR. Is it possible that I could automate the script to convert the images on the PDF as well to Google Docs?
Here is the Script in question:
GmailToDrive('0BxwJdbZfrRZQUmhldGQ0b3FDTjA', '"Test Email"');
function GmailToDrive(folderID, gmailSubject){
var threads = GmailApp.search('subject: ' + gmailSubject + ' -label: Imported'); // performs Gmail query for email threads
for (var i in threads){
var messages = threads[i].getMessages(); // finds all messages of threads returned by the query
for(var j in messages){
var attachments = messages[j].getAttachments(); // finds all attachments of found messages
var timestamp = messages[j].getDate(); // receives timestamp of each found message
var date = Utilities.formatDate(timestamp, "MST", "yyyy-MM-dd"); // rearranges the returned timestamp
for(var k in attachments){
var fileType = attachments[k].getContentType();
Logger.log(fileType);
if (fileType = 'application/pdf') { // if the application is a pdf then it will convert to a google doc.
var fileBlob = attachments[k].copyBlob().setContentType('application/pdf');
var resource = {
title: fileBlob.getName(),
mimeType: fileBlob.getContentType()
};
var options = {
ocr: true
};
var docFile = Drive.Files.insert(resource, fileBlob, options);
}
}
}
}
}

The ocr option is intended to read characters out of images and PDF documents. This will not include the images in the uploaded result.
Have a look at the convert option instead.
The API documentation provides a test on the right hand side which you can quickly check each parameter.

What is a blob URL and why it is used?

I am having trouble with blob URLs.
I was searching for src of a video tag on YouTube and I found that the video src was like:
src="blob:https://video_url"
I opened the blob URL that was in src of the video, but it gave an error. I can't open the link, but it was working with the src tag. How is this possible?
I have a few questions:
What is a blob URL?
Why it is used?
Can I make my own blob URL on a server?
Any additional details about blob URLs would be helpful as well.

Blob URLs (ref W3C, official name) or Object-URLs (ref. MDN and method name) are used with a Blob or a File object.
src="blob:https://crap.crap" I opened the blob url that was in src of
video it gave a error and i can't open but was working with the src
tag how it is possible?
Blob URLs can only be generated internally by the browser. URL.createObjectURL() will create a special reference to the Blob or File object which later can be released using URL.revokeObjectURL(). These URLs can only be used locally in the single instance of the browser and in the same session (ie. the life of the page/document).
What is blob url?
Why it is used?
Blob URL/Object URL is a pseudo protocol to allow Blob and File objects to be used as URL source for things like images, download links for binary data and so forth.
For example, you can not hand an Image object raw byte-data as it would not know what to do with it. It requires for example images (which are binary data) to be loaded via URLs. This applies to anything that require an URL as source. Instead of uploading the binary data, then serve it back via an URL it is better to use an extra local step to be able to access the data directly without going via a server.
It is also a better alternative to Data-URI which are strings encoded as Base-64. The problem with Data-URI is that each char takes two bytes in JavaScript. On top of that a 33% is added due to the Base-64 encoding. Blobs are pure binary byte-arrays which does not have any significant overhead as Data-URI does, which makes them faster and smaller to handle.
Can i make my own blob url on a server?
No, Blob URLs/Object URLs can only be made internally in the browser. You can make Blobs and get File object via the File Reader API, although BLOB just means Binary Large OBject and is stored as byte-arrays. A client can request the data to be sent as either ArrayBuffer or as a Blob. The server should send the data as pure binary data. Databases often uses Blob to describe binary objects as well, and in essence we are talking basically about byte-arrays.
if you have then Additional detail
You need to encapsulate the binary data as a BLOB object, then use URL.createObjectURL() to generate a local URL for it:
var blob = new Blob([arrayBufferWithPNG], {type: "image/png"}),
url = URL.createObjectURL(blob),
img = new Image();
img.onload = function() {
URL.revokeObjectURL(this.src); // clean-up memory
document.body.appendChild(this); // add image to DOM
}
img.src = url; // can now "stream" the bytes

This Javascript function supports to show the difference between the Blob File API and the Data API to download a JSON file in the client browser:
/**
* Save a text as file using HTML <a> temporary element and Blob
* #author Loreto Parisi
*/
var saveAsFile = function(fileName, fileContents) {
if (typeof(Blob) != 'undefined') { // Alternative 1: using Blob
var textFileAsBlob = new Blob([fileContents], {type: 'text/plain'});
var downloadLink = document.createElement("a");
downloadLink.download = fileName;
if (window.webkitURL != null) {
downloadLink.href = window.webkitURL.createObjectURL(textFileAsBlob);
} else {
downloadLink.href = window.URL.createObjectURL(textFileAsBlob);
downloadLink.onclick = document.body.removeChild(event.target);
downloadLink.style.display = "none";
document.body.appendChild(downloadLink);
}
downloadLink.click();
} else { // Alternative 2: using Data
var pp = document.createElement('a');
pp.setAttribute('href', 'data:text/plain;charset=utf-8,' +
encodeURIComponent(fileContents));
pp.setAttribute('download', fileName);
pp.onclick = document.body.removeChild(event.target);
pp.click();
}
} // saveAsFile
/* Example */
var jsonObject = {"name": "John", "age": 30, "car": null};
saveAsFile('out.json', JSON.stringify(jsonObject, null, 2));
The function is called like saveAsFile('out.json', jsonString);. It will create a ByteStream immediately recognized by the browser that will download the generated file directly using the File API URL.createObjectURL.
In the else, it is possible to see the same result obtained via the href element plus the Data API, but this has several limitations that the Blob API has not.

I have modified working solution to handle both the case.. when video is uploaded and when image is uploaded .. hope it will help some.
HTML
<input type="file" id="fileInput">
<div> duration: <span id='sp'></span><div>
Javascript
var fileEl = document.querySelector("input");
fileEl.onchange = function(e) {
var file = e.target.files[0]; // selected file
if (!file) {
console.log("nothing here");
return;
}
console.log(file);
console.log('file.size-' + file.size);
console.log('file.type-' + file.type);
console.log('file.acutalName-' + file.name);
let start = performance.now();
var mime = file.type, // store mime for later
rd = new FileReader(); // create a FileReader
if (/video/.test(mime)) {
rd.onload = function(e) { // when file has read:
var blob = new Blob([e.target.result], {
type: mime
}), // create a blob of buffer
url = (URL || webkitURL).createObjectURL(blob), // create o-URL of blob
video = document.createElement("video"); // create video element
//console.log(blob);
video.preload = "metadata"; // preload setting
video.addEventListener("loadedmetadata", function() { // when enough data loads
console.log('video.duration-' + video.duration);
console.log('video.videoHeight-' + video.videoHeight);
console.log('video.videoWidth-' + video.videoWidth);
//document.querySelector("div")
// .innerHTML = "Duration: " + video.duration + "s" + " <br>Height: " + video.videoHeight; // show duration
(URL || webkitURL).revokeObjectURL(url); // clean up
console.log(start - performance.now());
// ... continue from here ...
});
video.src = url; // start video load
};
} else if (/image/.test(mime)) {
rd.onload = function(e) {
var blob = new Blob([e.target.result], {
type: mime
}),
url = URL.createObjectURL(blob),
img = new Image();
img.onload = function() {
console.log('iamge');
console.dir('this.height-' + this.height);
console.dir('this.width-' + this.width);
URL.revokeObjectURL(this.src); // clean-up memory
console.log(start - performance.now()); // add image to DOM
}
img.src = url;
};
}
var chunk = file.slice(0, 1024 * 1024 * 10); // .5MB
rd.readAsArrayBuffer(chunk); // read file object
};
jsFiddle Url
https://jsfiddle.net/PratapDessai/0sp3b159/

The OP asks:
What is blob URL? Why is it used?
Blob is just byte sequence. Browsers recognize Blobs as byte streams. It is used to get byte stream from source.
According to Mozilla's documentation
A Blob object represents a file-like object of immutable, raw data. Blobs represent data that isn't necessarily in a JavaScript-native format. The File interface is based on Blob, inheriting blob functionality and expanding it to support files on the user's system.
The OP asks:
Can i make my own blob url on a server?
Yes you can there are several ways to do so for example try http://php.net/manual/en/function.ibase-blob-echo.php
Read more here:
https://developer.mozilla.org/en-US/docs/Web/API/Blob
http://www.w3.org/TR/FileAPI/#dfn-Blob
https://url.spec.whatwg.org/#urls

blob urls are used for showing files that the user uploaded, but they are many other purposes, like that it could be used for secure file showing, like how it is a little difficult to get a YouTube video as a video file without downloading an extension. But, they are probably more answers. My research is mostly just me using Inspect to try to get a YouTube video and an online article.

Another use case of blob urls is to load resources from the server, apply hacks and then tell the browser to interpret them.
One such example would be to load template files or even scss files.
Here is the scss example:
<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/sass.js/0.11.1/sass.sync.min.js"></script>
function loadCSS(text) {
const head = document.getElementsByTagName('head')[0]
const style = document.createElement('link')
const css = new Blob([text], {type: 'text/css'})
style.href = window.URL.createObjectURL(css)
style.type = 'text/css'
style.rel = 'stylesheet'
head.append(style)
}
fetch('/style.scss').then(res => res.text()).then(sass => {
Sass.compile(sass, ({text}) => loadCSS(text))
})
Now you could swap out Sass.compile for any kind of transformation function you like.
Blob urls keeps your DOM structure clean this way.
I'm sure by now you have your answers, so this is just one more thing you can do with it.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Convert multiple URL to individual PDFs - pdf

Related

Need t2.gstatic URL parameters for Web Scraping

App script conversion of DOC to PDF ruins formatting

Count total number of pages in pdf file

Converting PDF to Google Docs

What is a blob URL and why it is used?

Categories

Resources