Can I read PDF or Word Docs with Node.js? - pdf

I can't find any packages to do this. I know PHP has a ton of libraries for PDFs (like http://www.fpdf.org/) but anything for Node?

textract is a great lib that supports PDFs, Doc, Docx, etc.

Looks like there's a few for pdf, but I didn't find any for Word.
CPU bound processing like that isn't really Node's strong point anyway (i.e. you get no additional benefits using node to do it over any other language). A pragmatic approach would be to find a good tool and utilise it from Node.
I have heard good things around the office about docsplit http://documentcloud.github.com/docsplit/
While it's not Node, you could easily invoke it from Node with http://nodejs.org/docs/latest/api/all.html#child_process.exec

You can easily convert one into another, or use for example a .doc template to generate a .pdf file, but you will probably want to use an existing web service for this task.
This can be done using the services of Livedocx for example
To use this service from node, see node-livedocx (Disclaimer: I am the author of this node module)

I would suggest looking into unoconv for your initial conversion, this uses LibreOffice or OpenOffice for the actual conversion. Which adds some overhead.
I'd setup a few workers with all the necessities setup, and use a request/response queue for handling the conversion... (may want to look into kue or zmq)
In general this is a CPU bound and heavy task that should be offloaded... Pandoc and others specifically mention .docx, not .doc so they may or may not be options as well.
Note: I know this question is old, just wanted to provide a current answer for others coming across this.

you can use pdf-text for pdf files. it will extract text from a pdf into an array of text 'chunks'. Useful for doing fuzzy parsing on structured pdf text.
var pdfText = require('pdf-text')
var pathToPdf = __dirname + "/info.pdf"
pdfText(pathToPdf, function(err, chunks) {
//chunks is an array of strings
//loosely corresponding to text objects within the pdf
//for a more concrete example, view the test file in this repo
})
var fs = require('fs')
var buffer = fs.readFileSync(pathToPdf)
pdfText(buffer, function(err, chunks) {
console.log(chunks)
})
for docx files you can use mammoth, it will extract text from .docx files.
var mammoth = require("mammoth");
mammoth.extractRawText({path: "./doc.docx"})
.then(function(result){
var text = result.value; // The raw text
console.log(text);
var messages = result.messages;
})
.done();
I hope this will help.

For parsing pdf files you can use pdf2json node module
It allows you to convert pdf file to json as well as to raw text data.

Another good option if you only need to convert from Word documents is Mammoth.js.
Mammoth is designed to convert .docx documents, such as those created
by Microsoft Word, and convert them to HTML. Mammoth aims to produce
simple and clean HTML by using semantic information in the document,
and ignoring other details. For instance, Mammoth converts any
paragraph with the style Heading 1 to h1 elements, rather than
attempting to exactly copy the styling (font, text size, colour, etc.)
of the heading.
There's a large mismatch between the structure used by .docx and the
structure of HTML, meaning that the conversion is unlikely to be
perfect for more complicated documents. Mammoth works best if you only
use styles to semantically mark up your document.

Here is an example showing how to download and extract text from a PDF using PDF.js:
import _ from 'lodash';
import superagent from 'superagent';
import pdf from 'pdfjs-dist';
const url = 'http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf';
const main = async () => {
const response = await superagent.get(url).buffer();
const data = response.body;
const doc = await pdf.getDocument({ data });
for (const i of _.range(doc.numPages)) {
const page = await doc.getPage(i + 1);
const content = await page.getTextContent();
for (const { str } of content.items) {
console.log(str);
}
}
};
main().catch(error => console.error(error));

You can use Aspose.Words Cloud SDK for Node.js to extract text from DOC/DOCX,Open Office, and PDF. It's paid API but the free plan provides 150 free monthly API calls.
P.S: I'm developer evangelist at Aspose.
const { WordsApi, ConvertDocumentRequest } = require("asposewordscloud");
const fs = require('fs');
// Get Customer ID and Customer Key from https://dashboard.aspose.cloud/
wordsApi = new WordsApi("xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx", "xxxxxxxxxxxxxxxxxxxx");
const request = new ConvertDocumentRequest({
format: "txt",
document: fs.createReadStream("C:/Temp/02_pages.pdf"),
});
const outputFile = "C:/Temp/ConvertPDFtotxt.txt";
wordsApi.convertDocument(request).then((result) => {
console.log(result.response.statusCode);
console.log(result.body.byteLength);
fs.writeFileSync(outputFile, result.body);
}).catch(function(err) {
// Deal with an error
console.log(err);
});

Related

How can I render multiple URL's into a single PDF

I'm attempting to open a series of URL's to render the output, then combine into a single PDF using PhantomJS, but I cannot find any documentation on how to do this. I'm just using trial and error, but not getting anywhere - hoping somebody knows how to do this.
I'm not completely set on PhantomJS, so if you know of a better command line, node or JAVA tool that would be better, I'm all ears (or eyes in this case).
Here is the code I have that renders a single page. I've tried replicating the open/render, but it always overwrites the PDF instead of appending to it.
var page = require('webpage').create(),
system = require('system'),
fs = require('fs'),
pages = {
page1: 'http://localhost/test1.html',
page2: 'http://localhost/test2.html'
};
page.paperSize = {
format: 'A4',
orientation: 'portrait',
};
page.settings.dpi = "96";
// this renders a single page and overwrites the existing PDF or creates a new one
page.open('pages.page1', function() {
setTimeout(function() {
page.render('capture.pdf');
phantom.exit();
}, 5000);
});
PhantomJS renders one web page into one PDF file, so if you can merge several URLs into one html file you could open it in PhantomJS and make a PDF.
But it would be simpler to make several PDFs and then merge them into one with something like pdfkt at the end of the script, launching merge command from PhantomJS child module

Generating pdf in Angular 2

Me again with another Angular 2 question.
We are trying to generate a PDF file from a html source. I searched and searched trying to find an Angular 2 wrapper for the jsPdf or makePdf libraries, but I can't find any. Is there something I am missing? Is there a pure javascript way and is that good practice? Hope you guys can help.
Thanks in advance.
Francois
I was searching for the same thing, some weeks ago. I decided to do the generating on server-side (Node.js in my case). However you can do it on client-side, with jsPDF, like you mentioned.
Don't need a wrapper, just include the script and then access jsPDF through the window object. Wrapper will make it easier to test though.
I don't remember exactly but I thought it was something like this:
var doc = new window.jsPDF();
PDF is a complex file format, there may be some pdf parsers/generators built with js, but they will be limited & slow, your best bet is to do something server side.
HTML code:
<button type="button" (click)="downloadPdf()"
class="button">download</button>
Component.ts:
downloadPdf(){
this.authService.downloadPdf().subscribe(data => {
this.partnerDetails = data
} ); }
routes.js:
router.get('/downloadPdf',partnerCntrl.downloadPdf);
partnercntrl:
module.exports.downloadPdf = function (req, res) {
var fs = require('fs');
var pdf = require('html-pdf');
var html = fs.readFileSync('./test/businesscard.html', 'utf8');
var options = { format: 'Letter' };
pdf.create(html, options).toFile('./businesscard.pdf', function(err, res) {
if (err) return console.log(err);
console.log(res); // { filename: '/app/businesscard.pdf' }
});

Is there a gulp plugin to compile files contents to base64?

I have several html templates I require to "compile" and convert to base64 format. By compile I mean injecting JS and CSS inline, and then converting it to base64 format.
I tried gulp-base64 but that only works for images in CSS. Any ideas?
Edit: I just got an idea that maybe I could use gulp-foreach to process each file individually and then use a Buffer to convert each file's content's to base64 format. I know I could do something like:
> console.log(new Buffer("Hello World").toString('base64'));
SGVsbG8gV29ybGQ=
> console.log(new Buffer("SGVsbG8gV29ybGQ=", 'base64').toString('ascii'))
Hello World
But I'm not really sure how to do it since I don't understand quite well how file streams (I think it's called vinyl) work yet in gulp. Any help would be greatly appreciated.
I found your question while looking for a solution to the same problem. Your suggestion to use gulp-foreach led me to a solution, although I didn't use that package:
// import the appropriate plugins
//
const each = require('gulp-each');
const htmlToJs = require('gulp-html-to-js');
// I'm compiling a couple of small PDF files
//
gulp.task('compile:pdf', () =>
gulp.src('./files/**/*.pdf')
// use gulp-each to iterate over the files & convert the
// files to a base64-encoded data URL
//
.pipe(each((content, file, callback) => {
const output = `data:application/pdf;base64,${new Buffer(content).toString('base64')}`;
// the first arg in this callback is the error; the second
// is the content to pass along via the stream
//
callback(null, output)
}))
// use gulp-html-to-js to convert the data URL to a JS module
// which can be imported
//
.pipe(htmlToJs())
// and set the destination...
//
.pipe(gulp.dest('./client/modules/helpers/files'))
);
The end result is a JS file with contents that look like this:
'use strict';
module.exports = 'data:application/pdf;base64,... base64 encoded string...';

What is a blob URL and why it is used?

I am having trouble with blob URLs.
I was searching for src of a video tag on YouTube and I found that the video src was like:
src="blob:https://video_url"
I opened the blob URL that was in src of the video, but it gave an error. I can't open the link, but it was working with the src tag. How is this possible?
I have a few questions:
What is a blob URL?
Why it is used?
Can I make my own blob URL on a server?
Any additional details about blob URLs would be helpful as well.
Blob URLs (ref W3C, official name) or Object-URLs (ref. MDN and method name) are used with a Blob or a File object.
src="blob:https://crap.crap" I opened the blob url that was in src of
video it gave a error and i can't open but was working with the src
tag how it is possible?
Blob URLs can only be generated internally by the browser. URL.createObjectURL() will create a special reference to the Blob or File object which later can be released using URL.revokeObjectURL(). These URLs can only be used locally in the single instance of the browser and in the same session (ie. the life of the page/document).
What is blob url?
Why it is used?
Blob URL/Object URL is a pseudo protocol to allow Blob and File objects to be used as URL source for things like images, download links for binary data and so forth.
For example, you can not hand an Image object raw byte-data as it would not know what to do with it. It requires for example images (which are binary data) to be loaded via URLs. This applies to anything that require an URL as source. Instead of uploading the binary data, then serve it back via an URL it is better to use an extra local step to be able to access the data directly without going via a server.
It is also a better alternative to Data-URI which are strings encoded as Base-64. The problem with Data-URI is that each char takes two bytes in JavaScript. On top of that a 33% is added due to the Base-64 encoding. Blobs are pure binary byte-arrays which does not have any significant overhead as Data-URI does, which makes them faster and smaller to handle.
Can i make my own blob url on a server?
No, Blob URLs/Object URLs can only be made internally in the browser. You can make Blobs and get File object via the File Reader API, although BLOB just means Binary Large OBject and is stored as byte-arrays. A client can request the data to be sent as either ArrayBuffer or as a Blob. The server should send the data as pure binary data. Databases often uses Blob to describe binary objects as well, and in essence we are talking basically about byte-arrays.
if you have then Additional detail
You need to encapsulate the binary data as a BLOB object, then use URL.createObjectURL() to generate a local URL for it:
var blob = new Blob([arrayBufferWithPNG], {type: "image/png"}),
url = URL.createObjectURL(blob),
img = new Image();
img.onload = function() {
URL.revokeObjectURL(this.src); // clean-up memory
document.body.appendChild(this); // add image to DOM
}
img.src = url; // can now "stream" the bytes
This Javascript function supports to show the difference between the Blob File API and the Data API to download a JSON file in the client browser:
/**
* Save a text as file using HTML <a> temporary element and Blob
* #author Loreto Parisi
*/
var saveAsFile = function(fileName, fileContents) {
if (typeof(Blob) != 'undefined') { // Alternative 1: using Blob
var textFileAsBlob = new Blob([fileContents], {type: 'text/plain'});
var downloadLink = document.createElement("a");
downloadLink.download = fileName;
if (window.webkitURL != null) {
downloadLink.href = window.webkitURL.createObjectURL(textFileAsBlob);
} else {
downloadLink.href = window.URL.createObjectURL(textFileAsBlob);
downloadLink.onclick = document.body.removeChild(event.target);
downloadLink.style.display = "none";
document.body.appendChild(downloadLink);
}
downloadLink.click();
} else { // Alternative 2: using Data
var pp = document.createElement('a');
pp.setAttribute('href', 'data:text/plain;charset=utf-8,' +
encodeURIComponent(fileContents));
pp.setAttribute('download', fileName);
pp.onclick = document.body.removeChild(event.target);
pp.click();
}
} // saveAsFile
/* Example */
var jsonObject = {"name": "John", "age": 30, "car": null};
saveAsFile('out.json', JSON.stringify(jsonObject, null, 2));
The function is called like saveAsFile('out.json', jsonString);. It will create a ByteStream immediately recognized by the browser that will download the generated file directly using the File API URL.createObjectURL.
In the else, it is possible to see the same result obtained via the href element plus the Data API, but this has several limitations that the Blob API has not.
I have modified working solution to handle both the case.. when video is uploaded and when image is uploaded .. hope it will help some.
HTML
<input type="file" id="fileInput">
<div> duration: <span id='sp'></span><div>
Javascript
var fileEl = document.querySelector("input");
fileEl.onchange = function(e) {
var file = e.target.files[0]; // selected file
if (!file) {
console.log("nothing here");
return;
}
console.log(file);
console.log('file.size-' + file.size);
console.log('file.type-' + file.type);
console.log('file.acutalName-' + file.name);
let start = performance.now();
var mime = file.type, // store mime for later
rd = new FileReader(); // create a FileReader
if (/video/.test(mime)) {
rd.onload = function(e) { // when file has read:
var blob = new Blob([e.target.result], {
type: mime
}), // create a blob of buffer
url = (URL || webkitURL).createObjectURL(blob), // create o-URL of blob
video = document.createElement("video"); // create video element
//console.log(blob);
video.preload = "metadata"; // preload setting
video.addEventListener("loadedmetadata", function() { // when enough data loads
console.log('video.duration-' + video.duration);
console.log('video.videoHeight-' + video.videoHeight);
console.log('video.videoWidth-' + video.videoWidth);
//document.querySelector("div")
// .innerHTML = "Duration: " + video.duration + "s" + " <br>Height: " + video.videoHeight; // show duration
(URL || webkitURL).revokeObjectURL(url); // clean up
console.log(start - performance.now());
// ... continue from here ...
});
video.src = url; // start video load
};
} else if (/image/.test(mime)) {
rd.onload = function(e) {
var blob = new Blob([e.target.result], {
type: mime
}),
url = URL.createObjectURL(blob),
img = new Image();
img.onload = function() {
console.log('iamge');
console.dir('this.height-' + this.height);
console.dir('this.width-' + this.width);
URL.revokeObjectURL(this.src); // clean-up memory
console.log(start - performance.now()); // add image to DOM
}
img.src = url;
};
}
var chunk = file.slice(0, 1024 * 1024 * 10); // .5MB
rd.readAsArrayBuffer(chunk); // read file object
};
jsFiddle Url
https://jsfiddle.net/PratapDessai/0sp3b159/
The OP asks:
What is blob URL? Why is it used?
Blob is just byte sequence. Browsers recognize Blobs as byte streams. It is used to get byte stream from source.
According to Mozilla's documentation
A Blob object represents a file-like object of immutable, raw data. Blobs represent data that isn't necessarily in a JavaScript-native format. The File interface is based on Blob, inheriting blob functionality and expanding it to support files on the user's system.
The OP asks:
Can i make my own blob url on a server?
Yes you can there are several ways to do so for example try http://php.net/manual/en/function.ibase-blob-echo.php
Read more here:
https://developer.mozilla.org/en-US/docs/Web/API/Blob
http://www.w3.org/TR/FileAPI/#dfn-Blob
https://url.spec.whatwg.org/#urls
blob urls are used for showing files that the user uploaded, but they are many other purposes, like that it could be used for secure file showing, like how it is a little difficult to get a YouTube video as a video file without downloading an extension. But, they are probably more answers. My research is mostly just me using Inspect to try to get a YouTube video and an online article.
Another use case of blob urls is to load resources from the server, apply hacks and then tell the browser to interpret them.
One such example would be to load template files or even scss files.
Here is the scss example:
<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/sass.js/0.11.1/sass.sync.min.js"></script>
function loadCSS(text) {
const head = document.getElementsByTagName('head')[0]
const style = document.createElement('link')
const css = new Blob([text], {type: 'text/css'})
style.href = window.URL.createObjectURL(css)
style.type = 'text/css'
style.rel = 'stylesheet'
head.append(style)
}
fetch('/style.scss').then(res => res.text()).then(sass => {
Sass.compile(sass, ({text}) => loadCSS(text))
})
Now you could swap out Sass.compile for any kind of transformation function you like.
Blob urls keeps your DOM structure clean this way.
I'm sure by now you have your answers, so this is just one more thing you can do with it.

How do I get data from a background page to the content script in google chrome extensions

I've been trying to send data from my background page to a content script in my chrome extension. i can't seem to get it to work. I've read a few posts online but they're not really clear and seem quite high level. I've got managed to get the oauth working using the Oauth contacts example on the Chrome samples. The authentication works, i can get the data and display it in an html page by opening a new tab.
I want to send this data to a content script.
i'm having a lot of trouble with this and would really appreciate if someone could outline the explicit steps you need to follow to send data from a bg page to a content script or even better some code. Any takers?
the code for my background page is below (i've excluded the oauth paramaeters and other )
` function onContacts(text, xhr) {
contacts = [];
var data = JSON.parse(text);
var realdata = data.contacts;
for (var i = 0, person; person = realdata.person[i]; i++) {
var contact = {
'name' : person['name'],
'emails' : person['email']
};
contacts.push(contact); //this array "contacts" is read by the
contacts.html page when opened in a new tab
}
chrome.tabs.create({ 'url' : 'contacts.html'}); sending data to new tab
//chrome.tabs.executeScript(null,{file: "contentscript.js"});
may be this may work?
};
function getContacts() {
oauth.authorize(function() {
console.log("on authorize");
setIcon();
var url = "http://mydataurl/";
oauth.sendSignedRequest(url, onContacts);
});
};
chrome.browserAction.onClicked.addListener(getContacts);`
As i'm not quite sure how to get the data into the content script i wont bother posting the multiple versions of my failed content scripts. if I could just get a sample on how to request the "contacts" array from my content script, and how to send the data from the bg page, that would be great!
You have two options getting the data into the content script:
Using Tab API:
http://code.google.com/chrome/extensions/tabs.html#method-executeScript
Using Messaging:
http://code.google.com/chrome/extensions/messaging.html
Using Tab API
I usually use this approach when my extension will just be used once in a while, for example, setting the image as my desktop wallpaper. People don't set a wallpaper every second, or every minute. They usually do it once a week or even day. So I just inject a content script to that page. It is pretty easy to do so, you can either do it by file or code as explained in the documentation:
chrome.tabs.executeScript(tab.id, {file: 'inject_this.js'}, function() {
console.log('Successfully injected script into the page');
});
Using Messaging
If you are constantly need information from your websites, it would be better to use messaging. There are two types of messaging, Long-lived and Single-requests. Your content script (that you define in the manifest) can listen for extension requests:
chrome.extension.onRequest.addListener(function(request, sender, sendResponse) {
if (request.method == 'ping')
sendResponse({ data: 'pong' });
else
sendResponse({});
});
And your background page could send a message to that content script through messaging. As shown below, it will get the currently selected tab and send a request to that page.
chrome.tabs.getSelected(null, function(tab) {
chrome.tabs.sendRequest(tab.id, {method: 'ping'}, function(response) {
console.log(response.data);
});
});
Depends on your extension which method to use. I have used both. For an extension that will be used like every second, every time, I use Messaging (Long-Lived). For an extension that will not be used every time, then you don't need the content script in every single page, you can just use the Tab API executeScript because it will just inject a content script whenever you need to.
Hope that helps! Do a search on Stackoverflow, there are many answers to content scripts and background pages.
To follow on Mohamed's point.
If you want to pass data from the background script to the content script at initialisation, you can generate another simple script that contains only JSON and execute it beforehand.
Is that what you are looking for?
Otherwise, you will need to use the message passing interface
In the background page:
// Subscribe to onVisited event, so that injectSite() is called once at every pageload.
chrome.history.onVisited.addListener(injectSite);
function injectSite(data) {
// get custom configuration for this URL in the background page.
var site_conf = getSiteConfiguration(data.url);
if (site_conf)
{
chrome.tabs.executeScript({ code: 'PARAMS = ' + JSON.stringify(site_conf) + ';' });
chrome.tabs.executeScript({ file: 'site_injection.js' });
}
}
In the content script page (site_injection.js)
// read config directly from background
console.log(PARAM.whatever);
I thought I'd update this answer for current and future readers.
According to the Chrome API, chrome.extension.onRequest is "[d]eprecated since Chrome 33. Please use runtime.onMessage."
See this tutorial from the Chrome API for code examples on the messaging API.
Also, there are similar (newer) SO posts, such as this one, which are more relevant for the time being.