I have several html templates I require to "compile" and convert to base64 format. By compile I mean injecting JS and CSS inline, and then converting it to base64 format.
I tried gulp-base64 but that only works for images in CSS. Any ideas?
Edit: I just got an idea that maybe I could use gulp-foreach to process each file individually and then use a Buffer to convert each file's content's to base64 format. I know I could do something like:
> console.log(new Buffer("Hello World").toString('base64'));
SGVsbG8gV29ybGQ=
> console.log(new Buffer("SGVsbG8gV29ybGQ=", 'base64').toString('ascii'))
Hello World
But I'm not really sure how to do it since I don't understand quite well how file streams (I think it's called vinyl) work yet in gulp. Any help would be greatly appreciated.
I found your question while looking for a solution to the same problem. Your suggestion to use gulp-foreach led me to a solution, although I didn't use that package:
// import the appropriate plugins
//
const each = require('gulp-each');
const htmlToJs = require('gulp-html-to-js');
// I'm compiling a couple of small PDF files
//
gulp.task('compile:pdf', () =>
gulp.src('./files/**/*.pdf')
// use gulp-each to iterate over the files & convert the
// files to a base64-encoded data URL
//
.pipe(each((content, file, callback) => {
const output = `data:application/pdf;base64,${new Buffer(content).toString('base64')}`;
// the first arg in this callback is the error; the second
// is the content to pass along via the stream
//
callback(null, output)
}))
// use gulp-html-to-js to convert the data URL to a JS module
// which can be imported
//
.pipe(htmlToJs())
// and set the destination...
//
.pipe(gulp.dest('./client/modules/helpers/files'))
);
The end result is a JS file with contents that look like this:
'use strict';
module.exports = 'data:application/pdf;base64,... base64 encoded string...';
Related
I'm using expo to build a cross-platform application. In my app, I have a screen where user can select images or videos to upload.
When I use expo-image-picker to select image it gives me an object which its uri starts with file:/// and I can use this uri to display the image.
When I use expo-image-picker-multiple to select multiple images it gives me objects and the uri starts with asset-library:// and I can't use this uri to display the content of it nor send it to server.
How can I convert this asset-library:// to file://? What keyword should I use to get better results when doing google search on this problem or which tool should I use? I can't really find a proper solution to this one. This occurs on IOS devices.
Thanks!
[EDIT]
here is my code
var assetUri = 'asset-library://....'
var tempDir = `${FileSystem.cacheDirectory}${Math.random().toString(36).substring(7)}.jpg`
FileSystem.copyAsync({
from: assetUri,
to: tempDir
})
try {
var assetResult = await FileSystem.readAsStringAsync(tempDir, {
encoding: FileSystem.EncodingType.UTF8
})
console.log(assetResult)
}
catch(e) {
console.log(e)
}
File 'file:///var/mobile/Containers/Data/Application/0E-FC42-4630-B3C7-537D5EFB7D1F/Library/Caches/ExponentExperienceData/ichardexpohong/gwmke.jpg' could not be read.
I wanted only the filename so i used
let filename= result.assets[0].uri.split('/')[result.assets[0].uri.split('/').length-1]
so maybe if the path after both asset-library and file:// are the same you can use
take away the asset-library:// then replace it with file://
let uri='file://' + object.uri.split('asset-library://')[1]
hope it helps
Using Adobe PDF Embed API, you can register a callback:
this.adobeDCView = new window.AdobeDC.View(config);
this.adobeDCView.registerCallback(
window.AdobeDC.View.Enum.CallbackType.SAVE_API, (metaData, content, options) => {
})
Content is according to the docs here: https://www.adobe.io/apis/documentcloud/dcsdk/docs.html?view=view
content: The ArrayBuffer of file content
When I debug this content using chrome inspector, it shows me that content is a Int8Array.
Normally when we upload a pdf file, the user selects a file and we read as dataURI and get base64 and push that to AWS. So I need to convert this PDF's data (Int8Array) to Base64, so I can also push it to AWS.
Everything I have found online uses UInt8Array to base64, and I don't understand how to go from Int8Array to UInt8Array. I would think you can just add 128 to the signed int to get a ratio between 0-256, but this doesn't seem to work.
I have tried using this:
let decoder = new TextDecoder('utf8');
let b64 = btoa(decoder.decode(content));
console.log(b64);
But I get this error:
ERROR DOMException: Failed to execute 'btoa' on 'Window': The string to be encoded contains characters outside of the Latin1 range.
Please help me figure out how to go from Int8Array to Base64.
I use the function in this answer.
For Embed API, use the "content" parameter from the save callback as the input to the function.
You can see a working example at this CodePen. The functional part is below.
adobeDCView.registerCallback(
AdobeDC.View.Enum.CallbackType.SAVE_API,
function (metaData, content, options) {
/* Add your custom save implementation here...and based on that resolve or reject response in given format */
var base64PDF = arrayBufferToBase64(content);
var fileURL = "data:application/pdf;base64," + base64PDF;
$("#submitButton").attr("href", fileURL);
/* End save code */
return new Promise((resolve, reject) => {
resolve({
code: AdobeDC.View.Enum.ApiResponseCode.SUCCESS,
data: {
/* Updated file metadata after successful save operation */
metaData: { fileName: urlToPDF.split("/").slice(-1)[0] }
}
});
});
},
saveOptions
);
I want to be able to get the file name and type from an image from my camera roll. Here is the link to the file:
"assets-library://asset/asset.JPG?id=ED7AC36B-A150-4C38-BB8C-B6D696F4F2ED&ext=JPG"
How do I extract the file name and mime type from here. I'm particularly interested in the mime type because I can generate random names but I need to be correct on the type when uploading the file.
Right now, there is a really good package to handle this called expo-asset-utils.
import AssetUtils from 'expo-asset-utils';
const { localUri, width, height } = await AssetUtils.resolveAsync(uri);
// localUri now contains the converted URI to a 'file://' URI
try out npm i expo_file_name
import {getFileName} from 'expo_file_name'
const uri = //the value for the uri goes here
const fileName = getFileName(uri);
//the name of the file is then stored in the fileName variable
I see that CasperJS has a "download" function and an "on resource received" callback but I do not see the contents of a resource in the callback, and I don't want to download the resource to the filesystem.
I want to grab the contents of the resource so that I can do something with it in my script. Is this possible with CasperJS or PhantomJS?
This problem has been in my way for the last couple of days. The proxy solution wasn't very clean in my environment so I found out where phantomjs's QTNetworking core put the resources when it caches them.
Long story short, here is my gist. You need the cache.js and mimetype.js files:
https://gist.github.com/bshamric/4717583
//for this to work, you have to call phantomjs with the cache enabled:
//usage: phantomjs --disk-cache=true test.js
var page = require('webpage').create();
var fs = require('fs');
var cache = require('./cache');
var mimetype = require('./mimetype');
//this is the path that QTNetwork classes uses for caching files for it's http client
//the path should be the one that has 16 folders labeled 0,1,2,3,...,F
cache.cachePath = '/Users/brandon/Library/Caches/Ofi Labs/PhantomJS/data7/';
var url = 'http://google.com';
page.viewportSize = { width: 1300, height: 768 };
//when the resource is received, go ahead and include a reference to it in the cache object
page.onResourceReceived = function(response) {
//I only cache images, but you can change this
if(response.contentType.indexOf('image') >= 0)
{
cache.includeResource(response);
}
};
//when the page is done loading, go through each cachedResource and do something with it,
//I'm just saving them to a file
page.onLoadFinished = function(status) {
for(index in cache.cachedResources) {
var file = cache.cachedResources[index].cacheFileNoPath;
var ext = mimetype.ext[cache.cachedResources[index].mimetype];
var finalFile = file.replace("."+cache.cacheExtension,"."+ext);
fs.write('saved/'+finalFile,cache.cachedResources[index].getContents(),'b');
}
};
page.open(url, function () {
page.render('saved/google.pdf');
phantom.exit();
});
Then when you call phantomjs, just make sure the cache is enabled:
phantomjs --disk-cache=true test.js
Some notes:
I wrote this for the purpose of getting the images on a page without using the proxy or taking a low res snapshot. QT uses compression on certain text file resources and you will have to deal with the decompression if you use this for text files. Also, I ran a quick test to pull in html resources and it didn't parse the http headers out of the result. But, this is useful to me, hopefully someone else will find it so, modify it if you have problems with a specific content type.
I've found that until the phantomjs matures a bit, according to the issue 158 http://code.google.com/p/phantomjs/issues/detail?id=158 this is a bit of a headache for them.
So you want to do it anyways? I've opted to go a bit higher to accomplish this and have grabbed PyMiProxy over at https://github.com/allfro/pymiproxy, downloaded, installed, set it up, took their example code and made this in proxy.py
from miproxy.proxy import RequestInterceptorPlugin, ResponseInterceptorPlugin, AsyncMitmProxy
from mimetools import Message
from StringIO import StringIO
class DebugInterceptor(RequestInterceptorPlugin, ResponseInterceptorPlugin):
def do_request(self, data):
data = data.replace('Accept-Encoding: gzip\r\n', 'Accept-Encoding:\r\n', 1);
return data
def do_response(self, data):
#print '<< %s' % repr(data[:100])
request_line, headers_alone = data.split('\r\n', 1)
headers = Message(StringIO(headers_alone))
print "Content type: %s" %(headers['content-type'])
if headers['content-type'] == 'text/x-comma-separated-values':
f = open('data.csv', 'w')
f.write(data)
print ''
return data
if __name__ == '__main__':
proxy = AsyncMitmProxy()
proxy.register_interceptor(DebugInterceptor)
try:
proxy.serve_forever()
except KeyboardInterrupt:
proxy.server_close()
Then I fire it up
python proxy.py
Next I execute phantomjs with the proxy specified...
phantomjs --ignore-ssl-errors=yes --cookies-file=cookies.txt --proxy=127.0.0.1:8080 --web-security=no myfile.js
You may want to turn your security on or such, it was needless for me currently as I'm scraping just one source. You should now see a bunch of text flowing through your proxy console and if it lands on something with the mime type of "text/x-comma-separated-values" it'll save it as data.csv. This will also save all the headers and everything, but if you've come this far I'm sure you can figure out how to pop those off.
One other detail, I've found that I've had to disable gzip encoding, I could use zlib and decompress data in gzip from my own apache webserver, but if it comes out of IIS or such the decompression will get errors and I'm not sure about that part of it.
So my power company won't offer me an API? Fine! We do it the hard way!
Did not realize I could grab the source from the document object like this:
casper.start(url, function() {
var js = this.evaluate(function() {
return document;
});
this.echo(js.all[0].outerHTML);
});
More info here.
You can use Casper.debugHTML() to print out contents of a HTML resource:
var casper = require('casper').create();
casper.start('http://google.com/', function() {
this.debugHTML();
});
casper.run();
You can also store the HTML contents in a var using casper.getPageContent(): http://casperjs.org/api.html#casper.getPageContent (available in lastest master)
I can't find any packages to do this. I know PHP has a ton of libraries for PDFs (like http://www.fpdf.org/) but anything for Node?
textract is a great lib that supports PDFs, Doc, Docx, etc.
Looks like there's a few for pdf, but I didn't find any for Word.
CPU bound processing like that isn't really Node's strong point anyway (i.e. you get no additional benefits using node to do it over any other language). A pragmatic approach would be to find a good tool and utilise it from Node.
I have heard good things around the office about docsplit http://documentcloud.github.com/docsplit/
While it's not Node, you could easily invoke it from Node with http://nodejs.org/docs/latest/api/all.html#child_process.exec
You can easily convert one into another, or use for example a .doc template to generate a .pdf file, but you will probably want to use an existing web service for this task.
This can be done using the services of Livedocx for example
To use this service from node, see node-livedocx (Disclaimer: I am the author of this node module)
I would suggest looking into unoconv for your initial conversion, this uses LibreOffice or OpenOffice for the actual conversion. Which adds some overhead.
I'd setup a few workers with all the necessities setup, and use a request/response queue for handling the conversion... (may want to look into kue or zmq)
In general this is a CPU bound and heavy task that should be offloaded... Pandoc and others specifically mention .docx, not .doc so they may or may not be options as well.
Note: I know this question is old, just wanted to provide a current answer for others coming across this.
you can use pdf-text for pdf files. it will extract text from a pdf into an array of text 'chunks'. Useful for doing fuzzy parsing on structured pdf text.
var pdfText = require('pdf-text')
var pathToPdf = __dirname + "/info.pdf"
pdfText(pathToPdf, function(err, chunks) {
//chunks is an array of strings
//loosely corresponding to text objects within the pdf
//for a more concrete example, view the test file in this repo
})
var fs = require('fs')
var buffer = fs.readFileSync(pathToPdf)
pdfText(buffer, function(err, chunks) {
console.log(chunks)
})
for docx files you can use mammoth, it will extract text from .docx files.
var mammoth = require("mammoth");
mammoth.extractRawText({path: "./doc.docx"})
.then(function(result){
var text = result.value; // The raw text
console.log(text);
var messages = result.messages;
})
.done();
I hope this will help.
For parsing pdf files you can use pdf2json node module
It allows you to convert pdf file to json as well as to raw text data.
Another good option if you only need to convert from Word documents is Mammoth.js.
Mammoth is designed to convert .docx documents, such as those created
by Microsoft Word, and convert them to HTML. Mammoth aims to produce
simple and clean HTML by using semantic information in the document,
and ignoring other details. For instance, Mammoth converts any
paragraph with the style Heading 1 to h1 elements, rather than
attempting to exactly copy the styling (font, text size, colour, etc.)
of the heading.
There's a large mismatch between the structure used by .docx and the
structure of HTML, meaning that the conversion is unlikely to be
perfect for more complicated documents. Mammoth works best if you only
use styles to semantically mark up your document.
Here is an example showing how to download and extract text from a PDF using PDF.js:
import _ from 'lodash';
import superagent from 'superagent';
import pdf from 'pdfjs-dist';
const url = 'http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf';
const main = async () => {
const response = await superagent.get(url).buffer();
const data = response.body;
const doc = await pdf.getDocument({ data });
for (const i of _.range(doc.numPages)) {
const page = await doc.getPage(i + 1);
const content = await page.getTextContent();
for (const { str } of content.items) {
console.log(str);
}
}
};
main().catch(error => console.error(error));
You can use Aspose.Words Cloud SDK for Node.js to extract text from DOC/DOCX,Open Office, and PDF. It's paid API but the free plan provides 150 free monthly API calls.
P.S: I'm developer evangelist at Aspose.
const { WordsApi, ConvertDocumentRequest } = require("asposewordscloud");
const fs = require('fs');
// Get Customer ID and Customer Key from https://dashboard.aspose.cloud/
wordsApi = new WordsApi("xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx", "xxxxxxxxxxxxxxxxxxxx");
const request = new ConvertDocumentRequest({
format: "txt",
document: fs.createReadStream("C:/Temp/02_pages.pdf"),
});
const outputFile = "C:/Temp/ConvertPDFtotxt.txt";
wordsApi.convertDocument(request).then((result) => {
console.log(result.response.statusCode);
console.log(result.body.byteLength);
fs.writeFileSync(outputFile, result.body);
}).catch(function(err) {
// Deal with an error
console.log(err);
});