Grab the resource contents in CasperJS or PhantomJS - phantomjs

I see that CasperJS has a "download" function and an "on resource received" callback but I do not see the contents of a resource in the callback, and I don't want to download the resource to the filesystem.
I want to grab the contents of the resource so that I can do something with it in my script. Is this possible with CasperJS or PhantomJS?

This problem has been in my way for the last couple of days. The proxy solution wasn't very clean in my environment so I found out where phantomjs's QTNetworking core put the resources when it caches them.
Long story short, here is my gist. You need the cache.js and mimetype.js files:
https://gist.github.com/bshamric/4717583
//for this to work, you have to call phantomjs with the cache enabled:
//usage: phantomjs --disk-cache=true test.js
var page = require('webpage').create();
var fs = require('fs');
var cache = require('./cache');
var mimetype = require('./mimetype');
//this is the path that QTNetwork classes uses for caching files for it's http client
//the path should be the one that has 16 folders labeled 0,1,2,3,...,F
cache.cachePath = '/Users/brandon/Library/Caches/Ofi Labs/PhantomJS/data7/';
var url = 'http://google.com';
page.viewportSize = { width: 1300, height: 768 };
//when the resource is received, go ahead and include a reference to it in the cache object
page.onResourceReceived = function(response) {
//I only cache images, but you can change this
if(response.contentType.indexOf('image') >= 0)
{
cache.includeResource(response);
}
};
//when the page is done loading, go through each cachedResource and do something with it,
//I'm just saving them to a file
page.onLoadFinished = function(status) {
for(index in cache.cachedResources) {
var file = cache.cachedResources[index].cacheFileNoPath;
var ext = mimetype.ext[cache.cachedResources[index].mimetype];
var finalFile = file.replace("."+cache.cacheExtension,"."+ext);
fs.write('saved/'+finalFile,cache.cachedResources[index].getContents(),'b');
}
};
page.open(url, function () {
page.render('saved/google.pdf');
phantom.exit();
});
Then when you call phantomjs, just make sure the cache is enabled:
phantomjs --disk-cache=true test.js
Some notes:
I wrote this for the purpose of getting the images on a page without using the proxy or taking a low res snapshot. QT uses compression on certain text file resources and you will have to deal with the decompression if you use this for text files. Also, I ran a quick test to pull in html resources and it didn't parse the http headers out of the result. But, this is useful to me, hopefully someone else will find it so, modify it if you have problems with a specific content type.

I've found that until the phantomjs matures a bit, according to the issue 158 http://code.google.com/p/phantomjs/issues/detail?id=158 this is a bit of a headache for them.
So you want to do it anyways? I've opted to go a bit higher to accomplish this and have grabbed PyMiProxy over at https://github.com/allfro/pymiproxy, downloaded, installed, set it up, took their example code and made this in proxy.py
from miproxy.proxy import RequestInterceptorPlugin, ResponseInterceptorPlugin, AsyncMitmProxy
from mimetools import Message
from StringIO import StringIO
class DebugInterceptor(RequestInterceptorPlugin, ResponseInterceptorPlugin):
def do_request(self, data):
data = data.replace('Accept-Encoding: gzip\r\n', 'Accept-Encoding:\r\n', 1);
return data
def do_response(self, data):
#print '<< %s' % repr(data[:100])
request_line, headers_alone = data.split('\r\n', 1)
headers = Message(StringIO(headers_alone))
print "Content type: %s" %(headers['content-type'])
if headers['content-type'] == 'text/x-comma-separated-values':
f = open('data.csv', 'w')
f.write(data)
print ''
return data
if __name__ == '__main__':
proxy = AsyncMitmProxy()
proxy.register_interceptor(DebugInterceptor)
try:
proxy.serve_forever()
except KeyboardInterrupt:
proxy.server_close()
Then I fire it up
python proxy.py
Next I execute phantomjs with the proxy specified...
phantomjs --ignore-ssl-errors=yes --cookies-file=cookies.txt --proxy=127.0.0.1:8080 --web-security=no myfile.js
You may want to turn your security on or such, it was needless for me currently as I'm scraping just one source. You should now see a bunch of text flowing through your proxy console and if it lands on something with the mime type of "text/x-comma-separated-values" it'll save it as data.csv. This will also save all the headers and everything, but if you've come this far I'm sure you can figure out how to pop those off.
One other detail, I've found that I've had to disable gzip encoding, I could use zlib and decompress data in gzip from my own apache webserver, but if it comes out of IIS or such the decompression will get errors and I'm not sure about that part of it.
So my power company won't offer me an API? Fine! We do it the hard way!

Did not realize I could grab the source from the document object like this:
casper.start(url, function() {
var js = this.evaluate(function() {
return document;
});
this.echo(js.all[0].outerHTML);
});
More info here.

You can use Casper.debugHTML() to print out contents of a HTML resource:
var casper = require('casper').create();
casper.start('http://google.com/', function() {
this.debugHTML();
});
casper.run();
You can also store the HTML contents in a var using casper.getPageContent(): http://casperjs.org/api.html#casper.getPageContent (available in lastest master)

Related

Show local image file:///tmp/someimage.jpg

Scenario
I'm contributing for a OSS project that is build on BlazorServerSide and ElectronNET.API Version 9.31.1.
In an Electron window we would like to show images from local storage UI via <img> tag.
What I have tried:
I have tried with:
<img src="file:///home/dani/pictures/someimage.jpg" />
But doesn't work. Image doesn't appear. I have then tried to create electron window with WebSecurity = false, but also doesn't help (images appears as broken on UI):
var browserWindowOptions = new BrowserWindowOptions
{
WebPreferences = new WebPreferences
{
WebSecurity = false,
},
};
Task.Run(async () => await Electron.WindowManager.CreateWindowAsync(
browserWindowOptions,
$"http://localhost:{BridgeSettings.WebPort}/Language/SetCultureByConfig"
));
Finally, as workaround, I'm sending the images as data base64 in img src's attribute, but it looks like a dirty approach.
My Question:
My question is, how can I show on electron window picture files from local storage.
Some irrelevant info:
The open source line where I need assistance.
There are several ways to go about this, so I will try to cover the most relevant use cases. Some of this depends on the context of your project.
Access to local files behave as cross origin requests by default. You could try using the crossorigin=anonymous attribute on your image tag, but doesn't work because your local file system will not be responding with cross origin headers.
Disabling the webSecurity option is a workaround, but is not recommended for security reasons, and will not usually work correctly anyway if your html is not also loaded from the local file system.
Disabling webSecurity will disable the same-origin policy and set allowRunningInsecureContent property to true. In other words, it allows the execution of insecure code from different domains.
https://www.electronjs.org/docs/tutorial/security#5-do-not-disable-websecurity
Here are some methods of working around this issue:
1 - Use the HTML5 File API to load local file resources and provide the ArrayBuffer to ImageData to write the image to a <canvas> .
function loadAsUrl(theFile) {
var reader = new FileReader();
var putCanvas = function(canvas_id) {
return function(loadedEvent) {
var buffer = new Uint8ClampedArray(loadedEvent.target.result);
document.getElementById(canvas_id)
.getContext('2d')
.putImageData(new ImageData(buffer, width, height), 0, 0);
}
}
reader.onload = putCanvas("canvas_id");
reader.readAsArrayBuffer(theFile);
}
1.b - It is also possible to load a file as a data URL. A data URL can be set as source (src) on img elements with JavaScript. Here is a JavaScript function named loadAsUrl() that shows how to load a file as a data URL using the HTML5 file API:
function loadAsUrl(theFile) {
var reader = new FileReader();
reader.onload = function(loadedEvent) {
var image = document.getElementById("theImage");
image.setAttribute("src", loadedEvent.target.result);
}
reader.readAsDataURL(theFile);
}
2 - Use the Node API fs to read the file, and convert it into a base64 encoded data url to embed in the image tag.
Hack - Alternatively you can try loading the image in a BrowserView or <webview>. The former overlays the content of your BrowserWindow while the latter is embedded into the content.
// In the main process.
const { BrowserView, BrowserWindow } = require('electron')
const win = new BrowserWindow({ width: 800, height: 600 })
const view = new BrowserView()
win.setBrowserView(view)
view.setBounds({ x: 0, y: 0, width: 300, height: 300 })
view.webContents.loadURL('file:///home/dani/pictures/someimage.jpg')

How can I render multiple URL's into a single PDF

I'm attempting to open a series of URL's to render the output, then combine into a single PDF using PhantomJS, but I cannot find any documentation on how to do this. I'm just using trial and error, but not getting anywhere - hoping somebody knows how to do this.
I'm not completely set on PhantomJS, so if you know of a better command line, node or JAVA tool that would be better, I'm all ears (or eyes in this case).
Here is the code I have that renders a single page. I've tried replicating the open/render, but it always overwrites the PDF instead of appending to it.
var page = require('webpage').create(),
system = require('system'),
fs = require('fs'),
pages = {
page1: 'http://localhost/test1.html',
page2: 'http://localhost/test2.html'
};
page.paperSize = {
format: 'A4',
orientation: 'portrait',
};
page.settings.dpi = "96";
// this renders a single page and overwrites the existing PDF or creates a new one
page.open('pages.page1', function() {
setTimeout(function() {
page.render('capture.pdf');
phantom.exit();
}, 5000);
});
PhantomJS renders one web page into one PDF file, so if you can merge several URLs into one html file you could open it in PhantomJS and make a PDF.
But it would be simpler to make several PDFs and then merge them into one with something like pdfkt at the end of the script, launching merge command from PhantomJS child module

download csv (or other non html data) with phantomjs

How can I access simple csv data?
var webpage = require('webpage');
var csvPage = webpage.create();
var csvUrl= "http://www.scoach.ch/arcmsdownload/023c5c5aa58e6e0ff963ddcdea5ac016/CONTENT.csv/derivatives_2013-05-24.csv";
csvPage.open(csvUrl, function(status){
console.log("csv: " + csvPage.content);
});
This will give me just an empty html: which is not the expected result :-) I have tried several callbacks, but nothing helped me.
Thanks for your Help!
First, I'll just quickly point out that PhantomJS is overkill for this job. Use wget, curl, PHP file_get_contents, etc. However, I'm assuming this is part of a more complicated PhantomJS script, and you have a good reason.
I can only half answer your question, by showing you how to see the missing error messages:
var webpage = require('webpage');
var csvPage = webpage.create();
var csvUrl= "http://www.scoach.ch/arcmsdownload/023c5c5aa58e6e0ff963ddcdea5ac016/CONTENT.csv/derivatives_2013-05-24.csv";
csvPage.open(csvUrl, function(status){
console.log("status="+status);
console.log("csv: " + csvPage.plainText);
phantom.exit();
});
I made these changes:
Show the status (it is "fail")
Change to use plainText instead of content. (The latter wraps your content in html tags, which you don't want for csv).
Add phantom.exit(), just so it doesn't sit there at the end.
I don't know why the status is "fail", when I can get the file fine with wget. The next troubleshooting step is to add these two lines before calling csvPage.open:
csvPage.onResourceRequested = function (request) {
console.log('Request ' + JSON.stringify(request, undefined, 4));
};
csvPage.onResourceReceived = function (response) {
console.log('Receive ' + JSON.stringify(response, undefined, 4));
};
It is returning immediately, with 3878 bytes, even though I see a Content-Length header of 6,335,428. This might be a PhantomJS bug/limitation with either chunked encoding or very large files.
UPDATE: Another idea, for a short-term solution, is to call wget or curl from inside your PhantomJS script, using the new spawn or execFile commands: http://code.google.com/p/phantomjs/source/browse/examples/child_process-examples.js
This SO post might help.
Also note that PhantomJS is a separate web server from NodeJS, so using csv node libraries isn't an option.

How to precompile processingjs sketch to speed load times?

The loading times of my processingjs webpage are getting pretty hairy. How can I precache the compilation of processing to javascript?
It would be acceptable for my application to compile on first entering the webpage (maybe keeping the result in the local store?) and then reuse the compilation on subsequent loads.
There's two ways to drive down load time as experienced by the user. The first is using precompiled sketches, which is relatively easy: github repo, or even just download the master branch using github's download button (https://github.com/processing-js/processing-js), and then look for the "./tools/processing-helper.html" file. This is a helper page that lets you run or compile sketches to the JavaScript source that Processing.js uses. You will still need to run this alongside Processing, since it ties into the API provided, but you can use the "API only" version for that. Take the code it generates, prepend "var mySketch = ", and then do this on your page:
<script src="processing.api.js"></script>
<script>
function whenImGoodAndReady() {
var mySketch = (function.....) // generated by Processing.js helper
var myCanvas = document.getElementById('mycanvas');
new Processing(myCanvas, mySketch);
}
</script>
Just make sure to call the load function when, as the name implies, you're ready to do so =)
The other is to do late-loading, if you have any sketches that are initially off-screen.
There's a "lazy loading" extension in the full download for Processing.js - you can include that on your page, and it will make sketches load only once they're in view. That way you don't bog down the entire page load.
Alternatively, you can write a background loader that does the same thing as the lazy loading extension: turn off Processing.init, and instead gather all the script/canvas elements that represent Processing sketches, then loading them on a timeout using something like
var sketchList = [];
function findSketches() {
/* find all script/canvas elements */
for(every of these elements) {
sketchList.append({
canvas: <a canvas element>,
sourceCode: <the sketch code>
});
}
// kickstart slowloading
slowLoad();
}
function slowLoad() {
if(sketchList.length>0) {
var sketchData = sketchList.splice(0,1);
try {
new Processing(sketchData.canvas, sketchData.sourceCode);
setTimeout(slowLoad, 15000); // load next sketch in 15 seconds
} catch (e) { console.log(e); }
}
}
This will keep slow-loading your sketches until it's run out.

How do I get data from a background page to the content script in google chrome extensions

I've been trying to send data from my background page to a content script in my chrome extension. i can't seem to get it to work. I've read a few posts online but they're not really clear and seem quite high level. I've got managed to get the oauth working using the Oauth contacts example on the Chrome samples. The authentication works, i can get the data and display it in an html page by opening a new tab.
I want to send this data to a content script.
i'm having a lot of trouble with this and would really appreciate if someone could outline the explicit steps you need to follow to send data from a bg page to a content script or even better some code. Any takers?
the code for my background page is below (i've excluded the oauth paramaeters and other )
` function onContacts(text, xhr) {
contacts = [];
var data = JSON.parse(text);
var realdata = data.contacts;
for (var i = 0, person; person = realdata.person[i]; i++) {
var contact = {
'name' : person['name'],
'emails' : person['email']
};
contacts.push(contact); //this array "contacts" is read by the
contacts.html page when opened in a new tab
}
chrome.tabs.create({ 'url' : 'contacts.html'}); sending data to new tab
//chrome.tabs.executeScript(null,{file: "contentscript.js"});
may be this may work?
};
function getContacts() {
oauth.authorize(function() {
console.log("on authorize");
setIcon();
var url = "http://mydataurl/";
oauth.sendSignedRequest(url, onContacts);
});
};
chrome.browserAction.onClicked.addListener(getContacts);`
As i'm not quite sure how to get the data into the content script i wont bother posting the multiple versions of my failed content scripts. if I could just get a sample on how to request the "contacts" array from my content script, and how to send the data from the bg page, that would be great!
You have two options getting the data into the content script:
Using Tab API:
http://code.google.com/chrome/extensions/tabs.html#method-executeScript
Using Messaging:
http://code.google.com/chrome/extensions/messaging.html
Using Tab API
I usually use this approach when my extension will just be used once in a while, for example, setting the image as my desktop wallpaper. People don't set a wallpaper every second, or every minute. They usually do it once a week or even day. So I just inject a content script to that page. It is pretty easy to do so, you can either do it by file or code as explained in the documentation:
chrome.tabs.executeScript(tab.id, {file: 'inject_this.js'}, function() {
console.log('Successfully injected script into the page');
});
Using Messaging
If you are constantly need information from your websites, it would be better to use messaging. There are two types of messaging, Long-lived and Single-requests. Your content script (that you define in the manifest) can listen for extension requests:
chrome.extension.onRequest.addListener(function(request, sender, sendResponse) {
if (request.method == 'ping')
sendResponse({ data: 'pong' });
else
sendResponse({});
});
And your background page could send a message to that content script through messaging. As shown below, it will get the currently selected tab and send a request to that page.
chrome.tabs.getSelected(null, function(tab) {
chrome.tabs.sendRequest(tab.id, {method: 'ping'}, function(response) {
console.log(response.data);
});
});
Depends on your extension which method to use. I have used both. For an extension that will be used like every second, every time, I use Messaging (Long-Lived). For an extension that will not be used every time, then you don't need the content script in every single page, you can just use the Tab API executeScript because it will just inject a content script whenever you need to.
Hope that helps! Do a search on Stackoverflow, there are many answers to content scripts and background pages.
To follow on Mohamed's point.
If you want to pass data from the background script to the content script at initialisation, you can generate another simple script that contains only JSON and execute it beforehand.
Is that what you are looking for?
Otherwise, you will need to use the message passing interface
In the background page:
// Subscribe to onVisited event, so that injectSite() is called once at every pageload.
chrome.history.onVisited.addListener(injectSite);
function injectSite(data) {
// get custom configuration for this URL in the background page.
var site_conf = getSiteConfiguration(data.url);
if (site_conf)
{
chrome.tabs.executeScript({ code: 'PARAMS = ' + JSON.stringify(site_conf) + ';' });
chrome.tabs.executeScript({ file: 'site_injection.js' });
}
}
In the content script page (site_injection.js)
// read config directly from background
console.log(PARAM.whatever);
I thought I'd update this answer for current and future readers.
According to the Chrome API, chrome.extension.onRequest is "[d]eprecated since Chrome 33. Please use runtime.onMessage."
See this tutorial from the Chrome API for code examples on the messaging API.
Also, there are similar (newer) SO posts, such as this one, which are more relevant for the time being.