How can I control PhantomJS to skip download some kind of resource? - phantomjs

phantomjs has config loadImage,
but I want more,
how can I control phantomjs to skip download some kind of resource,
such as css etc...
=====
good news:
this feature is added.
https://code.google.com/p/phantomjs/issues/detail?id=230
The gist:
page.onResourceRequested = function(requestData, request) {
if ((/http:\/\/.+?\.css/gi).test(requestData['url']) || requestData['Content-Type'] == 'text/css') {
console.log('The url of the request is matching. Aborting: ' + requestData['url']);
request.abort();
}
};

UPDATED, Working!
Since PhantomJS 1.9, the existing answer didn't work. You must use this code:
var webPage = require('webpage');
var page = webPage.create();
page.onResourceRequested = function(requestData, networkRequest) {
var match = requestData.url.match(/wordfamily.js/g);
if (match != null) {
console.log('Request (#' + requestData.id + '): ' + JSON.stringify(requestData));
networkRequest.cancel(); // or .abort()
}
};
If you use abort() instead of cancel(), it will trigger onResourceError.
You can look at the PhantomJS docs

So finally you can try this http://github.com/eugenehp/node-crawler
otherwise you can still try the below approach with PhantomJS
The easy way, is to load page -> parse page -> exclude unwanted resource -> load it into PhatomJS.
Another way is just simply block the hosts in the firewall.
Optionally you can use a proxy to block certain URL addresses and queries to them.
And additional one, load the page, and then remove the unwanted resources, but I think its not the right approach here.

Use page.onResourceRequested, as in example loadurlwithoutcss.js:
page.onResourceRequested = function(requestData, request) {
if ((/http:\/\/.+?\.css/gi).test(requestData['url']) ||
requestData.headers['Content-Type'] == 'text/css') {
console.log('The url of the request is matching. Aborting: ' + requestData['url']);
request.abort();
}
};

No way for now (phantomjs 1.7), it does NOT support that.
But a nasty solution is using a http proxy, so you can screen out some request that you don't need

Related

SyncXHR in Page Dismissal Alternative

Since google has declared to disallow sync XHR in page dismissal, i havent found the decent replacement to this feature. I've tried sendBeacon, but the 64KB payload limit makes it useless for my use case. At this point, i found the workaround by configuring the chromium flag directly (#allow-sync-xhr-in-page-dismissal). But this is clearly not the final solution. It's not user friendly to force your user to tweak their own browser in order to use our app.
Is there any syncXHR in page dismissal alternative?
var xhr;
function saveChanges(){
xhr = new XMLHttpRequest();
xhr.open('POST',url,false)
xhr.send(post)
}
window.addEventListener('beforeunload', (event) =>{
saveChanges();
if(xhr.readyState == 4) return;
event.preventDefault();
event.returnValue = '';
})
Credit to : https://groups.google.com/a/chromium.org/g/blink-dev/c/LnqwTCiT9Gs/m/wM0yAjcfAAAJ

Firefox setResponseHeader isn't working

I'm working on a web application where I need to access elements of an iFrame using JavaScript. To do that, the iFrame has to send an "Allow-Control-Allow-Origin: *" header to the browser.
Unfortunately this doesn't happen, that's why I'm using an extension to modify the response headers, but for some reason, setResponseHeader doesn't work.
It gets even more confusing since I'm using setResponseHeader to strip X-Frame-Options, but when I'm setting a custom header, it just won't work.
I'm using Firefox's "Inspect Element"'s Network tab to observe the requests, and while it shows the request header being set correctly, it doesn't show the response header.
That's how I'm setting the request and response headers.
var chrome = require("chrome");
chrome.Cc["#mozilla.org/observer-service;1"].getService( chrome.Ci.nsIObserverService ).addObserver({
observe : function(subject, topic, data) {
var channel = subject.QueryInterface( chrome.Ci.nsIHttpChannel );
channel.setRequestHeader("x-mysite-extended", "somedata", false);
}
},"http-on-modify-request",false);
chrome.Cc["#mozilla.org/observer-service;1"].getService( chrome.Ci.nsIObserverService ).addObserver({
observe : function(subject, topic, data) {
var channel = subject.QueryInterface( chrome.Ci.nsIHttpChannel );
channel.setResponseHeader("x-mysite-extended", "somedata", false);
}
},"http-on-examine-response",false);
Again, the request header works according to the Network tab. I tried http-on-modify-request to set the response header but that didn't work as well.
That's how I'm stripping of the X-Frame-Options header, which works.
let myListener =
{
observe : function (aSubject, aTopic, aData)
{
console.log(aTopic);
if (aTopic == "http-on-examine-response")
{
let channel = aSubject.QueryInterface(Ci.nsIHttpChannel);
try
{ // getResponseHeader will throw if the header isn't set
let hasXFO = channel.getResponseHeader('X-Frame-Options');
if (hasXFO)
{
// Header found, disable it
channel.setResponseHeader('X-Frame-Options', '', false);
}
}
catch (e) {}
}
}
}
var observerService = Cc["#mozilla.org/observer-service;1"]
.getService(Ci.nsIObserverService);
observerService.addObserver(myListener, "http-on-examine-response", false);
I've been trying to solve this for two hours now so any help is appreciated. Thanks.
You're adding obserer for http-on-examine-response, with this you can only getResponseHeader
change it to http-on-modify-request. then you can setRequestHeader, you cant getResponseHeader in on modify request though.
This is scrap code but it worked for me:
observe : function(aSubject, aTopic, aData) {
// Make sure it is our connection first.
if (aSubject == channel) {
//this is our channel
//alert('is my mine');
cdxFire.myChannel = aSubject.QueryInterface(Components.interfaces.nsIHttpChannel);
if (cdxFire.myChannel.requestMethod == 'GET') {
//alert('its a get so need to removeObserver now');
//cdxFire.observerService.removeObserver(modHeaderListener, "http-on-modify-request");
}
if (aTopic == 'http-on-modify-request' && cdxFire.myChannel.requestMethod == 'POST') {
//can set headers here including cookie
try {
var xContentLength = httpChannel.getRequestHeader('Content-Length');
var xContentType = httpChannel.getRequestHeader('Content-Type');
//alert('content length is there so change it up');
cdxFire.myChannel.setRequestHeader('Content-Type','',false);
cdxFire.myChannel.setRequestHeader('Content-Type',xContentType,false);
cdxFire.myChannel.setRequestHeader('Content-Length','',false);
cdxFire.myChannel.setRequestHeader('Content-Length',xContentLength,false);

WL.download with multiple files (OneDrive API)

I'm trying to implement a OneDrive picker. The user can select his files and then, when saving, i can get these files and download them.
I follow the OneDrive API Documentation, and i get this :
WL.init({ client_id: clientId, redirect_uri: redirectUri });
WL.login({ "scope": "wl.skydrive wl.signin" }).then(
function(response) {
openFromSkyDrive();
},
function(response) {
log("Failed to authenticate.");
}
);
function openFromSkyDrive() {
WL.fileDialog({
mode: 'open',
select: 'single'
}).then(
function(response) {
log("The following file is being downloaded:");
log("");
var files = response.data.files;
for (var i = 0; i < files.length; i++) {
var file = files[i];
log(file.name);
WL.download({ "path": file.id + "/content" });
}
},
function(errorResponse) {
log("WL.fileDialog errorResponse = " + JSON.stringify(errorResponse));
}
);
}
function log(message) {
var child = document.createTextNode(message);
var parent = document.getElementById('JsOutputDiv') || document.body;
parent.appendChild(child);
parent.appendChild(document.createElement("br"));
}
In the select options, you can set 'single' or 'multi' to permit to the user to select one or more files from the picker.
But when i try to set 'multi', the WL.download method only work for the last file.
Thanks for help !
ps: i didn't found real solution on stackoverflow or any forum
This is a quirk with the WL.Download() function. It creates a hidden iframe to execute the download, but it uses the same iframe for all the downloads it does. So if you queue up two downloads in quick succession, it will navigate the iframe twice and you'll only end up actually downloading the last file. WL.Download() does not expose when a download is complete, so you can't simply wait for one to finish before starting the next.
Unfortunately, the code sample is a bit misleading, putting the WL.Download() calls in a for-loop. We've taken note of these issues.
In the meantime, to unblock yourself, you can get the download URL from the 'file.source' property and initiate the download yourself.

download csv (or other non html data) with phantomjs

How can I access simple csv data?
var webpage = require('webpage');
var csvPage = webpage.create();
var csvUrl= "http://www.scoach.ch/arcmsdownload/023c5c5aa58e6e0ff963ddcdea5ac016/CONTENT.csv/derivatives_2013-05-24.csv";
csvPage.open(csvUrl, function(status){
console.log("csv: " + csvPage.content);
});
This will give me just an empty html: which is not the expected result :-) I have tried several callbacks, but nothing helped me.
Thanks for your Help!
First, I'll just quickly point out that PhantomJS is overkill for this job. Use wget, curl, PHP file_get_contents, etc. However, I'm assuming this is part of a more complicated PhantomJS script, and you have a good reason.
I can only half answer your question, by showing you how to see the missing error messages:
var webpage = require('webpage');
var csvPage = webpage.create();
var csvUrl= "http://www.scoach.ch/arcmsdownload/023c5c5aa58e6e0ff963ddcdea5ac016/CONTENT.csv/derivatives_2013-05-24.csv";
csvPage.open(csvUrl, function(status){
console.log("status="+status);
console.log("csv: " + csvPage.plainText);
phantom.exit();
});
I made these changes:
Show the status (it is "fail")
Change to use plainText instead of content. (The latter wraps your content in html tags, which you don't want for csv).
Add phantom.exit(), just so it doesn't sit there at the end.
I don't know why the status is "fail", when I can get the file fine with wget. The next troubleshooting step is to add these two lines before calling csvPage.open:
csvPage.onResourceRequested = function (request) {
console.log('Request ' + JSON.stringify(request, undefined, 4));
};
csvPage.onResourceReceived = function (response) {
console.log('Receive ' + JSON.stringify(response, undefined, 4));
};
It is returning immediately, with 3878 bytes, even though I see a Content-Length header of 6,335,428. This might be a PhantomJS bug/limitation with either chunked encoding or very large files.
UPDATE: Another idea, for a short-term solution, is to call wget or curl from inside your PhantomJS script, using the new spawn or execFile commands: http://code.google.com/p/phantomjs/source/browse/examples/child_process-examples.js
This SO post might help.
Also note that PhantomJS is a separate web server from NodeJS, so using csv node libraries isn't an option.

Grab the resource contents in CasperJS or PhantomJS

I see that CasperJS has a "download" function and an "on resource received" callback but I do not see the contents of a resource in the callback, and I don't want to download the resource to the filesystem.
I want to grab the contents of the resource so that I can do something with it in my script. Is this possible with CasperJS or PhantomJS?
This problem has been in my way for the last couple of days. The proxy solution wasn't very clean in my environment so I found out where phantomjs's QTNetworking core put the resources when it caches them.
Long story short, here is my gist. You need the cache.js and mimetype.js files:
https://gist.github.com/bshamric/4717583
//for this to work, you have to call phantomjs with the cache enabled:
//usage: phantomjs --disk-cache=true test.js
var page = require('webpage').create();
var fs = require('fs');
var cache = require('./cache');
var mimetype = require('./mimetype');
//this is the path that QTNetwork classes uses for caching files for it's http client
//the path should be the one that has 16 folders labeled 0,1,2,3,...,F
cache.cachePath = '/Users/brandon/Library/Caches/Ofi Labs/PhantomJS/data7/';
var url = 'http://google.com';
page.viewportSize = { width: 1300, height: 768 };
//when the resource is received, go ahead and include a reference to it in the cache object
page.onResourceReceived = function(response) {
//I only cache images, but you can change this
if(response.contentType.indexOf('image') >= 0)
{
cache.includeResource(response);
}
};
//when the page is done loading, go through each cachedResource and do something with it,
//I'm just saving them to a file
page.onLoadFinished = function(status) {
for(index in cache.cachedResources) {
var file = cache.cachedResources[index].cacheFileNoPath;
var ext = mimetype.ext[cache.cachedResources[index].mimetype];
var finalFile = file.replace("."+cache.cacheExtension,"."+ext);
fs.write('saved/'+finalFile,cache.cachedResources[index].getContents(),'b');
}
};
page.open(url, function () {
page.render('saved/google.pdf');
phantom.exit();
});
Then when you call phantomjs, just make sure the cache is enabled:
phantomjs --disk-cache=true test.js
Some notes:
I wrote this for the purpose of getting the images on a page without using the proxy or taking a low res snapshot. QT uses compression on certain text file resources and you will have to deal with the decompression if you use this for text files. Also, I ran a quick test to pull in html resources and it didn't parse the http headers out of the result. But, this is useful to me, hopefully someone else will find it so, modify it if you have problems with a specific content type.
I've found that until the phantomjs matures a bit, according to the issue 158 http://code.google.com/p/phantomjs/issues/detail?id=158 this is a bit of a headache for them.
So you want to do it anyways? I've opted to go a bit higher to accomplish this and have grabbed PyMiProxy over at https://github.com/allfro/pymiproxy, downloaded, installed, set it up, took their example code and made this in proxy.py
from miproxy.proxy import RequestInterceptorPlugin, ResponseInterceptorPlugin, AsyncMitmProxy
from mimetools import Message
from StringIO import StringIO
class DebugInterceptor(RequestInterceptorPlugin, ResponseInterceptorPlugin):
def do_request(self, data):
data = data.replace('Accept-Encoding: gzip\r\n', 'Accept-Encoding:\r\n', 1);
return data
def do_response(self, data):
#print '<< %s' % repr(data[:100])
request_line, headers_alone = data.split('\r\n', 1)
headers = Message(StringIO(headers_alone))
print "Content type: %s" %(headers['content-type'])
if headers['content-type'] == 'text/x-comma-separated-values':
f = open('data.csv', 'w')
f.write(data)
print ''
return data
if __name__ == '__main__':
proxy = AsyncMitmProxy()
proxy.register_interceptor(DebugInterceptor)
try:
proxy.serve_forever()
except KeyboardInterrupt:
proxy.server_close()
Then I fire it up
python proxy.py
Next I execute phantomjs with the proxy specified...
phantomjs --ignore-ssl-errors=yes --cookies-file=cookies.txt --proxy=127.0.0.1:8080 --web-security=no myfile.js
You may want to turn your security on or such, it was needless for me currently as I'm scraping just one source. You should now see a bunch of text flowing through your proxy console and if it lands on something with the mime type of "text/x-comma-separated-values" it'll save it as data.csv. This will also save all the headers and everything, but if you've come this far I'm sure you can figure out how to pop those off.
One other detail, I've found that I've had to disable gzip encoding, I could use zlib and decompress data in gzip from my own apache webserver, but if it comes out of IIS or such the decompression will get errors and I'm not sure about that part of it.
So my power company won't offer me an API? Fine! We do it the hard way!
Did not realize I could grab the source from the document object like this:
casper.start(url, function() {
var js = this.evaluate(function() {
return document;
});
this.echo(js.all[0].outerHTML);
});
More info here.
You can use Casper.debugHTML() to print out contents of a HTML resource:
var casper = require('casper').create();
casper.start('http://google.com/', function() {
this.debugHTML();
});
casper.run();
You can also store the HTML contents in a var using casper.getPageContent(): http://casperjs.org/api.html#casper.getPageContent (available in lastest master)