Download file depending on mimetype in casperjs - phantomjs

In a web scraping exercise, I need to click on links, let them render the content if is html and download it otherwise. How do I accomplish this with casperjs or some other tools on top of phantom/slimerjs?
As I understand it, phantom/slimerjs lack the APIs to support download. casperjs has a download API but I am not able to see how to examine the mime type and let the html render while download other content.

In both PhantomJS and SlimerJS you can register a listener to each received response:
page.onResourceReceived = function(response) {
...
}
However, only in SlimerJS is response.body defined. By using this you can save the file. There is a full example in this blog post. (As that example shows, you must set page.captureContent to cover the files you want data for.)
There is no way to do this in PhantomJS 1.9.x (and I believe PhantomJS 2.x still has the same problem, but I have not personally confirmed this yet).
The other part of your question was about deciding what to save based on mime type. The full list of available fields shows you can use response.contentType.

Related

Web APIs FileReader() hide download option

I've successfully integrated the FileReader() that renders a file from a BLOB so the user can view and interact with it, however the revised criteria states that the user shouldn't be allowed to download the document now.
The requirement is that the download icon is removed from the FileReader() but I can't seem to find a way of doing this, as it's baked into the actual Web API.
I started to write my own PDF viewer using a basic Vue to PDF package and adding custom controls but this is a bit of a monster and I'd like to avoid a complete re-write to remove one action.
Is there any way of removing the download CTA before it renders the PDF?
More context..
The PDF is rendered in the DOM from a BLOB that's passed via an end point I hook into with Axios. I then readAsDataURL(blob) and finally create the FileReader() result as a URL.createObjectURL(blob) to give me the data that I render as the canvas src to enable the PDF viewer. Unfortunately this can't be a PNG as it needs multi pages. Thee issue is that it's sensitive docs that can only be viewed on the portal, so it's to prevent users from easily downloading (aware they could just print screen).

How to open a page with Phantomjs without running js or making subsequent requests?

Is there a way to just load the server generated HTML (without any js or images)?
The docs seem a little sparse
The strength of PhantomJS is exactly in its ability to emulate a real browser, which opens a page and makes all the subsequent request. If you want just html maybe better use curl or wget?
But nevertheless there is a way not to run js or load images: set corresponding page settings: http://phantomjs.org/api/webpage/property/settings.html
page.settings.javascriptEnabled = false;
page.settings.loadImages = false;

Phantomjs equivalent of browser's "Save Page As... Webpage, complete"

For my application I need to programmatically save a copy of a webpage HTML along with the images and resources needed to render it. Browsers have this functionality in their Save page as... Webpage, complete options.
It is of course easy to save the rendered HTML of a page using phantomjs or casperjs. However, I have not seen any examples of combining this with downloading the associated images, and doing the needed DOM changes to use the downloaded images.
Given that this functionality exists in webkit-based browsers (Chrome, Safari) I'm surprised it isn't in phantomjs -- or perhaps I just haven't found it!
Alternatively the PhantomJS, you can use the CasperJS to achieve the required result. CasperJS is a framework based on the PhantomJS, however, with a variety of modules and classes that support and complement the PhantomJS.
An example of a script that you can use is:
casper.test.begin('test script', 0, function(test) {
casper.start(url);
casper.then(function myFunction(){
//...
});
casper.run(function () {
//...
test.done();
});
});
With this script, within a "step", you can perform your downloads, be it a single image of a document, the page, a print or whatever.
Take a study on the download methods, getPageContent and capture / captureSelector in this link.
I hope these pointers can help you to go further!

phantomjs: how to generate a perfect html screenshot?

When I use control + s to save a page completely in my browser, and I open it, it more or less resembles the original website.
however, when I render a site on phantomjs and look at the generated html, it looks very different from the screenshot it produces.
how does phantomjs produce a good accurate image screenshot but not a good html screenshot? for example, when I take a screenshot of futureshop.ca and look at the html generated by phantomjs it looks like a completely different website. how do we resolve this?
for example take this cgi proxy:
https://ultraproxy.us/perl/nph-proxy.cgi/en/00/http/www.bestbuy.ca/ (uses perl cgi)
vs.
http://prerender.herokuapp.com/http://www.bestbuy.ca (using phantomjs)
how does a perl cgi script produce a more accurate looking page? is there a way to get phantomjs to do the same? phantomjs has the advantage of handling ajax loaded content while the perl cgi wouldn't.

Checking the contains of an embed tag using Selenium

We generate a pdf doc via a call to a web service that returns the path to the generated doc.
We use an embed html tag to display the pdf inline, i.e.
<div id="ctl00_ContentPlaceHolder2_ctl01_embedArea">
<embed wmode="transparent" src="http://www.company.com/vdir/folder/Pdfs/file.pdf" width="710" height="400"/>
I'd like to use selenium to check that the pdf is actually being displayed and if possible save the path, i.e. the src link into a variable.
Anyone know how to do this? Ideally we'd like to be able to then compare this pdf to a reference one but that's a question for another day.
As far as inspecting the pdf from selenium, you're more or less out of luck. The embed tag just drops a plugin into the page, and because a plugin isn't well represented in the DOM, Selenium can't get a very good handle on it.
However, if you're using Selenium-RC you may want to consider getting the src of the embed element, then requesting that URL directly and evaluating the resulting PDF in code. Assuming your embed element looks like this <embed id="embedded" src="http://example.com/static/pdf123.pdf" /> you can try something like this
String pdfSrc = selenium.getAttribute("embedded#src");
Then make a web request to the pdfSrc url and do (somehow) validate it's the one you want. It may be enough to just check that it's not a 404.