phantomjs: how to generate a perfect html screenshot? - phantomjs

When I use control + s to save a page completely in my browser, and I open it, it more or less resembles the original website.
however, when I render a site on phantomjs and look at the generated html, it looks very different from the screenshot it produces.
how does phantomjs produce a good accurate image screenshot but not a good html screenshot? for example, when I take a screenshot of futureshop.ca and look at the html generated by phantomjs it looks like a completely different website. how do we resolve this?
for example take this cgi proxy:
https://ultraproxy.us/perl/nph-proxy.cgi/en/00/http/www.bestbuy.ca/ (uses perl cgi)
vs.
http://prerender.herokuapp.com/http://www.bestbuy.ca (using phantomjs)
how does a perl cgi script produce a more accurate looking page? is there a way to get phantomjs to do the same? phantomjs has the advantage of handling ajax loaded content while the perl cgi wouldn't.

Related

How do I render a PDF from HTML with working named anchors?

Is there a way for a bunch of named anchors in a large html to be clickable within a PhantomJs generated PDF file?
I.e. say I have a table of contents or a list of FAQ questions. When clicking on the question/title - I'm taken to its answer/content within the same HTML file which is great but when the same HTML is rendered into a PDF each named anchor becomes an absolute URL (i.e. http://example.com/render.html#anchor_1) so clicking on it opens a browser with that URL instead of jumping to its content within the PDF file.
So, basically, is it possible (and how?) for a markup like this - https://fiddle.jshell.net/jyjuaaog/ to work within the generated PDF?
BTW, this works great when "printing as a PDF file" in Google Chrome but links end up broken when rendered in PhantomJs so there must be something I'm missing that I can't seem to find in the docs.
Any ideas?
Thanks!
Apparently there's a bug in PhantomJs preventing this. As suggested by PhantomJsCloud a quick-and-dirty workaround would be to replace the links with page links.

Is it possible to see the live rendering effect of PhantomJS?

PhantomJS is a very cool tool for taking screenshots. But since it doesn't tell you how your HTML renders like until you save the image, it's quite tough to adjust small details with it. Is there any way to make the process easier?
For example, to output the display of the page rendered by PhantomJS live to somewhere so that I can see how it actually renders.
Thanks,

Download file depending on mimetype in casperjs

In a web scraping exercise, I need to click on links, let them render the content if is html and download it otherwise. How do I accomplish this with casperjs or some other tools on top of phantom/slimerjs?
As I understand it, phantom/slimerjs lack the APIs to support download. casperjs has a download API but I am not able to see how to examine the mime type and let the html render while download other content.
In both PhantomJS and SlimerJS you can register a listener to each received response:
page.onResourceReceived = function(response) {
...
}
However, only in SlimerJS is response.body defined. By using this you can save the file. There is a full example in this blog post. (As that example shows, you must set page.captureContent to cover the files you want data for.)
There is no way to do this in PhantomJS 1.9.x (and I believe PhantomJS 2.x still has the same problem, but I have not personally confirmed this yet).
The other part of your question was about deciding what to save based on mime type. The full list of available fields shows you can use response.contentType.

PhantomJS: view dependents such as .js, not just final html

In PhantomJS, is there a way to view the dependents? For example, a page causes a JS script to load. Instead of just viewing the final browser html result, I'd like to see the js script.
What I want to see has Content-Type: application/json
It is possible.
In phantomjs you have the whole list of page callbacks.
Hooking to onResourceRequested and onResourceReceived will give you the desired information every time page tries to load anything.

Checking the contains of an embed tag using Selenium

We generate a pdf doc via a call to a web service that returns the path to the generated doc.
We use an embed html tag to display the pdf inline, i.e.
<div id="ctl00_ContentPlaceHolder2_ctl01_embedArea">
<embed wmode="transparent" src="http://www.company.com/vdir/folder/Pdfs/file.pdf" width="710" height="400"/>
I'd like to use selenium to check that the pdf is actually being displayed and if possible save the path, i.e. the src link into a variable.
Anyone know how to do this? Ideally we'd like to be able to then compare this pdf to a reference one but that's a question for another day.
As far as inspecting the pdf from selenium, you're more or less out of luck. The embed tag just drops a plugin into the page, and because a plugin isn't well represented in the DOM, Selenium can't get a very good handle on it.
However, if you're using Selenium-RC you may want to consider getting the src of the embed element, then requesting that URL directly and evaluating the resulting PDF in code. Assuming your embed element looks like this <embed id="embedded" src="http://example.com/static/pdf123.pdf" /> you can try something like this
String pdfSrc = selenium.getAttribute("embedded#src");
Then make a web request to the pdfSrc url and do (somehow) validate it's the one you want. It may be enough to just check that it's not a 404.