How to open a page with Phantomjs without running js or making subsequent requests? - phantomjs

Is there a way to just load the server generated HTML (without any js or images)?
The docs seem a little sparse

The strength of PhantomJS is exactly in its ability to emulate a real browser, which opens a page and makes all the subsequent request. If you want just html maybe better use curl or wget?
But nevertheless there is a way not to run js or load images: set corresponding page settings: http://phantomjs.org/api/webpage/property/settings.html
page.settings.javascriptEnabled = false;
page.settings.loadImages = false;

Related

(Karate) How to intercept the XHR request response code?

I am testing a login functionality on a 3rd party website. I have this url example.com/login . When I copy and paste this into the browser (chrome), page sometimes load, but sometime does not (empty blank white page).
The problem is that I have to run a script on this page to click one of the elements (all the elements are embedded inside #shadow-root). If the page loads, no problem, script is evaluated successfully. But page sometimes does not load and it returns a 404 in response to an XHR request, and as a result, my * eval(scrip("script") step returns "js eval failed...".
So I found the solution to refresh the page, and to do that, I am considering to capture the xhr request response. If the status code is 404, then refresh the page. If not, continue with the following steps.
Now, I think this may work, but I do not know how to implement karate's Intercepting HTTP Requests. And firstly, is that something doable?
I have looked into documentation here, but could not understand the examples.
https://github.com/karatelabs/karate/tree/master/karate-netty
Meanwhile, if there is another way of refreshing the page conditionally, I will be more than happy to hear about it. Thanks anyone in advance.
First, using JavaScript you should be able to handle shadow roots: https://stackoverflow.com/a/60618233/143475
And the above answer links to advanced examples of executing JS in the context of the current page. I suggest you do some research into that, try to take the help of someone who knows JS, the DOM and HTML well - and you should be find a way to know if the XHR has been made successfully or not - for e.g. based on whether some element on the page has changed etc.
Finally here is how you can do interception: https://stackoverflow.com/a/61372471/143475

Phantomjs equivalent of browser's "Save Page As... Webpage, complete"

For my application I need to programmatically save a copy of a webpage HTML along with the images and resources needed to render it. Browsers have this functionality in their Save page as... Webpage, complete options.
It is of course easy to save the rendered HTML of a page using phantomjs or casperjs. However, I have not seen any examples of combining this with downloading the associated images, and doing the needed DOM changes to use the downloaded images.
Given that this functionality exists in webkit-based browsers (Chrome, Safari) I'm surprised it isn't in phantomjs -- or perhaps I just haven't found it!
Alternatively the PhantomJS, you can use the CasperJS to achieve the required result. CasperJS is a framework based on the PhantomJS, however, with a variety of modules and classes that support and complement the PhantomJS.
An example of a script that you can use is:
casper.test.begin('test script', 0, function(test) {
casper.start(url);
casper.then(function myFunction(){
//...
});
casper.run(function () {
//...
test.done();
});
});
With this script, within a "step", you can perform your downloads, be it a single image of a document, the page, a print or whatever.
Take a study on the download methods, getPageContent and capture / captureSelector in this link.
I hope these pointers can help you to go further!

Download file depending on mimetype in casperjs

In a web scraping exercise, I need to click on links, let them render the content if is html and download it otherwise. How do I accomplish this with casperjs or some other tools on top of phantom/slimerjs?
As I understand it, phantom/slimerjs lack the APIs to support download. casperjs has a download API but I am not able to see how to examine the mime type and let the html render while download other content.
In both PhantomJS and SlimerJS you can register a listener to each received response:
page.onResourceReceived = function(response) {
...
}
However, only in SlimerJS is response.body defined. By using this you can save the file. There is a full example in this blog post. (As that example shows, you must set page.captureContent to cover the files you want data for.)
There is no way to do this in PhantomJS 1.9.x (and I believe PhantomJS 2.x still has the same problem, but I have not personally confirmed this yet).
The other part of your question was about deciding what to save based on mime type. The full list of available fields shows you can use response.contentType.

How to cache history.back like Safari in other browsers?

I want
history.back()
to be cached like Safari naturally does.
But this does not happen in other browsers
How can I implement safari like cache of history.back() in other browsers?
Your can cache the page resources in 'localStorage', but most modern browsers already do something similar(and better). Despite this native browser cache, the code generated from these resources takes a while to be calculated and applied.
You can give a little help to the browser structuring your website pages this way:
<script>
if(!localStorage[location.pathname]) {
//load this page from server
localStorage[location.pathname] = getGeneratedPage();
} else {
body.innerHTML = parseGeneratedPage(localStorage[location.pathname]);
}
</script>
This is just a VERY generic example. The getGeneratedPage could be a function which stores ONLY:
The DOM tree after page load
CSS rules matched for this page
JS functions which have at least one listener
Base64 Images(only recommended for small images or previews of big images)
etc
Also, you can make a server-side version of this or something like Opera Turbo.
Well, there are countless ways to make your page load in the blink of an eye. Hope it helps.

Screen Scraping - still not working

I have browsed through many posts on this and have tried some of the suggestions but still not understanding it fully.
I would like to scrape html pages that have some script running that usually executes the script to display a link after clicking. Some mentioned firebug and others talked about reverse engineering the code I need. But after trying reverse engineering I still dont see how to get the data after tracing the script function.
jQuery('.category-selector').toggle(
function() {
var categoryList = jQuery('#category-list');
categoryList.css('top', jQuery(this).offset().top+43);
jQuery('.category-selector img').attr ('src', '/images/up_arrow.png');
categoryList.removeClass('nodisplay');
},
function() {
var categoryList = jQuery('#category-list');
jQuery('.category-selector img').attr('src', '/images/down_arrow.png');
categoryList.addClass('nodisplay');
}
);
jQuery('.category-item a').click(
function(){
idToShow = jQuery(this).attr('id').substr(9);
hideAllExcept(jQuery('#category_' + idToShow));
jQuery('.category-item a').removeClass('activeLink');
jQuery(this).addClass('activeLink');
}
);
I am using vb.net and some sites were easy using firebug where looking at the script I was able to pull the data that I needed. What woudl I do in this scenario? the link is http://featured.typepad.com/ and the categories are what I am trying to access. Notice the url does not change.
Appreciate any responses.
My best suggestion would be to use Selenium for screen scraping. It is normally used for automated website testing but would fit your case well. I've used to screen scrape AJAX pages on multiple occasions where the page was heavily Javascript dependent.
http://seleniumhq.org/projects/ide/
You can write your screen scraping code to run in .NET and it can use Firefox or IE to run your screen scraping with.
With selenium what you'll do is record a screen scraping session with the Selenium IDE in Firefox (look for the Firefox extension in the link above). That screen scraping session can either output an HTML template or C# code. It might be able to output VB as well.
You'll copy the C# or VB.NET output from the screen scrape into a selenium .NET project that you'll create and then run the Selenium project through Nunit.
I'd suggest looking online for some help with getting Selenium started and working but this should get you on your way.