For my application I need to programmatically save a copy of a webpage HTML along with the images and resources needed to render it. Browsers have this functionality in their Save page as... Webpage, complete options.
It is of course easy to save the rendered HTML of a page using phantomjs or casperjs. However, I have not seen any examples of combining this with downloading the associated images, and doing the needed DOM changes to use the downloaded images.
Given that this functionality exists in webkit-based browsers (Chrome, Safari) I'm surprised it isn't in phantomjs -- or perhaps I just haven't found it!
Alternatively the PhantomJS, you can use the CasperJS to achieve the required result. CasperJS is a framework based on the PhantomJS, however, with a variety of modules and classes that support and complement the PhantomJS.
An example of a script that you can use is:
casper.test.begin('test script', 0, function(test) {
casper.start(url);
casper.then(function myFunction(){
//...
});
casper.run(function () {
//...
test.done();
});
});
With this script, within a "step", you can perform your downloads, be it a single image of a document, the page, a print or whatever.
Take a study on the download methods, getPageContent and capture / captureSelector in this link.
I hope these pointers can help you to go further!
Related
Is there a way to just load the server generated HTML (without any js or images)?
The docs seem a little sparse
The strength of PhantomJS is exactly in its ability to emulate a real browser, which opens a page and makes all the subsequent request. If you want just html maybe better use curl or wget?
But nevertheless there is a way not to run js or load images: set corresponding page settings: http://phantomjs.org/api/webpage/property/settings.html
page.settings.javascriptEnabled = false;
page.settings.loadImages = false;
In a web scraping exercise, I need to click on links, let them render the content if is html and download it otherwise. How do I accomplish this with casperjs or some other tools on top of phantom/slimerjs?
As I understand it, phantom/slimerjs lack the APIs to support download. casperjs has a download API but I am not able to see how to examine the mime type and let the html render while download other content.
In both PhantomJS and SlimerJS you can register a listener to each received response:
page.onResourceReceived = function(response) {
...
}
However, only in SlimerJS is response.body defined. By using this you can save the file. There is a full example in this blog post. (As that example shows, you must set page.captureContent to cover the files you want data for.)
There is no way to do this in PhantomJS 1.9.x (and I believe PhantomJS 2.x still has the same problem, but I have not personally confirmed this yet).
The other part of your question was about deciding what to save based on mime type. The full list of available fields shows you can use response.contentType.
I want
history.back()
to be cached like Safari naturally does.
But this does not happen in other browsers
How can I implement safari like cache of history.back() in other browsers?
Your can cache the page resources in 'localStorage', but most modern browsers already do something similar(and better). Despite this native browser cache, the code generated from these resources takes a while to be calculated and applied.
You can give a little help to the browser structuring your website pages this way:
<script>
if(!localStorage[location.pathname]) {
//load this page from server
localStorage[location.pathname] = getGeneratedPage();
} else {
body.innerHTML = parseGeneratedPage(localStorage[location.pathname]);
}
</script>
This is just a VERY generic example. The getGeneratedPage could be a function which stores ONLY:
The DOM tree after page load
CSS rules matched for this page
JS functions which have at least one listener
Base64 Images(only recommended for small images or previews of big images)
etc
Also, you can make a server-side version of this or something like Opera Turbo.
Well, there are countless ways to make your page load in the blink of an eye. Hope it helps.
I have browsed through many posts on this and have tried some of the suggestions but still not understanding it fully.
I would like to scrape html pages that have some script running that usually executes the script to display a link after clicking. Some mentioned firebug and others talked about reverse engineering the code I need. But after trying reverse engineering I still dont see how to get the data after tracing the script function.
jQuery('.category-selector').toggle(
function() {
var categoryList = jQuery('#category-list');
categoryList.css('top', jQuery(this).offset().top+43);
jQuery('.category-selector img').attr ('src', '/images/up_arrow.png');
categoryList.removeClass('nodisplay');
},
function() {
var categoryList = jQuery('#category-list');
jQuery('.category-selector img').attr('src', '/images/down_arrow.png');
categoryList.addClass('nodisplay');
}
);
jQuery('.category-item a').click(
function(){
idToShow = jQuery(this).attr('id').substr(9);
hideAllExcept(jQuery('#category_' + idToShow));
jQuery('.category-item a').removeClass('activeLink');
jQuery(this).addClass('activeLink');
}
);
I am using vb.net and some sites were easy using firebug where looking at the script I was able to pull the data that I needed. What woudl I do in this scenario? the link is http://featured.typepad.com/ and the categories are what I am trying to access. Notice the url does not change.
Appreciate any responses.
My best suggestion would be to use Selenium for screen scraping. It is normally used for automated website testing but would fit your case well. I've used to screen scrape AJAX pages on multiple occasions where the page was heavily Javascript dependent.
http://seleniumhq.org/projects/ide/
You can write your screen scraping code to run in .NET and it can use Firefox or IE to run your screen scraping with.
With selenium what you'll do is record a screen scraping session with the Selenium IDE in Firefox (look for the Firefox extension in the link above). That screen scraping session can either output an HTML template or C# code. It might be able to output VB as well.
You'll copy the C# or VB.NET output from the screen scrape into a selenium .NET project that you'll create and then run the Selenium project through Nunit.
I'd suggest looking online for some help with getting Selenium started and working but this should get you on your way.
I've got a html form submitting to a pdf using cfdocument.
Within that pdf, I have a link at the bottom that goes to another policy. I need that link to open up on a new page, rather than _self.
I've tried using Jquery to open the window and not sure if that is even possible, but wasn't successful to say the least.
So basically, I've got.
<cfdocument format="pdf">
stackoverflow
</cfdocument>
Not possible. Case closed!
Reason:
For my purpose, I'd need to be able to open another pdf in a browser window, but in order to do that, you would have to download the second one to Acrobat or another reader you've got.
Also, you're not able to use jquery to create the new window.
PDF
Or with JQuery
PDF
$(document).ready(function(){
$('#openPdfLink').click(function(){
window.open(this.href);
return false;
});
});