Screen Scraping - still not working - vb.net

I have browsed through many posts on this and have tried some of the suggestions but still not understanding it fully.
I would like to scrape html pages that have some script running that usually executes the script to display a link after clicking. Some mentioned firebug and others talked about reverse engineering the code I need. But after trying reverse engineering I still dont see how to get the data after tracing the script function.
jQuery('.category-selector').toggle(
function() {
var categoryList = jQuery('#category-list');
categoryList.css('top', jQuery(this).offset().top+43);
jQuery('.category-selector img').attr ('src', '/images/up_arrow.png');
categoryList.removeClass('nodisplay');
},
function() {
var categoryList = jQuery('#category-list');
jQuery('.category-selector img').attr('src', '/images/down_arrow.png');
categoryList.addClass('nodisplay');
}
);
jQuery('.category-item a').click(
function(){
idToShow = jQuery(this).attr('id').substr(9);
hideAllExcept(jQuery('#category_' + idToShow));
jQuery('.category-item a').removeClass('activeLink');
jQuery(this).addClass('activeLink');
}
);
I am using vb.net and some sites were easy using firebug where looking at the script I was able to pull the data that I needed. What woudl I do in this scenario? the link is http://featured.typepad.com/ and the categories are what I am trying to access. Notice the url does not change.
Appreciate any responses.

My best suggestion would be to use Selenium for screen scraping. It is normally used for automated website testing but would fit your case well. I've used to screen scrape AJAX pages on multiple occasions where the page was heavily Javascript dependent.
http://seleniumhq.org/projects/ide/
You can write your screen scraping code to run in .NET and it can use Firefox or IE to run your screen scraping with.
With selenium what you'll do is record a screen scraping session with the Selenium IDE in Firefox (look for the Firefox extension in the link above). That screen scraping session can either output an HTML template or C# code. It might be able to output VB as well.
You'll copy the C# or VB.NET output from the screen scrape into a selenium .NET project that you'll create and then run the Selenium project through Nunit.
I'd suggest looking online for some help with getting Selenium started and working but this should get you on your way.

Related

How to open a page with Phantomjs without running js or making subsequent requests?

Is there a way to just load the server generated HTML (without any js or images)?
The docs seem a little sparse
The strength of PhantomJS is exactly in its ability to emulate a real browser, which opens a page and makes all the subsequent request. If you want just html maybe better use curl or wget?
But nevertheless there is a way not to run js or load images: set corresponding page settings: http://phantomjs.org/api/webpage/property/settings.html
page.settings.javascriptEnabled = false;
page.settings.loadImages = false;

Selenium catch HTML every time change has been made in browser

Is it possible to use Selenium so that my code and browser will be integrated - I want to get updated HTML page every time I made any change on the web page in the browser?
In other words I would like to run my app which would automatically start a browser and every time I do any change on the web page selenium automatically get changed HTML in java/python code. For example select a dropdown item might be a good example.
Thanks!

Phantomjs equivalent of browser's "Save Page As... Webpage, complete"

For my application I need to programmatically save a copy of a webpage HTML along with the images and resources needed to render it. Browsers have this functionality in their Save page as... Webpage, complete options.
It is of course easy to save the rendered HTML of a page using phantomjs or casperjs. However, I have not seen any examples of combining this with downloading the associated images, and doing the needed DOM changes to use the downloaded images.
Given that this functionality exists in webkit-based browsers (Chrome, Safari) I'm surprised it isn't in phantomjs -- or perhaps I just haven't found it!
Alternatively the PhantomJS, you can use the CasperJS to achieve the required result. CasperJS is a framework based on the PhantomJS, however, with a variety of modules and classes that support and complement the PhantomJS.
An example of a script that you can use is:
casper.test.begin('test script', 0, function(test) {
casper.start(url);
casper.then(function myFunction(){
//...
});
casper.run(function () {
//...
test.done();
});
});
With this script, within a "step", you can perform your downloads, be it a single image of a document, the page, a print or whatever.
Take a study on the download methods, getPageContent and capture / captureSelector in this link.
I hope these pointers can help you to go further!

retrieving ad urls using scrapy and selenium

I am trying to retrieve the ad URLs for this website:
http://www.appledaily.com
The ad URLs are loaded using javascript so a standard crawlspider does not work. The ads also changes as you refresh the page.
I found this question here and what I gathered is that we need to first use selenium to load a page in the browser then use Scrapy to retrieve the url. I have some experiences with scrapy but none at all in using Selenium. Can anyone show/point me to resource on how I can write a script to do that?
Thank you very much!
EDIT:
I tried the following but neither works in opening the ad banner. Can anyone help?
from selenium import webdriver driver=webdriver.Firefox()
driver=webdriver.Firefox()
driver.get('http://appledaily.com')
adBannerElement = driver.find_element_by_id('adHeaderTop')
adBannerElement.click()
2nd try:
adBannerElement =driver.find_element_by_css_selector("div[#id='adHeaderTop']")
adBannerElement.click()
CSS Selector should not contain # symbol it should be 'div[id='adHeaderTop']' or a shorter way of representing the same as div#adHeaderTop
Actually on observing and analyzing the site and the event that you are trying to carry out, I find that the noscript tag is what should interest you. Just get the HTML source of this node, parse the href attribute and fire this URL.
It will be equivalent to clicking the banner.
<noscript>
"<a href="http://adclick.g.doubleclick.net/aclk%253Fsa%...</a>"
</noscript>
(This is not the complete node information, just inspect the banner in Chrome and you will find this tag).
EDIT: Here is a working snippet that gives you the URL without clicking on the Ad banner, as mentioned from the tag.
driver = new FirefoxDriver();
driver.navigate().to("http://www.appledaily.com");
WebElement objHidden = driver.findElement(By.cssSelector("div#adHeaderTop_ad_container noscript"));
if(objHidden != null){
String innerHTML = objHidden.getAttribute("innerHTML");
String adURL = innerHTML.split("\"")[1];
System.out.println("** " + adURL); ///URL when you click on the Ad
}
else{
System.out.println("<noscript> element not found...");
}
Though this is written in Java, the page source wont change.

capturing a browser refreshed event using Selenium Web Driver

I am writing a program to automate link validations in a site. Our site is having more than 400 links per page and we need to open each link and see it is returning a valid page i.e 200, there are other requirements as well to check if the page is a 404 redirection page etc. It means to validate 400 inks it will take about 30 minutes or so.
My design is to integrate this with the Front-End (Selenium) automation in a way that each time the browser loads a new page or browser refreshes it will trigger a new thread by passing the page source for validating all the href available.
We are not following a page object model otherwise I could trigger this in my each page.
Question here is that is there any way we can listen to a browser refresh or page load event using Selenium Web Driver?
Correct me if I don't understand your question, but page_refresh and page_load_event can be two very different goals for you, if you are dealing with AJAX. You can try this article about the AJAX part
and this one for selenium custom events synchronization.
This solution here is the most actual I could find.
Actually Selenium is JS driver so this answers can be helpful if you want to try it too:
check-if-page-reloaded-or-refresh-in-js
is-page-reloaded-or-refreshed-using-jquery-or-javascript
post_detect_refresh_with_javascript