I am trying to retrieve the ad URLs for this website:
http://www.appledaily.com
The ad URLs are loaded using javascript so a standard crawlspider does not work. The ads also changes as you refresh the page.
I found this question here and what I gathered is that we need to first use selenium to load a page in the browser then use Scrapy to retrieve the url. I have some experiences with scrapy but none at all in using Selenium. Can anyone show/point me to resource on how I can write a script to do that?
Thank you very much!
EDIT:
I tried the following but neither works in opening the ad banner. Can anyone help?
from selenium import webdriver driver=webdriver.Firefox()
driver=webdriver.Firefox()
driver.get('http://appledaily.com')
adBannerElement = driver.find_element_by_id('adHeaderTop')
adBannerElement.click()
2nd try:
adBannerElement =driver.find_element_by_css_selector("div[#id='adHeaderTop']")
adBannerElement.click()
CSS Selector should not contain # symbol it should be 'div[id='adHeaderTop']' or a shorter way of representing the same as div#adHeaderTop
Actually on observing and analyzing the site and the event that you are trying to carry out, I find that the noscript tag is what should interest you. Just get the HTML source of this node, parse the href attribute and fire this URL.
It will be equivalent to clicking the banner.
<noscript>
"<a href="http://adclick.g.doubleclick.net/aclk%253Fsa%...</a>"
</noscript>
(This is not the complete node information, just inspect the banner in Chrome and you will find this tag).
EDIT: Here is a working snippet that gives you the URL without clicking on the Ad banner, as mentioned from the tag.
driver = new FirefoxDriver();
driver.navigate().to("http://www.appledaily.com");
WebElement objHidden = driver.findElement(By.cssSelector("div#adHeaderTop_ad_container noscript"));
if(objHidden != null){
String innerHTML = objHidden.getAttribute("innerHTML");
String adURL = innerHTML.split("\"")[1];
System.out.println("** " + adURL); ///URL when you click on the Ad
}
else{
System.out.println("<noscript> element not found...");
}
Though this is written in Java, the page source wont change.
Related
I am trying to submit/callback after entering h-captcha-response and g-recaptcha-response with the solved token but I don't understand how I am supposed to submit it.
How can I submit the hCaptcha without form,button or data-callback.
Here is the entire HTML of the page containing the hCaptcha.
https://justpaste.me/57J0
You have to find in the javascript files a specific function (like "testCaptcha") who submit the answer. When you find it, you can call it like this:
captcha = yourTOKEN
driver.execute_script("""
let [captcha] = arguments
testCaptcha(captcha)
""", captcha)
Could you please precise an URL where you have this captcha ? It'll be helpful to find this specific function.
How can I get page source of current page?
I make driver.get(link) and I am on main page. Then I use selenium to get other page (by tag and xpath) and when I get good page I'd like to obtain its page source.
I tried driver.page_source() but I obtain page source of main page not this current.
driver = webdriver.Chrome(ccc)
driver.get('https://aaa.com')
check1 = driver.find_element_by_xpath('/html/body/div[1]/div/div[2]/button')
check1.click()
time.sleep(1)
check2=driver.find_element_by_xpath('/html/body/div[1]/div[2]/div[2]/div[1]/div/a')
check2.click()
And after check2.click() I am on page with new link (this link only works by click not directly). How can I get page source for this new link?
I need it to change selenium for Beautiful Soup
I have used Webdriver and displaying sources of page
I can find broken inks/ images in any particular webpage. But I am not able to find it throughout all the pages using Selenium. I have gone through many blogs but didn't find any code working. It would be great help if anyone one of you could help me to fix this problem
Collect all the href attribute in your page using the 'a' and 'img' tagname in a list.
In java, iterate the loop,setup a HttpURLCOnnection for each url from the href list. Connect to it and check the response code. Google for logic and error codes responses.
If you want to check broken images for all the pages, you can use HTTPClient library to check status codes of the images on a page.
First try to find all images on each page by using Webdriver.
Below is the syntax:
List<WebElement> imagesList = driver.findElements(By.tagName("img"));
Below is the syntax to get links
List<WebElement> anchorTagsList = driver.findElements(By.tagName("a"));
Now you need to iterate through each image and verify response code with HttpStatus.
You can find example from here Find Broken / Invalid Images on a Page
You can find example from here Find Broken Links on a Page
I am trying to Click on a link and check that it is active. However the class is what is determining whether it is active or not.
This is for pagination for a web page where i want to automate the web driver to navigate to different pages and ensure the link click has indeed taken the user to the correct page.
I am using Selenium2Library with firefox.
Does anyone have any suggestions. Thanks.
Here's solution in Java, I hope you can translate it to whatever language you use.
WebElement link = driver.findElement(By.cssSelector("[title='No. 2']"));
String linkClass = link.findElement(By.xpath("./..")).getAttribute("class");
if ("active".equals(linkClass)) {
link.click();
}
I have browsed through many posts on this and have tried some of the suggestions but still not understanding it fully.
I would like to scrape html pages that have some script running that usually executes the script to display a link after clicking. Some mentioned firebug and others talked about reverse engineering the code I need. But after trying reverse engineering I still dont see how to get the data after tracing the script function.
jQuery('.category-selector').toggle(
function() {
var categoryList = jQuery('#category-list');
categoryList.css('top', jQuery(this).offset().top+43);
jQuery('.category-selector img').attr ('src', '/images/up_arrow.png');
categoryList.removeClass('nodisplay');
},
function() {
var categoryList = jQuery('#category-list');
jQuery('.category-selector img').attr('src', '/images/down_arrow.png');
categoryList.addClass('nodisplay');
}
);
jQuery('.category-item a').click(
function(){
idToShow = jQuery(this).attr('id').substr(9);
hideAllExcept(jQuery('#category_' + idToShow));
jQuery('.category-item a').removeClass('activeLink');
jQuery(this).addClass('activeLink');
}
);
I am using vb.net and some sites were easy using firebug where looking at the script I was able to pull the data that I needed. What woudl I do in this scenario? the link is http://featured.typepad.com/ and the categories are what I am trying to access. Notice the url does not change.
Appreciate any responses.
My best suggestion would be to use Selenium for screen scraping. It is normally used for automated website testing but would fit your case well. I've used to screen scrape AJAX pages on multiple occasions where the page was heavily Javascript dependent.
http://seleniumhq.org/projects/ide/
You can write your screen scraping code to run in .NET and it can use Firefox or IE to run your screen scraping with.
With selenium what you'll do is record a screen scraping session with the Selenium IDE in Firefox (look for the Firefox extension in the link above). That screen scraping session can either output an HTML template or C# code. It might be able to output VB as well.
You'll copy the C# or VB.NET output from the screen scrape into a selenium .NET project that you'll create and then run the Selenium project through Nunit.
I'd suggest looking online for some help with getting Selenium started and working but this should get you on your way.