Scraping links with selenium - selenium

I was working to scrape links to articles on a website. But normally when site was loaded it list only 5 articles then it requires to click load more button to display more articles list.
Html source has only links to first five articles.
I used selenium python to automate clicking load more button to completely load webpage with all article listings.
Question is now how can i extract links to all those articles.
After loading site completely with selenium i tried to get html source with driver.page_source and printed it but still it has only link to first 5 articles.
I want to get links to all those articles that were loaded in webpage after clicking load more button.
Please someone help to provide solution.

Maybe the links take some time to show up and your code is doing driver.source_code before the source code is updated. You can select the links with Selenium after an explicit wait so that you can make sure that the links that are dinamically added to the web page are fully loaded. It is difficult to boil down exactly what you need without a link to your source, but (in Python) it should be something similar to:
from selenium.webdriver.support.ui import WebDriverWait
def condition(driver):
"""If the selector defined in the function retrieves 10 or more results, return the results.
Else, return None.
"""
selector = 'a.my_class' # Selects all <a> tags with the class "my_class"
els = driver.find_elements_by_css_selector(selector)
if len(els) >= 10:
return els
# Making an assignment only when the condition returns a truthy value when called (waiting until 2 min):
links_elements = WebDriverWait(driver, timeout=120).until(condition)
# Getting the href attribute of the links
links_href = [link.get_attribute('href') for link in links_elements]
In this code, you are:
Constantly looking for the elements you want until there are 10 or more of them. You can do this by CSS Selector (as in the example), XPath or other method. This gives you a list of Selenium objects as soon as the wait condition returns an object with a True value, until a certain timeout. See more on explicit waits in the documentation. You should make the appropriate condition for your case - maybe expecting a certain number of links is not good if you are not sure of how many links there will be in the end.
Extracting what you want from the Selenium object. For that, use the appropriate method over the elements in the list you got from the step above.

Related

XHR request pulls a lot of HTML content, how can I scrape it/crawl it?

So, I'm trying to scrape a website with infinite scrolling.
I'm following this tutorial on scraping infinite scrolling web pages: https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016
But the example given looks pretty easy, it's an orderly JSON object with the data you want.
I want to scrape this https://www.bahiablancapropiedades.com/buscar#/terrenos/venta/bahia-blanca/todos-los-barrios/rango-min=50.000,rango-max=350.000
The XHR response for each page is weird, looks like corrupted html code
This is how the Network tab looks
I'm not sure how to navigate the items inside "view". I want the spider to enter each item and crawl some information for every one.
In the past I've succesfully done this with normal pagination and rules guided by xpaths.
https://www.bahiablancapropiedades.com/buscar/resultados/0
This is XHR url.
While scrolling the page it will appear the 8 records per request.
So do one thing get all records XPath. these records divide by 8. it will appear the count of XHR requests.
do below process. your issue will solve. I get the same issue as me. I applied below logic. it will resolve.
pagination_count = xpath of presented number
value = int(pagination_count) / 8
for pagination_value in value:
url = https://www.bahiablancapropiedades.com/buscar/resultados/+[pagination_value]
pass this url to your scrapy funciton.
It is not corrupted HTML, it is escaped to prevent it from breaking the JSON. Some websites will return simple JSON data and others, like this one, will return the actual HTML to be added.
To get the elements you need to get the HTML out of the JSON response and create your own parsel Selector (this is the same as when you use response.css(...)).
You can try the following in scrapy shell to get all the links in one of the "next" pages:
scrapy shell https://www.bahiablancapropiedades.com/buscar/resultados/3
import json
import parsel
json_data = json.loads(response.text)
sel = parsel.Selector(json_data['view']) # view contains the HTML
sel.css('a::attr(href)').getall()

Scrapy: find HTTP call from button click

I am trying to scrape flyers from flipp.com/weekly_ads using Scrapy. Before I can scrape the flyers, I need to input my area code, and search for local flyers (on the site, this is done by clicking a button).
I am trying to input a value, and simualate "clicking a button" using Scrapy.
Initially, I thought that I would be able to use a FormRequest.from_response to search for the form, and input my area code as a value. However, the button is written in javascript, meaning that the form cannot be found.
So, I tried to find the HTTP call via Inspect Element > Developer Tools > Network > XHR to see if any of the calls would load the equivalent flipp page with the new, inputted area code (my area code).
Now, I am very new to Scrapy, and HTTP requests/responses, so I am unsure if the link I found is the correct one (as in, the response with the new area code), or not.
This is the request I found:
https://gateflipp.flippback.com/bf/flipp/data?locale=en-us&postal_code=90210&sid=10775773055673477
I used an arbitrary postal code for the request (90210).
I suspect this is the incorrect request, but in the case that I am wrong, and this is correct:
How do I navigate to - flipp.com/weekly_ads/groceries from this request, while maintaining the new area code?
If this is incorrect:
How do I input a value for a javascript button, and get the result using Scrapy?
import scrapy
import requests
import json
class flippSpider(scrapy.Spider):
name = "flippSpider"
postal_code = "M1T2R8"
start_urls = ["https://flipp.com/weekly_ads"]
def parse(self, response): #Input value and simulate button click
return Request() #Find http call to simulate button click with correct field/value parameters
def parse_formrequest(self, response):
yield scrapy.Request("https://flipp.com/weekly_ads/groceries", callback= self.parse_groceries)
def parse_groceries(self, response):
flyers = []
flyer_names = response.css("class.flyer-name").extract()
for flyer_name in flyer_names:
flyer = FlippspiderItem()
flyer["name"] = flyer_name
flyers.append(flyer)
self.log(flyer["name"])
print(flyer_name)
return flyers
I expected to find the actual javascript button request within the XHR links but the one I found seems to be incorrect.
Edit: I do not want to use Selenium, it's slow, and I do not want a browser to pop up during execution of the spider.
I suspect this is the incorrect request, but in the case that I am wrong, and this is correct:
That is the correct URL to get the data powering that website; the things you see on screen when you go to flipp.com/weekly_ads/groceries is just packaging that data in HTML
How do I navigate to - flipp.com/weekly_ads/groceries from this request, while maintaining the new area code?
I am pretty sure you are asking the wrong question. You don't need to -- and in fact navigating to flipp.com/weekly_ads/groceries will 100% not do what you want anyway. You can observe that when you click on "Groceries", the content changes but the browser does not navigate to any new page, nor does it make a new XHR request. Thus, everything that you need is in that JSON. What is happening is they are using the flyers.*.categories that contains "Groceries" to narrow down the 129 flyers that are returned to just those related to Groceries.
As for "maintaining the new area code," it's a similar "wrong question" because every piece of data that is returned by that XHR is scoped to the postal code in question. Thus, you don't need to re-submit anything, and nor would I expect any data that comes back from your postal_code=90210 request to contain 30309 (or whatever) data.
Believe it or not, you're actually in a great place: you don't need to deal with complicated CSS or XPath queries to liberate the data from its HTML prison: they are kind enough to provide you with an API to their data. You just need to deal with unpacking the content from their structure into your own.

Unable to locate element using Selenium

I have reviewed several questions pertaining to this popular topic but have not yet found a solution. I am trying to scrape a dynamic webpage that requires the user to click something and then enter some input. The site I am trying to scrape is here: https://a810-dobnow.nyc.gov/publish/#!/
I am trying to click where it says "Building Identification Number" and proceed to enter some input. I cannot seem to even locate the element I need to click. I used a wait and also checked to see if it was located in some other frame I needed to switch to, it is not as far as I can see:
driver = webdriver.Chrome("C:\\Users\\#####\\Downloads\\chromedriver_win32\\chromedriver.exe")
driver.get("https://a810-dobnow.nyc.gov/publish/#!/")
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//*[#id=""accordiongroup-9-9180-tab""]/h4/a/span/h3")))
driver.find_element_by_xpath("//*[#id=""accordiongroup-9-9180-tab""]/h4/a/span/h3").click()
I just loaded the page, and when i try to search the dom for the xpath you have provided it fails to find the matching element.
I'd recommend using something like:
driver.find_element_by_xpath("//h3[contains(text(), 'Building Identification Number (BIN)')]").click()
Hope this helps

How to find out broken links/ images throughout all the pages of a website using selenium WebDriver?

I can find broken inks/ images in any particular webpage. But I am not able to find it throughout all the pages using Selenium. I have gone through many blogs but didn't find any code working. It would be great help if anyone one of you could help me to fix this problem
Collect all the href attribute in your page using the 'a' and 'img' tagname in a list.
In java, iterate the loop,setup a HttpURLCOnnection for each url from the href list. Connect to it and check the response code. Google for logic and error codes responses.
If you want to check broken images for all the pages, you can use HTTPClient library to check status codes of the images on a page.
First try to find all images on each page by using Webdriver.
Below is the syntax:
List<WebElement> imagesList = driver.findElements(By.tagName("img"));
Below is the syntax to get links
List<WebElement> anchorTagsList = driver.findElements(By.tagName("a"));
Now you need to iterate through each image and verify response code with HttpStatus.
You can find example from here Find Broken / Invalid Images on a Page
You can find example from here Find Broken Links on a Page

Having one spider use items returned from another spider?

So I've written a spider that extracts certain desired links from a webpage and puts the URL, link text, and other information not necessarily contained in the <a> tag itself, into an item for each link.
How should I pass this item onto another spider which scrapes the URL provided in that item?
This question has been asked many times.
Below are some links on this site that answer your question.
Some answer it directly ie passing items to another function but you may realise that you do not need to do it that way, so other methods are linked to show whats possible.
Using multiple spiders at in the project in Scrapy
Scrapy - parse a page to extract items - then follow and store item url contents
Scrapy: Follow link to get additional Item data?