Having one spider use items returned from another spider? - scrapy

So I've written a spider that extracts certain desired links from a webpage and puts the URL, link text, and other information not necessarily contained in the <a> tag itself, into an item for each link.
How should I pass this item onto another spider which scrapes the URL provided in that item?

This question has been asked many times.
Below are some links on this site that answer your question.
Some answer it directly ie passing items to another function but you may realise that you do not need to do it that way, so other methods are linked to show whats possible.
Using multiple spiders at in the project in Scrapy
Scrapy - parse a page to extract items - then follow and store item url contents
Scrapy: Follow link to get additional Item data?

Related

How to get data in dashboard with Scrapy?

I'm scraping some data about car renting from getaround.com. I recently saw that it was possible to get cars availability with scrapy-splash from a calendar rendered with Javascript. An example is given in this url :
https://fr.getaround.com/location-voiture/liege/ford-fiesta-533656
The information I need is contained in the div tag with class owner_calendar_month. However, I saw that some data seem to be accessible in the div tag with class js_car_calendar calendar_large, in which the attribute data-path specify /dashboard/cars/533656/calendar. Do you know how to access this path ? And to scrape the data within it using Scrapy ?
If you visit https://fr.getaround.com/dashboard/cars/533656/calendar you get an error saying you have to be logged in to view the data. So first of all you would have to create a method in Scrapy to sign in to the website if you want to be able to scrape that data.

Scraping links with selenium

I was working to scrape links to articles on a website. But normally when site was loaded it list only 5 articles then it requires to click load more button to display more articles list.
Html source has only links to first five articles.
I used selenium python to automate clicking load more button to completely load webpage with all article listings.
Question is now how can i extract links to all those articles.
After loading site completely with selenium i tried to get html source with driver.page_source and printed it but still it has only link to first 5 articles.
I want to get links to all those articles that were loaded in webpage after clicking load more button.
Please someone help to provide solution.
Maybe the links take some time to show up and your code is doing driver.source_code before the source code is updated. You can select the links with Selenium after an explicit wait so that you can make sure that the links that are dinamically added to the web page are fully loaded. It is difficult to boil down exactly what you need without a link to your source, but (in Python) it should be something similar to:
from selenium.webdriver.support.ui import WebDriverWait
def condition(driver):
"""If the selector defined in the function retrieves 10 or more results, return the results.
Else, return None.
"""
selector = 'a.my_class' # Selects all <a> tags with the class "my_class"
els = driver.find_elements_by_css_selector(selector)
if len(els) >= 10:
return els
# Making an assignment only when the condition returns a truthy value when called (waiting until 2 min):
links_elements = WebDriverWait(driver, timeout=120).until(condition)
# Getting the href attribute of the links
links_href = [link.get_attribute('href') for link in links_elements]
In this code, you are:
Constantly looking for the elements you want until there are 10 or more of them. You can do this by CSS Selector (as in the example), XPath or other method. This gives you a list of Selenium objects as soon as the wait condition returns an object with a True value, until a certain timeout. See more on explicit waits in the documentation. You should make the appropriate condition for your case - maybe expecting a certain number of links is not good if you are not sure of how many links there will be in the end.
Extracting what you want from the Selenium object. For that, use the appropriate method over the elements in the list you got from the step above.

XHR request pulls a lot of HTML content, how can I scrape it/crawl it?

So, I'm trying to scrape a website with infinite scrolling.
I'm following this tutorial on scraping infinite scrolling web pages: https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016
But the example given looks pretty easy, it's an orderly JSON object with the data you want.
I want to scrape this https://www.bahiablancapropiedades.com/buscar#/terrenos/venta/bahia-blanca/todos-los-barrios/rango-min=50.000,rango-max=350.000
The XHR response for each page is weird, looks like corrupted html code
This is how the Network tab looks
I'm not sure how to navigate the items inside "view". I want the spider to enter each item and crawl some information for every one.
In the past I've succesfully done this with normal pagination and rules guided by xpaths.
https://www.bahiablancapropiedades.com/buscar/resultados/0
This is XHR url.
While scrolling the page it will appear the 8 records per request.
So do one thing get all records XPath. these records divide by 8. it will appear the count of XHR requests.
do below process. your issue will solve. I get the same issue as me. I applied below logic. it will resolve.
pagination_count = xpath of presented number
value = int(pagination_count) / 8
for pagination_value in value:
url = https://www.bahiablancapropiedades.com/buscar/resultados/+[pagination_value]
pass this url to your scrapy funciton.
It is not corrupted HTML, it is escaped to prevent it from breaking the JSON. Some websites will return simple JSON data and others, like this one, will return the actual HTML to be added.
To get the elements you need to get the HTML out of the JSON response and create your own parsel Selector (this is the same as when you use response.css(...)).
You can try the following in scrapy shell to get all the links in one of the "next" pages:
scrapy shell https://www.bahiablancapropiedades.com/buscar/resultados/3
import json
import parsel
json_data = json.loads(response.text)
sel = parsel.Selector(json_data['view']) # view contains the HTML
sel.css('a::attr(href)').getall()

How to follow lazy loading with scrapy?

I am trying to crawl a page that is using lazy loading to get the next set of items. My crawler follows normal links, but this one seems to be different:
The page:
https://www.omegawatches.com/de/vintage-watches
is followed by https://www.omegawatches.com/de/vintage-watches?p=2
But only if you load it within the browser. Scrapy will not follow the link.
Is there a way to make scray follow the pages 1,2,3,4 automatically?
The page follows Virtual scrolling and the api through which it gets data is
https://www.omegawatches.com/de/vintage-watches?p=1&ajax=1
it returns a json data which contains different details including products in html format, and if the next page exist or not in a a tag with class link next
increase the page number till there is no a tag with link next class.

Scrapy is not returning any data after a certain level of div

I am trying to crawl a website : https://www.firstpost.com/search/sachin-tendulkar
steps followed :
a. fetch("https://www.firstpost.com/search/sachin-tendulkar")
b. view(response) --> everything is working as expected till this point.
Once i start to extract the data with the below syntax I am able to only get divs upto certain levels
response.xpath('//div[#id="results"]').extract()
after this div i am not able to access any other divs and its content.
I haven't faced this kind of issue in past when developing crawler for other website.. is the issue site specific..?
Can you please let me know a way to crawl the internal divs?
Can you elaborate on "not able to access any other divs and its content"? Do you get any error?
I can access all the div's and their content. For ex. the main content of the search result is inside the div - gsc-expansionArea which can be accessed via
//div[class="gsc-expansionArea"]
and this can give you an iterable to work.
Only the first result is outside this div which can be accessed via another div
//div[class="gsc-webResult gsc-result"]
And the last sibling of this //div[class="gcsc-branding"] has no search results in it.