How to follow lazy loading with scrapy? - scrapy

I am trying to crawl a page that is using lazy loading to get the next set of items. My crawler follows normal links, but this one seems to be different:
The page:
https://www.omegawatches.com/de/vintage-watches
is followed by https://www.omegawatches.com/de/vintage-watches?p=2
But only if you load it within the browser. Scrapy will not follow the link.
Is there a way to make scray follow the pages 1,2,3,4 automatically?

The page follows Virtual scrolling and the api through which it gets data is
https://www.omegawatches.com/de/vintage-watches?p=1&ajax=1
it returns a json data which contains different details including products in html format, and if the next page exist or not in a a tag with class link next
increase the page number till there is no a tag with link next class.

Related

I am needing to do an internal link that is in FAQs on a site using Vue 3 and Inertia.js?

I have an array of content coming from a database that will be displayed on a page as a group of FAQs. Some of the content will have links to other internal pages on the site. How do I link to the pages using Inertia's link component so that a full page refresh doesn't happen?
It depends on what is returned after using the link. If you return a full view in the response, the page is reloaded. If you return a small JSON object or something else, you can process it without full loading.

How to get data in dashboard with Scrapy?

I'm scraping some data about car renting from getaround.com. I recently saw that it was possible to get cars availability with scrapy-splash from a calendar rendered with Javascript. An example is given in this url :
https://fr.getaround.com/location-voiture/liege/ford-fiesta-533656
The information I need is contained in the div tag with class owner_calendar_month. However, I saw that some data seem to be accessible in the div tag with class js_car_calendar calendar_large, in which the attribute data-path specify /dashboard/cars/533656/calendar. Do you know how to access this path ? And to scrape the data within it using Scrapy ?
If you visit https://fr.getaround.com/dashboard/cars/533656/calendar you get an error saying you have to be logged in to view the data. So first of all you would have to create a method in Scrapy to sign in to the website if you want to be able to scrape that data.

How to get the text from span with scrapy

I am trying to get the 4 next to the Shipment unit from this page: https://www.yesasia.com/global/hospital-playlist-ost-kihno-kit-album-99s-version/1090424142-0-0-0-en/info.html
I am only getting None.
response.xpath("//div[#class='infoContent']/table/tbody/tr[8]/td/span/text()").get()
This website uses JavaScript. You must load the JS and then you can use the selectors to extract info. There is a framework called splash that can help you do this.

XHR request pulls a lot of HTML content, how can I scrape it/crawl it?

So, I'm trying to scrape a website with infinite scrolling.
I'm following this tutorial on scraping infinite scrolling web pages: https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016
But the example given looks pretty easy, it's an orderly JSON object with the data you want.
I want to scrape this https://www.bahiablancapropiedades.com/buscar#/terrenos/venta/bahia-blanca/todos-los-barrios/rango-min=50.000,rango-max=350.000
The XHR response for each page is weird, looks like corrupted html code
This is how the Network tab looks
I'm not sure how to navigate the items inside "view". I want the spider to enter each item and crawl some information for every one.
In the past I've succesfully done this with normal pagination and rules guided by xpaths.
https://www.bahiablancapropiedades.com/buscar/resultados/0
This is XHR url.
While scrolling the page it will appear the 8 records per request.
So do one thing get all records XPath. these records divide by 8. it will appear the count of XHR requests.
do below process. your issue will solve. I get the same issue as me. I applied below logic. it will resolve.
pagination_count = xpath of presented number
value = int(pagination_count) / 8
for pagination_value in value:
url = https://www.bahiablancapropiedades.com/buscar/resultados/+[pagination_value]
pass this url to your scrapy funciton.
It is not corrupted HTML, it is escaped to prevent it from breaking the JSON. Some websites will return simple JSON data and others, like this one, will return the actual HTML to be added.
To get the elements you need to get the HTML out of the JSON response and create your own parsel Selector (this is the same as when you use response.css(...)).
You can try the following in scrapy shell to get all the links in one of the "next" pages:
scrapy shell https://www.bahiablancapropiedades.com/buscar/resultados/3
import json
import parsel
json_data = json.loads(response.text)
sel = parsel.Selector(json_data['view']) # view contains the HTML
sel.css('a::attr(href)').getall()

Having one spider use items returned from another spider?

So I've written a spider that extracts certain desired links from a webpage and puts the URL, link text, and other information not necessarily contained in the <a> tag itself, into an item for each link.
How should I pass this item onto another spider which scrapes the URL provided in that item?
This question has been asked many times.
Below are some links on this site that answer your question.
Some answer it directly ie passing items to another function but you may realise that you do not need to do it that way, so other methods are linked to show whats possible.
Using multiple spiders at in the project in Scrapy
Scrapy - parse a page to extract items - then follow and store item url contents
Scrapy: Follow link to get additional Item data?