How to get the text from span with scrapy - scrapy

I am trying to get the 4 next to the Shipment unit from this page: https://www.yesasia.com/global/hospital-playlist-ost-kihno-kit-album-99s-version/1090424142-0-0-0-en/info.html
I am only getting None.
response.xpath("//div[#class='infoContent']/table/tbody/tr[8]/td/span/text()").get()

This website uses JavaScript. You must load the JS and then you can use the selectors to extract info. There is a framework called splash that can help you do this.

Related

How to get data in dashboard with Scrapy?

I'm scraping some data about car renting from getaround.com. I recently saw that it was possible to get cars availability with scrapy-splash from a calendar rendered with Javascript. An example is given in this url :
https://fr.getaround.com/location-voiture/liege/ford-fiesta-533656
The information I need is contained in the div tag with class owner_calendar_month. However, I saw that some data seem to be accessible in the div tag with class js_car_calendar calendar_large, in which the attribute data-path specify /dashboard/cars/533656/calendar. Do you know how to access this path ? And to scrape the data within it using Scrapy ?
If you visit https://fr.getaround.com/dashboard/cars/533656/calendar you get an error saying you have to be logged in to view the data. So first of all you would have to create a method in Scrapy to sign in to the website if you want to be able to scrape that data.

XHR request pulls a lot of HTML content, how can I scrape it/crawl it?

So, I'm trying to scrape a website with infinite scrolling.
I'm following this tutorial on scraping infinite scrolling web pages: https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016
But the example given looks pretty easy, it's an orderly JSON object with the data you want.
I want to scrape this https://www.bahiablancapropiedades.com/buscar#/terrenos/venta/bahia-blanca/todos-los-barrios/rango-min=50.000,rango-max=350.000
The XHR response for each page is weird, looks like corrupted html code
This is how the Network tab looks
I'm not sure how to navigate the items inside "view". I want the spider to enter each item and crawl some information for every one.
In the past I've succesfully done this with normal pagination and rules guided by xpaths.
https://www.bahiablancapropiedades.com/buscar/resultados/0
This is XHR url.
While scrolling the page it will appear the 8 records per request.
So do one thing get all records XPath. these records divide by 8. it will appear the count of XHR requests.
do below process. your issue will solve. I get the same issue as me. I applied below logic. it will resolve.
pagination_count = xpath of presented number
value = int(pagination_count) / 8
for pagination_value in value:
url = https://www.bahiablancapropiedades.com/buscar/resultados/+[pagination_value]
pass this url to your scrapy funciton.
It is not corrupted HTML, it is escaped to prevent it from breaking the JSON. Some websites will return simple JSON data and others, like this one, will return the actual HTML to be added.
To get the elements you need to get the HTML out of the JSON response and create your own parsel Selector (this is the same as when you use response.css(...)).
You can try the following in scrapy shell to get all the links in one of the "next" pages:
scrapy shell https://www.bahiablancapropiedades.com/buscar/resultados/3
import json
import parsel
json_data = json.loads(response.text)
sel = parsel.Selector(json_data['view']) # view contains the HTML
sel.css('a::attr(href)').getall()

How to follow lazy loading with scrapy?

I am trying to crawl a page that is using lazy loading to get the next set of items. My crawler follows normal links, but this one seems to be different:
The page:
https://www.omegawatches.com/de/vintage-watches
is followed by https://www.omegawatches.com/de/vintage-watches?p=2
But only if you load it within the browser. Scrapy will not follow the link.
Is there a way to make scray follow the pages 1,2,3,4 automatically?
The page follows Virtual scrolling and the api through which it gets data is
https://www.omegawatches.com/de/vintage-watches?p=1&ajax=1
it returns a json data which contains different details including products in html format, and if the next page exist or not in a a tag with class link next
increase the page number till there is no a tag with link next class.

Response has nothing in it

I have been following the scrapy tutorial trying to create a very simple web scraper for warframe.market. I have about a year of coding experience from school, but no python experience. I simply want to get the price of an item from the website. I used the following to scrape the page:
scrapy shell "https://warframe.market/items/hydroid_prime_set"
then I inspected the web page to find the individual elements that I am trying to scrape. I used this command to try to view the results I wanted:
response.css("div.order-row.d-flex.col-12").extract()
This did not work, so I used view(response) to see what I had scraped, and my cmd just waits endlessly at this point.
Is HTTPS stopping me from scraping? Am I selecting the wrong css in my response? Is the webpage too big? Could someone please show me where I went wrong?
Thanks
The response isn't empty, but it's rendered using javascript (you can validate it inspecting the response.body), for example try this in the shell:
import json
data = json.loads(response.css('#application-state::text').extract_first())
for order in data.get('payload',{}).get('orders', []):
print '"{}" price: {}'.format(order.get('platinum'),
order.get('user',{}).get('ingame_name'))

Missing Angularjs HTML elements in Phantomjs

I am crawling a website. I must change the date of the given job when suddenly I realized that the element is missing. When I screen capture it, the element is really missing. Is there any way to render that element? The website runs with Angularjs because I noticed the ng in the HTML code. Here are the pictures, the first one is the desktop capture and the second one is from the phantomjs.
Normal date in web browser
No date, straight to the next label
I found how to solve this, just wait for the angularjs to load on phantomjs. It takes an estimate of 5 seconds before it loads. The best to do here is setTimeout function.