Return image contents by Scrapy-Splash - scrapy

I'm using Scrapy-Splash requests to get a rendered screenshot of a page, but I also need the images on that page. I use the pipelines to download those images, but I was thinking - does this not make two requests for the same image? Once when Splash is rendering the page and once when I send a download request. Is there a way I can get the images returned by the Scrapy-Splash request?

You can enable response bodies (use either respone_body argument or splash.response_body_enabled=True) and then extract images from HAR export.

Related

I am trying to create a clone of Youtube app & using Youtube API v3 how to get list of thumbnails with info what has videos on original Youtube page?

I am using a request:
https://www.googleapis.com/youtube/v3/search?part=snippet&type=video&q={search value}&maxResults=3&key={key}`
This way I am getting json with info about video thumbnails and descriptions what i need but I don't have an url to channel picture and don't have infro how many views has a video.
I found the way to get the info about channel URL logo in separate request
https://www.googleapis.com/youtube/v3/search?part=snippet&channelId={id}
but it doesn't look right to make additional separate requests for every single logo of the list and Views statistics.
Also, is it possible to get in the same request for embedded video URL's as well?
You are right, it's impossible from a search query q=${searchValue} to get the associated channel thumbnail URL in a single request.

XHR request pulls a lot of HTML content, how can I scrape it/crawl it?

So, I'm trying to scrape a website with infinite scrolling.
I'm following this tutorial on scraping infinite scrolling web pages: https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016
But the example given looks pretty easy, it's an orderly JSON object with the data you want.
I want to scrape this https://www.bahiablancapropiedades.com/buscar#/terrenos/venta/bahia-blanca/todos-los-barrios/rango-min=50.000,rango-max=350.000
The XHR response for each page is weird, looks like corrupted html code
This is how the Network tab looks
I'm not sure how to navigate the items inside "view". I want the spider to enter each item and crawl some information for every one.
In the past I've succesfully done this with normal pagination and rules guided by xpaths.
https://www.bahiablancapropiedades.com/buscar/resultados/0
This is XHR url.
While scrolling the page it will appear the 8 records per request.
So do one thing get all records XPath. these records divide by 8. it will appear the count of XHR requests.
do below process. your issue will solve. I get the same issue as me. I applied below logic. it will resolve.
pagination_count = xpath of presented number
value = int(pagination_count) / 8
for pagination_value in value:
url = https://www.bahiablancapropiedades.com/buscar/resultados/+[pagination_value]
pass this url to your scrapy funciton.
It is not corrupted HTML, it is escaped to prevent it from breaking the JSON. Some websites will return simple JSON data and others, like this one, will return the actual HTML to be added.
To get the elements you need to get the HTML out of the JSON response and create your own parsel Selector (this is the same as when you use response.css(...)).
You can try the following in scrapy shell to get all the links in one of the "next" pages:
scrapy shell https://www.bahiablancapropiedades.com/buscar/resultados/3
import json
import parsel
json_data = json.loads(response.text)
sel = parsel.Selector(json_data['view']) # view contains the HTML
sel.css('a::attr(href)').getall()

Why is there a time span between different network requests?

I'm optimizing the loading times in a web app and I don't know what's the problem. Firebug's Net panel is showing time holes between requests.
Can someone explain me this chart?
The gap between the requests can have two reasons:
Time needed to parse the requested page
When you request a URL, the browser needs to parse the returned contents to check whether they contain URLs to other ressources like JavaScripts, CSS files, images, etc. Subsequently requested ressources need to be parsed, too. So e.g. CSS files can contain references to images. Though the contents of the CSS file first need to be parsed to get those URLs.
Dynamically requested ressources
Using JavaScript resources can be requested asynchronously. These requests can be triggered e.g. through AJAX or by dynamically inserting DOM nodes like <img src="xyz.png" alt=""> into the page.

scrape the reponse which would be loaded from ajax event

I am using scrapy tool to scrape content from website, i need help from you guys how to scrape the reponse which is dynamically loaded from ajax.
when content loading from ajax at that mean time url not changing it keep remains same but content would be changed so on that event i need to crawl.
thank you,
G.kavirajan
yield FormRequest('http://addons.prestashop.com/en/modules/featureproduct/ajax-homefeatured.php',
formdata={'type':'new','ajax':'1'},
callback=self.your_callback_method)
bellow are the urls that you can easily catch using fiddler or firebug
this is for featured tab http://addons.prestashop.com/en/modules/featureproduct/ajax-homefeatured.php?ajax=1&type=random
this is for new tab http://addons.prestashop.com/en/modules/featureproduct/ajax-homefeatured.php?ajax=1&type=new
you can request on these url directly to get results you required, although website is using POST request to get data for these url, but i tried with parameter GET request is also working properly

is it possible to do low-level pixel inspection with selenium?

is it possible to inspect the value of a specific pixel in the browser rendered page with Selenium? can i get a buffer of the rendered page as an image?
Also, is it possible to send mouse-scroll-down commands to the browser?
cheers
You can save a screenshot of the entire page and then manipulate the image file