Splash not rendering page completely - scrapy

I have been learning using scrapy + splash for scraping web pages with js. I have a problem with one site - https://aukro.cz/mobilni-telefony - it is not rendering completely. Instead I have the whole page with empty list of products.
scraped page
I have already tried modifying wait time and scrolling page with no effect. Lua script below
function main(splash, args)
splash:go(args.url)
assert(splash:wait(20))
local scroll_to = splash:jsfunc("window.scrollTo")
scroll_to(0, 300)
return {png=splash:png()}
end
What else should I do? Thanks for help in advance.

Related

Scrapy Splash - Not Rendering the full content

I'm trying to scrape this site https://tucson.craigslist.org/search/acc?postedToday=1#search=1~list~0~0, When i try it in splash web console and try to give a wait time of 30 seconds, sometimes it renders the full page and sometimes it is not rendering, When i run in scrapy it did not even render the javascript path, it just returns the rendered content (content received on first call to page)
Can someone help me on this?
Note
This site uses localstorage to store the result and render from there
I deployed the splash instances using the aquarium (Three instances, 1 Slot per instance, 3600 Timeout, Disabled Private Mode)

Can scrapy-splash Ignore 504 HTTP Status?

i want to scrap javascript loading web pages, so i use scrapy-splash but some pages so lots of loading time.
like this :
i think [processUser..] things that makes slower.
there are any way to ignore that 504 pages ? because when i set timeout less than 90 , cause 504 gateway error in scrapy shell or spiders.
and can get result html code ( only get 200 ) when time i set is over?
There's a mechanism in splash to abort a request before it starts loading the body which you can leverage using splash:on_response_headers hook. However in your case this hook will only be able to catch and abort the page when the status and the headers are in, and that is after it finishes waiting for the gateway timeout (504). So instead you might want splash:on_request hook to abort the request before it's even sent like so
function main(splash, args)
splash:on_request(function(request)
if request.url:find('processUser') then
request:abort()
end
end)
assert(splash:go(args.url))
assert(splash:wait(.5))
return {
har = splash:har(),
}
end
UPD: Another and perhaps a better way to go about this is to set splash.resource_timeout before any requests take place:
function main(splash, args)
splash.resource_timeout = 3
...
When you are using Splash to render a webpage you are basically using a web browser.
When you ask Splash to render http://example.com:
Splash goes to http://example.com
Splash executes all of the javascript
2.1 javascript makes some requests
2.2 some requests return 50x codes
Splash returns page data
Unforntunately Splash right now does not support any custom rules for blocking javascript requests - it just takes the page and does everything your browser would do without any addons: load everything without question.
All that being said it's highly unlikely that those 50x requests are slowing down your page load, if so it shouldn't be a significant amount.

Missing Angularjs HTML elements in Phantomjs

I am crawling a website. I must change the date of the given job when suddenly I realized that the element is missing. When I screen capture it, the element is really missing. Is there any way to render that element? The website runs with Angularjs because I noticed the ng in the HTML code. Here are the pictures, the first one is the desktop capture and the second one is from the phantomjs.
Normal date in web browser
No date, straight to the next label
I found how to solve this, just wait for the angularjs to load on phantomjs. It takes an estimate of 5 seconds before it loads. The best to do here is setTimeout function.

Splash get html before finish Lua Script

I have web page with strong ajax pagination (button for only next page).
For go to page etc number 5, script should press button Next 5 times.
But after script click - data for current page will lost.
It's possible return html content from Lua script to scrapy, and after this continue script run?
Now i use bad way. I merge html code for each page inside Lua script, and after last page i return it. But i think it's not good.

yii change page without background slideshow refresh

I have a website which uses backgroundstretch to display a slideshow on the background of the website. I use a normal Yii website structure where the content is displayed by: , according to the url.
Now because of the reload the slideshow starts all over again when I go to another page. Is there a way to display the new pages without having to reload the background?
Thanks in advance!
It might be possible to load pages using ajax and replace the existing content on top of your rotating slide show, but it would likely be a big architectural change to your website.
Instead you might try storing the current slideshow index in a cookie, and when you load your next page, start the slideshow on the current image instead of the first image.