Scrapy Splash - Not Rendering the full content - scrapy

I'm trying to scrape this site https://tucson.craigslist.org/search/acc?postedToday=1#search=1~list~0~0, When i try it in splash web console and try to give a wait time of 30 seconds, sometimes it renders the full page and sometimes it is not rendering, When i run in scrapy it did not even render the javascript path, it just returns the rendered content (content received on first call to page)
Can someone help me on this?
Note
This site uses localstorage to store the result and render from there
I deployed the splash instances using the aquarium (Three instances, 1 Slot per instance, 3600 Timeout, Disabled Private Mode)

Related

Is it possible to track if a page finished loading with my current implementation of webView?

Picture of WebView
Hello, I am working on a login page for my app. Right now the webView login page works! However, there is one big issue with the current implementation.
Basically for my login I call a series of functions:
Opens hidden website with a webView,
Injects javaScript to login to website,
Changes to page of website containing data,
Extracts data,
Pushes to new view as login was successful
Now this is all working, except I had to hardcode times for each function to take place using DispatchQueue.main.async. This is of course problematic because some of the functions vary in time, for example the time it takes to load the webpage. This means my login is successful 75% of the time. I need a way to track if the webView is finished loading so I can call the next function only when it is done loading. However, every webView I have seen that has something like this, uses a completely different structure. When I have tried these other structures, I could not make certain things work like my login function that uses evaluateJavascript.
Is there anyway to have this feature by adding something to my current implementation? Thanks!

How to use Scrapy-Splash

I try to use Scrapy to crawl a web whose URL will not change with the page go on.I used Splash to simulate CLICK .However ,it only take me to second page.I wonder how can I keep get next page and how to crawl web like that.

Can scrapy-splash Ignore 504 HTTP Status?

i want to scrap javascript loading web pages, so i use scrapy-splash but some pages so lots of loading time.
like this :
i think [processUser..] things that makes slower.
there are any way to ignore that 504 pages ? because when i set timeout less than 90 , cause 504 gateway error in scrapy shell or spiders.
and can get result html code ( only get 200 ) when time i set is over?
There's a mechanism in splash to abort a request before it starts loading the body which you can leverage using splash:on_response_headers hook. However in your case this hook will only be able to catch and abort the page when the status and the headers are in, and that is after it finishes waiting for the gateway timeout (504). So instead you might want splash:on_request hook to abort the request before it's even sent like so
function main(splash, args)
splash:on_request(function(request)
if request.url:find('processUser') then
request:abort()
end
end)
assert(splash:go(args.url))
assert(splash:wait(.5))
return {
har = splash:har(),
}
end
UPD: Another and perhaps a better way to go about this is to set splash.resource_timeout before any requests take place:
function main(splash, args)
splash.resource_timeout = 3
...
When you are using Splash to render a webpage you are basically using a web browser.
When you ask Splash to render http://example.com:
Splash goes to http://example.com
Splash executes all of the javascript
2.1 javascript makes some requests
2.2 some requests return 50x codes
Splash returns page data
Unforntunately Splash right now does not support any custom rules for blocking javascript requests - it just takes the page and does everything your browser would do without any addons: load everything without question.
All that being said it's highly unlikely that those 50x requests are slowing down your page load, if so it shouldn't be a significant amount.

Splash not rendering page completely

I have been learning using scrapy + splash for scraping web pages with js. I have a problem with one site - https://aukro.cz/mobilni-telefony - it is not rendering completely. Instead I have the whole page with empty list of products.
scraped page
I have already tried modifying wait time and scrolling page with no effect. Lua script below
function main(splash, args)
splash:go(args.url)
assert(splash:wait(20))
local scroll_to = splash:jsfunc("window.scrollTo")
scroll_to(0, 300)
return {png=splash:png()}
end
What else should I do? Thanks for help in advance.

Missing Angularjs HTML elements in Phantomjs

I am crawling a website. I must change the date of the given job when suddenly I realized that the element is missing. When I screen capture it, the element is really missing. Is there any way to render that element? The website runs with Angularjs because I noticed the ng in the HTML code. Here are the pictures, the first one is the desktop capture and the second one is from the phantomjs.
Normal date in web browser
No date, straight to the next label
I found how to solve this, just wait for the angularjs to load on phantomjs. It takes an estimate of 5 seconds before it loads. The best to do here is setTimeout function.