Can scrapy-splash Ignore 504 HTTP Status? - scrapy

i want to scrap javascript loading web pages, so i use scrapy-splash but some pages so lots of loading time.
like this :
i think [processUser..] things that makes slower.
there are any way to ignore that 504 pages ? because when i set timeout less than 90 , cause 504 gateway error in scrapy shell or spiders.
and can get result html code ( only get 200 ) when time i set is over?

There's a mechanism in splash to abort a request before it starts loading the body which you can leverage using splash:on_response_headers hook. However in your case this hook will only be able to catch and abort the page when the status and the headers are in, and that is after it finishes waiting for the gateway timeout (504). So instead you might want splash:on_request hook to abort the request before it's even sent like so
function main(splash, args)
splash:on_request(function(request)
if request.url:find('processUser') then
request:abort()
end
end)
assert(splash:go(args.url))
assert(splash:wait(.5))
return {
har = splash:har(),
}
end
UPD: Another and perhaps a better way to go about this is to set splash.resource_timeout before any requests take place:
function main(splash, args)
splash.resource_timeout = 3
...

When you are using Splash to render a webpage you are basically using a web browser.
When you ask Splash to render http://example.com:
Splash goes to http://example.com
Splash executes all of the javascript
2.1 javascript makes some requests
2.2 some requests return 50x codes
Splash returns page data
Unforntunately Splash right now does not support any custom rules for blocking javascript requests - it just takes the page and does everything your browser would do without any addons: load everything without question.
All that being said it's highly unlikely that those 50x requests are slowing down your page load, if so it shouldn't be a significant amount.

Related

Scrapy Splash - Not Rendering the full content

I'm trying to scrape this site https://tucson.craigslist.org/search/acc?postedToday=1#search=1~list~0~0, When i try it in splash web console and try to give a wait time of 30 seconds, sometimes it renders the full page and sometimes it is not rendering, When i run in scrapy it did not even render the javascript path, it just returns the rendered content (content received on first call to page)
Can someone help me on this?
Note
This site uses localstorage to store the result and render from there
I deployed the splash instances using the aquarium (Three instances, 1 Slot per instance, 3600 Timeout, Disabled Private Mode)

Is it possible to track if a page finished loading with my current implementation of webView?

Picture of WebView
Hello, I am working on a login page for my app. Right now the webView login page works! However, there is one big issue with the current implementation.
Basically for my login I call a series of functions:
Opens hidden website with a webView,
Injects javaScript to login to website,
Changes to page of website containing data,
Extracts data,
Pushes to new view as login was successful
Now this is all working, except I had to hardcode times for each function to take place using DispatchQueue.main.async. This is of course problematic because some of the functions vary in time, for example the time it takes to load the webpage. This means my login is successful 75% of the time. I need a way to track if the webView is finished loading so I can call the next function only when it is done loading. However, every webView I have seen that has something like this, uses a completely different structure. When I have tried these other structures, I could not make certain things work like my login function that uses evaluateJavascript.
Is there anyway to have this feature by adding something to my current implementation? Thanks!

(Karate) How to intercept the XHR request response code?

I am testing a login functionality on a 3rd party website. I have this url example.com/login . When I copy and paste this into the browser (chrome), page sometimes load, but sometime does not (empty blank white page).
The problem is that I have to run a script on this page to click one of the elements (all the elements are embedded inside #shadow-root). If the page loads, no problem, script is evaluated successfully. But page sometimes does not load and it returns a 404 in response to an XHR request, and as a result, my * eval(scrip("script") step returns "js eval failed...".
So I found the solution to refresh the page, and to do that, I am considering to capture the xhr request response. If the status code is 404, then refresh the page. If not, continue with the following steps.
Now, I think this may work, but I do not know how to implement karate's Intercepting HTTP Requests. And firstly, is that something doable?
I have looked into documentation here, but could not understand the examples.
https://github.com/karatelabs/karate/tree/master/karate-netty
Meanwhile, if there is another way of refreshing the page conditionally, I will be more than happy to hear about it. Thanks anyone in advance.
First, using JavaScript you should be able to handle shadow roots: https://stackoverflow.com/a/60618233/143475
And the above answer links to advanced examples of executing JS in the context of the current page. I suggest you do some research into that, try to take the help of someone who knows JS, the DOM and HTML well - and you should be find a way to know if the XHR has been made successfully or not - for e.g. based on whether some element on the page has changed etc.
Finally here is how you can do interception: https://stackoverflow.com/a/61372471/143475

How to open a page with Phantomjs without running js or making subsequent requests?

Is there a way to just load the server generated HTML (without any js or images)?
The docs seem a little sparse
The strength of PhantomJS is exactly in its ability to emulate a real browser, which opens a page and makes all the subsequent request. If you want just html maybe better use curl or wget?
But nevertheless there is a way not to run js or load images: set corresponding page settings: http://phantomjs.org/api/webpage/property/settings.html
page.settings.javascriptEnabled = false;
page.settings.loadImages = false;

Performance testing a web app using Jmeter

I'm new to jmeter, i tried performance test a web app using it,
It has 4 pages,
Login page(Http Authorization Manager)
Page 1
Page 2
Page 3
Here, when i use my app in real time it take too much time(> 2 sec) to load from one page to another.
But, in jmeter the results shows that the pages loads in quick time(avg time - 668 ms).
Is it hitting the pages individually?(i.e, from login page to page 1 and login page to page 2, etc)
What i wanted to know is, for the below scenario how my app performs with more samples.
Sequence : Login - goto page 1 - click on a link - goto page 2 - click on a link- goto page 3
Or Is there any way to record a sequence and do a load test with 100 users or so?
Here, when i use my app in real time it take too much time(> 2 sec) to load from one page to another. But, in jmeter the results shows that the pages loads in quick time(avg time - 668 ms).
There are some reasons why JMeter is faster:
Jmeter opens only html page, browser opens page with pictures and with another stuff
Jmeter doesn't render html and JS, but browser does
Make some changes to your JMeter script:
Add HTTP Cookie Manager
Add HTTP Cache Manager
Add HTTP Request Defaults
Move Login page as child into Once Only Controller (as you won't login each time, right?)