How to set cookies in Scrapy+Splash when javascript makes multiple requests? - scrapy

When the javascript is loaded, it makes a another ajax request where cookies should be set in the response. However, Splash does not keep any cookies across multiple requests, is there a way to keep the cookies across all requests? Or even assign them manually between each requests.

Yes, there is an example in scrapy-splash README - see Session Handling section. In short, first, make sure that all settings are correct. Then use SplashRequest(url, endpoint='execute', args={'lua_source': script}) to send scrapy requests. Rendering script should be like this:
function main(splash)
splash:init_cookies(splash.args.cookies)
-- ... your script
return {
cookies = splash:get_cookies(),
-- ... other results, e.g. html
}
end
There is also a complete example with cookie handling, header handling, etc. in scrapy-splash README - see a last example here.

Related

How to cache whole next.js's HTML page in Redis?

I'm new to next.js. I'm trying to cache the whole HTML page in next.js. So, that I can reduce the response time in the next call for that page.
I tried by creating a custom server and then save the response that came from rendertoHTML()/render(). But it didn't return any response.
Redis is not for caching whole HTML pages or any other complex object types for that matter. It's key–value database. In Redis you should keep your object simple.
For your case it's best to use stale-while-revalidate (SWR) cache-control headers in combination with getServerSideProps for server-rendering in NextJS.
See here for example.

How does scrapy-splash filter duplicates?

When using scrapy-splash library to render JS. We add its custom DUPEFILTER_CLASS to the settings.py file.
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
Seems like this is used to filter requests in order not to send much requests and speed up the process. But, what is the base for filtering requests when using scrapy-splash? is it the url?
Duplicates are detected using the splash_request_fingerprint function. From looking at the code and issue 900 (still open) , the url is taken into account, but you have the option of passing a meta parameter to the request if you want to differentiate it from some other request with the same url. But we have to look at scrapy.utils.request:request_fingerprint because this too is called.
What is part of the fingerprint:
the url of the request
the request method (source and keep_fragments is set to True)
the request's body
What's not part of the fingerprint:
the http request headers (since include_headers is None by default)
url fragments by default are not used to compute the fingerprint, unless request.meta.splash.args contains the key url
It's useful to follow issue 900 in order to keep up to date. In the later comments , some recipes and examples for using/customizing fingerprinting are starting to emerge.

Prevent scrapy reponse from being added to cache

I am crawling a website that returns pages with a captcha and a status code 200 suggesting everything is ok. This causes the page to be put into scrapy's cache.
I want to recrawl these pages later. But if they are in the cache, they won't get recrawled.
Is it feasible to overload the process_response function from the httpcache middleware or to look for a specific string in the reponse html and override the 200 code with an error code?
What would be the easiest way to keep scrapy from putting certain responses into the cache.
Scrapy uses scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware to cache http responses. To ignore this caching you can just set request meta keyword dont_cache to True like:
yield Request(url, meta={'dont_cache': True})
The docs above also mention how to disable it project-wide with a setting if you are interested in that too.

Intercept Requests With Custom Responses in PhantomJS?

Is there a way to intercept a resource request and give it a response directly from the handler? Something like this:
page.onRequest(function(request){
request.reply({data: 123});
});
My use case is for using PhantomJS to render a page that makes calls to my API. In order to avoid authentication issues, I'd like to intercept all http requests to the API and return the responses manually, without making the actual http request.
onResourceRequest almost does this, but doesn't have any modification capabilities.
Possibilities that I see:
I could store the page as a Handlebars template, and render the data into the page and pass it off as the raw html to PhantomJS (instead of a URL). While this would work, it would make changes difficult since I'd have to write the data layer for each webpage, and the webpages couldn't stand alone.
I could redirect to localhost, and have a server there that listens and responds to the requests. This assumes that it would be ok to have an open, un-authenticated version of the API on localhost.
Add the data via page.evaluate to the page's global window object. This has the same problems as #1: I'd need to know a-priori what data the page needs, and write server side code unique to each page.
I recently needed to do this when generating pdfs with phantom js.
It's slightly hacky, but seems to work.
var page = require('webpage').create(),
server = require('webserver').create(),
totallyRandomPortnumber = 29522,
...
//in my actual code, totallyRandomPortnumber is created by a java application,
//because phantomjs will report the port in use as '0' when listening to a random port
//thereby preventing its reuse in page.onResourceRequested...
server.listen(totallyRandomPortnumber, function(request, response) {
response.statusCode = 200;
response.setHeader('Content-Type', 'application/json;charset=UTF-8');
response.write(JSON.stringify({data: 'somevalue'}));
response.close();
});
page.onResourceRequested = function(requestData, networkRequest) {
if(requestData.url.indexOf('interceptme') != -1) {
networkRequest.changeUrl('http://localhost:' + totallyRandomPortnumber);
}
};
In my actual application I'm sending some data to phantomjs to overwrite request/responses, so I'm doing more checking on urls both in server.listen and page.onResourceRequested.
This feels like a poor-mans-interceptor, but it should get you (or whoever this may concern) going.

Jmeter : How to test a website to render a page regardless of the content

I have a requirement where the site only needs to respond to the user within certain seconds, regardless of the contents.
Now there is a option in Jmeter in HTTP Proxy Server -> URL Patterns to exclude and then to start recording.
Here I can specify gif, css or other content to ignore. However before starting the recording I have to be aware of what are the various contents that are going to be there.
Is there any specific parameter to pass to Jmeter or any other tool which takes care about loading the page only and I can assert the response code of that page and no the other contents of the page are recorded.
Thanks.
Use the standard HTTP Request sampler with DISABLED (not checked) option Retrieve All Embedded Resources from HTML Files (set via sampler's control panel):
"It also lets you control whether or not JMeter parses HTML files for
images and other embedded resources and sends HTTP requests to
retrieve them."
NOTE: You may also define the same setting via HTTP Request Defaults.
NOTE: See also "Response size calculation" in the same HTTP Request article.
Add assertions to your http samplers:
Duration Assertion: to tests if response was received within a defined amount of time;
Response Assertion: to ensure that request was successfull,
e.g.
Response Field to Test = Response Code
Pattern Matching Rules = Equals
Patterns to Test = 200
You want to run test that would ignore resources after certain number of seconds?
I don't understand, what are you trying to accomplish by doing that?
Users will still receive those resources when they request your url, so your tests wont be accurate.
I don't mean any disrespect, but is it possible that you misunderstood the requirements?
I assume that the requirement is to load all the resources in certain number of seconds, not to cut off the ones that fail to fit in that time?