Prevent scrapy reponse from being added to cache - scrapy

I am crawling a website that returns pages with a captcha and a status code 200 suggesting everything is ok. This causes the page to be put into scrapy's cache.
I want to recrawl these pages later. But if they are in the cache, they won't get recrawled.
Is it feasible to overload the process_response function from the httpcache middleware or to look for a specific string in the reponse html and override the 200 code with an error code?
What would be the easiest way to keep scrapy from putting certain responses into the cache.

Scrapy uses scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware to cache http responses. To ignore this caching you can just set request meta keyword dont_cache to True like:
yield Request(url, meta={'dont_cache': True})
The docs above also mention how to disable it project-wide with a setting if you are interested in that too.

Related

ResponseCache attribute causing non 200 responses to be cached

We are using the [ResponseCacheAttribute] from Microsoft.AspNetCore.Mvc.Core with a policy like so on action methods or controllers:
[ResponseCache(CacheProfileName = "Default")]
In case a non 200 response is send like 400, 403 or 500 it is also being cached. So the first time we go to the server and get for example bad request. The second time no call is made to the server and the answer is still bad request (from disk cache).
I read in the documentation that when using response cache middleware only 200 responses are being cached. This attribute seems to be flawed and always adds the caching response header no matter what status code.
We like to define caching only on certain controllers or action methods. Not on all requests.
Does anyone know the solution for this?
I simulate the problem by using a status code result:
return StatusCode(500);
I would then expect it to always come back to this code with a breakpoint and never caching it.

How to cache whole next.js's HTML page in Redis?

I'm new to next.js. I'm trying to cache the whole HTML page in next.js. So, that I can reduce the response time in the next call for that page.
I tried by creating a custom server and then save the response that came from rendertoHTML()/render(). But it didn't return any response.
Redis is not for caching whole HTML pages or any other complex object types for that matter. It's key–value database. In Redis you should keep your object simple.
For your case it's best to use stale-while-revalidate (SWR) cache-control headers in combination with getServerSideProps for server-rendering in NextJS.
See here for example.

How to set cookies in Scrapy+Splash when javascript makes multiple requests?

When the javascript is loaded, it makes a another ajax request where cookies should be set in the response. However, Splash does not keep any cookies across multiple requests, is there a way to keep the cookies across all requests? Or even assign them manually between each requests.
Yes, there is an example in scrapy-splash README - see Session Handling section. In short, first, make sure that all settings are correct. Then use SplashRequest(url, endpoint='execute', args={'lua_source': script}) to send scrapy requests. Rendering script should be like this:
function main(splash)
splash:init_cookies(splash.args.cookies)
-- ... your script
return {
cookies = splash:get_cookies(),
-- ... other results, e.g. html
}
end
There is also a complete example with cookie handling, header handling, etc. in scrapy-splash README - see a last example here.

RESTlet redirect sending browser riap URI

I'm using RESTlet to handle PUT requests from a browser and after a successful PUT, I want to redirect the browser to different web page.
Seems like a standard PUT->REDIRECT->GET to me, but I'm not figuring out how to do it in my RESTlet resource.
Here is my code after the PUT has done the requested work:
getResponse().redirectSeeOther("/account");
However that results in the browser getting:
Response Headers
Location riap://application/account
Of course, "riap" protocol is meaningless to the browser and "application" is not a server name. It seems like there ought to be a way to send a redirect back to the browser without building the entire URL in my redirectSeeOther() call. Building the URL seems like to could be error prone.
Is there an easy way to redirect without building the whole URL from the ground up?
Thanks!
Sincerely,
Stephen McCants
Although I am not 100% sure in what type of class you are trying to do this.
Try :
Reference reference = getRequest().getRootRef().clone().addSegment("account");
redirectSeeOther(reference);
I usually also then set the body as
return new ReferenceList(Arrays.asList(reference)).getTextRepresentation();
but that may not be necessary for all clients, or at all. I will usually use this style in a class that extends ServerResource - Restlet (2.0.x or 2.1.x).

Jmeter : How to test a website to render a page regardless of the content

I have a requirement where the site only needs to respond to the user within certain seconds, regardless of the contents.
Now there is a option in Jmeter in HTTP Proxy Server -> URL Patterns to exclude and then to start recording.
Here I can specify gif, css or other content to ignore. However before starting the recording I have to be aware of what are the various contents that are going to be there.
Is there any specific parameter to pass to Jmeter or any other tool which takes care about loading the page only and I can assert the response code of that page and no the other contents of the page are recorded.
Thanks.
Use the standard HTTP Request sampler with DISABLED (not checked) option Retrieve All Embedded Resources from HTML Files (set via sampler's control panel):
"It also lets you control whether or not JMeter parses HTML files for
images and other embedded resources and sends HTTP requests to
retrieve them."
NOTE: You may also define the same setting via HTTP Request Defaults.
NOTE: See also "Response size calculation" in the same HTTP Request article.
Add assertions to your http samplers:
Duration Assertion: to tests if response was received within a defined amount of time;
Response Assertion: to ensure that request was successfull,
e.g.
Response Field to Test = Response Code
Pattern Matching Rules = Equals
Patterns to Test = 200
You want to run test that would ignore resources after certain number of seconds?
I don't understand, what are you trying to accomplish by doing that?
Users will still receive those resources when they request your url, so your tests wont be accurate.
I don't mean any disrespect, but is it possible that you misunderstood the requirements?
I assume that the requirement is to load all the resources in certain number of seconds, not to cut off the ones that fail to fit in that time?