How to check ETag before Request in Scrapy? - scrapy

I want to request a URL every minute. But before I request the whole page, I want to check if it is updated based on ETag/Content-length/Age in the header. How can I implement this in Scrapy?

check out scrapy downloader middleware it comes with an implementation of RFC2616 policy
Do not attempt to store responses/requests with no-store
cache-control directive set
Do not serve responses from cache if no-cache cache-control directive is set even for fresh responses
Compute freshness lifetime from max-age cache-control directive
Compute freshness lifetime from Expires response header
Compute freshness lifetime from Last-Modified response header (heuristic used by Firefox)
Compute current age from Age response header
Compute current age from Date header
Revalidate stale responses based on Last-Modified response header
Revalidate stale responses based on ETag response header
Set Date header for any received response missing it

Related

What is the default value of Access-Control-Allow-Origin header?

Is "*" or the server's URI the default value for Access-Control-Allow-Origin header?
If the header is not set, does it mean that every origin has access to the resource?
There is no default value.
If it isn't set, then it isn't set. If it is set, then it must have an explicit value.
If the header is not set, does it mean that every origin has access to the resource?
No. It means that the Same Origin Policy is enforced as normal. No origins are granted permission.
the server's URI
There is no reason to ever set the Access-Control-Allow-Origin to be the server's own URL. Same Origin requests don't need permission from CORS.
Came across this looking for the headers that work without CORS and found this nice safe list from Mozilla: https://developer.mozilla.org/en-US/docs/Glossary/CORS-safelisted_request_header
A CORS-safelisted request header is one of the following HTTP headers:
Accept,
Accept-Language,
Content-Language,
Content-Type.

Accept-Encoding:gzip and Content-Encoding:gzip

What is the difference between the two HTTP headers?
Accept-Encoding:gzip
Content-Encoding:gzip
Accept-Encoding:
It is a request header. The HTTP client uses this header to tell the server which encoding(s) it supports. The server is allowed to send the response content in any of these encodings.
From MDN
The Accept-Encoding request HTTP header advertises which content encoding, usually a compression algorithm, the client is able to understand. Using content negotiation, the server selects one of the proposals, uses it and informs the client of its choice with the Content-Encoding response header.
Content-Encoding:
It is a response header. The HTTP server uses this header to tell the client which particular encoding the content has actually been encoded in.
From MDN
The Content-Encoding entity header is used to compress the media-type. When present, its value indicates which encodings were applied to the entity-body. It lets the client know, how to decode in order to obtain the media-type referenced by the Content-Type header.
If you want to see them play in action, open Inspect Element in Firefox / Chrome, then check for the Network tab to see them in action. Look for Accept-Encoding in request headers and Content-Encoding in response headers.
Accept-Encoding
To paraphrase IETF internet standard RFC-7231, the Accept-Encoding request header field can be used by user agents to make requests that indicate what response content-codings are acceptable in the response.
The Accept-Encoding header can be quite complex, e.g.
Accept-Encoding: gzip;q=1.0, identity; q=0.5, *;q=0
https://datatracker.ietf.org/doc/html/rfc7231#section-5.3.4
Content-Encoding
The Content-Encoding response header field indicates what content codings have been applied to the response representation. Content-Encoding is primarily used to allow the response entity to be compressed without losing the identity of its underlying media type.
The Content-Encoding value is simple and should be accompanied by a "Vary" header, e.g.
Content-Encoding: gzip
Vary: Accept-Encoding
https://datatracker.ietf.org/doc/html/rfc7231#section-3.1.2.2

Apache: Disable Cache-Control: max-age?

A book about performance reads that you should use Expires or Cache-Control: max-age but not both .
Expires was easy to configure on my Apache.
Now I would like to disable the unneeded Cache-Control: max-age but I don't how to.
Your mention of both headers suggests that you're using mod_expires.
You cannot select only one header using mod_expires. The code that sets the headers in mod_expires.c unconditionally sets both headers to equivalent values:
apr_table_mergen(t, "Cache-Control",
apr_psprintf(r->pool, "max-age=%" APR_TIME_T_FMT,
apr_time_sec(expires - r->request_time)));
timestr = apr_palloc(r->pool, APR_RFC822_DATE_LEN);
apr_rfc822_date(timestr, expires);
apr_table_setn(t, "Expires", timestr);
However, using mod_header may allow you to set what you want, using something like:
Header unset Cache-Control
There is a case for using both: Cache-Control allows much finer control than Expires, while Expires may help much older clients.

Can browser display objects from its cache without receiving a 304 status code?

I'm trying to understand if is it possible to avoid request for some embedded objects, loading them directly from cache without asking to web server if the object is valid or not (i don't want web server response to me with 304 http status code) Is it possible ? Does the expire header works for this way? How?
Of course: Request:
<script scr="my_js.php"></script>
Response:
<? header("HTTP/1.1 304 Not Modified");
header("Expires: Mon, 31 Dec 2035 12:00:00 gmt");
header("Cache-Control: max-age=".(60*60*24*365));
echo "//this is a simpe example"; ?>
Solved
Browser loads resources from his cache without asking them to the web server only the first time you open the page (new tab or new browser window).
The other times browser ALWAYS ask information to the server about the resources saved in his cache. Then, the web server response with 200 or 301.
Yes, setting a distant expiry header and the asset will not be downloaded again until that expiry.
If you remove the Last-Modified and ETag header, you will totally eliminate If-Modified-Since and If-None-Match requests and their 304 Not Modified Responses, so a file will stay cached without checking for updates until the Expires header indicates new content is available!
Source.
From my htaccess ...
<IfModule mod_headers.c>
Header unset Pragma
FileETag None
Header unset ETag
# cache images/pdf docs for 10 days
<FilesMatch "\.(ico|pdf|jpg|jpeg|png|gif|js)$">
Header set Expires "Mon, 31 Dec 2035 12:00:00 gmt"
Header unset ETag
Header unset Last-Modified
</FilesMatch>
# cache html/htm/xml/txt diles for 2 days
<FilesMatch "\.(html|htm|xml|txt|xsl)$">
Header set Cache-Control "max-age=7200, must-revalidate"
</FilesMatch>
</IfModule>
it seems doesn't works .... for example firebug's net panel show me always 200 status code and access.log file report me that external objects are always requested by the browser.

Apache caching problem

I'm loading some json through apache as per:
http://arguments.callee.info/2010/04/20/running-apache-and-node-js-together/
The JSON however is outdated when I use the apache url. The node.js :8000 url serves the correct data.
How can I make sure apache doesn't cache json?
Thanks.
You can append a "cache killer" on the URL you are fetching asynchronously. That is some value that will always make the URL unique.
var url = "http://example.com/service.json?" + new Date().getTime();
A possible solution would be to setup the expire headers to the past and make sure that the browser does not cache nay json via cache-control haders for json files and
You can try to add this to your apache config file :
<FilesMatch "\.(json|json)$">
Header set Cache-Control "max-age=0, no-cache, no-store, must-revalidate"
Header set Pragma "no-cache"
Header set Expires "Thu, 01 Jan 1970 00:00:00 GMT"
</FilesMatch>
The mod_headers module will need to be installed in Apache to use this method.
If you are interested you can have a read at the roots
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9