"Serve static assets with an efficient cache policy" - How many days is best practice? - browser-cache

On google page speed test, it logically suggests to setup browser caching for static content, like .js .css etc.
"A long cache lifetime can speed up repeat visits to your page."
I am setting this up, but for how long? There must be best practice on this, as I don't know if 7 days or 77 days is a good idea. Perhaps subjective, site dependent and all that usual stuff, so is the longer the better? A year perhaps?
I have cache busting enabled, so future updates aren't an issue for my website.

Related

Web Server Caches - in-memory vs the OS

I'm not entirely sure if this question would be better suited for something like Serverfault - however, since I'm a programmer, and not a sys-admin, I'm asking from the perspective of a programmer.
These days there are a HUGE number of options available for caching static web content. Things like Varnish or Squid are used throughout the industry.
However, I'm somewhat confused here. From a theoretical perspective, I don't see how the caching of static content requires the use of some 3rd party software apart from the web-server and OS.
Dynamic-content, (such as, the result of an expensive PHP script calculation or something), certainly could benefit from a good caching system.
But with static content, what do we gain by caching resources in memory? Wouldn't the OS page cache already provide the same benefits as a dedicated caching system like Varnish or Squid? Or am I missing some of the benefits?
Varnish, in fact, stores data in Virtual Memory using mmap - and lets the OS handle the page swapping. So, how exactly is this even different from just saving cached resources to disk and opening them with fread?
You are correct. For static resources, the memory can just as well be put to use for the page cache instead of using Varnish.
Chaining caches (varnish, pagecache) for identical content that compete for the same resource (server memory) is silly.
If you in addition have some dynamic content, you may choose to combine and serve everything from cache due to operational reasons. For example is it simpler to collect access logs and statistics from a single software stack, than two. This also applies to things like staff training and security patching.

Crawling for Eternity

I've recently been building a new web app dealing with Recurring Events. These events can recur on a daily, weekly or monthly basis.
This all is working great. But when I started creating the Event Browser Page (which will be visible to the public internet) a thought came across my mind.
If a crawler hits this page, with a next and previous button to browse the dates, it will just continue forever ? So I opted out of using generic HTML links and used AJAX. Which means that bots will not be able to follow the links.
But this method means I'm losing any that functionality for users without Javascript. Or is the amount of users without Javascript too small to worry ?
Is there a better way to handle this ?
I'm also very interested in how bots like the Google Crawler detects black holes like these and what it does to handle them ?
Add a nofollow tag to the page, or to the individual links you don't want crawled. This can be in robots.txt or in the page source. See the Robots Exclusion Standard
You may still need to think about how to fend off ill-behaved bots which do not respect the standard.
Even a minimally functional web crawler requires a lot more sophistication than you might imagine, and the situation you describe is not a problem. Crawlers operate on some variant of a breadth-first search, so even if they do nothing to detect black holes, it's not a big deal. Another typical feature of web crawlers that helps is that they avoid fetching a lot of pages from the same domain in a short time span, because otherwise they would inadvertently be performing a DOS attack against any site with less bandwidth than the crawler.
Even though it's not strictly necessary for a crawler to detect black holes, a good one might have all sorts of heuristics to avoid wasting time on low-value pages. For instance, it may choose ignore a pages that don't have a minimum amount of English (or whatever language) text, pages that contain nothing but links, pages that seem to contain binary data, etc. The heuristics don't have to be perfect because the basic breadth-first nature of the search ensures that no single site can waste too much of the crawler's time, and the sheer size of the web means that even if it misses some "good" pages, there are always plenty of other good pages to be found. (Of course this is from the perspective of the web crawler; if you own the pages being skipped, it might be more of a problem for you, but companies like Google that run web crawlers are intentionally secretive about the exact details of things like that because they don't want people trying to outguess their heuristics.)

How important are website optimizations?

Currently I am running Apache and MySQL and I hear about people talking about GZipping content, something about ETags, using a CDN, adding expire headers, minifying text documents, combining script files, etc. I downloaded a Firefox add-on called YSlow and I noticed that many websites do not employ all of these tactics. I believe even Google has a D rating. So I ask, SO, how important are these optimizations?
They depend highly on your traffic and resources at your disposal.
If you make the website for Joe's Pizza in the middle of nowhere, there is no real need to waste time optimising the site, it will likely have a handful of visits a day.
But Stack Overflow receives thousands of hits a minute (probably more), so they use a CDN, distant expiry headers, minification, etc.
Honestly, if people aren't complaining it's probably not a big deal. If people are complaining, start by looking at the database.
In my years of web development most web application performance problems have stemmed from the DB (this doesn't mean that all performance problems come from the DB but it's a good place to start). While I am fascinated for things like minified JS and css sprites, I suspect that these things do not make a difference in a "day in the life of your average web developer".
It's good that you consider these things, but unless you are working at an extremely high traffic site, it probably won't make a difference.
It all depends on your application.
Minifying, for example, might be great for an application that is very external .js dependent. There is no reason NOT to do this - there is no overhead required and it potentially saves quite a few bytes.
Compression is great for certain content types - terrible for others and involves a slight overhead while transporting pages.
CDNs are up to your affordability, content type and how dynamic the content is. You obviously don't need Akamai backing up the average Drupal site.
etc, etc, etc

Two Different Sites With the Same Content. Horrible Idea?

My boss wants to change our dynamic web site into static one, in order to make it more interactive between users.
However, I strongly believe big revamping may cause ranking turbulence, and we really can not afford to it. The reason is we changed the structure of our web site couple of months ago. Specifically speaking, we eliminated lot of pages and congregated the content into one page. i.e. putting images, tutorials etc. of a product into the pages under relevant products. Then following suggestions from experts, we redirected all of eliminated url via 301 to the corresponding product page. Unfortunately we experienced over one month decline of our page ranking in Google. That why I am so anxiety of our performance in search engines.
I figured a wild solution though. But I am not sure if it is doable! So please give your idea!
I want to maintain two sites at the same time for a period, one is dynamic, and the another is static. One will be presented under www.<><><>.com while the other is http://<><><>.com. and the latter should be static.
If everything goes well, maybe one or two months later, I can pull the static site: http://<><><>.com down.
Am I dreaming?
Thanks for your time for the lengthy reading and suggestion.
This is the wrong place to ask the question, but it's a horrible idea.
Maintaining the old site at a similar URL will confuse users and only delay the problem rather than solve it. A demo phase (YouTube, GMail, Google Ads, etc did this with their interfaces) where users are given the choice to switch to a new interface (cookies!) can ease radical transitions, but eventually you'll just want to make a clear cut.
If you fear bad consequences for your page rank, make sure the new site is better and redirect all indexed old links to the appropriate new sites where possible (at least for a transitional period).

robots.txt: disallow all but a select few, why not? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I've been thinking a while about disallowing every crawler except Ask, Google, Microsoft, and Yahoo! from my site.
The reasoning behind this is that I've never seen any traffic being generated by any of the other web-crawlers out there.
My questions are:
Is there any reason not to?
Has anybody done this?
Did you notice any negative effects?
Update:
Up till now I used the blacklist approach: if I do not like the crawler, I add them to the disallow list.
I'm no fan of blacklisting however as this is a never ending story: there are always more crawlers out there.
I'm no so much worried about the real ugly misbehaving crawlers, they are detected and blocked automatically. (and they typically do no ask for robots.txt anyhow :)
However, many crawlers are not really misbehaving in any way, they just do not seem to generate any value for me / my customers.
There are for example a couple of crawlers that power website who claim they will be The Next Google; Only Better. I've never seen any traffic coming from them and I'm quite sceptical about them becoming better than any of the four search engines mentioned above.
Update 2:
I've been analysing the traffic to several sites for some time now, and it seems that for reasonable small sites, 100 unique human visitors a day (=visitors that I cannot identify as being not human). About 52% of the generated traffic is by automated processes.
60% of all automated visitors is not reading robots.txt, 40% (21% of total traffic)
does request robots.txt. (this includes Ask, Google, Microsoft, and Yahoo!)
So my thinking is, If I block all the well behaved crawlers that do not seem to generate any value for me, I could reduce the bandwidth use and server load by around 12% - 17%.
The internet is a publishing mechanism. If you want to whitelist your site, you're against the grain, but that's fine.
Do you want to whitelist your site?
Bear in mind that badly behaved bots which ignore robots.txt aren't affected anyway (obviously), and well behaved bots are probably there for a good reason, it's just that that's opaque to you.
Whilst other sites that crawl your sites might not be sending any content your way, its possible that they themselves are being indexed by google et al, and so adding to your page rank, blocking them from your site might affect this.
Is there any reason not to?
Do you want to be left out of something which could be including your site which you have no knowledge of and is indirectly bringing a lot of content your way.
If some strange crawlers are hammering your site and eating your bandwidth you may want to, but it is quite possible that such crawlers wouldn’t honour your robots.txt either.
Examine your log files and see what crawlers you have and what proportion of your bandwidth they are eating. There may be more direct ways to block traffic which is bombarding your site.
This is currently a bit awkward, as there is no “Allow” field. The easy way is to put all files to be disallowed into a separate directory, say “stuff”, and leave the one file in the level above this directory.
My only worry is that you may miss the next big thing.
There was a long period where AltaVista was the search engine. Possibly even more than Google is now. (there was no bing, or Ask, and Yahoo was a directory, rather than a search engine as such). Sites that blocked all but Altavista back then would have never seen traffic from Google, and therefore never known how popular it was getting, unless they heard about it from another source, which might have put them at a considerable disadvantage for a while.
Pagerank tends to be biased towards older sites. You don't want to appear newer than you are because you were blocking access via robots.txt for no reason. These guys: http://www.dotnetdotcom.org/ may be completely useless now, but maybe in 5 years time, the fact that you weren't in their index now will count against you in the next big search engine.