Eliminating Expensive Back End System Calls and SEO - seo

We have a web site which makes expensive calls to a back end system to display product availability. I'd like to eliminate these calls for page views that are not actual customers. My first thought was to filter on user agent and if the requester is a spider / search engine crawler, to display a "Call for availability" or some such message (which would be the same message we would display if the backend systems were down for maintenance or generally unavailable) rather than make a call to the backend system for real availability.
In discussions with folks, there seems to be much concern over the availability icon (a very small icon, mind you) being different when being crawled vs. when a user is viewing or requesting the page - that we might be penalized for cloaking the search engines.
As the information we are displaying is a very small image icon, and we are not offering drastically different content to the search engines vs. live users, I really don't see cloaking as an issue - but I'd like to get some outside perspective.
Is simulating an "information not available" scenario for search engines acceptable practice when the overall content of the page does not change, or would it still qualify in some way as cloaking?

Why don't you make the "information" which you are displaying use javascript / ajax. That way when the page loads through a non-javascript enabled browser (e.g. search engine spider), this "expensive call" isn't made.
Alternatively you could put this information in an IFRAME on your page. And exclude indexing the page shown in the IFRAME through robots.txt or the META / robots tag.
Both approaches are completely "white hat" although I think the 2nd is more so.

Related

SEO: Allowing crawler to index all pages when only few are visible at a time

I'm working on improving the site for the SEO purposes and hit an interesting issue. The site, among other things, includes a large directory of individual items (it doesn't really matter what these are). Each item has its own details page, which is accessed via
http://www.mysite.com/item.php?id=item_id
or
http://www.mysite.com/item.php/id/title
The directory is large - having about 100,000 items in it. Naturally, on any of the pages only a few items are listed. For example, on the main site homepage, there are links to about 5 or 6 items, from some other page there links to about a dozen different items, etc.
When real users visits the site, they can use search form to find item by keyword or location - so there would be a list produced matching their search criteria. However when, for example, a google crawler visits the site, it won't even attempt to put a text into the keyword search field and submit the form. Thus as far as the bot is concern, after indexing the entire site, it has covered only a few dozen items at best. Naturally, I want it to index each individual item separately. What are my options here?
One thing I considered is to check the user agent and IP ranges and if the requestor is a bot (as best I can say), then add a div to the end of the most relevant page with links to each individual item. Yes, this would be a huge page to load - and I'm not sure how google bot would react to this.
Any other things I can do? What are best practices here?
Thanks in advance.
One thing I considered is to check the user agent and IP ranges and if
the requestor is a bot (as best I can say), then add a div to the end
of the most relevant page with links to each individual item. Yes,
this would be a huge page to load - and I'm not sure how google bot
would react to this.
That would be a very bad thing to do. Serving up different content to the search engines specifically for their benefit is called cloaking and is a great way to get your site banned. Don't even consider it.
Whenever a webmaster is concerned about getting their pages indexed having an XML sitemap is an easy way to ensure the search engines are aware of your site's content. They're very easy to create and update, too, if your site is database driven. The XML file does not have to be static so you can dynamically produce it whenever the search engines request it (Google, Yahoo, and Bing all support XML sitemaps). You can find out mroe about XML sitemaps at sitemaps.org.
If you want to make your content available to search engines and want to benefit from semantic markup (i.e. HTML) you should also make sure your all of content can be reached through hyperlinks (in other words not through form submissions or JavaScript). The reason for this is twofold:
The anchor text in the links to your items will contain the keywords you want to rank well for. This is one of the more heavily weighted ranking factors.
Links count as "votes", especially to Google. Links from external websites, especially related websites, are what you'll hear people recommend the most and for good reason. They're valuable to have. But internal links carry weight, too, and can be a great way to prop up your internal item pages.
(Bonus) Google has PageRank which used to be a huge part of their ranking algorithm but plays only a small part now. But it still has value and links "pass" PageRank to each page they link to increasing the PageRank of that page. When you have as many pages as you do that's a lot of potential PageRank to pass around. If you built your site well you could probably get your home page to a PageRank of 6 just from internal linking alone.
Having an HTML sitemap that somehow links to all of your products is a great way to ensure that search engines, and users, can easily find all of your products. It is also recommended that you structure your site so more important pages are closer to the root of your website (home page) and then as you branch out gets to sub pages (categories) and then to specific items. This gives search engines an idea of what pages are important and helps them organize them (which helps them rank them). It also helps them follow those links from top to bottom and find all of your content.
Each item has its own details page, which is accessed via
http://www.mysite.com/item.php?id=item_id
or
http://www.mysite.com/item.php/id/title
This is also bad for SEO. When you can pull up the same page using two different URLs you have duplicate content on your website. Google is on a crusade to increase the quality of their index and they consider duplicate content to be low quality. Their infamous Panda Algorithm is partially out to find and penalize sites with low quality content. Considering how many products you have it is only a matter of time before you are penalized for this. Fortunately the solution is easy. You just need to specify a canonical URL for your product pages. I recommend the second format as it is more search engine friendly.
Read my answer to an SEO question at the Pro Webmaster's site for even more information on SEO.
I would suggest for starters having an xml sitemap. Generate a list of all your pages, and submit this to Google via webmaster tools. It wouldn't hurt having a "friendly" sitemap either - linked to from the front page, which lists all these pages, preferably by category, too.
If you're concerned with SEO, then having links to your pages is hugely important. Google could see your page and think "wow, awesome!" and give you lots of authority -- this authority (some like to call it link juice" is then passed down to pages that are linked from it. You ought to make a hierarchy of files, more important ones closer to the top and/or making it wide instead of deep.
Also, showing different stuff to the Google crawler than the "normal" visitor can be harmful in some cases, if Google thinks you're trying to con it.
Sorry -- A little bias on Google here - but the other engines are similar.

http-equiv refresh & SEO

I have web pages that may or may not need to do some pre-processing before it can be displayed to the user. What I have done is to display this page with a message (e.g. "please wait ..."), then do an http-equiv refresh to the same page. When it refreshes to the same page, the pre-processing is done and the actual content is displayed.
Will this harm me in terms of SEO?
No one can answer this question with a solid answer however a page that the content changes which seems to be a result of a form or something, this isn't something that should be indexed, you should probably disallow search engines from accessing that page by using robots.txt
This would really depend on the type of page/URL you are on and what type of content you are generating/selecting for the user.
The best option would be to have a "standard" version of the page and have that load by default. If the content should vary based on user/session or cookie SEs will only see the "default version". Be careful not to shoot yourself in the foot though. A user demonstrating the same behavior as a SE should also see the default content.

Can search engines index pages generated by server side code?

I'm guessing a site like stack overflow doesn't keep an html file around for every question ever asked. Instead, server-side code creates the page every time a question is clicked on(I think). Is it possible for search engines to index every quesiton on Stack Overflow, or would a page-per-question need to be kept in the directory so the search engine can crawl it?
Yes. Search engines can index dynamically generated pages no problem. In fact, from the search engine bot's perspective, it can't really even distinguish between a dynamically generated page and a static one.
You might be interested by the Dynamic URLs vs. static URLs post on the Official Google Webmaster Central Blog.
Yes it's perfectly possible - when a link is followed the server returns HTML just like any other web page. The only difference is that the server generated it, rather than a person.
As far as the client (be it a browser or search engine) is concerned, there is no difference between a server-generated page and a static file. They're virtually indistinguishable (depending on how the page is generated, it might be missing Last-Modified headers, etc). As such, yes, search engines can index generated pages without a problem.
That said, there is something to be said for giving them a hint. Using sitemaps, for example, gives a search engine a nice listing of all your pages, so it's less likely to miss them. More importantly, it can summarize last modified times, to focus the search engine's attention on what has changed recently. This isn't mandatory, but it does help - regardless of whether the pages are static HTML or generated.
Any link that uses a GET can be followed by most crawlers. Anything that requires a POST will generally be ignored.
The mechanism for generating the page is irrelevant.
yes if this is not restricted by robot.txt or meta tags.Search engine requests web page like normal user,no one have access to server side code(if your site isn't hacked))
Search engines can see pretty much anything on a given Web page that isn't hidden behind client-side code (i.e., JavaScript).
So, if there's a URL that you can enter into your browser's address bar to get this page, and this page is linked to from somewhere, a search engine will find it and "see" the same content that you do. The fact that the page was generated dynamically by a server is irrelevant to a search engine, since what is sent to a browser upon requesting a URL is still just an HTML file.
In other words, that HTML file doesn't exist in the same form on the server - i.e., it's actually some server-side code that generates HTML, not a static HTML file - but that's not what a search engine is crawling though and indexing, rather links to document URLs that are exactly what you see in your browser's address bar.

How does Google track search result clicks? Is this the best way?

As the question states, I'm trying to figure out how google tracks clicks on search results. When you view the source, you find the following:
<em>Yahoo</em>!
The function rwt is, which is pretty messy:
windows.rwt=function(b,d,e,g,h,f,i,j){
var a=encodeURIComponent||escape,c=b.href.split("#");
b.href=["/url?sa=t\x26source\x3dweb",d?"&oi="+a(d):"",e?"&cad="+a(e):"","&ct=",a(g),"&cd=",a(h),"&url=",a(c[0]).replace(/\+/g,"%2B"),"&ei=7_C2SbqXBMW0-AbU4OWnCw",f?"&usg="+f:"",i,c[1]?"#"+c[1]:""].join("");
b.onmousedown="";
return true};
So it looks like Google is changing the href of the a tag to /url?... which I'm assuming is where their tracking is. From LiveHeaders in Firefox, it looks like this page is redirecting the browser to the original href of the a tag.
Is this correct and is this the best method of tracking clicks on links on your site, such as ads?
It's actually changing the href of the link rather than the window location. It's setting b.href, and b refers to the link itself. This runs in onmousedown, so when you release the mouse and the click is handled you magically get sent to that new href.
Any click tracking pretty much comes down to sending the user to some equivalent of Google's /url?... script, counting the click, and performing a 302 redirect to the real destination.
This javascript href replacement has the advantage of automatically filtering out any robots that don't run scripts. The downside is that it also filters out any real people that have javascript disabled. If, like Google, you just care which link is most popular with your real human users, this works out quite well. The clicks that you do record should be representative of real human traffic, and you can safely ignore the clicks from non-javascript users because they probably have the same preferences anyway.
Most adverts just link straight to the counting URL with no javascript replacement. This means that you definitely count every real click on the link, but you need to worry about filtering out requests from robots, since they'll now see your counting URL too.
Which you prefer really depends on why you want to track the clicks.
I think most people expect ads to click through via some sort of tracking system, so I shouldn't worry too much about following this particular javascript implementation - as much as anything that's probably there to ensure that the user sees the correct link in the browsers status bar, that various other interesting bits of info (search terms, position on the result set at the time, who you are, etc) are sent across (without you realising it) and that the links still work if JavaScript is disabled.
Generally, yes directing the user through some tracking page with the ID of the ad they have clicked on, and possibly some additional indication of where they have come from is sensible - that way you aren't relying on other mechanisms (such as JS event handlers) to track clicks on the links, it's certainly the way most ad systems I've used work.

Possible to prevent search engine spiders from infinitely crawling paging links on search results?

Our SEO team would like to open up our main dynamic search results page to spiders and remove the 'nofollow' from the meta tags. It is currently accessible to spiders via allowing the path in robots.txt, but with a 'nofollow' clause in the meta tag which prevents spiders from going beyond the first page.
<meta name="robots" content="index,nofollow">
I am concerned that if we remove the 'nofollow', the impact to our search system will be catastrophic, as spiders will start crawling through all pages in the result set. I would appreciate advice as to:
1) Is there a way to remove the 'nofollow' from the meta tag, but prevent spiders from following only certain links on the page? I have read mixed opinions on rel="nofollow", is this a viable option?
<a rel="nofollow" href="http://www.mysite.com/paginglink" >Next Page</a>
2) Is there a way to control the 'depth' of how far spiders will go? It wouldn't be so bad if they hit a few pages, then stopped.
3) Our search results pages have the standard next/previous links, which would in theory cause spiders to hit pages recursively to infinity, what is the effect of this on SEO?
I understand that different spiders behave differently, but am mainly concerned with the big players, such as Google, Yahoo, MSN.
Note our Search results pages and paging links are not bot-friendly, in that they are not re-written and have a ?name=value query string, but from what I've seen spiders no longer just abort when they see the '?' as the results pages ARE getting indexed with decent page rank.
To be honest you are looking at nofollow wrong. Chances are the search spiders are already especially Google, Yahoo, and MSN searching the nofollow pages, because they still have to hit those pages to see if they have a noindex.
The real problem is nofollow doesn't actually mean don't follow, it just means don't pass on my reputation to this link. So unless you are aggressively blocking bots, which it doesn't sound like you are, changing the ROBOTS meta tag and robot commands on links will not effect performance because they are already hitting your site. To confirm this just look at your HTTP Server Log.
So my vote is that you will not see any problem with removing the robot limits.
I've seen Google index a calendar system that had relative links on each page through the end of time (Jan 19, 2038 - see: http://en.wikipedia.org/wiki/Year_2038_problem). We didn't notice the load on our servers until it exposed a bug in the source code dealing with dates in 2038.
I don't know about the other search engines, but Google offers a number of helpful tools for controlling how much the googlebot impacts your server infrastructure. See http://www.google.com/webmasters/.
There is an option in webmaster tools to set the crawl rate for your site.
Google bots are pretty intelligent about not traversing an entire database of dynamically-generated pages, as long as the URLs give some hint that they are dynamic (i.e. file extension of .asp or .jsp, etc. and numeric ids as query parameters). If you use rewrite rules to make your URLs "friendly", then the bots have a harder time determining whether or not it's a static page they are reading or a dynamically generated page. See this Google article for more information about dynamic vs. static URLs.
You may also want to consider creating a Google Sitemap to give the bots a better idea about what pages on your site can be indexed and which cannot.