I have web pages that may or may not need to do some pre-processing before it can be displayed to the user. What I have done is to display this page with a message (e.g. "please wait ..."), then do an http-equiv refresh to the same page. When it refreshes to the same page, the pre-processing is done and the actual content is displayed.
Will this harm me in terms of SEO?
No one can answer this question with a solid answer however a page that the content changes which seems to be a result of a form or something, this isn't something that should be indexed, you should probably disallow search engines from accessing that page by using robots.txt
This would really depend on the type of page/URL you are on and what type of content you are generating/selecting for the user.
The best option would be to have a "standard" version of the page and have that load by default. If the content should vary based on user/session or cookie SEs will only see the "default version". Be careful not to shoot yourself in the foot though. A user demonstrating the same behavior as a SE should also see the default content.
Related
If a homepage on a website has a content if a user is not logged in and another content when the user login, would a search engine bot be able to crawl the user specific content?
If they are not able to crawl, then I can duplicate the content from another part of the website to make it easily accessible to users who have mentioned their needs at the registration time.
My guess is no, but I would rather make sure before I do something stupid.
You cannot assume that crawler support cookies, but you can identify the crawler and let the crawler be "Logged in" in your site by code. However this will open up for any user to pretend being a crawler to gain the data in the logged in area.
The bot will be able to see all the content in your document. If the content does not exist in the document, then it will not be seen by the bot. If it exists in the document but is hidden from view, the crawler will be able to pick it up.
Even if this could be done it is against the terms for most search engines to show the crawler content that is not the same as what any user will get on entry and can cause your site to be banned from the index.
This is why sites like expertsexchange have to provide the answer if you scroll all the way to the bottom even though they try to make it look like you have to register. (This is only possible if you enter expertsexchange with a google referer btw, for this reason)
I'm guessing a site like stack overflow doesn't keep an html file around for every question ever asked. Instead, server-side code creates the page every time a question is clicked on(I think). Is it possible for search engines to index every quesiton on Stack Overflow, or would a page-per-question need to be kept in the directory so the search engine can crawl it?
Yes. Search engines can index dynamically generated pages no problem. In fact, from the search engine bot's perspective, it can't really even distinguish between a dynamically generated page and a static one.
You might be interested by the Dynamic URLs vs. static URLs post on the Official Google Webmaster Central Blog.
Yes it's perfectly possible - when a link is followed the server returns HTML just like any other web page. The only difference is that the server generated it, rather than a person.
As far as the client (be it a browser or search engine) is concerned, there is no difference between a server-generated page and a static file. They're virtually indistinguishable (depending on how the page is generated, it might be missing Last-Modified headers, etc). As such, yes, search engines can index generated pages without a problem.
That said, there is something to be said for giving them a hint. Using sitemaps, for example, gives a search engine a nice listing of all your pages, so it's less likely to miss them. More importantly, it can summarize last modified times, to focus the search engine's attention on what has changed recently. This isn't mandatory, but it does help - regardless of whether the pages are static HTML or generated.
Any link that uses a GET can be followed by most crawlers. Anything that requires a POST will generally be ignored.
The mechanism for generating the page is irrelevant.
yes if this is not restricted by robot.txt or meta tags.Search engine requests web page like normal user,no one have access to server side code(if your site isn't hacked))
Search engines can see pretty much anything on a given Web page that isn't hidden behind client-side code (i.e., JavaScript).
So, if there's a URL that you can enter into your browser's address bar to get this page, and this page is linked to from somewhere, a search engine will find it and "see" the same content that you do. The fact that the page was generated dynamically by a server is irrelevant to a search engine, since what is sent to a browser upon requesting a URL is still just an HTML file.
In other words, that HTML file doesn't exist in the same form on the server - i.e., it's actually some server-side code that generates HTML, not a static HTML file - but that's not what a search engine is crawling though and indexing, rather links to document URLs that are exactly what you see in your browser's address bar.
Is it possible to fine-tune directives to Google to such an extent that it will ignore part of a page, yet still index the rest?
There are a couple of different issues we've come across which would be helped by this, such as:
RSS feed/news ticker-type text on a page displaying content from an external source
users entering contact phone etc. details who want them visible on the site but would rather they not be google-able
I'm aware that both of the above can be addressed via other techniques (such as writing the content with JavaScript), but am wondering if anyone knows if there's a cleaner option already available from Google?
I've been doing some digging on this and came across mentions of googleon and googleoff tags, but these seem to be exclusive to Google Search Appliances.
Does anyone know if there's a similar set of tags to which Googlebot will adhere?
Edit: Just to clarify, I don't want to go down the dangerous route of cloaking/serving up different content to Google, which is why I'm looking to see if there's a "legit" way of achieving what I'd like to do here.
What you're asking for, can't really be done, Google either takes the entire page, or none of it.
You could do some sneaky tricks though like insert the part of the page you don't want indexed in an iFrame and use robots.txt to ask Google not to index that iFrame.
In short NO - unless you use cloaking with is discouraged by Google.
Please check out the official documentation from here
http://code.google.com/apis/searchappliance/documentation/46/admin_crawl/Preparing.html
Go to section "Excluding Unwanted Text from the Index"
<!--googleoff: index-->
here will be skipped
<!--googleon: index-->
Found useful resource for using certain duplicate content and not to allow index by search engine for such content.
<p>This is normal (X)HTML content that will be indexed by Google.</p>
<!--googleoff: index-->
<p>This (X)HTML content will NOT be indexed by Google.</p>
<!--googleon: index>
At your server detect the search bot by IP using PHP or ASP. Then feed the IP addresses that fall into that list a version of the page you wish to be indexed. In that search engine friendly version of your page use the canonical link tag to specify to the search engine the page version that you do not want to be indexed.
This way the page with the content that do want to be index will be indexed by address only while the only the content you wish to be indexed will be indexed. This method will not get you blocked by the search engines and is completely safe.
Yes definitely you can stop Google from indexing some parts of your website by creating custom robots.txt and write which portions you don't want to index like wpadmins, or a particular post or page so you can do that easily by creating this robots.txt file .before creating check your site robots.txt for example www.yoursite.com/robots.txt.
All search engines either index or ignore the entire page. The only possible way to implement what you want is to:
(a) have two different versions of the same page
(b) detect the browser used
(c) If it's a search engine, serve the second version of your page.
This link might prove helpful.
There are meta-tags for bots, and there's also the robots.txt, with which you can restrict access to certain directories.
As the question states, I'm trying to figure out how google tracks clicks on search results. When you view the source, you find the following:
<em>Yahoo</em>!
The function rwt is, which is pretty messy:
windows.rwt=function(b,d,e,g,h,f,i,j){
var a=encodeURIComponent||escape,c=b.href.split("#");
b.href=["/url?sa=t\x26source\x3dweb",d?"&oi="+a(d):"",e?"&cad="+a(e):"","&ct=",a(g),"&cd=",a(h),"&url=",a(c[0]).replace(/\+/g,"%2B"),"&ei=7_C2SbqXBMW0-AbU4OWnCw",f?"&usg="+f:"",i,c[1]?"#"+c[1]:""].join("");
b.onmousedown="";
return true};
So it looks like Google is changing the href of the a tag to /url?... which I'm assuming is where their tracking is. From LiveHeaders in Firefox, it looks like this page is redirecting the browser to the original href of the a tag.
Is this correct and is this the best method of tracking clicks on links on your site, such as ads?
It's actually changing the href of the link rather than the window location. It's setting b.href, and b refers to the link itself. This runs in onmousedown, so when you release the mouse and the click is handled you magically get sent to that new href.
Any click tracking pretty much comes down to sending the user to some equivalent of Google's /url?... script, counting the click, and performing a 302 redirect to the real destination.
This javascript href replacement has the advantage of automatically filtering out any robots that don't run scripts. The downside is that it also filters out any real people that have javascript disabled. If, like Google, you just care which link is most popular with your real human users, this works out quite well. The clicks that you do record should be representative of real human traffic, and you can safely ignore the clicks from non-javascript users because they probably have the same preferences anyway.
Most adverts just link straight to the counting URL with no javascript replacement. This means that you definitely count every real click on the link, but you need to worry about filtering out requests from robots, since they'll now see your counting URL too.
Which you prefer really depends on why you want to track the clicks.
I think most people expect ads to click through via some sort of tracking system, so I shouldn't worry too much about following this particular javascript implementation - as much as anything that's probably there to ensure that the user sees the correct link in the browsers status bar, that various other interesting bits of info (search terms, position on the result set at the time, who you are, etc) are sent across (without you realising it) and that the links still work if JavaScript is disabled.
Generally, yes directing the user through some tracking page with the ID of the ad they have clicked on, and possibly some additional indication of where they have come from is sensible - that way you aren't relying on other mechanisms (such as JS event handlers) to track clicks on the links, it's certainly the way most ad systems I've used work.
I Produced a page which I have no intention to let Search Engines find and claw it.
The advisable solution is robot.txt. But it is not applicable in my situation.
So I isolated this page from my site by clearing all links from other pages to this page, and never put its URL in external sites.
Logically, then, it is impossible for search engines to find out this page. And that means no matter how many out-bound links nesting in this page, the PR of site is save.
Am I right?
Thank you very much!
Hope this question is programming related!
No, there's still a chance your page can be found by search engine crawlers. For example, it's been speculated that data from the Google Toolbar can be used to alert Googlebot to the presence of a page. And there's still a chance others might link to your page from external sites if the URL becomes known.
Your best bet is to add a robots meta tag to your page, this will prevent it from being indexed, and prevent crawlers from following any links:
<meta name="robots" content="noindex,nofollow" />
If it is on the internet and not restricted, it will be found. It may make it harder to find, but it is still possible a crawler may happen across it.
What is the link so I can check? ;)
If you have outbound links on this "isolated" page then your page will probably show up as a referrer in the logs of the linked-to page. Depending on how much the owners of the linked-to page track their stats, then they may find your page.
I've seen httpd log files turn up in Google searches. This in turn may lead others to find your page, including crawlers and other robots.
The easiest solution might be to password protect the page?