In the effort of building a live site on its actual live hosting platform is there a way to tell google to not YET index the website? I found the following:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=93710
But would that tell them to never come back or would they simply see the noindex tag and then not list the results, then when it comes back to crawl again later and my site is good to go I would have the noindex removed and the site would then start getting indexed?
Sounds like you want to use a robots.txt file instead:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449&topic=2370588&ctx=topic
Update your robots.txt file when you want your content to be indexed.
You can use the robot.txt method.
You can specify which subpage could be spidered. And google comes back, checking the file before indexing. So you can delete the file later in order to get fully indexed.
More Information
About /robots.txt
Robots.txt File Generator
You can always change it. The way Google and other robots find your page is if it is linked to on another page. As long as it isn't linked to on another page, it won't be found. Also, once your site is up, chances are that it will be far back in the list of sites.
Related
We have got a deprecated domain www.deprecateddomain.com. Specific fact is that we have got reverse proxy working and redirecting all requests from this domain to the new one www.newdomain.com.
The problem is when you type "deprecateddomain.com" in google search, there is a link to www.deprecateddomain.com in search results besides results with "newdomain.com". It means that there is such entries in google index. Our customer don't want to see links to old site.
We were suggested to create fake robots.txt with Disallow: / directive for www.deprecateddomain.com and reverse proxy rules to get this file from some directory. But after investigation the subject I started hesitating that it will help. Will it remove entries with old domain from index?
Why not to just create the request in search console to remove www.deprecateddomain.com from index? In my opinion it might help.
Anyway, I'm novice in this question. Could you give me advice what to do?
Google takes time to remove old/obsolete entries from its ranking, especially on low visited or low value pages. You have no control on it. Google needs to revisit each page to see the redirection you have implemented.
So DO NOT implement a disallow on the old website, because it will make the problem worse. Bots won't be able to crawls those pages and see the redirection you have implemented. So they will stay longer in the rankings.
You must also make sure you implement a proper 301 redirection (i.e. a permanent one, not a temporary) for all pages of the old website. Else, some pages may stay in the ranking for quite some time.
If some pages are obsolete and should be deleted rather than redirected, return a 404 for them. Google will remove them quickly from its index.
We do a lot of email marketing and sometimes developers will put the html file out on the image server (i know the easy answer is to not do this) but those html files end up getting indexed by Google and eventually rank high on search results. Which in turns makes the SEO company's want us to remove these pages. Is it possible to have google not index anything from our sub domain? we have image.{ourUrl}.com where we put all these files.
Would putting a robot.txt file in the main directory do it? Or would we need to add that robot text file in every directory?
Is there an easy way to blanket this?
A robots.txt file would just stop crawling, files might still be indexed. a noindex directive would work, you could use an x-robots-tag. See here https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag
I have a problem with lots of 404 errors on one site. I figured out that these errors are happening because google is trying to find pages that no longer exist.
Now I need to tell Google not to index those pages again.
I found some solutions on the internet about using robots.txt file. But this is not a site that I built. I just need to fix those errors.The thing is, those pages are generated. They do not physically exist in that form. So I can not add anything in php code.
And I am not quite sure how to add those to robot.txt.
When I just write:
*User-agent: *
noindex: /objekten/anzeigen/haus_antea/5-0000001575*
and hit test button in webmaster tools
I get this from Googlebot:
Allowed
Detected as a directory; specific files may have different restrictions
And I do not know what that means.
I am new in this kind of stuff so please write your answer as simpler as it can be.
Sorry for bad english.
I think Google will remove such pages that return a 404 error automatically from its index. Google will not display these pages in the results. So you don't need to care about that.
Just make sure, that these pages are not linked from other pages. If so, Google may try to index them from time to time. In this case you should return a 301 error (permanently moved) and redirect to the correct url. Google will follow the 301 errors and use the redirected url instead.
Robots.txt is only necessary, if you want to remove pages that are already in the search results. But I think pages with error code 404 will not be displayed there anyway.
So I have a website that I recently made changes to, and one of the changes was removing a page from the site. I deleted the page, it doesn't exist anymore.
However, when you search for my site, one of the results is the page that I deleted. People are clicking on the page and getting an error.
How do I remove that page from the search results?
Here is the solution
First get ur site on google webmaster. Then go to site configuration -- > crawler access --> remove url . Click on New removal request and add the page you want to remove and make sure you have added that page to the robots.txt of your site. Google will deindex the page within 24 hrs.
You simply wait for googles robots to find out that it doesn't exist anymore.
A trick that used to work is to upload a sitemap to google where you add the url to the deleted page and set it to top priority and that it changes every day. That way the google robots will prio that page and quicker find out that its not there anymore.
There might be other ways but none that are known to me.
You can remove specific pages using the webmaster tools I believe.
Yahoo Web tools offer a similar service as I understand it.
This information was correct the last time I tried to do this a little while ago.
You should go to https://www.google.com/webmasters/tools/removals and remove the pages which you want.
Is it possible to fine-tune directives to Google to such an extent that it will ignore part of a page, yet still index the rest?
There are a couple of different issues we've come across which would be helped by this, such as:
RSS feed/news ticker-type text on a page displaying content from an external source
users entering contact phone etc. details who want them visible on the site but would rather they not be google-able
I'm aware that both of the above can be addressed via other techniques (such as writing the content with JavaScript), but am wondering if anyone knows if there's a cleaner option already available from Google?
I've been doing some digging on this and came across mentions of googleon and googleoff tags, but these seem to be exclusive to Google Search Appliances.
Does anyone know if there's a similar set of tags to which Googlebot will adhere?
Edit: Just to clarify, I don't want to go down the dangerous route of cloaking/serving up different content to Google, which is why I'm looking to see if there's a "legit" way of achieving what I'd like to do here.
What you're asking for, can't really be done, Google either takes the entire page, or none of it.
You could do some sneaky tricks though like insert the part of the page you don't want indexed in an iFrame and use robots.txt to ask Google not to index that iFrame.
In short NO - unless you use cloaking with is discouraged by Google.
Please check out the official documentation from here
http://code.google.com/apis/searchappliance/documentation/46/admin_crawl/Preparing.html
Go to section "Excluding Unwanted Text from the Index"
<!--googleoff: index-->
here will be skipped
<!--googleon: index-->
Found useful resource for using certain duplicate content and not to allow index by search engine for such content.
<p>This is normal (X)HTML content that will be indexed by Google.</p>
<!--googleoff: index-->
<p>This (X)HTML content will NOT be indexed by Google.</p>
<!--googleon: index>
At your server detect the search bot by IP using PHP or ASP. Then feed the IP addresses that fall into that list a version of the page you wish to be indexed. In that search engine friendly version of your page use the canonical link tag to specify to the search engine the page version that you do not want to be indexed.
This way the page with the content that do want to be index will be indexed by address only while the only the content you wish to be indexed will be indexed. This method will not get you blocked by the search engines and is completely safe.
Yes definitely you can stop Google from indexing some parts of your website by creating custom robots.txt and write which portions you don't want to index like wpadmins, or a particular post or page so you can do that easily by creating this robots.txt file .before creating check your site robots.txt for example www.yoursite.com/robots.txt.
All search engines either index or ignore the entire page. The only possible way to implement what you want is to:
(a) have two different versions of the same page
(b) detect the browser used
(c) If it's a search engine, serve the second version of your page.
This link might prove helpful.
There are meta-tags for bots, and there's also the robots.txt, with which you can restrict access to certain directories.