I'm working with Symfony 1.4 and i want deactivate Google's index in my web site, what's the best code i will have to use?
robots: no-index,nofollow
robots: disallow
Just have a look here
User-agent: *
Disallow: /
They do different things:
By using a robots.txt, you can disallow crawling of your site.
By using a meta-robots element, you can disallow indexing of your site.
If you want one, you can’t have the other.
So you have to decide:
Googlebot should never visit your URLs, but it might index your URLs (learning about them from different sources): use robots.txt.
Googlebot should never index your URLs, but it might visit them: use meta-robots.
Related
I want to disallow all bots to crawl and index a site. Except Googlebot. I want to allow google to index the index (/) URL, but nothing else. Preferably in robots.txt.
Do you have any ideas on how to achieve this? Thanks!
You'll need to use a robot.txt.
Just create one in your public folder of the website and add;
User-agent: Googlebot
Allow: /
User-agent: *
Disallow: /
This will allow google crawler to index all pages and disallow all other crawlers for the whole website.
For details refer: https://support.google.com/webmasters/answer/6062596?hl=en
I know this could be considered off topic, but still thought I'd answer if this helps.
As John Conde suggested, try webmasters....
I created a new website and I do not want it to be crawled by search engines as well as not appear in search results.
I already created a robots.txt
User-agent: *
Disallow: /
I have a html page. I wanted to use
<meta name="robots" content="noindex">
but Google page says it should be used when a page is not blocked by robots.txt as robots.txt will not see noindex tag at all.
Is there any way I can use both noindex as well as robots.txt?
There are two solutions, neither of which are elegant.
You are correct that even if you Disallow: / that your URLs might still appear in the search results, just likely without a meta description and a Google generated title.
Assuming you are only doing this temporarily, the recommended approach is to be basic http auth in front of your site. This isn't great since users will have to put in a basic username and password, but this will prevent your site from getting crawled and indexed.
If you can't or don't want to put basic auth in front of your site, the alternative is to still Disallow: / in your Robots.txt file, and use Google Search Console to regularly purge the Google index by requesting the site be removed from the index.
This is inelegant in multiple ways.
You'll have to monitor the search results to see if URLs get indexed
You'll have to manually request the removal in the Google Search Console
Google really didn't intend for the removal feature to be used in this fashion, and who knows if they'll start ignoring your requests over time. But I'd imagine it would actually continue to work even though they'd prefer you didn't use it that way.
How do I tell crawlers / bots not to index any URL that has /node/ pattern?
Following is since day one but I noticed that Google has still indexed a lot of URLs that has
/node/ in it, e.g. www.mywebsite.com/node/123/32
Disallow: /node/
Is there anything that states that do not index any URL that has /node/
Should I write something like following:
Disallow: /node/*
Update:
The real problem is despite:
Disallow: /node/
in robots.txt, Google has indexed pages under this URL e.g. www.mywebsite.com/node/123/32
/node/ is not a physical directory, this is how drupal 6 shows it's content, I guess this is my problem that node is not a directory, merely part of URLs being generated by drupal for the content, how do I handle this? will this work?
Disallow: /*node
Thanks
Disallow: /node/ will disallow any url that starts with /node/ (after the host). The asterisk is not required.
So it will block www.mysite.com/node/bar.html, but will not block www.mysite.com/foo/node/bar.html.
If you want to block anything that contains /node/, you have to write Disallow: */node/
Note also that Googlebot can cache robots.txt for up to 7 days. So if you make a change to your robots.txt today, it might be a week before Googlebot updates its copy of your robots.txt. During that time, it will be using its cached copy.
Disallow: /node/* is exactly what you want to do. Search engines support wildcards in their robots.txt notation and the * characters means "any characters". See Google's notes on robots.txt for more.
update
An alternative way to make sure search engines stay out of a directory, and all directories below it, is to block them with the robots HTTP header. This can be done by placing the following in an htaccess file in your node directory:
Header set x-robots-tag: noindex
Your original Disallow was fine. Jim Mischel's comment seemed spot on and would cause me to wonder if it was just taking time for Googlebot to fetch the updated robots.txt and then unindex relevant pages.
A couple additional thoughts:
Your page URLs may appear in Google search results even if you've included it in robots.txt. See: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449 ("...While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web."). To many people, this is counter-intuitive.
Second, I'd highly recommend verifying ownership of your site in Google Webmaster Tools (https://www.google.com/webmasters/tools/home?hl=en), then using tools such as Health->"Fetch as Google" to see real time diagnostics related to retrieving your page. (Does that result indicate that robots.txt is preventing crawling?)
I haven't used it, but Bing has a similar tool: http://www.bing.com/webmaster/help/fetch-as-bingbot-fe18fa0d . It seems well worthwhile to use diagnostic tools provided by Google, Bing, etc. to perform real-time diagnostics on the site.
This question is a bit old, so I hope you've solved the original problem.
I'm working on this big website and I want to put it online before its fully finished...
I'm working locally and the database is getting really big so I wanted to upload the website and continue to work on it in the server, but allowing people to enter, so I can test.
The question is if this is good for SEO, I mean, there are a lot of things SEO related that are incomplete.. For example: there are no friendly URLs, no sitemap, no .htacces file, lot of 'in-construction' sections...
Will Google penalize me forever? How does it work? Google indexes and gets the structure of the site just once or is it constantly updating and checking for changes? Will using User-agent: * Disallow: in robots.txt fully stop Google from indexing it? Can I change the robots.txt file later and have Google index it again? What dp you recommend?
Sure, just put a robots.txt file in your root so you can be safe that google doesn't start indexing it.
Like this:
User-agent: *
Disallow: /
This is how i understand this issue:
Google will reach your website if someone submitted your website URL http://www.google.com/addurl/ or there is a link to your website in another already indexed website.
When google reach your website it will look at the robots.txt and will see what rules there, if you disallow indexing using code like the following, google will not index your website at the moment.
User-agent: *
Disallow: /
But google will visit your website again after some days may be, and will do the same as the first time, if you didn't find the robots.txt or found that you put rules that allow them to index the website using code like the following, they will start indexing the website pages and content.
User-agent: *
Allow: /
About putting the website online from now or not? if you will disallow google index using robots.txt, there no difference, go for which is better for you.
Note:
I am not sure 100% from rules i mentioned in this answer as google always change their indexing technics.
Also what i said about Google is the same for other search engines such as yahoo and bing, but its not a rule for any search engine, its just a common way, so may be other search engine index all your website links while you have robots.txt disallow indexing.
And i used to put a stage version from my websites to test on the live environment before going on the real life version, and used to use the robots.txt and i never found any of these stage links in Google, Bing or Yahoo.
As long as your security is not beta quality, it's a good idea to get your site online as early as possible.
Google indexes your site periodically, and will index more frequently as it detects more frequent changes and/or your pagerank increases.
Let's say I have a web site for hosting community generated content that targets a very specific set of users. Now, let's say in the interest of fostering a better community I have an off-topic area where community members can post or talk about anything they want, regardless of the site's main theme.
Now, I want most of the content to get indexed by Google. The notable exception is the off-topic content. Each thread has it's own page, but all the threads are listed in the same folder so I can't just exclude search engines from a folder somewhere. It has to be per-page. A traditional robots.txt file would get huge, so how else could I accomplish this?
This will work for all well-behaving search engines, just add it to the <head>:
<meta name="robots" content="noindex, nofollow" />
If using Apache I'd use mod-rewrite to alias robots.txt to a script that could dynamically generate the necessary content.
Edit: If using IIS you could use ISAPIrewrite to do the same.
You can implement it by substituting robots.txt with dynamic script generating the output.
With Apache You could make simple .htaccess rule to acheive that.
RewriteRule ^robots\.txt$ /robots.php [NC,L]
Simlarly to #James Marshall's suggestion - in ASP.NET you could use an HttpHandler to redirect calls to robots.txt to a script which generated the content.
Just for that thread , make sure your head contains a noindex meta tag. Thats one more way to tell search engines not to crawl your page other than blocking in robots.txt
Just keep in mind that a robots.txt disallow will NOT prevent Google from indexing pages that have links from external sites, all it does is prevent crawling internally. See http://www.webmasterworld.com/google/4490125.htm or http://www.stonetemple.com/articles/interview-matt-cutts.shtml.
You can disallow search engines to read or index your content by restricting robot meta tags. In this way, spider will consider your instructions and will index only such pages that you want.
block dynamic webpage by robots.txt use this code
User-agent: *
Disallow: /setnewsprefs?
Disallow: /index.html?
Disallow: /?
Allow: /?hl=
Disallow: /?hl=*&