In robots.txt can I write the following relative URL for the sitemap file?
sitemap: /sitemap.ashx
Or do I have to use the complete (absolute) URL for the sitemap file, like:
sitemap: http://subdomain.domain.com/sitemap.ashx
Why I wonder:
I own a new blog service, www.domain.com, that allow users to blog on accountname.domain.com.
I use wildcards, so all subdomains (accounts) point to: "blog.domain.com".
In blog.domain.com I put the robots.txt to let search engines find the sitemap.
But, due to the wildcards, all user account share the same robots.txt file.Thats why I can't use the second alternative. And for now I can't use url rewrite for txt files. (I guess that later versions of IIS can handle this?)
According to the official documentation on sitemaps.org it needs to be a full URL:
You can specify the location of the Sitemap using a robots.txt file. To do this, simply add the following line including the full URL to the sitemap:
Sitemap: http://www.example.com/sitemap.xml
Google crawlers are not smart enough, they can't crawl relative URLs, that's why it's always recommended to use absolute URL's for better crawlability and indexability.
Therefore, you can not use this variation
> sitemap: /sitemap.xml
Recommended syntax is
Sitemap: https://www.yourdomain.com/sitemap.xml
Note:
Don't forgot to capitalise the first letter in "sitemap"
Don't forgot to put space after "Sitemap:"
Good technical & logical question my dear friend.
No in robots.txt file you can't go with relative URL of the sitemap; you need to go with the complete URL of the sitemap.
It's better to go with "sitemap: https://www.example.com/sitemap_index.xml"
In the above URL after the colon gives space.
I also like to support Deepak.
Related
Is it possible to add SEO tags such as 'noodp' to a robots.txt file instead of using <meta> tags? I am trying to avoid messing with our CMS template, although I suspect that I may have to...
Could I try something similar to this...
User-Agent: *
Disallow: /hidden
Sitemap: www.example.com
noodp:
I think robots.txt takes precedence over meta tags? For noindex for instance, the crawler will not even see the page in question. For something like noodp however, is this still the case?
You can't do this with robots.txt, but you can get the same effect using the X-Robots-Tag response header.
Add something like this to the appropriate part of your .htaccess file:
Header set X-Robots-Tag "noodp"
This tells the server to include the following line in the response headers:
X-Robots-Tag: noodp
Search engines (that support X-Robots-Tag) will interpret this header line exactly the same way they would interpret 'noodp' in a robots meta tag. In general, you can put anything in an X-Robots-Tag header that you can put in a robots meta tag. Note that the page must not be blocked by robots.txt, otherwise the crawler will never request the page, and will therefore never see the header.
No.
Robots.txt file goal is to provide information to robots about crawl, not about what they should do on what they crawled.
<meta> robots (or X-Robots-Tag instructions) and robots.txt instructions are two very distinct things.
Google gives good information about this in his article Learn about robots.txt file:
robots.txt should only be used to control crawling traffic
If you want to add some robots instructions without messing with your CMS, HTTP header X-Robots-Tag might be a good solution. You can try to add it through your server config.
I'd like to start using specific landing pages in a marketing campaign. A quick search on google shows how to disallow specific pages and/or directories using a robots.txt file. (link)
If I don't want the search engines to index these landing pages should I put a single page entries in the robot.txt file or should I put them in specific directories and disallow the directory?
My concern is that anybody can read a robots.txt file and if the actual page names are visible within the robots.txt file it defeats the purpose.
"It defeats the purpose." How so? The purpose of robots.txt is to prevent crawlers from reading particular files or groups of files. Whether you exclude the individual files or put them all in a directory and exclude that directory is irrelevant as far as the crawler's behavior is concerned.
The benefit to putting them all in directories is that your robots.txt file is smaller and easier to manage. You don't have to add a new entry every time you create a new landing page.
You're right that putting a file name in robots.txt lets anybody who reads the file know that the file is there. That shouldn't be a problem. If you have sensitive information that you don't want others to see then it shouldn't be accessible, regardless of whether it's mentioned in robots.txt. Because if the file is publicly accessible, then a bot is going to find it even if you don't mention it in robots.txt.
robots.txt is just a guideline. The existence of a disallow line in robots.txt doesn't prevent an unfriendly crawler from looking at those pages. It just tells the crawler that you don't want them looking at those pages. But crawlers can ignore robots.txt. They shouldn't, and you can block them if they do, but robots.txt itself is more like a stop sign than a road block.
You should be able to simply use the NOINDEX META tag in the HEAD of your page.
http://www.robotstxt.org/meta.html
How do I tell crawlers / bots not to index any URL that has /node/ pattern?
Following is since day one but I noticed that Google has still indexed a lot of URLs that has
/node/ in it, e.g. www.mywebsite.com/node/123/32
Disallow: /node/
Is there anything that states that do not index any URL that has /node/
Should I write something like following:
Disallow: /node/*
Update:
The real problem is despite:
Disallow: /node/
in robots.txt, Google has indexed pages under this URL e.g. www.mywebsite.com/node/123/32
/node/ is not a physical directory, this is how drupal 6 shows it's content, I guess this is my problem that node is not a directory, merely part of URLs being generated by drupal for the content, how do I handle this? will this work?
Disallow: /*node
Thanks
Disallow: /node/ will disallow any url that starts with /node/ (after the host). The asterisk is not required.
So it will block www.mysite.com/node/bar.html, but will not block www.mysite.com/foo/node/bar.html.
If you want to block anything that contains /node/, you have to write Disallow: */node/
Note also that Googlebot can cache robots.txt for up to 7 days. So if you make a change to your robots.txt today, it might be a week before Googlebot updates its copy of your robots.txt. During that time, it will be using its cached copy.
Disallow: /node/* is exactly what you want to do. Search engines support wildcards in their robots.txt notation and the * characters means "any characters". See Google's notes on robots.txt for more.
update
An alternative way to make sure search engines stay out of a directory, and all directories below it, is to block them with the robots HTTP header. This can be done by placing the following in an htaccess file in your node directory:
Header set x-robots-tag: noindex
Your original Disallow was fine. Jim Mischel's comment seemed spot on and would cause me to wonder if it was just taking time for Googlebot to fetch the updated robots.txt and then unindex relevant pages.
A couple additional thoughts:
Your page URLs may appear in Google search results even if you've included it in robots.txt. See: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449 ("...While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web."). To many people, this is counter-intuitive.
Second, I'd highly recommend verifying ownership of your site in Google Webmaster Tools (https://www.google.com/webmasters/tools/home?hl=en), then using tools such as Health->"Fetch as Google" to see real time diagnostics related to retrieving your page. (Does that result indicate that robots.txt is preventing crawling?)
I haven't used it, but Bing has a similar tool: http://www.bing.com/webmaster/help/fetch-as-bingbot-fe18fa0d . It seems well worthwhile to use diagnostic tools provided by Google, Bing, etc. to perform real-time diagnostics on the site.
This question is a bit old, so I hope you've solved the original problem.
we have a blog in a sub-directory of the main url.
http://www.domain.com/blog/
the blog is run by wordpress and we are using Google Sitemap Generator to create the XML file.
We have an index of all of our sitemaps in the main sitemap.xml which leads to many sitemaps.
From an SEO standpoint would it be best to link directly to the sitemap that is under the blog directory:
e.g. http://www.domian.com/blog/sitemap.xml
or should be do a cron (daily) to copy the file to the main domain's directory:
e.g. http://www.domain.com/sitemap_blog.xml
which will be linked from the main index with the other sitemaps.
What is the best way from an SEO standpoint???
It doesn't matter where the sitemap is. you will want to register its location with the search engines you want to be able to find it. The main thing though is to have a link to your sitemap location in the robots.txt file using the following line:
Sitemap: <sitemap_location>
Your robots.txt file should be in your domain's root.
Let's say I have a web site for hosting community generated content that targets a very specific set of users. Now, let's say in the interest of fostering a better community I have an off-topic area where community members can post or talk about anything they want, regardless of the site's main theme.
Now, I want most of the content to get indexed by Google. The notable exception is the off-topic content. Each thread has it's own page, but all the threads are listed in the same folder so I can't just exclude search engines from a folder somewhere. It has to be per-page. A traditional robots.txt file would get huge, so how else could I accomplish this?
This will work for all well-behaving search engines, just add it to the <head>:
<meta name="robots" content="noindex, nofollow" />
If using Apache I'd use mod-rewrite to alias robots.txt to a script that could dynamically generate the necessary content.
Edit: If using IIS you could use ISAPIrewrite to do the same.
You can implement it by substituting robots.txt with dynamic script generating the output.
With Apache You could make simple .htaccess rule to acheive that.
RewriteRule ^robots\.txt$ /robots.php [NC,L]
Simlarly to #James Marshall's suggestion - in ASP.NET you could use an HttpHandler to redirect calls to robots.txt to a script which generated the content.
Just for that thread , make sure your head contains a noindex meta tag. Thats one more way to tell search engines not to crawl your page other than blocking in robots.txt
Just keep in mind that a robots.txt disallow will NOT prevent Google from indexing pages that have links from external sites, all it does is prevent crawling internally. See http://www.webmasterworld.com/google/4490125.htm or http://www.stonetemple.com/articles/interview-matt-cutts.shtml.
You can disallow search engines to read or index your content by restricting robot meta tags. In this way, spider will consider your instructions and will index only such pages that you want.
block dynamic webpage by robots.txt use this code
User-agent: *
Disallow: /setnewsprefs?
Disallow: /index.html?
Disallow: /?
Allow: /?hl=
Disallow: /?hl=*&