Correct way to add robots.txt and hide it? - authentication

I have a secret folder on my hosting, which may not be seen by visitors. I've added a robots.txt to htdocs:
User-agent: *
Disallow: /super-private/
However, if a visitor goes to http://example.com/robots.txt, he can see the name of the private folder. Is there anything to be done? Htaccess maybe?

robots.txt is not the solution here. All it does is tell things like search engine spiders that a particular URL should not be indexed; it doesn't prevent access.
Put a .htaccess file in super-private containing the following:
Deny From All
Once you've done this, there's no need for robots.txt, as it'll be inaccessible anyway. If you want to allow access to certain people, then look into authentication with .htaccess.

Don't mention this private folder in robots.txt. Then simply disallow the access to it with .htaccess:
deny from all
Also if there are no links to this super-private folder in the other pages robots should never know if its existence but disallowing the access is a good thing to do if this folder should never be directly accessed from clients.

Related

Single robots.txt file for all subdomains

I have a site (example.com) and have my robots.txt set up in the root directory. I have also multiple subdomains (foo.example.com, bar.example.com, and more to come in the future) whose robots.txt will all be identical as that of example.com. I know that I can place a file at the root of each subdomain but I'm wondering if it's possible to redirect the crawlers searching for robots.txt on any subdomain to example.com/robots.txt?
Sending a redirect header for your robots.txt file is not advised, nor is it officially supported.
Google's documentation specifically states:
Handling of robots.txt redirects to disallowed URLs is undefined and discouraged.
But the documentation does say redirect "will be generally followed". If you add your subdomains into Google Webmaster Tools and go to "Crawl > Blocked URLs" you can test your subdomain robots.txts that are 301 redirecting. It should come back as positively working.
However, with that said, I would strongly suggest that you just symlink the files into place and that each robots.txt file responds with a 200 OK at the appropriate URLs. This is much more inline with the original robots.txt specification, as well as, Google's documentation, and who knows exactly how bing / yahoo will handle it over time.

How to Disallow Landing Pages Using robots.txt file?

I'd like to start using specific landing pages in a marketing campaign. A quick search on google shows how to disallow specific pages and/or directories using a robots.txt file. (link)
If I don't want the search engines to index these landing pages should I put a single page entries in the robot.txt file or should I put them in specific directories and disallow the directory?
My concern is that anybody can read a robots.txt file and if the actual page names are visible within the robots.txt file it defeats the purpose.
"It defeats the purpose." How so? The purpose of robots.txt is to prevent crawlers from reading particular files or groups of files. Whether you exclude the individual files or put them all in a directory and exclude that directory is irrelevant as far as the crawler's behavior is concerned.
The benefit to putting them all in directories is that your robots.txt file is smaller and easier to manage. You don't have to add a new entry every time you create a new landing page.
You're right that putting a file name in robots.txt lets anybody who reads the file know that the file is there. That shouldn't be a problem. If you have sensitive information that you don't want others to see then it shouldn't be accessible, regardless of whether it's mentioned in robots.txt. Because if the file is publicly accessible, then a bot is going to find it even if you don't mention it in robots.txt.
robots.txt is just a guideline. The existence of a disallow line in robots.txt doesn't prevent an unfriendly crawler from looking at those pages. It just tells the crawler that you don't want them looking at those pages. But crawlers can ignore robots.txt. They shouldn't, and you can block them if they do, but robots.txt itself is more like a stop sign than a road block.
You should be able to simply use the NOINDEX META tag in the HEAD of your page.
http://www.robotstxt.org/meta.html

robots.txt which folders to disallow - SEO?

I am currently writing my robots.txt file and have some trouble deciding whether I should allow or disallow some folders for SEO purposes.
Here are the folders I have:
/css/ (css)
/js/ (javascript)
/img/ (images i use for the website)
/php/ (PHP which will return a blank page such as for example checkemail.php which checks an email address or register.php which puts data into a SQL database and sends an email)
/error/ (my error 401,403,404,406,500 html pages)
/include/ (header.html and footer.html I include)
I was thinking about disallowing only the PHP pages and let the rest.
What do you think?
Thanks a lot
Laurent
/css and /js -- CSS and Javascript files will probably be crawled by googlebot whether or not you have them in robots.txt. Google uses them to render your pages for site preview. Google has asked nicely that you not put them in robots.txt.
/img -- Googlebot may crawl this even when in robots.txt the same way as CSS and Javascript. Putting your images in robots.txt generally prevents them from being indexed in Google image search. Google image search may be a source of visitors to your site so you may wish to be indexed there.
/php -- sounds like you don't want spiders hitting the urls that perform actions. Good call to use robots.txt
/error -- If your site is set up correctly the spiders will probably never know what directory your error pages are served from. They generally get served at the url that has the error and the spider never sees their actual url. This isn't the case if you redirect to them, which isn't recommended practice anyway. As such, I would say there is no need to put them in robots.txt

preventing from the site any external links

I am using DokuWiki and as we've tried to secure it as much as possible the best security for us to keep it's location on our server secret. Therefore we want to make sure no link can be clicked on any pages which would reveal the location of our infrastructure. Is there any way to configure this restriction with DokuWiki or are there known ways to pass URLs through a third party?
Did you tried to protect the site with .htaccess and .htpasswd?? Is a good solution for other to not enter on your site.
And if the site is online you should include a robots.txt to avoid crawlers to index it
User-agent: *
Disallow: /
Hope i help you

robots.txt - exclude any URL that contains "/node/"

How do I tell crawlers / bots not to index any URL that has /node/ pattern?
Following is since day one but I noticed that Google has still indexed a lot of URLs that has
/node/ in it, e.g. www.mywebsite.com/node/123/32
Disallow: /node/
Is there anything that states that do not index any URL that has /node/
Should I write something like following:
Disallow: /node/*
Update:
The real problem is despite:
Disallow: /node/
in robots.txt, Google has indexed pages under this URL e.g. www.mywebsite.com/node/123/32
/node/ is not a physical directory, this is how drupal 6 shows it's content, I guess this is my problem that node is not a directory, merely part of URLs being generated by drupal for the content, how do I handle this? will this work?
Disallow: /*node
Thanks
Disallow: /node/ will disallow any url that starts with /node/ (after the host). The asterisk is not required.
So it will block www.mysite.com/node/bar.html, but will not block www.mysite.com/foo/node/bar.html.
If you want to block anything that contains /node/, you have to write Disallow: */node/
Note also that Googlebot can cache robots.txt for up to 7 days. So if you make a change to your robots.txt today, it might be a week before Googlebot updates its copy of your robots.txt. During that time, it will be using its cached copy.
Disallow: /node/* is exactly what you want to do. Search engines support wildcards in their robots.txt notation and the * characters means "any characters". See Google's notes on robots.txt for more.
update
An alternative way to make sure search engines stay out of a directory, and all directories below it, is to block them with the robots HTTP header. This can be done by placing the following in an htaccess file in your node directory:
Header set x-robots-tag: noindex
Your original Disallow was fine. Jim Mischel's comment seemed spot on and would cause me to wonder if it was just taking time for Googlebot to fetch the updated robots.txt and then unindex relevant pages.
A couple additional thoughts:
Your page URLs may appear in Google search results even if you've included it in robots.txt. See: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449 ("...While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web."). To many people, this is counter-intuitive.
Second, I'd highly recommend verifying ownership of your site in Google Webmaster Tools (https://www.google.com/webmasters/tools/home?hl=en), then using tools such as Health->"Fetch as Google" to see real time diagnostics related to retrieving your page. (Does that result indicate that robots.txt is preventing crawling?)
I haven't used it, but Bing has a similar tool: http://www.bing.com/webmaster/help/fetch-as-bingbot-fe18fa0d . It seems well worthwhile to use diagnostic tools provided by Google, Bing, etc. to perform real-time diagnostics on the site.
This question is a bit old, so I hope you've solved the original problem.