Will Drupal robots.txt disallow still be recorded in apache log file? - apache

I set up some rules in robots.txt for a specific agent for my Drupal website.
I have one question. When this agent is trying to access the website, will this access still be logged in my apache access_log file?

Disallow in robots.txt is not technically preventing user agents from accessing your website. Each user agent decides whether it wants to honour your robots.txt or not.
By default, Drupal is not doing anything with the content of your robots.txt file, and the robots.txt content doesn’t affect your server’s logs at all.

Related

How may i prevent search engines from crawling a subdomain on my website?

I have cPanel installed on my website.
I went to the Domains section on cPanel
I clicked on subdomains.
I assigned the subdomain name (e.g : personal.mywebsite.com )
It wanted me to assign document root folder also. I assigned mywebsite.com/personal
if i create robots.txt in my website root(e.g : website.com)
User-agent:
Disallow: /personal/
Can it also block personal.mywebsite.com?
what should i do?
thanks
When you want to block URLs on personal.example.com, visit http://personal.example.com/robots.txt (resp. https instead of http).
It doesn’t matter how your server organizes folders in the backend, it only matters which robots.txt is available when accessing this URL.

Single robots.txt file for all subdomains

I have a site (example.com) and have my robots.txt set up in the root directory. I have also multiple subdomains (foo.example.com, bar.example.com, and more to come in the future) whose robots.txt will all be identical as that of example.com. I know that I can place a file at the root of each subdomain but I'm wondering if it's possible to redirect the crawlers searching for robots.txt on any subdomain to example.com/robots.txt?
Sending a redirect header for your robots.txt file is not advised, nor is it officially supported.
Google's documentation specifically states:
Handling of robots.txt redirects to disallowed URLs is undefined and discouraged.
But the documentation does say redirect "will be generally followed". If you add your subdomains into Google Webmaster Tools and go to "Crawl > Blocked URLs" you can test your subdomain robots.txts that are 301 redirecting. It should come back as positively working.
However, with that said, I would strongly suggest that you just symlink the files into place and that each robots.txt file responds with a 200 OK at the appropriate URLs. This is much more inline with the original robots.txt specification, as well as, Google's documentation, and who knows exactly how bing / yahoo will handle it over time.

Disallow crawling of the CDN site

So I have a site http://www.example.com.
The JS/CSS/Images are served from a CDN - http://xxxx.cloudfront.net OR http://cdn.example.com; they are both the same things. Now the CDN just serves any type of file, including my PHP pages. Google somehow got crawling that CDN site as well; two site actually - from cdn.example.com AND from http://xxxx.cloudfront.net. Considering
I am NOT trying set up a subdomain OR a mirror site. If that happens, that is a side affect of me trying to set up a CDN.
CDN is some web server, not necessarily an Apache. I do not know what type of server would that be.
There is no request processing on CDN. it just fetches things from origin server. I think, you cannot put custom files out there on the CDN; it just fetches things from the origin server. Whatever you need to put on the CDN comes from the origin server.
How do I prevent the crawling of PHP pages?
Should I allow crawling of images from cdn.example.com OR from example.com? The links to images inside the HTML are all to cdn.example.com. If I allow crawling of images only from example.com, then there is practically nothing to crawl - there are no links to such images. If I allow crawling of images from cdn.example.com, then does it not leak away the SEO benefits?
Some alternatives that I considered, based on stackoverflow answers:
Write custom robot_cdn.txt and serve that custom robots_cdn.txt based on HTTP_HOST. This is as per many answers on the stack overflow.
Serve a new robots.txt from subdomain. As I explained above, I do not think that CDN can be treated like a subdomain.
Do 301 redirects when HTTP_HOST is cdn.example.com to www.example.com
Suggestions?
Questions related to this, e.g. How Disallow a mirror site (on sub-domain) using robots.txt?
You can put robots.txt in your root directory so that it will be served with cdn.-yourdomain-.com/robots.txt. In this robots.txt you can disallow all the crawlers with the below setting
User-agent: *
Disallow: /

robots.txt which folders to disallow - SEO?

I am currently writing my robots.txt file and have some trouble deciding whether I should allow or disallow some folders for SEO purposes.
Here are the folders I have:
/css/ (css)
/js/ (javascript)
/img/ (images i use for the website)
/php/ (PHP which will return a blank page such as for example checkemail.php which checks an email address or register.php which puts data into a SQL database and sends an email)
/error/ (my error 401,403,404,406,500 html pages)
/include/ (header.html and footer.html I include)
I was thinking about disallowing only the PHP pages and let the rest.
What do you think?
Thanks a lot
Laurent
/css and /js -- CSS and Javascript files will probably be crawled by googlebot whether or not you have them in robots.txt. Google uses them to render your pages for site preview. Google has asked nicely that you not put them in robots.txt.
/img -- Googlebot may crawl this even when in robots.txt the same way as CSS and Javascript. Putting your images in robots.txt generally prevents them from being indexed in Google image search. Google image search may be a source of visitors to your site so you may wish to be indexed there.
/php -- sounds like you don't want spiders hitting the urls that perform actions. Good call to use robots.txt
/error -- If your site is set up correctly the spiders will probably never know what directory your error pages are served from. They generally get served at the url that has the error and the spider never sees their actual url. This isn't the case if you redirect to them, which isn't recommended practice anyway. As such, I would say there is no need to put them in robots.txt

Correct way to add robots.txt and hide it?

I have a secret folder on my hosting, which may not be seen by visitors. I've added a robots.txt to htdocs:
User-agent: *
Disallow: /super-private/
However, if a visitor goes to http://example.com/robots.txt, he can see the name of the private folder. Is there anything to be done? Htaccess maybe?
robots.txt is not the solution here. All it does is tell things like search engine spiders that a particular URL should not be indexed; it doesn't prevent access.
Put a .htaccess file in super-private containing the following:
Deny From All
Once you've done this, there's no need for robots.txt, as it'll be inaccessible anyway. If you want to allow access to certain people, then look into authentication with .htaccess.
Don't mention this private folder in robots.txt. Then simply disallow the access to it with .htaccess:
deny from all
Also if there are no links to this super-private folder in the other pages robots should never know if its existence but disallowing the access is a good thing to do if this folder should never be directly accessed from clients.