How may i prevent search engines from crawling a subdomain on my website? - apache

I have cPanel installed on my website.
I went to the Domains section on cPanel
I clicked on subdomains.
I assigned the subdomain name (e.g : personal.mywebsite.com )
It wanted me to assign document root folder also. I assigned mywebsite.com/personal
if i create robots.txt in my website root(e.g : website.com)
User-agent:
Disallow: /personal/
Can it also block personal.mywebsite.com?
what should i do?
thanks

When you want to block URLs on personal.example.com, visit http://personal.example.com/robots.txt (resp. https instead of http).
It doesn’t matter how your server organizes folders in the backend, it only matters which robots.txt is available when accessing this URL.

Related

Will Drupal robots.txt disallow still be recorded in apache log file?

I set up some rules in robots.txt for a specific agent for my Drupal website.
I have one question. When this agent is trying to access the website, will this access still be logged in my apache access_log file?
Disallow in robots.txt is not technically preventing user agents from accessing your website. Each user agent decides whether it wants to honour your robots.txt or not.
By default, Drupal is not doing anything with the content of your robots.txt file, and the robots.txt content doesn’t affect your server’s logs at all.

Noindex Only One Subdomain

I am having difficulty finding information on how to completely noindex only one particular subdomain via htaccess (from my understanding, that's the best way?) and it is important for me that only that one subdomain and its files are never indexed or crawlable.
I have an Apache server that uses Plesk and the subdomain is for an email software we use for newsletter campaigns etc.
The subdomain is "mail" (e.g https://mail.test.com) and my goal is to only make "mail" noindex because for some reason the software has seo features that can wind up harming our general purpose etc.
Create robots.txt inside subdomain document root with the following content:
User-agent: *
Disallow: /

Single robots.txt file for all subdomains

I have a site (example.com) and have my robots.txt set up in the root directory. I have also multiple subdomains (foo.example.com, bar.example.com, and more to come in the future) whose robots.txt will all be identical as that of example.com. I know that I can place a file at the root of each subdomain but I'm wondering if it's possible to redirect the crawlers searching for robots.txt on any subdomain to example.com/robots.txt?
Sending a redirect header for your robots.txt file is not advised, nor is it officially supported.
Google's documentation specifically states:
Handling of robots.txt redirects to disallowed URLs is undefined and discouraged.
But the documentation does say redirect "will be generally followed". If you add your subdomains into Google Webmaster Tools and go to "Crawl > Blocked URLs" you can test your subdomain robots.txts that are 301 redirecting. It should come back as positively working.
However, with that said, I would strongly suggest that you just symlink the files into place and that each robots.txt file responds with a 200 OK at the appropriate URLs. This is much more inline with the original robots.txt specification, as well as, Google's documentation, and who knows exactly how bing / yahoo will handle it over time.

Avoid google indexing subdomains

As far as I searched for it, not able to find a proper answer for such kinda problem.
I have a few TLDs installed on the same cPanel account.
One of them is known as the main domain, and the rest are secondary domain.
cPanel automatically creates subdomains when you add a secondary domain somthing like;
http://secondary.maindomain.com
My problem is google indexed my pages both from 2 addresses.
Like:
secondary.com/blabla.html
secondary.maindomain.com/blabla.html
How can I remove those indexes from google? And
How can I avoid those subdomains being indexed for the future?
For this purpose you can add robots.txt to your document root path and add 'Disallow: ' to avoid any search engine or Google to index your files or directories.
For example to avoid indexing your subdomain in google add below entries in robots.txt and place robots.txt in document root path of you subdomain:
User-agent: Googlebot
Disallow: /
or for all search engines:
User-agent: *
Disallow: /

Disallow crawling of the CDN site

So I have a site http://www.example.com.
The JS/CSS/Images are served from a CDN - http://xxxx.cloudfront.net OR http://cdn.example.com; they are both the same things. Now the CDN just serves any type of file, including my PHP pages. Google somehow got crawling that CDN site as well; two site actually - from cdn.example.com AND from http://xxxx.cloudfront.net. Considering
I am NOT trying set up a subdomain OR a mirror site. If that happens, that is a side affect of me trying to set up a CDN.
CDN is some web server, not necessarily an Apache. I do not know what type of server would that be.
There is no request processing on CDN. it just fetches things from origin server. I think, you cannot put custom files out there on the CDN; it just fetches things from the origin server. Whatever you need to put on the CDN comes from the origin server.
How do I prevent the crawling of PHP pages?
Should I allow crawling of images from cdn.example.com OR from example.com? The links to images inside the HTML are all to cdn.example.com. If I allow crawling of images only from example.com, then there is practically nothing to crawl - there are no links to such images. If I allow crawling of images from cdn.example.com, then does it not leak away the SEO benefits?
Some alternatives that I considered, based on stackoverflow answers:
Write custom robot_cdn.txt and serve that custom robots_cdn.txt based on HTTP_HOST. This is as per many answers on the stack overflow.
Serve a new robots.txt from subdomain. As I explained above, I do not think that CDN can be treated like a subdomain.
Do 301 redirects when HTTP_HOST is cdn.example.com to www.example.com
Suggestions?
Questions related to this, e.g. How Disallow a mirror site (on sub-domain) using robots.txt?
You can put robots.txt in your root directory so that it will be served with cdn.-yourdomain-.com/robots.txt. In this robots.txt you can disallow all the crawlers with the below setting
User-agent: *
Disallow: /