I am having difficulty finding information on how to completely noindex only one particular subdomain via htaccess (from my understanding, that's the best way?) and it is important for me that only that one subdomain and its files are never indexed or crawlable.
I have an Apache server that uses Plesk and the subdomain is for an email software we use for newsletter campaigns etc.
The subdomain is "mail" (e.g https://mail.test.com) and my goal is to only make "mail" noindex because for some reason the software has seo features that can wind up harming our general purpose etc.
Create robots.txt inside subdomain document root with the following content:
User-agent: *
Disallow: /
Related
Several domains are configured as add-ons to my primary hosting account (shared hosting).
The directory structure looks like this (primary domain is example.com):
public_html (example.com)
_sub
ex1 --> displayed as example-realtor.com
ex2 --> displayed as example-author.com
ex3 --> displayed as example-blogger.com
(the SO requirement to use example as the domain makes explanation more difficult - for example, sub ex1 might point to plutorealty and ex2 might point to amazon, or some other business sub-hosting with me. The point is that each ex# is a different company's website, so mentally substitute something normal and different for each "example")
Because these domains (ex1, ex2, etc) are add-on domains, they are accessible in two ways (ideally, the 2nd method is known only to me):
(1) http://example1.com
(2) http://example.com/_sub/ex1/index.php
Again, example1.com is a totally unrelated website/domain name from example.com
QUESTIONS:
(a) How will the site be indexed on search engines? Will both (1) and (2) show up in search results? It is undesireable for method 2 to show up in google)
(b) Should I put a robots.txt in public_html that disallows each folder in the _sub folder? Eg:
User-agent: *
Disallow: /_sub/
Disallow: /_sub/ex1/
Disallow: /_sub/ex2/
Disallow: /_sub/ex3/
(c) Is there a more common way to configure add-on domains?
This robots.txt would be sufficient, you don’t have to list anything that comes after /_sub/:
User-agent: *
Disallow: /_sub/
This would disallow bots (who honor the robots.txt) to crawl any URL whose path starts with /_sub/. But that doesn’t necessarily stop these bots to index your URL itself (e.g., list them in their search results).
Ideally you would redirect from http://example.com/_sub/ex1/ to http://example1.com/ with HTTP status code 301. It depends on your server how that works (for Apache, you could use a .htaccess). Then everyone ends up on the canonical URL for your site.
Do not Use Multi site features with Google. Google Ranking effect on Main domain also. If Black hat and also Spam generate sub directory sites.
My Suggestion If you need important site on Sub Categories then Put all Sub Domain noindex .
Robot.txt
User-agent: *
Disallow: /_sub/
Disallow: /_sub/ex1/
Disallow: /_sub/ex2/
Disallow: /_sub/ex3/
I have cPanel installed on my website.
I went to the Domains section on cPanel
I clicked on subdomains.
I assigned the subdomain name (e.g : personal.mywebsite.com )
It wanted me to assign document root folder also. I assigned mywebsite.com/personal
if i create robots.txt in my website root(e.g : website.com)
User-agent:
Disallow: /personal/
Can it also block personal.mywebsite.com?
what should i do?
thanks
When you want to block URLs on personal.example.com, visit http://personal.example.com/robots.txt (resp. https instead of http).
It doesn’t matter how your server organizes folders in the backend, it only matters which robots.txt is available when accessing this URL.
As far as I searched for it, not able to find a proper answer for such kinda problem.
I have a few TLDs installed on the same cPanel account.
One of them is known as the main domain, and the rest are secondary domain.
cPanel automatically creates subdomains when you add a secondary domain somthing like;
http://secondary.maindomain.com
My problem is google indexed my pages both from 2 addresses.
Like:
secondary.com/blabla.html
secondary.maindomain.com/blabla.html
How can I remove those indexes from google? And
How can I avoid those subdomains being indexed for the future?
For this purpose you can add robots.txt to your document root path and add 'Disallow: ' to avoid any search engine or Google to index your files or directories.
For example to avoid indexing your subdomain in google add below entries in robots.txt and place robots.txt in document root path of you subdomain:
User-agent: Googlebot
Disallow: /
or for all search engines:
User-agent: *
Disallow: /
So I have a site http://www.example.com.
The JS/CSS/Images are served from a CDN - http://xxxx.cloudfront.net OR http://cdn.example.com; they are both the same things. Now the CDN just serves any type of file, including my PHP pages. Google somehow got crawling that CDN site as well; two site actually - from cdn.example.com AND from http://xxxx.cloudfront.net. Considering
I am NOT trying set up a subdomain OR a mirror site. If that happens, that is a side affect of me trying to set up a CDN.
CDN is some web server, not necessarily an Apache. I do not know what type of server would that be.
There is no request processing on CDN. it just fetches things from origin server. I think, you cannot put custom files out there on the CDN; it just fetches things from the origin server. Whatever you need to put on the CDN comes from the origin server.
How do I prevent the crawling of PHP pages?
Should I allow crawling of images from cdn.example.com OR from example.com? The links to images inside the HTML are all to cdn.example.com. If I allow crawling of images only from example.com, then there is practically nothing to crawl - there are no links to such images. If I allow crawling of images from cdn.example.com, then does it not leak away the SEO benefits?
Some alternatives that I considered, based on stackoverflow answers:
Write custom robot_cdn.txt and serve that custom robots_cdn.txt based on HTTP_HOST. This is as per many answers on the stack overflow.
Serve a new robots.txt from subdomain. As I explained above, I do not think that CDN can be treated like a subdomain.
Do 301 redirects when HTTP_HOST is cdn.example.com to www.example.com
Suggestions?
Questions related to this, e.g. How Disallow a mirror site (on sub-domain) using robots.txt?
You can put robots.txt in your root directory so that it will be served with cdn.-yourdomain-.com/robots.txt. In this robots.txt you can disallow all the crawlers with the below setting
User-agent: *
Disallow: /
I have around 300+ subdomains for my site, I need to block them from indexing in search engines.
I saw this robots.txt code
User-agent: *
Disallow: /
But I need to do it for every subdomain, is there an easier way to do it with single robot.txt file in the root of the main domain.