Robots.txt and sub-folders - seo

Several domains are configured as add-ons to my primary hosting account (shared hosting).
The directory structure looks like this (primary domain is example.com):
public_html (example.com)
_sub
ex1 --> displayed as example-realtor.com
ex2 --> displayed as example-author.com
ex3 --> displayed as example-blogger.com
(the SO requirement to use example as the domain makes explanation more difficult - for example, sub ex1 might point to plutorealty and ex2 might point to amazon, or some other business sub-hosting with me. The point is that each ex# is a different company's website, so mentally substitute something normal and different for each "example")
Because these domains (ex1, ex2, etc) are add-on domains, they are accessible in two ways (ideally, the 2nd method is known only to me):
(1) http://example1.com
(2) http://example.com/_sub/ex1/index.php
Again, example1.com is a totally unrelated website/domain name from example.com
QUESTIONS:
(a) How will the site be indexed on search engines? Will both (1) and (2) show up in search results? It is undesireable for method 2 to show up in google)
(b) Should I put a robots.txt in public_html that disallows each folder in the _sub folder? Eg:
User-agent: *
Disallow: /_sub/
Disallow: /_sub/ex1/
Disallow: /_sub/ex2/
Disallow: /_sub/ex3/
(c) Is there a more common way to configure add-on domains?

This robots.txt would be sufficient, you don’t have to list anything that comes after /_sub/:
User-agent: *
Disallow: /_sub/
This would disallow bots (who honor the robots.txt) to crawl any URL whose path starts with /_sub/. But that doesn’t necessarily stop these bots to index your URL itself (e.g., list them in their search results).
Ideally you would redirect from http://example.com/_sub/ex1/ to http://example1.com/ with HTTP status code 301. It depends on your server how that works (for Apache, you could use a .htaccess). Then everyone ends up on the canonical URL for your site.

Do not Use Multi site features with Google. Google Ranking effect on Main domain also. If Black hat and also Spam generate sub directory sites.
My Suggestion If you need important site on Sub Categories then Put all Sub Domain noindex .
Robot.txt
User-agent: *
Disallow: /_sub/
Disallow: /_sub/ex1/
Disallow: /_sub/ex2/
Disallow: /_sub/ex3/

Related

Noindex Only One Subdomain

I am having difficulty finding information on how to completely noindex only one particular subdomain via htaccess (from my understanding, that's the best way?) and it is important for me that only that one subdomain and its files are never indexed or crawlable.
I have an Apache server that uses Plesk and the subdomain is for an email software we use for newsletter campaigns etc.
The subdomain is "mail" (e.g https://mail.test.com) and my goal is to only make "mail" noindex because for some reason the software has seo features that can wind up harming our general purpose etc.
Create robots.txt inside subdomain document root with the following content:
User-agent: *
Disallow: /

robots.txt for disallowing Google to not to follow specific URLs

I have included robots.txt in the root directory of my application in order to tell Google bots that do not follow this http://www.test.com/example.aspx?id=x&date=10/12/2014 URL or the URL with the same extension but different query string values. For that I have used following piece of code:
User-agent: *
Disallow:
Disallow: /example.aspx/
But I found in the Webmaster Tools that Google is still following this page and has chached a number of URLs with the specified extension, is it something that query strings are creating problem because as far as I know that Google do not bother about query string, but just in case. Am I using it correctly or something else also needs to be done in order to achieve the task.
Your instruction is wrong :
Disallow: /example.aspx/
This is blocking all URLs in the direcory /example.aspx/
If you want to block all URLs of the file /example.aspx, use this instruction:
Disallow: /example.aspx
You can test it with Google Webmaster Tools.

How may i prevent search engines from crawling a subdomain on my website?

I have cPanel installed on my website.
I went to the Domains section on cPanel
I clicked on subdomains.
I assigned the subdomain name (e.g : personal.mywebsite.com )
It wanted me to assign document root folder also. I assigned mywebsite.com/personal
if i create robots.txt in my website root(e.g : website.com)
User-agent:
Disallow: /personal/
Can it also block personal.mywebsite.com?
what should i do?
thanks
When you want to block URLs on personal.example.com, visit http://personal.example.com/robots.txt (resp. https instead of http).
It doesn’t matter how your server organizes folders in the backend, it only matters which robots.txt is available when accessing this URL.

Avoid google indexing subdomains

As far as I searched for it, not able to find a proper answer for such kinda problem.
I have a few TLDs installed on the same cPanel account.
One of them is known as the main domain, and the rest are secondary domain.
cPanel automatically creates subdomains when you add a secondary domain somthing like;
http://secondary.maindomain.com
My problem is google indexed my pages both from 2 addresses.
Like:
secondary.com/blabla.html
secondary.maindomain.com/blabla.html
How can I remove those indexes from google? And
How can I avoid those subdomains being indexed for the future?
For this purpose you can add robots.txt to your document root path and add 'Disallow: ' to avoid any search engine or Google to index your files or directories.
For example to avoid indexing your subdomain in google add below entries in robots.txt and place robots.txt in document root path of you subdomain:
User-agent: Googlebot
Disallow: /
or for all search engines:
User-agent: *
Disallow: /

How to block all subdomain in a site from search engine?

I have around 300+ subdomains for my site, I need to block them from indexing in search engines.
I saw this robots.txt code
User-agent: *
Disallow: /
But I need to do it for every subdomain, is there an easier way to do it with single robot.txt file in the root of the main domain.