In robots.txt only allow crawling for subdomain NOT subdirectory on shared hosting? - seo

I just changed the DNS settings so the folder /forum is now a subdomain instead of a subdirectory. If I do a robots.txt file and say:
User-agent: *
Disallow: /forum
Will that disallow crawling for the subdirectory AND subdomain?
I want to disallow crawling of the subdirectory, but ALLOW crawling of the subdomain. Note: this is on shared hosting so both the subdirectory and subdomain can be visited. This is why I have this issue.
So, How can I only permit crawling for the subdomain?

It's the correct way, if you want to stop crawling. But note: If the URLs are already indexed, the won't be removed.
The way I would prefer is to set all pages to "noindex/follow" via meta tags or even better you the "canonical tag" to send the search engines traffic to the subdomain url
Into your
On a given URL like "http://www.yourdomain.com/directoryname/post-of-the-day" use
<link rel="canonical" href="http://directoyname.yourdomain.com/post-of-the-day" />
The latest URL will be the only one in SERPs

Related

How to set up a subdomain for subpage redirects

I'm working on a multi-site web server hosting both WordPress and MediaWiki. My WordPress site is at https://example.com, and I want the MediaWiki site to be at https://wiki.example.com. Since MediaWiki uses /mediawiki or /mediawiki/index.php in its URLs, I would need both of those to result in a usage of the subdomain (for instance, https://www.example.com/mediawiki/index.php/Main_Page should change to https://wiki.example.com/Main_Page). I assume this requires 1) editing of the DNS records with my domain registrar and 2) redirects added to my .htaccess file.
What do I add to my DNS records to allow for the subdomain?
What do I add to .htaccess to remove /mediawiki from the URL?

Prestashop multishop: how no-index Alias subdomain

In a shop created with PS 1.7.6.1 we have created a resellers "view";
At the moment we have the mail webshop for B2C on www.domainname.com and a view with reseller.domainname.com (for B2B market)
For aspects related to SEO (duplicate product sheets etc ...), I would NOT index the entire alias of the subdomain "reseller.domainname.com"
I can NOT proceed via FTP with robots.txt file as there is no root dedicated to that alias, so it is impossible to add a command dedicated to that Url (that's not a real subdomain)
Is it possible to proceed via the HTACCESS file?
Is there any way to prevent URL path indexing reseller.domainname.com
Thank you
Do you mean that both websites are sharing the same document root
(common scenario with third-level subdomain multishops ?)
In this case the solution is to edit your .htaccess like this
RewriteRule ^robots\.txt$ robots/%{HTTP_HOST}.txt [L]
This way you can have a different robots.txt for each shop named:
robots/mysite1.com.txt
robots/mysubdomain.mysite2.com.txt
Most likely you would like to add a
User-agent: *
Disallow: /
on the robots.txt of the reseller shop.

Ad-hoc shortened URL using Redirect / Rewrite?

I want to print a short, easy-to-type URL on paper brochures.
So that people can type example.com/foo into their smartphone browser, and the browser will display an existing page, say http://example.com/bar/yada.php .
I see that most pages about modrewrite involve regex, but what if I only need manually defined single pages?
Should I have an actual foo directory in the web root, containing a .htaccess file?
The following did what I needed, placed in the .htaccess at webroot.
An actual foo directory need not exist.
RedirectMatch 301 "(?i)^/foo$" "/bar/yada.php"
RedirectMatch 301 "(?i)^/foo/$" "/bar/yada.php"

robots.txt allow all except few sub-directories

I want my site to be indexed in search engines except few sub-directories. Following are my robots.txt settings:
robots.txt in the root directory
User-agent: *
Allow: /
Separate robots.txt in the sub-directory (to be excluded)
User-agent: *
Disallow: /
Is it the correct way or the root directory rule will override the sub-directory rule?
No, this is wrong.
You can’t have a robots.txt in a sub-directory. Your robots.txt must be placed in the document root of your host.
If you want to disallow crawling of URLs whose paths begin with /foo, use this record in your robots.txt (http://example.com/robots.txt):
User-agent: *
Disallow: /foo
This allows crawling everything (so there is no need for Allow) except URLs like
http://example.com/foo
http://example.com/foo/
http://example.com/foo.html
http://example.com/foobar
http://example.com/foo/bar
…
Yes there are
User-agent: *
Disallow: /
The above directive is useful if you are developing a new website and do not want search engines to index your incomplete website.
also,you can get advanced infos right here
You can manage them with robots.txt which sits in the root directory. Make sure to have allow patterns before your disallow patterns.

Disable crawling for subdomain

I want to disable crawling for my subdomains.
For example:
my main domain is maindomain.com
subdomain_one.com (add-on domain)
subdomain_two.com (add-on domain)
So I want to disable crawling for subdomain_one.maildomain.com.
I have used this in robot.txt:
User-agent: *
Disallow: /subdomain_one/
Disallow: /subdomain_two/
The file must be called robots.txt, not robot.txt.
If you want to disallow all bots to crawl your subdomain, you have to place a robots.txt file in the document root of this subdomain, with the following content:
User-agent: *
Disallow: /
Each host needs its own robots.txt. You can’t specify subdomains inside of the robots.txt, only beginnings of URL paths.
So if you want to block all files on http://sub.example.com/, the robots.txt must be accessible from http://sub.example.com/robots.txt.
It doesn’t matter how your sites are organized on the server-side, it only matters what is publicly accessible.