Robots.txt http://example.com vs.http:// www.example.com - seo

I have a situation where we have two code bases that need to stay intact..
example: http://example.com.
And a new site http://www.example.com.
The old site (no WWW) supports some legacy code and has the rule:
User-agent: *
Disallow: /
But in the new version (with WWW) there is no robots.txt.
Is Google looking to the old (no WWW) robots.txt file as its rule? And will adding
User-agent: *
Allow: /
to the (WWW) side override this?
Changing robots.txt on in the old codebase is not an option at this time.

No, the subdomain "www." and the subdomain "" are separate subdomains, and the robots.txt from one of them is not used for the other.

Related

Prestashop multishop: how no-index Alias subdomain

In a shop created with PS 1.7.6.1 we have created a resellers "view";
At the moment we have the mail webshop for B2C on www.domainname.com and a view with reseller.domainname.com (for B2B market)
For aspects related to SEO (duplicate product sheets etc ...), I would NOT index the entire alias of the subdomain "reseller.domainname.com"
I can NOT proceed via FTP with robots.txt file as there is no root dedicated to that alias, so it is impossible to add a command dedicated to that Url (that's not a real subdomain)
Is it possible to proceed via the HTACCESS file?
Is there any way to prevent URL path indexing reseller.domainname.com
Thank you
Do you mean that both websites are sharing the same document root
(common scenario with third-level subdomain multishops ?)
In this case the solution is to edit your .htaccess like this
RewriteRule ^robots\.txt$ robots/%{HTTP_HOST}.txt [L]
This way you can have a different robots.txt for each shop named:
robots/mysite1.com.txt
robots/mysubdomain.mysite2.com.txt
Most likely you would like to add a
User-agent: *
Disallow: /
on the robots.txt of the reseller shop.

I have a 302 redirect pointing to www. but Googlebot keeps crawling non-www URLs

Do you know if it is possible to force the robots crawl on www.domaine.com and not domaine.com ? In my case, I have a web app that has enabled cached urls with prerender.io (to view the HTML code), but only on www.
So, when the robots crawl on domaine.com, it has no data.
The redirection is automatic (domaine.com> http://www.domaine.com) on Nginx, but no results.
I said that my on my sitemap, urls have all www.
My Nginx redirect :
server {
listen *:80;
server_name stephane-richin.fr;
location / {
if ($http_host ~ "^([^\.]+)\.([^\.]+)$"){
rewrite ^/(.*) http://www.stephane-richin.fr/$1 redirect;
}
}
}
Do you have an idea ?
Thank you !
If you submitted a sitemap with the correct URLs a week ago, it seems strange that the Google keeps requesting the old ones.
Anyway - you’re sending the wrong status code in your non-www to www redirect. You are sending a 302 but should be sending a 301. Philippe explains the difference in this answer:
Status 301 means that the resource (page) is moved permanently to a new location. The client/browser should not attempt to request the original location but use the new location from now on.
Status 302 means that the resource is temporarily located somewhere else, and the client/browser should continue requesting the original url.
Could you have a robots.txt file with
User-agent: *
Disallow: /
on domaine.com and a different one with
User-agent: *
Disallow:
on www.domaine.com?

robots.txt allow all except few sub-directories

I want my site to be indexed in search engines except few sub-directories. Following are my robots.txt settings:
robots.txt in the root directory
User-agent: *
Allow: /
Separate robots.txt in the sub-directory (to be excluded)
User-agent: *
Disallow: /
Is it the correct way or the root directory rule will override the sub-directory rule?
No, this is wrong.
You can’t have a robots.txt in a sub-directory. Your robots.txt must be placed in the document root of your host.
If you want to disallow crawling of URLs whose paths begin with /foo, use this record in your robots.txt (http://example.com/robots.txt):
User-agent: *
Disallow: /foo
This allows crawling everything (so there is no need for Allow) except URLs like
http://example.com/foo
http://example.com/foo/
http://example.com/foo.html
http://example.com/foobar
http://example.com/foo/bar
…
Yes there are
User-agent: *
Disallow: /
The above directive is useful if you are developing a new website and do not want search engines to index your incomplete website.
also,you can get advanced infos right here
You can manage them with robots.txt which sits in the root directory. Make sure to have allow patterns before your disallow patterns.

Disable crawling for subdomain

I want to disable crawling for my subdomains.
For example:
my main domain is maindomain.com
subdomain_one.com (add-on domain)
subdomain_two.com (add-on domain)
So I want to disable crawling for subdomain_one.maildomain.com.
I have used this in robot.txt:
User-agent: *
Disallow: /subdomain_one/
Disallow: /subdomain_two/
The file must be called robots.txt, not robot.txt.
If you want to disallow all bots to crawl your subdomain, you have to place a robots.txt file in the document root of this subdomain, with the following content:
User-agent: *
Disallow: /
Each host needs its own robots.txt. You can’t specify subdomains inside of the robots.txt, only beginnings of URL paths.
So if you want to block all files on http://sub.example.com/, the robots.txt must be accessible from http://sub.example.com/robots.txt.
It doesn’t matter how your sites are organized on the server-side, it only matters what is publicly accessible.

What does this robots.txt mean? Doesn't it allow any robots?

User-agent: *
Disallow:
Disallow: /admin
Disallow: /admin
Sitemap: http://www.myadress.com/ext/sm/Sitemap_114.xml
I've found this robots.txt file in one of my website's root folder. I don't know i made it or who.
I think this file does not allow any robots to admin folder. This is good.
But i wonder if this blocks all robots to all files in my website?
I've changed it with this file:
User-agent: *
Disallow: /admin
Allow: /
Sitemap: http://www.myadress.com/ext/sm/Sitemap_114.xml
ps: this website is not getting any index for a long time. was the old file problem?
The old script, other than the duplicate /Admin entry, was correct. There is no Allow: command, and an empty 'Disallow:' opens the site up to robots.
http://www.free-seo-news.com/all-about-robots-txt.htm
Specifically, check item #7 under 'Things you should avoid' and #1 under 'Tips and Tricks'
Possibly. I read your old file the same way you did - that root was being disallowed. Either way, your new robots file is set up how I expect that you want it.