seo question: where should my blog sitemap.xml live - seo

we have a blog in a sub-directory of the main url.
http://www.domain.com/blog/
the blog is run by wordpress and we are using Google Sitemap Generator to create the XML file.
We have an index of all of our sitemaps in the main sitemap.xml which leads to many sitemaps.
From an SEO standpoint would it be best to link directly to the sitemap that is under the blog directory:
e.g. http://www.domian.com/blog/sitemap.xml
or should be do a cron (daily) to copy the file to the main domain's directory:
e.g. http://www.domain.com/sitemap_blog.xml
which will be linked from the main index with the other sitemaps.
What is the best way from an SEO standpoint???

It doesn't matter where the sitemap is. you will want to register its location with the search engines you want to be able to find it. The main thing though is to have a link to your sitemap location in the robots.txt file using the following line:
Sitemap: <sitemap_location>
Your robots.txt file should be in your domain's root.

Related

How to customize DNN robots.txt to allow a module specific sitemap to be crawled by search engines?

I am using the EasyDNN News module for the blog, news articles, etc. on our DNN website. The core DNN sitemap does not include the articles generated by this module, but the module creates its own sitemap.
For example:
domain.com/blog/mid/1005/ctl/sitemap
When I try to submit this sitemap to Google, it says my Robots.txt file is blocking it.
Looking at the Robots.txt file that ships with DNN, I noticed the following lines under the Slurp and Googlebot user-agents:
Disallow: /*/ctl/ # Slurp permits *
Disallow: /*/ctl/ # Googlebot permits *
I'd like to submit the module's sitemap, but I'd like to know why the /ctl is disallowed for these user-agents, and what would the impact be if I just removed these lines from the file? Specifically, as it pertains to Google crawling the site.
As an added reference, I have read the article below about avoiding a duplicate content penalty by disallowing specific urls that contain /ctl such as login, register, terms, etc. I'm wondering if this is why DNN just disallowed any url with /ctl.
http://www.codeproject.com/Articles/18151/DotNetNuke-Search-Engine-Optimization-Part-Remov
The proper way to do this would be to use the DNN Sitemap provider, something that is pretty darn easy to do as a module developer.
I don't have a blog post/tutorial on it, but I do have sample code which can be found in
http://dnnsimplearticle.codeplex.com/SourceControl/latest#cs/Providers/Sitemap/Sitemap.cs
This will allow custom modules to add their own information to the DNN Sitemap.
The reason /CTL is disallowed is because the normal way to load the Login/Registration/Profile controls is to do site?ctl=login and that is typically not something that people want to have indexed.
The other option is just edit the robots.txt file.

How to Disallow Landing Pages Using robots.txt file?

I'd like to start using specific landing pages in a marketing campaign. A quick search on google shows how to disallow specific pages and/or directories using a robots.txt file. (link)
If I don't want the search engines to index these landing pages should I put a single page entries in the robot.txt file or should I put them in specific directories and disallow the directory?
My concern is that anybody can read a robots.txt file and if the actual page names are visible within the robots.txt file it defeats the purpose.
"It defeats the purpose." How so? The purpose of robots.txt is to prevent crawlers from reading particular files or groups of files. Whether you exclude the individual files or put them all in a directory and exclude that directory is irrelevant as far as the crawler's behavior is concerned.
The benefit to putting them all in directories is that your robots.txt file is smaller and easier to manage. You don't have to add a new entry every time you create a new landing page.
You're right that putting a file name in robots.txt lets anybody who reads the file know that the file is there. That shouldn't be a problem. If you have sensitive information that you don't want others to see then it shouldn't be accessible, regardless of whether it's mentioned in robots.txt. Because if the file is publicly accessible, then a bot is going to find it even if you don't mention it in robots.txt.
robots.txt is just a guideline. The existence of a disallow line in robots.txt doesn't prevent an unfriendly crawler from looking at those pages. It just tells the crawler that you don't want them looking at those pages. But crawlers can ignore robots.txt. They shouldn't, and you can block them if they do, but robots.txt itself is more like a stop sign than a road block.
You should be able to simply use the NOINDEX META tag in the HEAD of your page.
http://www.robotstxt.org/meta.html

robots.txt which folders to disallow - SEO?

I am currently writing my robots.txt file and have some trouble deciding whether I should allow or disallow some folders for SEO purposes.
Here are the folders I have:
/css/ (css)
/js/ (javascript)
/img/ (images i use for the website)
/php/ (PHP which will return a blank page such as for example checkemail.php which checks an email address or register.php which puts data into a SQL database and sends an email)
/error/ (my error 401,403,404,406,500 html pages)
/include/ (header.html and footer.html I include)
I was thinking about disallowing only the PHP pages and let the rest.
What do you think?
Thanks a lot
Laurent
/css and /js -- CSS and Javascript files will probably be crawled by googlebot whether or not you have them in robots.txt. Google uses them to render your pages for site preview. Google has asked nicely that you not put them in robots.txt.
/img -- Googlebot may crawl this even when in robots.txt the same way as CSS and Javascript. Putting your images in robots.txt generally prevents them from being indexed in Google image search. Google image search may be a source of visitors to your site so you may wish to be indexed there.
/php -- sounds like you don't want spiders hitting the urls that perform actions. Good call to use robots.txt
/error -- If your site is set up correctly the spiders will probably never know what directory your error pages are served from. They generally get served at the url that has the error and the spider never sees their actual url. This isn't the case if you redirect to them, which isn't recommended practice anyway. As such, I would say there is no need to put them in robots.txt

Can a relative sitemap url be used in a robots.txt?

In robots.txt can I write the following relative URL for the sitemap file?
sitemap: /sitemap.ashx
Or do I have to use the complete (absolute) URL for the sitemap file, like:
sitemap: http://subdomain.domain.com/sitemap.ashx
Why I wonder:
I own a new blog service, www.domain.com, that allow users to blog on accountname.domain.com.
I use wildcards, so all subdomains (accounts) point to: "blog.domain.com".
In blog.domain.com I put the robots.txt to let search engines find the sitemap.
But, due to the wildcards, all user account share the same robots.txt file.Thats why I can't use the second alternative. And for now I can't use url rewrite for txt files. (I guess that later versions of IIS can handle this?)
According to the official documentation on sitemaps.org it needs to be a full URL:
You can specify the location of the Sitemap using a robots.txt file. To do this, simply add the following line including the full URL to the sitemap:
Sitemap: http://www.example.com/sitemap.xml
Google crawlers are not smart enough, they can't crawl relative URLs, that's why it's always recommended to use absolute URL's for better crawlability and indexability.
Therefore, you can not use this variation
> sitemap: /sitemap.xml
Recommended syntax is
Sitemap: https://www.yourdomain.com/sitemap.xml
Note:
Don't forgot to capitalise the first letter in "sitemap"
Don't forgot to put space after "Sitemap:"
Good technical & logical question my dear friend.
No in robots.txt file you can't go with relative URL of the sitemap; you need to go with the complete URL of the sitemap.
It's better to go with "sitemap: https://www.example.com/sitemap_index.xml"
In the above URL after the colon gives space.
I also like to support Deepak.

Is Addon domain affecting SEO

I am just a learn in the field of SEO and i have a main domain and an addon domains. Both have separate websites. Consider main.com is my main domain and addon.com is my addon domain name which is pointed to a sub directory called "addon".
I can access addon.com by using the following 3 ways.
addon.com
main.com/addon
addon.main.com
Are these urls are indexed separately by search engines? If so how can i prevent this?
Does Search engine think main.com/addon as a page in the main.com?
I am not sure i need to worry about all these things or just leave it as it is. I searched to google but couldn't find a right answer.
It may be too late to answer. However, it may benefit others.
Primarydomain and subdomain or addon-domain will not be linked by the search engines automatically, unless you link them purposefully or inadvertently. Except all conditions are true:
Your web root normally public_html has no index page
Directory indexing of your web root is opened, eventually
exposing/linking your sub-folder -which is attached to your
addon-domain- to google and entire world.
In that scenario robots.txt solution is not recommended, because search engines may ignore robot.txt rules.
Reference
Google will only index pages if they are linked to or listed in the sitemap. You can stop the addon.main.com or main.com/addon being indexed by using noindex tags:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
or disallowing it in the robots.txt
The search engine will consider main.com/addon as a page of main.com - if sites are completely separate i'd recommend using a separate domain (preferably a keyword rich domain) but it's up to you really
We have three domain names with the same content. For the three domains, it will return a 200 OK HTTP code. It will look like duplicates of the same content. If there is a canonical tag on every page it will be better.
The best would be to create a redirection on the subdomain panel in cpanel so that at least addon.main.com would redirect to addon.com
Then, you can add a robots.txt to the root path of the primary domain and add
user-agent:*
disallow:/
so that no robot will visit main.com/addon
Google gives less weight to subdomain hosted site of another domain.
Superbad for SEO
If you are hosting for SEO and love the convenience of cPanel, then forget hosting domains as addon domains.
#Vasanthan R.P.
Its an excellent question, often overlooked by SEO professionals. +1 for you