robots.txt configuration - seo

I have a few doubts about this robots file.
User-agent: *
Disallow: /administrator/
Disallow: /css/
Disallow: /func/
Disallow: /images/
Disallow: /inc/
Disallow: /js/
Disallow: /login/
Disallow: /recover/
Disallow: /Scripts/
Disallow: /store/com-handler/
Disallow: /store/img/
Disallow: /store/theme/
Disallow: /store/StoreSys.swf
Disallow: config.php
This is going to disable crawlers for all files inside each folder right?
Or i have to add a asterisk at the end of each folder name?
I think this should do it. But i'm not sure if have to add Allow: / right after User-agent i suppose it isn't needed.
Anything wrong in this robots file?
PS: If someone can suggest a validation app for local use, i would be glad.
Thanks.

It's fine as is, if I understand what you want. E.g.
/administrator/
/css/subpage
are both blocked, but
/foo
is allowed. Note that Allow is a less supported extension designed only to counter a previous Disallow. You might use it if, for instance, despite your
Disallow: /images/
you decide you want a particular image allowed. So,
Allow: /images/ok_image
All other images remain blocked. You can see http://www.searchtools.com/robots/robots-txt.html for more info, including a list of checkers.

Related

What is causing URLs to be blocked in Web Master Tools on BigCommerce store?

My client has a store in BigCommerce and is using Google Web Master Tools to monitor search performance. Google is telling the client that certain URLs are blocked on mobile by the robots.txt file. It seems any URL that ends with ?sort=newest is being blocked.
The contents of the HTTP robots.txt file is:
User-agent: *
Disallow: /account.php
Disallow: /cart.php
Disallow: /checkout.php
Disallow: /finishorder.php
Disallow: /login.php
Disallow: /orderstatus.php
Disallow: /postreview.php
Disallow: /productimage.php
Disallow: /productupdates.php
Disallow: /remote.php
Disallow: /search.php
Disallow: /viewfile.php
Disallow: /wishlist.php
Disallow: /admin/
Disallow: /_socialshop/
The contents on the HTTPS robots.txt file is:
User-agent: *
Disallow: /
User-agent: google-xrawler
Allow: /feeds/*
Obviosuly, no mention of ?sort=newest in the robots files.
How can I stop the ?sort=newest pages from being blocked on mobile search?
While I understand your question, blocking url filters is actually best practice. You can read more about it under on Moz under the heading "No Indexed Filters".
Some additional information is offered by BC around this and how to indicate to google what Search Parameter should be crawled, if any.

Trying to disallow one page in subdirectory

I am trying to disallow one page in subdirectory
I am using that robots.txt code is it
User-Agent: *
Disallow:
Disallow: /form.aspx
but the form.aspx is in processfolder and my url is showing like
www.yoursite.com/process/form.aspx
so how can I disallow form.aspx in robots.txt.
The format which is given above robots.txt: is it right?
please guide
If you want to block http://example.com/process/form.aspx and allow everything else, you can use:
# robots.txt on <http://example.com/robots.txt>
User-agent: *
Disallow: /process/form.aspx
Note that this would also block URLs like http://example.com/process/form.aspx.foo, http://example.com/process/form.aspx/bar, etc.

making robots.txt

I am making a robot.txt for my website. Can anybody cnmform that am i doing it correctly? If I am wrong, please tell me how to write in the correct form.
admincp, adminpp etc are folder in my hosting server:
User-agent: *
Disallow: /admincp/
Disallow: /adminpp/
Disallow: /Advertise with us/
Disallow: /ajax/
Disallow: /banner/
Disallow: /cont_img/
Disallow: /corcel/
Disallow: /css/
Disallow: /fbold/
Disallow: /images/
Disallow: /img/
Disallow: /js/
Disallow: /pic/
Disallow: /Scripts/
Disallow: /textpimg/
Disallow: /thumb_uploadtopics/
Disallow: /upload_p1/
Disallow: /uploadtopics/
You can read this article to get a better grip on writing Robots.txt
http://seoroi.com/seo-faq/robotstxt-what-it-is-why-its-used-and-how-to-write-it/
For example, you can write:
User-agent: *
Disallow: /src/
Disallow: /cgi-bin/
Disallow: /~zohaib/
Disallow: /temp/
Yes, you are doing it correctly it seems.
If u r write correctly ur folder name and page name than u r right, not any problem i seen in your robots.txt file.
Some instruction if u want to know.
user-agent : * --> means u r allow to all search engine bot for crawl.
Disallow : ---> means which one folder and page u don't want to allow to visitor.

Blocking folders inbetween allowed content

I have a site with the following structure:
http://www.example.com/folder1/folder2/folder3
I would like to disallow indexing in folder1, and folder2.
But I would like the robots to index everything under folder3.
Is there a way to do this with the robots.txt?
For what I read I think that everything inside a specified folder is disallowed.
Would the following achieve my goal?
user-agent: *
Crawl-delay: 0
Sitemap: <Sitemap url>
Allow: /folder1/folder2/folder3
Disallow: /folder1/folder2/
Disallow: /folder1/
Allow: /
Yes, it works... however google has a tool to test your robots.txt file
you only need to go on google webmaster tools (https://www.google.com/webmasters/tools/)
and open the section "site configuration -> crawler access"
All you would need is:
user-agent: *
Crawl-delay: 0
Sitemap:
Allow: /folder1/folder2/folder3
Disallow: /folder1/
Allow: /
At least googlebot will see the more specific allowing of that one directory and disallow anything from folder1 and on. This is backed up by this post by a Google employee.
Line breaks in records are not allowed, so your original robots.txt should look like this:
user-agent: *
Crawl-delay: 0
Sitemap: <Sitemap url>
Allow: /folder1/folder2/folder3
Disallow: /folder1/folder2/
Disallow: /folder1/
Allow: /
Possible improvements:
Specifying Allow: / is superfluous, as it’s the default anyway.
Specifying Disallow: /folder1/folder2/ is superfluous, as Disallow: /folder1/ is sufficient.
As Sitemap is not per record, but for all bots, you could specify it as a separate block.
So your robots.txt could look like this:
User-agent: *
Crawl-delay: 0
Allow: /folder1/folder2/folder3
Disallow: /folder1/
Sitemap: http://example.com/sitemap
(Note that the Allow field is not part of the original robots.txt specification, so don’t expect all bots to understand it.)

robots.txt ignrore all folders but crawl all files in root

should i then do
User-agent: *
Disallow: /
is it as simple as that?
or will that not crawl the files in the root either?
basically that is what i am after - crawling all the files/pages in the root, but not any of the folders at all
or am i going to have to specify each folder explicitly.. ie
disallow: /admin
disallow: /this
.. etc
thanks
nat
Your example will block all all the files in root.
There isn't a "standard" way to easily do what you want without specifying each folder explicitly.
Some crawlers however do support extensions that will allow you to do pattern matching. You could disallow all bots that don't support the pattern matching, but allow those that do.
For example
# disallow all robots
User-agent: *
Disallow: /
# let google read html and files
User-agent: Googlebot
Allow: /*.html
Allow: /*.pdf
Disallow: /