I am making a robot.txt for my website. Can anybody cnmform that am i doing it correctly? If I am wrong, please tell me how to write in the correct form.
admincp, adminpp etc are folder in my hosting server:
User-agent: *
Disallow: /admincp/
Disallow: /adminpp/
Disallow: /Advertise with us/
Disallow: /ajax/
Disallow: /banner/
Disallow: /cont_img/
Disallow: /corcel/
Disallow: /css/
Disallow: /fbold/
Disallow: /images/
Disallow: /img/
Disallow: /js/
Disallow: /pic/
Disallow: /Scripts/
Disallow: /textpimg/
Disallow: /thumb_uploadtopics/
Disallow: /upload_p1/
Disallow: /uploadtopics/
You can read this article to get a better grip on writing Robots.txt
http://seoroi.com/seo-faq/robotstxt-what-it-is-why-its-used-and-how-to-write-it/
For example, you can write:
User-agent: *
Disallow: /src/
Disallow: /cgi-bin/
Disallow: /~zohaib/
Disallow: /temp/
Yes, you are doing it correctly it seems.
If u r write correctly ur folder name and page name than u r right, not any problem i seen in your robots.txt file.
Some instruction if u want to know.
user-agent : * --> means u r allow to all search engine bot for crawl.
Disallow : ---> means which one folder and page u don't want to allow to visitor.
Related
I'm trying to respect the robots.txt file, while webcrawling, and I encountered something weird. The the robots.txt URL I'm trying to access is: https://podatki.gov.si/robots.txt
If I open this link in Chrome, I get this:
User-agent: *
Disallow: /
But if I open this link with Internet Explorer or Selenium WebDriver (ChromeDriver), I get this:
#
# robots.txt
#
# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these "robots" where not to go on your site,
# you save bandwidth and server resources.
#
# This file will be ignored unless it is at the root of your host:
# Used: http://example.com/robots.txt
# Ignored: http://example.com/site/robots.txt
#
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/robotstxt.html
User-agent: *
Crawl-delay: 10
# CSS, JS, Images
Allow: /misc/*.css$
Allow: /misc/*.css?
Allow: /misc/*.js$
Allow: /misc/*.js?
Allow: /misc/*.gif
Allow: /misc/*.jpg
Allow: /misc/*.jpeg
Allow: /misc/*.png
Allow: /modules/*.css$
Allow: /modules/*.css?
Allow: /modules/*.js$
Allow: /modules/*.js?
Allow: /modules/*.gif
Allow: /modules/*.jpg
Allow: /modules/*.jpeg
Allow: /modules/*.png
Allow: /profiles/*.css$
Allow: /profiles/*.css?
Allow: /profiles/*.js$
Allow: /profiles/*.js?
Allow: /profiles/*.gif
Allow: /profiles/*.jpg
Allow: /profiles/*.jpeg
Allow: /profiles/*.png
Allow: /themes/*.css$
Allow: /themes/*.css?
Allow: /themes/*.js$
Allow: /themes/*.js?
Allow: /themes/*.gif
Allow: /themes/*.jpg
Allow: /themes/*.jpeg
Allow: /themes/*.png
# Directories
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /profiles/
Disallow: /scripts/
Disallow: /themes/
# Files
Disallow: /CHANGELOG.txt
Disallow: /cron.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /INSTALL.sqlite.txt
Disallow: /install.php
Disallow: /INSTALL.txt
Disallow: /LICENSE.txt
Disallow: /MAINTAINERS.txt
Disallow: /update.php
Disallow: /UPGRADE.txt
Disallow: /xmlrpc.php
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /filter/tips/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
Disallow: /user/logout/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=filter/tips/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
Disallow: /?q=user/logout/
Why does this happen? The latter seems to be a generic robots.txt file, maybe something autogenerated?
I have observed the same behavior as follows:
When accessed the webpage https://podatki.gov.si/robots.txt manually, I got:
User-agent: *
Disallow: /
When accessed the webpage https://podatki.gov.si/robots.txt using ChromeDriver and Chrome, I got:
#
# robots.txt
#
# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these "robots" where not to go on your site,
# you save bandwidth and server resources.
#
# This file will be ignored unless it is at the root of your host:
# Used: http://example.com/robots.txt
# Ignored: http://example.com/site/robots.txt
#
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/robotstxt.html
User-agent: *
Crawl-delay: 10
# CSS, JS, Images
Allow: /misc/*.css$
Allow: /misc/*.css?
Allow: /misc/*.js$
Allow: /misc/*.js?
Allow: /misc/*.gif
Allow: /misc/*.jpg
Allow: /misc/*.jpeg
Allow: /misc/*.png
Allow: /modules/*.css$
Allow: /modules/*.css?
Allow: /modules/*.js$
Allow: /modules/*.js?
Allow: /modules/*.gif
Allow: /modules/*.jpg
Allow: /modules/*.jpeg
Allow: /modules/*.png
Allow: /profiles/*.css$
Allow: /profiles/*.css?
Allow: /profiles/*.js$
Allow: /profiles/*.js?
Allow: /profiles/*.gif
Allow: /profiles/*.jpg
Allow: /profiles/*.jpeg
Allow: /profiles/*.png
Allow: /themes/*.css$
Allow: /themes/*.css?
Allow: /themes/*.js$
Allow: /themes/*.js?
Allow: /themes/*.gif
Allow: /themes/*.jpg
Allow: /themes/*.jpeg
Allow: /themes/*.png
# Directories
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /profiles/
Disallow: /scripts/
Disallow: /themes/
# Files
Disallow: /CHANGELOG.txt
Disallow: /cron.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /INSTALL.sqlite.txt
Disallow: /install.php
Disallow: /INSTALL.txt
Disallow: /LICENSE.txt
Disallow: /MAINTAINERS.txt
Disallow: /update.php
Disallow: /UPGRADE.txt
Disallow: /xmlrpc.php
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /filter/tips/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
Disallow: /user/logout/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=filter/tips/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
Disallow: /?q=user/logout/
robots.txt
As per robotstxt.org website owners use the robots.txt file to give instructions about their site to web robots. This is called The Robots Exclusion Protocol.
It works as follows:
A robot wants to vist a website URL, e.g. http://www.example.com/welcome.html.
Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:
User-agent: *
Disallow: /
The User-agent: * means this section applies to all robots.
The Disallow: / tells the robot that it should not visit any pages on the site.
There are two important considerations when using robots.txt:
Robots can ignore your robots.txt. Specially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
The robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.
Outro
As using ChromeDriver and Chrome the navigator.webdriver defines a standard way for co-operating user agents to inform the document that it is controlled by WebDriver, so that alternate code paths can be triggered during automation. Hence you are able to see more contents from the robots.txt.
You can find a relevant discussion in Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection
I am making my robots.txt file. But I am a little bit insecure about how to make disallow Googlebot-Image. I want to allow the Google bot to crawl my site, except for the disallow I have made below. This is what I made:
User-agent: Googlebot
Disallow:
Disallow: /courses/
Disallow: /portfolio/portfolio-template.php/
User-agent: Googlebot-Image
Disallow:
/images/graphics/erhvervserfaring/
/images/graphics/uddannelse/
sitemap: http://www.example.com/sitemap.xml
Should the User-agent: Googlebot and User-agent: Googlebot-Image be written together, so it is this instead?:
User-agent: Googlebot-Image
User-agent: Googlebot
Disallow:
Disallow: /courses/
/portfolio/portfolio-template.php/
/images/graphics/erhvervserfaring/
/images/graphics/uddannelse/
No, you should write them separately along with the Disallow information.
Also, you should copy the Disallow information too.
User-agent: Googlebot-Image
Disallow: /courses/
Disallow: /portfolio/portfolio-template.php/
Disallow: /images/graphics/erhvervserfaring/
Disallow: /images/graphics/uddannelse/
User-agent: Googlebot
Disallow: /courses/
Disallow: /portfolio/portfolio-template.php/
Disallow: /images/graphics/erhvervserfaring/
Disallow: /images/graphics/uddannelse/
As a reference, you can see what Facebook and what apple did in their robots.txt.
My client has a store in BigCommerce and is using Google Web Master Tools to monitor search performance. Google is telling the client that certain URLs are blocked on mobile by the robots.txt file. It seems any URL that ends with ?sort=newest is being blocked.
The contents of the HTTP robots.txt file is:
User-agent: *
Disallow: /account.php
Disallow: /cart.php
Disallow: /checkout.php
Disallow: /finishorder.php
Disallow: /login.php
Disallow: /orderstatus.php
Disallow: /postreview.php
Disallow: /productimage.php
Disallow: /productupdates.php
Disallow: /remote.php
Disallow: /search.php
Disallow: /viewfile.php
Disallow: /wishlist.php
Disallow: /admin/
Disallow: /_socialshop/
The contents on the HTTPS robots.txt file is:
User-agent: *
Disallow: /
User-agent: google-xrawler
Allow: /feeds/*
Obviosuly, no mention of ?sort=newest in the robots files.
How can I stop the ?sort=newest pages from being blocked on mobile search?
While I understand your question, blocking url filters is actually best practice. You can read more about it under on Moz under the heading "No Indexed Filters".
Some additional information is offered by BC around this and how to indicate to google what Search Parameter should be crawled, if any.
I have a site with the following structure:
http://www.example.com/folder1/folder2/folder3
I would like to disallow indexing in folder1, and folder2.
But I would like the robots to index everything under folder3.
Is there a way to do this with the robots.txt?
For what I read I think that everything inside a specified folder is disallowed.
Would the following achieve my goal?
user-agent: *
Crawl-delay: 0
Sitemap: <Sitemap url>
Allow: /folder1/folder2/folder3
Disallow: /folder1/folder2/
Disallow: /folder1/
Allow: /
Yes, it works... however google has a tool to test your robots.txt file
you only need to go on google webmaster tools (https://www.google.com/webmasters/tools/)
and open the section "site configuration -> crawler access"
All you would need is:
user-agent: *
Crawl-delay: 0
Sitemap:
Allow: /folder1/folder2/folder3
Disallow: /folder1/
Allow: /
At least googlebot will see the more specific allowing of that one directory and disallow anything from folder1 and on. This is backed up by this post by a Google employee.
Line breaks in records are not allowed, so your original robots.txt should look like this:
user-agent: *
Crawl-delay: 0
Sitemap: <Sitemap url>
Allow: /folder1/folder2/folder3
Disallow: /folder1/folder2/
Disallow: /folder1/
Allow: /
Possible improvements:
Specifying Allow: / is superfluous, as it’s the default anyway.
Specifying Disallow: /folder1/folder2/ is superfluous, as Disallow: /folder1/ is sufficient.
As Sitemap is not per record, but for all bots, you could specify it as a separate block.
So your robots.txt could look like this:
User-agent: *
Crawl-delay: 0
Allow: /folder1/folder2/folder3
Disallow: /folder1/
Sitemap: http://example.com/sitemap
(Note that the Allow field is not part of the original robots.txt specification, so don’t expect all bots to understand it.)
I have a few doubts about this robots file.
User-agent: *
Disallow: /administrator/
Disallow: /css/
Disallow: /func/
Disallow: /images/
Disallow: /inc/
Disallow: /js/
Disallow: /login/
Disallow: /recover/
Disallow: /Scripts/
Disallow: /store/com-handler/
Disallow: /store/img/
Disallow: /store/theme/
Disallow: /store/StoreSys.swf
Disallow: config.php
This is going to disable crawlers for all files inside each folder right?
Or i have to add a asterisk at the end of each folder name?
I think this should do it. But i'm not sure if have to add Allow: / right after User-agent i suppose it isn't needed.
Anything wrong in this robots file?
PS: If someone can suggest a validation app for local use, i would be glad.
Thanks.
It's fine as is, if I understand what you want. E.g.
/administrator/
/css/subpage
are both blocked, but
/foo
is allowed. Note that Allow is a less supported extension designed only to counter a previous Disallow. You might use it if, for instance, despite your
Disallow: /images/
you decide you want a particular image allowed. So,
Allow: /images/ok_image
All other images remain blocked. You can see http://www.searchtools.com/robots/robots-txt.html for more info, including a list of checkers.