I am making my robots.txt file. But I am a little bit insecure about how to make disallow Googlebot-Image. I want to allow the Google bot to crawl my site, except for the disallow I have made below. This is what I made:
User-agent: Googlebot
Disallow:
Disallow: /courses/
Disallow: /portfolio/portfolio-template.php/
User-agent: Googlebot-Image
Disallow:
/images/graphics/erhvervserfaring/
/images/graphics/uddannelse/
sitemap: http://www.example.com/sitemap.xml
Should the User-agent: Googlebot and User-agent: Googlebot-Image be written together, so it is this instead?:
User-agent: Googlebot-Image
User-agent: Googlebot
Disallow:
Disallow: /courses/
/portfolio/portfolio-template.php/
/images/graphics/erhvervserfaring/
/images/graphics/uddannelse/
No, you should write them separately along with the Disallow information.
Also, you should copy the Disallow information too.
User-agent: Googlebot-Image
Disallow: /courses/
Disallow: /portfolio/portfolio-template.php/
Disallow: /images/graphics/erhvervserfaring/
Disallow: /images/graphics/uddannelse/
User-agent: Googlebot
Disallow: /courses/
Disallow: /portfolio/portfolio-template.php/
Disallow: /images/graphics/erhvervserfaring/
Disallow: /images/graphics/uddannelse/
As a reference, you can see what Facebook and what apple did in their robots.txt.
Related
I'm trying to respect the robots.txt file, while webcrawling, and I encountered something weird. The the robots.txt URL I'm trying to access is: https://podatki.gov.si/robots.txt
If I open this link in Chrome, I get this:
User-agent: *
Disallow: /
But if I open this link with Internet Explorer or Selenium WebDriver (ChromeDriver), I get this:
#
# robots.txt
#
# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these "robots" where not to go on your site,
# you save bandwidth and server resources.
#
# This file will be ignored unless it is at the root of your host:
# Used: http://example.com/robots.txt
# Ignored: http://example.com/site/robots.txt
#
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/robotstxt.html
User-agent: *
Crawl-delay: 10
# CSS, JS, Images
Allow: /misc/*.css$
Allow: /misc/*.css?
Allow: /misc/*.js$
Allow: /misc/*.js?
Allow: /misc/*.gif
Allow: /misc/*.jpg
Allow: /misc/*.jpeg
Allow: /misc/*.png
Allow: /modules/*.css$
Allow: /modules/*.css?
Allow: /modules/*.js$
Allow: /modules/*.js?
Allow: /modules/*.gif
Allow: /modules/*.jpg
Allow: /modules/*.jpeg
Allow: /modules/*.png
Allow: /profiles/*.css$
Allow: /profiles/*.css?
Allow: /profiles/*.js$
Allow: /profiles/*.js?
Allow: /profiles/*.gif
Allow: /profiles/*.jpg
Allow: /profiles/*.jpeg
Allow: /profiles/*.png
Allow: /themes/*.css$
Allow: /themes/*.css?
Allow: /themes/*.js$
Allow: /themes/*.js?
Allow: /themes/*.gif
Allow: /themes/*.jpg
Allow: /themes/*.jpeg
Allow: /themes/*.png
# Directories
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /profiles/
Disallow: /scripts/
Disallow: /themes/
# Files
Disallow: /CHANGELOG.txt
Disallow: /cron.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /INSTALL.sqlite.txt
Disallow: /install.php
Disallow: /INSTALL.txt
Disallow: /LICENSE.txt
Disallow: /MAINTAINERS.txt
Disallow: /update.php
Disallow: /UPGRADE.txt
Disallow: /xmlrpc.php
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /filter/tips/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
Disallow: /user/logout/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=filter/tips/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
Disallow: /?q=user/logout/
Why does this happen? The latter seems to be a generic robots.txt file, maybe something autogenerated?
I have observed the same behavior as follows:
When accessed the webpage https://podatki.gov.si/robots.txt manually, I got:
User-agent: *
Disallow: /
When accessed the webpage https://podatki.gov.si/robots.txt using ChromeDriver and Chrome, I got:
#
# robots.txt
#
# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these "robots" where not to go on your site,
# you save bandwidth and server resources.
#
# This file will be ignored unless it is at the root of your host:
# Used: http://example.com/robots.txt
# Ignored: http://example.com/site/robots.txt
#
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/robotstxt.html
User-agent: *
Crawl-delay: 10
# CSS, JS, Images
Allow: /misc/*.css$
Allow: /misc/*.css?
Allow: /misc/*.js$
Allow: /misc/*.js?
Allow: /misc/*.gif
Allow: /misc/*.jpg
Allow: /misc/*.jpeg
Allow: /misc/*.png
Allow: /modules/*.css$
Allow: /modules/*.css?
Allow: /modules/*.js$
Allow: /modules/*.js?
Allow: /modules/*.gif
Allow: /modules/*.jpg
Allow: /modules/*.jpeg
Allow: /modules/*.png
Allow: /profiles/*.css$
Allow: /profiles/*.css?
Allow: /profiles/*.js$
Allow: /profiles/*.js?
Allow: /profiles/*.gif
Allow: /profiles/*.jpg
Allow: /profiles/*.jpeg
Allow: /profiles/*.png
Allow: /themes/*.css$
Allow: /themes/*.css?
Allow: /themes/*.js$
Allow: /themes/*.js?
Allow: /themes/*.gif
Allow: /themes/*.jpg
Allow: /themes/*.jpeg
Allow: /themes/*.png
# Directories
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /profiles/
Disallow: /scripts/
Disallow: /themes/
# Files
Disallow: /CHANGELOG.txt
Disallow: /cron.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /INSTALL.sqlite.txt
Disallow: /install.php
Disallow: /INSTALL.txt
Disallow: /LICENSE.txt
Disallow: /MAINTAINERS.txt
Disallow: /update.php
Disallow: /UPGRADE.txt
Disallow: /xmlrpc.php
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /filter/tips/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
Disallow: /user/logout/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=filter/tips/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
Disallow: /?q=user/logout/
robots.txt
As per robotstxt.org website owners use the robots.txt file to give instructions about their site to web robots. This is called The Robots Exclusion Protocol.
It works as follows:
A robot wants to vist a website URL, e.g. http://www.example.com/welcome.html.
Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:
User-agent: *
Disallow: /
The User-agent: * means this section applies to all robots.
The Disallow: / tells the robot that it should not visit any pages on the site.
There are two important considerations when using robots.txt:
Robots can ignore your robots.txt. Specially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
The robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.
Outro
As using ChromeDriver and Chrome the navigator.webdriver defines a standard way for co-operating user agents to inform the document that it is controlled by WebDriver, so that alternate code paths can be triggered during automation. Hence you are able to see more contents from the robots.txt.
You can find a relevant discussion in Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection
My client has a store in BigCommerce and is using Google Web Master Tools to monitor search performance. Google is telling the client that certain URLs are blocked on mobile by the robots.txt file. It seems any URL that ends with ?sort=newest is being blocked.
The contents of the HTTP robots.txt file is:
User-agent: *
Disallow: /account.php
Disallow: /cart.php
Disallow: /checkout.php
Disallow: /finishorder.php
Disallow: /login.php
Disallow: /orderstatus.php
Disallow: /postreview.php
Disallow: /productimage.php
Disallow: /productupdates.php
Disallow: /remote.php
Disallow: /search.php
Disallow: /viewfile.php
Disallow: /wishlist.php
Disallow: /admin/
Disallow: /_socialshop/
The contents on the HTTPS robots.txt file is:
User-agent: *
Disallow: /
User-agent: google-xrawler
Allow: /feeds/*
Obviosuly, no mention of ?sort=newest in the robots files.
How can I stop the ?sort=newest pages from being blocked on mobile search?
While I understand your question, blocking url filters is actually best practice. You can read more about it under on Moz under the heading "No Indexed Filters".
Some additional information is offered by BC around this and how to indicate to google what Search Parameter should be crawled, if any.
I'd like to disallow all subdirectories in my folder /search but allow indexing the search folder itself (I have content on /search).
Testing this does not work:
User-Agent: *
Allow: /search/
Disallow: /search/*
Your code appears correct. Try with a slight adjustment to Allow:
User-Agent: *
Disallow: /search/*
Allow: /search/$
I am making a robot.txt for my website. Can anybody cnmform that am i doing it correctly? If I am wrong, please tell me how to write in the correct form.
admincp, adminpp etc are folder in my hosting server:
User-agent: *
Disallow: /admincp/
Disallow: /adminpp/
Disallow: /Advertise with us/
Disallow: /ajax/
Disallow: /banner/
Disallow: /cont_img/
Disallow: /corcel/
Disallow: /css/
Disallow: /fbold/
Disallow: /images/
Disallow: /img/
Disallow: /js/
Disallow: /pic/
Disallow: /Scripts/
Disallow: /textpimg/
Disallow: /thumb_uploadtopics/
Disallow: /upload_p1/
Disallow: /uploadtopics/
You can read this article to get a better grip on writing Robots.txt
http://seoroi.com/seo-faq/robotstxt-what-it-is-why-its-used-and-how-to-write-it/
For example, you can write:
User-agent: *
Disallow: /src/
Disallow: /cgi-bin/
Disallow: /~zohaib/
Disallow: /temp/
Yes, you are doing it correctly it seems.
If u r write correctly ur folder name and page name than u r right, not any problem i seen in your robots.txt file.
Some instruction if u want to know.
user-agent : * --> means u r allow to all search engine bot for crawl.
Disallow : ---> means which one folder and page u don't want to allow to visitor.
I have a few doubts about this robots file.
User-agent: *
Disallow: /administrator/
Disallow: /css/
Disallow: /func/
Disallow: /images/
Disallow: /inc/
Disallow: /js/
Disallow: /login/
Disallow: /recover/
Disallow: /Scripts/
Disallow: /store/com-handler/
Disallow: /store/img/
Disallow: /store/theme/
Disallow: /store/StoreSys.swf
Disallow: config.php
This is going to disable crawlers for all files inside each folder right?
Or i have to add a asterisk at the end of each folder name?
I think this should do it. But i'm not sure if have to add Allow: / right after User-agent i suppose it isn't needed.
Anything wrong in this robots file?
PS: If someone can suggest a validation app for local use, i would be glad.
Thanks.
It's fine as is, if I understand what you want. E.g.
/administrator/
/css/subpage
are both blocked, but
/foo
is allowed. Note that Allow is a less supported extension designed only to counter a previous Disallow. You might use it if, for instance, despite your
Disallow: /images/
you decide you want a particular image allowed. So,
Allow: /images/ok_image
All other images remain blocked. You can see http://www.searchtools.com/robots/robots-txt.html for more info, including a list of checkers.