How to tell search engines to use my updated robots.txt file? - seo

Before, I had blocked the search engine robots to prevent crawling my website using the robots.txt file but now I want to unblock them.
I updated the robots.txt file and allowed the search engine robots to crawl my website but it seems the search engines still use my old robots.txt file, How I can tell the search engines to use my new robots.txt file? or is there something wrong in my robots.txt file?
The content of my old robots.txt file:
User-agent: *
Disallow: /
The content of my new robots.txt file:
User-agent: *
Allow: /
# Disallow these directories, url types & file-types
Disallow: /trackback/
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /xmlrpc.php
Disallow: /wp-
Disallow: /cgi-bin
Disallow: /readme.html
Disallow: /license.txt
Disallow: /*?*
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: /*.gz$
Disallow: /*.wmv$
Disallow: /*.cgi$
Disallow: /*.xhtml$
Disallow: /*/wp-*
Disallow: /*/feed/*
Disallow: /*/*?s=*
Disallow: /*/*.js$
Disallow: /*/*.inc$
Allow: /wp-content/uploads/
User-agent: ia_archiver*
Disallow: /
User-agent: duggmirror
Disallow: /
Sitemap: https://example.com/sitemap.xml

Will need to be done independently for each search engine, otherwise it may just happen over time. For Google, use the Google Search Console tool. This will allow you to upload new robots.txt and submit for recrawling.

Related

SE robots don't index pages from sitemap.xml

I uploaded a sitemap to my site and not all the url's have been indexed. I've got a bunch of url's which have not been indexed by Google. And I don't know why happen this ...
Right now, I've got 716 url's withount indexing.
If I watch which url's have not been indexed we've got this example of url's:
All these url's are completely accesible. If you make click in some of them, you can access to the site correctly:
https://www.calzadosniza.es/es/mujer/zapatos-mujer/zapato-descubierto-puntera-charol-ancho-juan-mastre-108-7920#/62-tallas_grandes-40/116-color-azul
https://www.calzadosniza.es/es/mujer/sandalias-mujer/sandalia-cuna-pala-cruzada-combi-plata-glenda-porronet-6551-porronet-8751#/62-tallas_grandes-40/114-color-blanco
https://www.calzadosniza.es/es/mujer/botas-y-botines-mujer/bota-militar-cordon-piso-volumen-2670-tekila-3999#/63-tallas_grandes-41/113-color-negro
If I inspect one of them, for example this: https://www.calzadosniza.es/es/mujer/zapatos-mujer/zapato-descubierto-puntera-charol-ancho-juan-mastre-108-7920#/62-tallas_grandes-40/116-color-azul
I've got this result:
My robot.txt file is:
# Allow Directives
Allow: */modules/*.css
Allow: */modules/*.js
Allow: */modules/*.png
Allow: */modules/*.jpg
# Private pages
Disallow: /*?orderby=
Disallow: /*?orderway=
Disallow: /*?tag=
Disallow: /*?id_currency=
Disallow: /*?search_query=
Disallow: /*?back=
Disallow: /*?n=
Disallow: /*&orderby=
Disallow: /*&orderway=
Disallow: /*&tag=
Disallow: /*&id_currency=
Disallow: /*&search_query=
Disallow: /*&back=
Disallow: /*&n=
Disallow: /*controller=addresses
Disallow: /*controller=address
Disallow: /*controller=authentication
Disallow: /*controller=cart
Disallow: /*controller=discount
Disallow: /*controller=footer
Disallow: /*controller=get-file
Disallow: /*controller=header
Disallow: /*controller=history
Disallow: /*controller=identity
Disallow: /*controller=images.inc
Disallow: /*controller=init
Disallow: /*controller=my-account
Disallow: /*controller=order
Disallow: /*controller=order-slip
Disallow: /*controller=order-detail
Disallow: /*controller=order-follow
Disallow: /*controller=order-return
Disallow: /*controller=order-confirmation
Disallow: /*controller=pagination
Disallow: /*controller=password
Disallow: /*controller=pdf-invoice
Disallow: /*controller=pdf-order-return
Disallow: /*controller=pdf-order-slip
Disallow: /*controller=product-sort
Disallow: /*controller=search
Disallow: /*controller=statistics
Disallow: /*controller=attachment
Disallow: /*controller=guest-tracking
# Directories
Disallow: */cache/
Disallow: */classes/
Disallow: */config/
Disallow: */controllers/
Disallow: */css/
Disallow: */download/
Disallow: */js/
Disallow: */localization/
Disallow: */log/
Disallow: */mails/
Disallow: */modules/
Disallow: */override/
Disallow: */pdf/
Disallow: */src/
Disallow: */tools/
Disallow: */translations/
Disallow: */upload/
Disallow: */vendor/
Disallow: */web/
Disallow: */webservice/
# Files
Disallow: /*es/password-recovery
Disallow: /*es/address
Disallow: /*es/addresses
Disallow: /*es/login
Disallow: /*es/cart
Disallow: /*es/discount
Disallow: /*es/order-history
Disallow: /*es/identity
Disallow: /*es/my-account
Disallow: /*es/order-follow
Disallow: /*es/credit-slip
Disallow: /*es/order
Disallow: /*es/search
Disallow: /*es/guest-tracking
Disallow: /*es/order-confirmation
Disallow: /*ca/password-recovery
Disallow: /*ca/address
Disallow: /*ca/addresses
Disallow: /*ca/login
Disallow: /*ca/cart
Disallow: /*ca/discount
Disallow: /*ca/order-history
Disallow: /*ca/identity
Disallow: /*ca/my-account
Disallow: /*ca/order-follow
Disallow: /*ca/credit-slip
Disallow: /*ca/order
Disallow: /*ca/search
Disallow: /*ca/guest-tracking
Disallow: /*ca/order-confirmation
Disallow: /*gl/password-recovery
Disallow: /*gl/address
Disallow: /*gl/addresses
Disallow: /*gl/login
Disallow: /*gl/cart
Disallow: /*gl/discount
Disallow: /*gl/order-history
Disallow: /*gl/identity
Disallow: /*gl/my-account
Disallow: /*gl/order-follow
Disallow: /*gl/credit-slip
Disallow: /*gl/order
Disallow: /*gl/search
Disallow: /*gl/guest-tracking
Disallow: /*gl/order-confirmation
Disallow: /*eu/password-recovery
Disallow: /*eu/address
Disallow: /*eu/addresses
Disallow: /*eu/login
Disallow: /*eu/cart
Disallow: /*eu/discount
Disallow: /*eu/order-history
Disallow: /*eu/identity
Disallow: /*eu/my-account
Disallow: /*eu/order-follow
Disallow: /*eu/credit-slip
Disallow: /*eu/order
Disallow: /*eu/search
Disallow: /*eu/guest-tracking
Disallow: /*eu/order-confirmation
So, Why all these url's are not been indexed when I upload my sitemap to Google Console Search?
Am I doing something wrong?
Sitemap is not directive for Search Engines, it's just recommendation. SE can't crawl all pages, use "Priority" field in your sitemap.
Try to check html on unindexed pages manually, may be there is prohibition tag:
<meta name="robots" content="noindex" />

robots.txt allows all urls

robots.txt
# robots.txt
User-agent: *
Disallow: */admin/*
Disallow: */Edit/*
Disallow: */edit/*
Disallow: */SendEmailToResetPassword/*
Disallow: */GetBidsAndComments/*
Disallow: */User/GetLoginUserId/*
Disallow: */Account/SendEmailToResetPassword/*
Disallow: */DeleteConfirmed/*
Testing using the robots.txt tester from google search console.
Urls it allows:
http://www.exp.com/ClassifiedAds/DeleteConfirmed/22c1c7b6-e114-4f29-8844-d3ae5e89950b
http://www.exp.com/Account/SendEmailToResetPassword?ReturnUrl=%2Fitem%2Fhome-appliances-80b30b83-7ba9-4565-9e44-0d40d3f7f1d6
http://www.exp.com/api/Electronic/GetBidsAndComments?id=51426&_=1548892800023
How can I disallow these urls?

prevent googlebot from indexing file types in robots.txt and .htaccess

There are many Stack Overflow questions on how to prevent google bot from indexing, for instance, txt files. There's this:
robots.txt
User-agent: Googlebot Disallow: /*.txt$
.htaccess
<Files ~ "\.txt$">
Header set X-Robots-Tag "noindex, nofollow"
</Files>
However, what is the syntax for both of these when trying to prevent two types of files from being indexed? In my case - txt and doc.
In your robots.txt file:
User-agent: Googlebot
Disallow: /*.txt$
Disallow: /*.doc$
More details at Google Webmasters: Create a robots.txt file
In your .htaccess file:
<FilesMatch "\.(txt|doc)$">
Header set X-Robots-Tag "noindex, nofollow"
</FilesMatch>
More details here: http://httpd.apache.org/docs/current/sections.html

Sitemap/robots.txt configuration conflict

My robots.txt contains the following rules:
Disallow: /api/
Allow: /
Allow: /apiDocs
The /apiDocs URL is in the sitemap, but according to Google Webmaster Tools, these robots.txt rules prohibit it from being crawled. I want to prevent all URLs that match /api/* from being crawled, but allow the URL /apiDocs to be crawled.
How should I change my robots.txt to achieve this?
Line breaks aren’t allowed in a record (you have one between your Disallow and the two Allow lines).
You don’t need Allow: / (it’s the same as Disallow:, which is the default).
You disallow crawling of /api/ (which is any URL whose path starts with "api" followed by a "/"), so there is no need for Allow: /apiDocs as it’s allowed anyway.
So your fallback record should look like:
User-Agent: *
Disallow: /login/
Disallow: /logout/
Disallow: /admin/
Disallow: /error/
Disallow: /festival/subscriptions
Disallow: /artistSubscription
Disallow: /privacy
Disallow: /terms
Disallow: /static
Disallow: /api/
When a bot is matched by this "fallback" record, it is allowed to crawl URLs whose paths start with apiDocs.

Need to block subdomain using robots.txt which is on same directory level

I have one problem
I have domain name for example www.testing.com and new.testing.com so i do not want to new.testing.com display in any search engine. I have added one robots.txt to the new.testing.com. And both site has same parent directory
--httpdoc
----testing.com
----new.testing.com
So i want to know that can i handle both site using one robots.txt of testing.com is it possible???
Please suggest me solution if possible.
The best thing you could do is to add separate robots.txt files. Put one in each directory. You should have
testing.com/robots.txt and
new.testing.com/robots.txt
After adding robots.txt file to new.testing.com, you should add following code to keep search engines away.
User-agent: *
Disallow: /
RewriteEngine on
RewriteCond %{HTTP_HOST} ^subdomain.website.com$
RewriteRule ^robots\.txt$ robots-subdomain.txt
Then add the following to /robots-subdomain.txt:
User-agent: *
Disallow: /
Following rules inside works for me,
RewriteEngine On
RewriteCond %{HTTP_HOST} ^subdomain\.maindomain\.com$ [NC]
RewriteRule ^/robots.txt$ /nobots.txt [L]
And add 'nobots.txt' to root directory as follows,
User-agent: *
Disallow: /