I would like to disallow all parameters in a specific url.
If i add this rule :
Disallow: /*?*
It works but for all url
What i would like to do :
Disallow: /my-specific-url/*?*
But according to Google Webmaster Tools, this rule doesn’t work.
Your example looks like it should be working, but you do need to include the User-agent line. The following robots.txt file:
User-agent: *
Disallow: /my-specific-url/*?*
Will block the following URLs:
http://example.com/my-specific-url/?
http://example.com/my-specific-url/?a=b
but it will not block the following:
http://example.com/my-specific-url/
http://example.com/some-other-url/?a=b
Note that the trailing * is harmless but serves no useful purpose. A cleaner way to do exactly the same thing would be:
User-agent: *
Disallow: /my-specific-url/*?
Also note that wildcards are supported by the major search engines, but they are not supported by many other crawlers.
While you can't use regular expressions, you are allowed to use wildcards
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt#url-matching-based-on-path-values
Have you tried something like
Disallow: /my-specific-url/*var1=*
Related
I'm trying to set it up where www.url.com/folder is disallowed, but www.url.com/folder/1 is allowed. I have it set up as follows:
User-agent: *
Disallow: /folder
Allow: /folder/*
which works when testing with the Google robots.txt tester, but if I look at the logs, I can see Googlebot hitting all of the urls other than /folder.
Am I missing something? Should allow go first?
I think this one should work:
User-agent: *
Disallow: /folder/$
Allow: /folder/*
I am trying to disallow one page in subdirectory
I am using that robots.txt code is it
User-Agent: *
Disallow:
Disallow: /form.aspx
but the form.aspx is in processfolder and my url is showing like
www.yoursite.com/process/form.aspx
so how can I disallow form.aspx in robots.txt.
The format which is given above robots.txt: is it right?
please guide
If you want to block http://example.com/process/form.aspx and allow everything else, you can use:
# robots.txt on <http://example.com/robots.txt>
User-agent: *
Disallow: /process/form.aspx
Note that this would also block URLs like http://example.com/process/form.aspx.foo, http://example.com/process/form.aspx/bar, etc.
I have written some rules to block few URL in Robot.txt Now i want to varied those rules. Is there any tools for verifying robot.txt?
I have written this rule:
Disallow: /classifieds/search*/
to block these URLs:
http://example.com/classifieds/search?filter_states=4&filter_frieght=8&filter_driver=2
http://example.com/classifieds/search?keywords=Covenant+Transport&type=Carrier
http://example.com/classifieds/search/
http://example.com/classifieds/search
I also want to know what is the difference between these rules
Disallow: /classifieds/search*/
Disallow: /classifieds/search/
Disallow: /classifieds/search
Your rule Disallow: /classifieds/search*/ does not do what you want it to do.
First, note that the * character has no special meaning in the original robots.txt specification. But some parsers, like Google’s, use it as a wildcard for pattern matching. Assuming that you have this rule for those parsers only:
From your example, this rule would only block http://example.com/classifieds/search/. The three other URLs don’t have a / after search.
Disallow: /classifieds/search
→ blocks all URLs whose paths start with /classifieds/search
Disallow: /classifieds/search/
→ blocks all URLs whose paths start with /classifieds/search/
Disallow: /classifieds/search*/
→ for parsers following the original spec: blocks all URLs whose paths start with /classifieds/search*/
→ for parsers that use * as wildcard: blocks all URLs whose paths start with /classifieds/search, followed by anything, followed by /
For blocking the four example URLs, simply use the following:
User-agent: *
Disallow: /classifieds/search
This will block, for example:
http://example.com/classifieds/search?filter=4
http://example.com/classifieds/search/
http://example.com/classifieds/search/foo
http://example.com/classifieds/search
http://example.com/classifieds/search.html
http://example.com/classifieds/searching
The problem with using a robots.txt is that it cannot block anything per se, but rather ask the webcrawler nicely not to crawl certain areas of your site.
As for verification, provided that the syntax is valid, it should work, and you can monitor the server logs to see if some known compliant bots avoid those directories after reading robots.txt. This of course relies on the bots accessing your site complying with the standard.
There are a lot of online validators that can be used, such as http://www.frobee.com/robots-txt-check
And when it comes to those three rules:
> **Disallow: /classifieds/search*/**
Disallow anything inside a directory where the name starts with search, but not the directory itself
> **Disallow: /classifieds/search/**
Disallow anything inside the directory named search
> **Disallow: /classifieds/search**
Disallow any directory starting with search
I dont have tested this myself, but did you try robots.txt checker?
As for the difference between the three rules, I'd say that
Disallow: /classifieds/search*/ disallows all subdirectories of /classifieds/ beginning with "search"
Disallow: /classifieds/search/ only disallows the /classifieds/search/ directory
Disallow: /classifieds/search disallows the visit of a file called /classifieds/search
I'm working on a WordPress site that has a login portal where users can access 'classified' documents in pdf,doc and a few other formats. The files are uploaded via the media manager, so are always stored in /wp-content/uploads
I need to make sure these file types are not shown in search results. I've made some rules in .htaccess and robots.txt that I think will work, but it's very hard to test, so I was hoping someone could glance over them and let me know if they'll do what I'm expecting them to or not. One thing in particular I wasn't sure of: would the disallow: /wp-content/ stop the x-robots-tag from being seen?
.htaccess - under # end Wordpress
# do not index specified file types
<IfModule mod_headers.c>
<FilesMatch "\.(doc|docx|xls|xlsx|pdf|ppt|pptx)$">
Header set X-Robots-Tag "noindex"
</FilesMatch>
</IfModule>
robots.txt - complete
User-agent: *
Disallow: /feed/
Disallow: /trackback/
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /xmlrpc.php
Disallow: /wp-
Disallow: /growers-portal
Disallow: /growers-portal/
Disallow: /grower_posts
Disallow: /grower_posts/
Sitemap: http://www.pureaussiepineapples.com.au/sitemap_index.xml
Neither of those stop anyone reading your "classified" documents. To do that you really want to restrict access based on logged in users.
The robots tag will keep the files out of the search results.
However, robots.txt does not stop files being in the search results. Google takes that directive to say they can't read the file but they can still include it in the index.
This causes an interesting scenario. Your robots.txt stops Google reading the robots tag so does not know you want it out of the index.
So, if you're not going to physically control access to the files I would use the robots tag but not robots.txt directives.
I'm trying to prevent Google, Yahoo et al from hitting my /products/ID/purchase page and am unsure on how to do it.
I currently block them from hitting sign in with the following:
User-agent: *
Disallow: /sign_in
Can I do something like the following?
User-agent: *
Disallow: /products/*/purchase
Or should it be:
User-agent: *
Disallow: /purchase
I assume you want to block /products/ID/purchase but allow /products/ID.
Your last suggestion would only block pages that start with "purchase":
User-agent: *
Disallow: /purchase
So this is not what you want.
You'd need your second suggestion:
User-agent: *
Disallow: /products/*/purchase
This would block all URLs that start with /products/, followed by any character(s), followed by /purchase.
Note: It uses the wildcard *. In the original robots.txt "specification", this is not a character with special meaning. However, some search engines extended the spec and use it as a kind of wildcard. So it should work for Google and probably some other search engines, but you can't bet that it would work with all the other crawlers/bots.
So your robots.txt could look like:
User-agent: *
Disallow: /sign_in
Disallow: /products/*/purchase
Also note that some search engines (including Google) might still list a URL in their search results (without title/snippet) although it is blocked in robots.txt. This might be the case when they find a link to a blocked page on a page that is allowed to be crawled. To prevent this, you'd have to noindex the document.
According to Google Disallow: /products/*/purchase should work.
But according to robotstxt.org this doesn't work.