The Problem
I am trying to block a path that contains a certain URL parameter with robots.txt. I want to block the path regardless of where this URL parameter appears in the query string.
What I have tried
After reading several answers, I have tried:
Disallow /*?param=
Also:
Disallow /*?param=*
These will only block the path if param is the first URL parameter. But not if it appears later in the URL.
I also tried:
Disallow /*?*param=*
While this works, it also blocks any path that has a URL parameter with suffix of param and as such is not an acceptable solution.
The Question
How can I block a path that contains a specific URL parameter regardless of where it appears in the query string?
If you want to block:
/path/file?param=value
/path/file?a=b¶m=value
but you do not want to block:
/path/file?otherparam=value
/path/file?a=b&otherparam=value
You need to use two disallow lines, like this:
User-agent: *
Disallow: /*?param=
Disallow: /*¶m=
There is no way to do this reliably with a single line.
Related
I want to forward this URL
https://demo.example.com/page1?v=105
to
https://www.example.com/page1
It's not working if I set so:
But if I remove the ?v= part, it works:
Is there any way to include the ?v= part in this page rule?
The rule you mentioned works only for that exact URL: https://demo.example.com/page1?v=105. If it has any additional query parameter or even an additional character, eg https://demo.example.com/page1?v=1050, it won't match the rule.
If you need to use complicated URL matches that require specific query strings, you might need to use bulk redirects or a Cloudflare worker.
I want to disallow URLs with my robots txt with a url parameter and numeric order.
I have a website with GET parameters like:
example.com/show?id_item=1
to
example.com/show?id_item=999
To disallow from id_item 1 to 500.
It´s possible to disallow in robots.txt a range "id_item" without write tons (500 in that case) of lines?
It depends on the range. It’s easy for your example (1 to 999, disallowing 1 to 500):
User-agent: *
Disallow: /show?id_item=1
Disallow: /show?id_item=2
Disallow: /show?id_item=3
Disallow: /show?id_item=4
Disallow: /show?id_item=500
This disallows any id_item that starts with "1", "2", "3", "4", or "500".
So URLs like these will be disallowed:
https://example.com/show?id_item=1
https://example.com/show?id_item=19
https://example.com/show?id_item=150
https://example.com/show?id_item=1350
https://example.com/show?id_item=1foo
If you expect IDs higher than 999, it doesn’t work like that anymore (because IDs like "1001" would also be disallowed). You might have to make use of Allow then (but this feature isn’t part of the original robots.txt spec, so not necessarily supported by all consumers), and the list becomes longer.
Depending on the range, $ might be useful. It indicates the end of the URL (but this is also a feature that’s not part of the original robots.txt spec, so it’s not necessarily supported by all robots.txt parsers). For example, the following line would only block ID "500":
Disallow: /show?id_item=500$
No, there is really no way to do this with robots.txt, other than having 500 lines, one for each number. (not a recommendation!) The closest thing is the wildcard extension "*", but this will match a string of any length, made up of any characters. There is no way to match a specific pattern of digits, which is what you would need to match a numeric range.
If your goal is to keep these pages out of the search engines, then the best way to do this is to add code to selectively block these pages with robots meta tags or x-robots-tag headers whenever the id is in the target range.
If your goal is to prevent the pages from being crawled at all (e.g. to reduce server load) then you are out of luck. You will have to choose between blocking all of them (with Disallow: /show?id_item=) or none of them.
I have various products with their own set paths. Eg:
electronics/mp3-players/sony-hg122
fitness/devices/gymboss
If want to be able to access URLs in this format. For example:
http://www.mysite.com/fitness/devices/gymboss
http://www.mysite.com/electronics/mp3-players/sony-hg122
My strategy was to override the "init" function of the SiteController in order to catch the paths and then direct it to my own implementation of a render function. However, this doesn't allow me to catch the path.
Am I going about it the wrong way? What would be the correct strategy to do this?
** EDIT **
I figure I have to make use of the URL manager. But how do I dynamically add path formats if they are all custom in a database?
Eskimo's setup is a good solid approach for most Yii systems. However, for yours, I would suggest creating a custom UrlRule to query your database:
http://www.yiiframework.com/doc/guide/1.1/en/topics.url#using-custom-url-rule-classes
Note: the URL rules are parsed on every single Yii request, so be careful in there. If you aren't efficient, you can rapidly slow down your site. By default rules are cached (if you have a cache setup), but I don't know if that applies to dynamic DB rules (I would think not).
In your URL manager (protected/config/main.php), Set urlFormat to path (and toptionally set showScriptName to false (this hides the index.php part of the URL))
'urlManager' => array(
'urlFormat' => 'path',
'showScriptName'=>false,
Next, in your rules, you could setup something like:
catalogue/<category_url:.+>/<product_url:.+> => product/view,
So what this does is route and request with a structure like catalogue/electronics/ipods to the ProductController actionView. You can then access the category_url and product_url portions of the URL like so:
$_GET['category_url'];
$_GET['product_url'];
How this rule works is, any URL which starts with the word catalogue (directly after your domain name) which is followed by another word (category_url), and another word (product_url), will be directed to that controller/action.
You will notice that in my example I am preceding the category and product with the word catalogue. Obviously you could replace this with whatever you prefer or leave it out all together. The reason I have put it in is, consider the following URL:
http://mywebsite.com/site/about
If you left out the 'catalogue' portion of the URL and defined your rule only as:
<category_url:.+>/<product_url:.+> => product/view,
the URL Manager would see the site portion of the URL as the category_url value, and the about portion as the product_url. To prevent this you can either have the catalogue protion of the URL, or define rules for the non catalogue pages (ie; define a rule for site/about)
Rules are interpreted top to bottom, and only the first rule is matched. Obviously you can add as many rules as you need for as many different URL structures as you need.
I hope this gets you on the right path, feel free to comment with any questions or clarifications you need
I would like Google to ignore URLs like this:
http://www.mydomain.example/new-printers?dir=asc&order=price&p=3
In other words, all the URLs that have the parameters dir, order and price should be ignored. How do I do so with robots.txt?
Here's a solutions if you want to disallow query strings:
Disallow: /*?*
or if you want to be more precise on your query string:
Disallow: /*?dir=*&order=*&p=*
You can also add to the robots.txt which url to allow
Allow: /new-printer$
The $ will make sure only the /new-printer will be allowed.
More info:
http://code.google.com/web/controlcrawlindex/docs/robots_txt.html
http://sanzon.wordpress.com/2008/04/29/advanced-usage-of-robotstxt-w-querystrings/
You can block those specific query string parameters with the following lines
Disallow: /*?*dir=
Disallow: /*?*order=
Disallow: /*?*p=
So if any URL contains dir=, order=, or p= anywhere in the query string, it will be blocked.
Register your website with Google WebMaster Tools. There you can tell Google how to deal with your parameters.
Site Configuration -> URL Parameters
You should have the pages that contain those parameters indicate that they should be excluded from indexing via the robots meta tag. e.g.
I thought using colons in URIs was "illegal". Then I saw that vimeo.com is using URIs like http://www.vimeo.com/tag:sample.
What do you feel about the usage of colons in URIs?
How do I make my Apache server work with the "colon" syntax because now it's throwing the "Access forbidden!" error when there is a colon in the first segment of the URI?
Colons are allowed in the URI path. But you need to be careful when writing relative URI paths with a colon since it is not allowed when used like this:
<a href="tag:sample">
In this case tag would be interpreted as the URI’s scheme. Instead you need to write it like this:
<a href="./tag:sample">
Are colons allowed in URLs?
Yes, unless it's in the first path segment of a relative-path reference
So for example you can have a URL like this:
https://en.wikipedia.org/wiki/Template:Welcome
And you can use it normally as an absolute URL or some relative variants:
Welcome Template
Welcome Template
Welcome Template
But this would be invalid:
Welcome Template
because the "Template" here would be mistaken for the protocol scheme.
You would have to use:
Welcome Template
to use a relative link from a page on the same level in the hierarchy.
The spec
See the RFC 3986, Section 3.3:
https://www.rfc-editor.org/rfc/rfc3986#section-3.3
The path component contains data, usually organized in hierarchical
form, that, along with data in the non-hierarchical query component
(Section 3.4), serves to identify a resource within the scope of the
URI's scheme and naming authority (if any). The path is terminated
by the first question mark ("?") or number sign ("#") character, or
by the end of the URI.
If a URI contains an authority component, then the path component
must either be empty or begin with a slash ("/") character. If a URI
does not contain an authority component, then the path cannot begin
with two slash characters ("//"). In addition, a URI reference
(Section 4.1) may be a relative-path reference, in which case the
first path segment cannot contain a colon (":") character. The ABNF
requires five separate rules to disambiguate these cases, only one of
which will match the path substring within a given URI reference. We
use the generic term "path component" to describe the URI substring
matched by the parser to one of these rules. [emphasis added]
Example URL that uses a colon:
https://en.wikipedia.org/wiki/Template:Welcome
Also note the difference between Apache on Linux and Windows. Apache on Windows somehow doesn't allow colons to be used in the first part of the URL. Linux has no problem with this, however.