I would like Google to ignore URLs like this:
http://www.mydomain.example/new-printers?dir=asc&order=price&p=3
In other words, all the URLs that have the parameters dir, order and price should be ignored. How do I do so with robots.txt?
Here's a solutions if you want to disallow query strings:
Disallow: /*?*
or if you want to be more precise on your query string:
Disallow: /*?dir=*&order=*&p=*
You can also add to the robots.txt which url to allow
Allow: /new-printer$
The $ will make sure only the /new-printer will be allowed.
More info:
http://code.google.com/web/controlcrawlindex/docs/robots_txt.html
http://sanzon.wordpress.com/2008/04/29/advanced-usage-of-robotstxt-w-querystrings/
You can block those specific query string parameters with the following lines
Disallow: /*?*dir=
Disallow: /*?*order=
Disallow: /*?*p=
So if any URL contains dir=, order=, or p= anywhere in the query string, it will be blocked.
Register your website with Google WebMaster Tools. There you can tell Google how to deal with your parameters.
Site Configuration -> URL Parameters
You should have the pages that contain those parameters indicate that they should be excluded from indexing via the robots meta tag. e.g.
Related
I want to forward this URL
https://demo.example.com/page1?v=105
to
https://www.example.com/page1
It's not working if I set so:
But if I remove the ?v= part, it works:
Is there any way to include the ?v= part in this page rule?
The rule you mentioned works only for that exact URL: https://demo.example.com/page1?v=105. If it has any additional query parameter or even an additional character, eg https://demo.example.com/page1?v=1050, it won't match the rule.
If you need to use complicated URL matches that require specific query strings, you might need to use bulk redirects or a Cloudflare worker.
I am trying to connect to this API endpoint, some parameters such as roomTypes and addOns require more parameters inside them. What should the URL be like?
Here is what I am trying, unsuccessfully:
https://api.lodgify.com/v2/quote/308200/?arrival=2020-10-02&departure=2020-10-07&propertyId=308200&roomTypes=[Id=373125, People=5]&addOns=[]
See image of Documentation
The correct format of parameters are as following:
https://api.lodgify.com/v2/quote/{PropertyID}?arrival={DATE}&departure={DATE}&roomTypes[0].id={RoomID}&roomTypes[0].people={PEOPLE}
It seems like you have space (white space) between Id and People in your URL, an URL must not contain a literal space
I want to disallow URLs with my robots txt with a url parameter and numeric order.
I have a website with GET parameters like:
example.com/show?id_item=1
to
example.com/show?id_item=999
To disallow from id_item 1 to 500.
It´s possible to disallow in robots.txt a range "id_item" without write tons (500 in that case) of lines?
It depends on the range. It’s easy for your example (1 to 999, disallowing 1 to 500):
User-agent: *
Disallow: /show?id_item=1
Disallow: /show?id_item=2
Disallow: /show?id_item=3
Disallow: /show?id_item=4
Disallow: /show?id_item=500
This disallows any id_item that starts with "1", "2", "3", "4", or "500".
So URLs like these will be disallowed:
https://example.com/show?id_item=1
https://example.com/show?id_item=19
https://example.com/show?id_item=150
https://example.com/show?id_item=1350
https://example.com/show?id_item=1foo
If you expect IDs higher than 999, it doesn’t work like that anymore (because IDs like "1001" would also be disallowed). You might have to make use of Allow then (but this feature isn’t part of the original robots.txt spec, so not necessarily supported by all consumers), and the list becomes longer.
Depending on the range, $ might be useful. It indicates the end of the URL (but this is also a feature that’s not part of the original robots.txt spec, so it’s not necessarily supported by all robots.txt parsers). For example, the following line would only block ID "500":
Disallow: /show?id_item=500$
No, there is really no way to do this with robots.txt, other than having 500 lines, one for each number. (not a recommendation!) The closest thing is the wildcard extension "*", but this will match a string of any length, made up of any characters. There is no way to match a specific pattern of digits, which is what you would need to match a numeric range.
If your goal is to keep these pages out of the search engines, then the best way to do this is to add code to selectively block these pages with robots meta tags or x-robots-tag headers whenever the id is in the target range.
If your goal is to prevent the pages from being crawled at all (e.g. to reduce server load) then you are out of luck. You will have to choose between blocking all of them (with Disallow: /show?id_item=) or none of them.
The Problem
I am trying to block a path that contains a certain URL parameter with robots.txt. I want to block the path regardless of where this URL parameter appears in the query string.
What I have tried
After reading several answers, I have tried:
Disallow /*?param=
Also:
Disallow /*?param=*
These will only block the path if param is the first URL parameter. But not if it appears later in the URL.
I also tried:
Disallow /*?*param=*
While this works, it also blocks any path that has a URL parameter with suffix of param and as such is not an acceptable solution.
The Question
How can I block a path that contains a specific URL parameter regardless of where it appears in the query string?
If you want to block:
/path/file?param=value
/path/file?a=b¶m=value
but you do not want to block:
/path/file?otherparam=value
/path/file?a=b&otherparam=value
You need to use two disallow lines, like this:
User-agent: *
Disallow: /*?param=
Disallow: /*¶m=
There is no way to do this reliably with a single line.
How can I prevent search engines indexing pages with 3 or more forwarslashes in their paths?
for example:
www.example.com/about.html ->ok
www.example.com/1/2/3/4/5/test.html -> no index!!!
How should I write the robots.txt? Thanks!
Try with
User-Agent: *
Disallow: /*/*/*/*
Although I'm not sure all bots support the *.