Disallow range of numeric URLs in robots.txt

Disallow range of numeric URLs in robots.txt - apache

I want to disallow URLs with my robots txt with a url parameter and numeric order.
I have a website with GET parameters like:
example.com/show?id_item=1
to
example.com/show?id_item=999
To disallow from id_item 1 to 500.
It´s possible to disallow in robots.txt a range "id_item" without write tons (500 in that case) of lines?

It depends on the range. It’s easy for your example (1 to 999, disallowing 1 to 500):
User-agent: *
Disallow: /show?id_item=1
Disallow: /show?id_item=2
Disallow: /show?id_item=3
Disallow: /show?id_item=4
Disallow: /show?id_item=500
This disallows any id_item that starts with "1", "2", "3", "4", or "500".
So URLs like these will be disallowed:
https://example.com/show?id_item=1
https://example.com/show?id_item=19
https://example.com/show?id_item=150
https://example.com/show?id_item=1350
https://example.com/show?id_item=1foo
If you expect IDs higher than 999, it doesn’t work like that anymore (because IDs like "1001" would also be disallowed). You might have to make use of Allow then (but this feature isn’t part of the original robots.txt spec, so not necessarily supported by all consumers), and the list becomes longer.
Depending on the range, $ might be useful. It indicates the end of the URL (but this is also a feature that’s not part of the original robots.txt spec, so it’s not necessarily supported by all robots.txt parsers). For example, the following line would only block ID "500":
Disallow: /show?id_item=500$

No, there is really no way to do this with robots.txt, other than having 500 lines, one for each number. (not a recommendation!) The closest thing is the wildcard extension "*", but this will match a string of any length, made up of any characters. There is no way to match a specific pattern of digits, which is what you would need to match a numeric range.
If your goal is to keep these pages out of the search engines, then the best way to do this is to add code to selectively block these pages with robots meta tags or x-robots-tag headers whenever the id is in the target range.
If your goal is to prevent the pages from being crawled at all (e.g. to reduce server load) then you are out of luck. You will have to choose between blocking all of them (with Disallow: /show?id_item=) or none of them.

Related

Apache regex -301 redirects to eradicate duplicates in url path

We are using a CMS that produces URLs of the format www.domain.com/home/help/contact/contact. Here the first occurrence of contact is the directory and the second occurrence is the HTML page itself. These urls are causing issues in the SEO space.
We have implemented canonical tags but the business wants to make sure they don't come across these duplicates in both the search engines and Google analytics, and have asked us to implement a 301 solution on our web server.
My question is we have got a regex to find these matches but I also need the part of the URL before the match. The regex we have is .*?([\w]+)\/\1+ and this returns contact in /home/help/contact/contact. How can I get the /home/help/ path as well so I can redirect to the right page? Can someone help with this please as I am a beginner when it comes to regex?

Since you're able to get contact using a matching group, enclose everything before that inside a matching group as well:
(.*?)(/[\w]+)\2+
I have put the / inside a matching group too, so that you won't get false positives for
/home/some/app/page
this would be \1 ^ ^ found repetition (character p would be matched)

robots.txt disallow path containing a URL parameter regardless of order

The Problem
I am trying to block a path that contains a certain URL parameter with robots.txt. I want to block the path regardless of where this URL parameter appears in the query string.
What I have tried
After reading several answers, I have tried:
Disallow /*?param=
Also:
Disallow /*?param=*
These will only block the path if param is the first URL parameter. But not if it appears later in the URL.
I also tried:
Disallow /*?*param=*
While this works, it also blocks any path that has a URL parameter with suffix of param and as such is not an acceptable solution.
The Question
How can I block a path that contains a specific URL parameter regardless of where it appears in the query string?

If you want to block:
/path/file?param=value
/path/file?a=b&param=value
but you do not want to block:
/path/file?otherparam=value
/path/file?a=b&otherparam=value
You need to use two disallow lines, like this:
User-agent: *
Disallow: /*?param=
Disallow: /*&param=
There is no way to do this reliably with a single line.

Ignore URLs in robot.txt with specific parameters?

I would like Google to ignore URLs like this:
http://www.mydomain.example/new-printers?dir=asc&order=price&p=3
In other words, all the URLs that have the parameters dir, order and price should be ignored. How do I do so with robots.txt?

Here's a solutions if you want to disallow query strings:
Disallow: /*?*
or if you want to be more precise on your query string:
Disallow: /*?dir=*&order=*&p=*
You can also add to the robots.txt which url to allow
Allow: /new-printer$
The $ will make sure only the /new-printer will be allowed.
More info:
http://code.google.com/web/controlcrawlindex/docs/robots_txt.html
http://sanzon.wordpress.com/2008/04/29/advanced-usage-of-robotstxt-w-querystrings/

You can block those specific query string parameters with the following lines
Disallow: /*?*dir=
Disallow: /*?*order=
Disallow: /*?*p=
So if any URL contains dir=, order=, or p= anywhere in the query string, it will be blocked.

Register your website with Google WebMaster Tools. There you can tell Google how to deal with your parameters.
Site Configuration -> URL Parameters
You should have the pages that contain those parameters indicate that they should be excluded from indexing via the robots meta tag. e.g.

Prevent search engines indexing pages with 3 or more forwarslashes in their paths

How can I prevent search engines indexing pages with 3 or more forwarslashes in their paths?
for example:
www.example.com/about.html ->ok
www.example.com/1/2/3/4/5/test.html -> no index!!!
How should I write the robots.txt? Thanks!

Try with
User-Agent: *
Disallow: /*/*/*/*
Although I'm not sure all bots support the *.

Preventing YQL from URL encoding a key

I am wondering if it is possible to prevent YQL from URL encoding a key for a datatable?
Example:
The current guardian API works with IDs like this:
item_id = "environment/2010/oct/29/biodiversity-talks-ministers-nagoya-strategy"
The problem with these IDs is that they contain slashes (/) and these characters should not be URL encoded in the API call but instead stay as they are.
So If I now have this query
SELECT * FROM guardian.content.item WHERE item_id='environment/2010/oct/29/biodiversity-talks-ministers-nagoya-strategy'
while using the following url defintion in my datatable
<url>http://content.guardianapis.com/{item_id}</url>
then this results in this API call
http://content.guardianapis.com/environment%2F2010%2Foct%2F29%2Fbiodiversity-talks-ministers-nagoya-strategy?format=xml&order-by=newest&show-fields=all
Instead the guardian API expects the call to look like this:
http://content.guardianapis.com/environment/2010/oct/29/biodiversity-talks-ministers-nagoya-strategy?format=xml&order-by=newest&show-fields=all
So the problem is really just that the / characters gets encoded as %2F which I don't want to happen in this case.
Any ideas on how this can be achieved?
You can also check the full datatable I am using:
http://github.com/spier/yql-tables/blob/master/guardian/guardian.content.item.xml

The URI-template expansions in YQL (e.g. {item_id}) only follow the version 3 spec. With version 4 it would be possible to simply (only slightly) change the expansion to do what you want, but alas not currently with YQL.
So, a solution. You could bring a very, very basic <execute> block into play: one which adds the item_id value to the path as needed.
<execute><![CDATA[
response.object = request.path(item_id).get().response;
]]></execute>
Finally, see the diff against your table (with a few other, minor tweaks to allow the above to work).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Disallow range of numeric URLs in robots.txt - apache

Related

Apache regex -301 redirects to eradicate duplicates in url path

robots.txt disallow path containing a URL parameter regardless of order

Ignore URLs in robot.txt with specific parameters?

Prevent search engines indexing pages with 3 or more forwarslashes in their paths

Preventing YQL from URL encoding a key

Categories

Resources