Simple question. I want to add:
Disallow */*details-print/
Basically, blocking rules in the form of /foo/bar/dynamic-details-print --- foo and bar in this example can also be totally dynamic.
I thought this would be simple, but then on www.robotstxt.org there is this message:
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".
So we can't do that? Do search engines abide by it? But then, there's Quora.com's robots.txt file:
Disallow: /ajax/
Disallow: /*/log
Disallow: /*/rss
Disallow: /*_POST
So, who is right -- Or am I misunderstanding the text on robotstxt.org?
Thanks!
The answer is, "it depends". The robots.txt "standard" as defined at robotstxt.org is the minimum that bots are expected to support. Googlebot, MSNbot, and Yahoo Slurp support some common extensions, and there's really no telling what other bots support. Some say what they support and others don't.
In general, you can expect the major search engine bots to support the wildcards that you've written, and the one you have there looks like it will work. Best bet would be to run it past one or more of these robots.txt validators or use Google's Webmaster tools to check it.
Related
I have disallowed certain pages using robots.txt for all crawlers. Do I have to write metatags for those files or web crawlers will just skip them so there is no need to do so?
If the crawler you want to limit obeys robots.txt then you are fine, but if it dosn't then you are probably screwed eighterway, because chances are it will ignore meta too.
All major search-engine crawlers does obey it however so you are probably fine.
You are good to go. All of the big search engines (Google, really) obey any entries you make in robots.txt. http://www.robotstxt.org/robotstxt.html
Also, be aware that the robots.txt file itself is viewable, so don't use this as a security measure. http://www.cre8asiteforums.com/forums/index.php?showtopic=55546
Well written bots will ignore those pages (provided that robots.txt syntax is correct).
Is this a good idea??
http://browsers.garykeith.com/stream.asp?RobotsTXT
What does abusive crawling mean? How is that bad for my site?
Not really. Most "bad bots" ignore the robots.txt file anyway.
Abuse crawling usually means scraping. These bots are showing up to harvest email addresses or more commonly, content.
As to how you can stop them? That's really tricky and often not wise. Anti-crawl techniques have a tendency to be less than perfect and cause problems for regular humans.
Sadly, like "shrinkage" in retail, it's a cost of doing business on the web.
A user-agent (which includes crawlers) is under no obligation to honour your robots.txt. The best you can do is try to identify abusive access patterns (via web-logs, etc.), and block the corresponding IP.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 5 years ago.
Improve this question
Okay, I know this question have been asked plenty of times already, but I haven't found any actual answer.
Considering SEO, what is the best way to construct the URL for multiple languages? One top-level domain for each language would feel unnecessary, so I'm thinking about different subdomains or sub-folders. And in that case, which would be better - en.mydomain.com or english.mydomain.com? And if eg. the english version is more viewed than the swedish version, how do I tell the search engines that they actually are the same page?
Pretty everything is answered in this Google Webmasters article: Multi-regional and multilingual sites.
Here's a summary of relevance:
URL structures
Consider using a URL structure that makes it easy to geotarget parts of your site to different regions. The following table outlines your options:
ccTLDs (country-code top-level domain names)
Example: example.de
Pros:
Clear geotargeting
Server location irrelevant
Easy separation of sites
Cons:
Expensive (and may have limited availability)
Requires more infrastructure
Strict ccTLD requirements (sometimes)
Subdomains with gTLDS (generic top-level domain name)
Example: de.example.com
Pros:
Easy to set up
Can use Webmaster Tools geotargeting
Allows different server locations
Easy separation of sites
Cons:
Users might not recognize geotargeting from the URL alone (is "de" the language or country?)
Subdirectories with gTLDs
Example: example.com/de/
Pros:
Easy to set up
Can use Webmaster Tools geotargeting
Low maintenance (same host)
Cons:
Users might not recognize geotargeting from the URL alone
Single server location
Separation of sites harder
URL parameters
Example: example.com?loc=de
Pros:
Not recommended.
Cons:
URL-based segmentation difficult
Users might not recognize geotargeting from the URL alone
Geotargeting in Webmaster Tools is not possible
Duplicate content and international sites
Websites that provide content for different regions and in different languages sometimes create content that is the same or similar but available on different URLs. This is generally not a problem as long as the content is for different users in different countries. While we strongly recommend that you provide unique content for each different group of users, we understand that this may not always be possible. There is generally no need to "hide" the duplicates by disallowing crawling in a robots.txt file or by using a "noindex" robots meta tag. However, if you're providing the same content to the same users on different URLs (for instance, if both example.de/ and example.com/de/ show German language content for users in Germany), you should pick a preferred version and redirect (or use the rel=canonical link element) appropriately.
Google's guidelines are:
Make sure the page language is obvious
Make sure each language version is easily discoverable
This point specifically references URLs as needing to be kept separate. The example they provide is:
For example, the following .ca URLs use fr as a subdomain or subdirectory to clearly indicate French content: http:// example.ca/fr/vélo-de-montagne.html and http:// fr.example.ca/vélo-de-montagne.html.
They also state:
It’s fine to translate words in the URL, or to use an Internationalized Domain Name (IDN). Make sure to use UTF-8 encoding in the URL (in fact, we recommend using UTF-8 wherever possible) and remember to escape the URLs properly when linking to them.
Targeting the site content to a specific country
This is done through CCTLDs, Geotargetting settings in Search Console, Server Location and 'other signals'.
If you're worried about duplicate content, they state:
Websites that provide content for different regions and in different languages sometimes create content that is the same or similar but available on different URLs. This is generally not a problem as long as the content is for different users in different countries. While we strongly recommend that you provide unique content for each different group of users, we understand that this might not always be possible.
If you do re-use the same content across the same website (but in a different language then:
There is generally no need to "hide" the duplicates by disallowing crawling in a robots.txt file or by using a "noindex" robots meta tag.
But!
However, if you're providing the same content to the same users on different URLs (for instance, if both example.de/ and example.com/de/ show German language content for users in Germany), you should pick a preferred version and redirect (or use the rel=canonical link element) appropriately. In addition, you should follow the guidelines on rel-alternate-hreflang to make sure that the correct language or regional URL is served to searchers.
So, be sure to declare the relationship between different languages using hreflang.
Example below:
<link rel="alternate" href="http://example.com" hreflang="en-us" />
You can use this in a number of places including your page markup, HTTP headers, or even the sitemap.
Here's a link to a hreflang generator which you might find useful.
Hope this helps.
Is there a fool proof way to restrict your content from being indexed by major search engines?
Thanks
Prady
One possible way, is the Robots.txt file.
User-Agent: *
Disallow: /
Here is a blog post discussing other techniques, including meta tags.
Most search engines follow robots.txt. I've heard Yahoo Slurp! does not.
You could scan user agent for well known bots, such as Google, Yahoo, Bing, Internet Archive, etc and produce blank output. You will be penalised for giving alternate content to Google, but since you are blocking them, it won't be a problem.
The most important thing is whatever you publish publically can and will be accessed by bots such as search engine spiders.
Don't forget bots have a nasty habit of being where you don't want them to be (mixed with bad coding practices, can be quite disastrous).
Fool proof? I think not. You can restrict IP's, use Robots.txt, meta tags, but if a search engine really really wants your content indexed, it will find a way.
I need to honor the web browser's list of language preferences. Supported languages are English and French. For example: http_accept_language="jp-JP;fr;en-US;en" redirects to a directory called /French/. How can I do this with rewrite rules in my .htaccess file?
I wouldn’t use mod_rewrite for this but a more powerful language. Because Accept-Language is a list of weighted values (see quality value) and the occurrence of one of the identifiers does not mean that it’s preferred over another value (especially q=0 means not acceptable at all).
As already said, use a more powerful language than mod_rewrite, parse the list of value and find the best match of preferred options and available options.