I have disallowed certain pages using robots.txt for all crawlers. Do I have to write metatags for those files or web crawlers will just skip them so there is no need to do so?
If the crawler you want to limit obeys robots.txt then you are fine, but if it dosn't then you are probably screwed eighterway, because chances are it will ignore meta too.
All major search-engine crawlers does obey it however so you are probably fine.
You are good to go. All of the big search engines (Google, really) obey any entries you make in robots.txt. http://www.robotstxt.org/robotstxt.html
Also, be aware that the robots.txt file itself is viewable, so don't use this as a security measure. http://www.cre8asiteforums.com/forums/index.php?showtopic=55546
Well written bots will ignore those pages (provided that robots.txt syntax is correct).
Related
What is considered to be best practice for url structuring these days?
for some reason i thought including an extension at the end of a url was once you got down to the 'lowest' part of your hierarchy e.g.
/category/sub-category/product.html
then all category urls would be:
/category/sub-category/
rather than including an extension at the end because there is still further to go down the structure.
looking forward to your thoughts.
Andy.
EDIT
Just for clarification purposes: I'm looking at this from an ecommerce perspective.
Your question is not very clear, but I'll reply as I understand it.
As to use or not to use file extensions, according to Google's representative Matt Cutts, Google crawls .html, .php, or .asp, but you should keep away from .exe, .dll,.bin. They signify largely binary data, so may be ignored by Googlebot.
Still, when designing SEO friendly URLs, keep in mind they should be short and descriptive, so you may use your keywords to rank higher. So, if you have good keywords in your category names, why not let them be visible in the URL.
Make sure you're using static instead of dynamic URLs, they are easier to remember, and they don't change.
Is this a good idea??
http://browsers.garykeith.com/stream.asp?RobotsTXT
What does abusive crawling mean? How is that bad for my site?
Not really. Most "bad bots" ignore the robots.txt file anyway.
Abuse crawling usually means scraping. These bots are showing up to harvest email addresses or more commonly, content.
As to how you can stop them? That's really tricky and often not wise. Anti-crawl techniques have a tendency to be less than perfect and cause problems for regular humans.
Sadly, like "shrinkage" in retail, it's a cost of doing business on the web.
A user-agent (which includes crawlers) is under no obligation to honour your robots.txt. The best you can do is try to identify abusive access patterns (via web-logs, etc.), and block the corresponding IP.
Simple question. I want to add:
Disallow */*details-print/
Basically, blocking rules in the form of /foo/bar/dynamic-details-print --- foo and bar in this example can also be totally dynamic.
I thought this would be simple, but then on www.robotstxt.org there is this message:
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".
So we can't do that? Do search engines abide by it? But then, there's Quora.com's robots.txt file:
Disallow: /ajax/
Disallow: /*/log
Disallow: /*/rss
Disallow: /*_POST
So, who is right -- Or am I misunderstanding the text on robotstxt.org?
Thanks!
The answer is, "it depends". The robots.txt "standard" as defined at robotstxt.org is the minimum that bots are expected to support. Googlebot, MSNbot, and Yahoo Slurp support some common extensions, and there's really no telling what other bots support. Some say what they support and others don't.
In general, you can expect the major search engine bots to support the wildcards that you've written, and the one you have there looks like it will work. Best bet would be to run it past one or more of these robots.txt validators or use Google's Webmaster tools to check it.
Is there a fool proof way to restrict your content from being indexed by major search engines?
Thanks
Prady
One possible way, is the Robots.txt file.
User-Agent: *
Disallow: /
Here is a blog post discussing other techniques, including meta tags.
Most search engines follow robots.txt. I've heard Yahoo Slurp! does not.
You could scan user agent for well known bots, such as Google, Yahoo, Bing, Internet Archive, etc and produce blank output. You will be penalised for giving alternate content to Google, but since you are blocking them, it won't be a problem.
The most important thing is whatever you publish publically can and will be accessed by bots such as search engine spiders.
Don't forget bots have a nasty habit of being where you don't want them to be (mixed with bad coding practices, can be quite disastrous).
Fool proof? I think not. You can restrict IP's, use Robots.txt, meta tags, but if a search engine really really wants your content indexed, it will find a way.
I have been browsing around on the internet and researching effective sitemap web pages. I have encountered these two sitemaps and questioning their effectiveness.
http://www.webanswers.com/sitemap/
http://www.answerbag.com/sitemap/
Are these sitemaps effective?
Jeff Atwood, (One of the guys who made this site) wrote a great article on the importance of sitemaps.
I'm a little aggravated that we have
to set up this special file for the
Googlebot to do its job properly; it
seems to me that web crawlers should
be able to spider down our simple
paging URL scheme without me giving
them an explicit assist.
The good news is that since we set up
our sitemaps.xml, every question on
Stack Overflow is eminently findable.
But when 50% of your traffic comes
from one source, perhaps it's best not
to ask these kinds of questions.
So yeah, effective for people, or effective for google?
I would have thought a HTML sitemap should be useful to a human, whereas these 2 sites aren't. If you're trying to target a search engine then a sitemap.xml file that conforms to sitemaps.org would be a better approach. Whilst the html approach would work it's easier to generate a xml file and have your robots.txt file pointing at this.