Stop abusive bots from crawling?

Stop abusive bots from crawling? - seo

Is this a good idea??
http://browsers.garykeith.com/stream.asp?RobotsTXT
What does abusive crawling mean? How is that bad for my site?

Not really. Most "bad bots" ignore the robots.txt file anyway.
Abuse crawling usually means scraping. These bots are showing up to harvest email addresses or more commonly, content.
As to how you can stop them? That's really tricky and often not wise. Anti-crawl techniques have a tendency to be less than perfect and cause problems for regular humans.
Sadly, like "shrinkage" in retail, it's a cost of doing business on the web.

A user-agent (which includes crawlers) is under no obligation to honour your robots.txt. The best you can do is try to identify abusive access patterns (via web-logs, etc.), and block the corresponding IP.

Related

Are domains with gTLD's (.ninja, .guru, .museum, etc..) bad for SEO?

Would using one of the newer gTLD's have any adverse effect on SEO?

Good or bad are relative terms. Will the new gTLDs prevent your website from being indexed or crawled by search engines? Of course not.
Would it help or hurt you competitively? Depends what you are competing for.
If you are doing geographic specific services, then yes they are less effective that cTLDs.
But search engines are private companies and the owners are TLDs are private individuals, so the value ultimately comes from the market.
.io domains are good for tech because it's convention.
.biz is considered less that .com because one is considered the prime TLD target for a company name.
Without years of testing, observation and analysis you cannot possibly speculate on whether or not they will provide any extra value than any other TLD.

I don't think they will have an adverse effect on seo but don't expect them to have an advantage either. Big G mostly cares about the content quality, speed and that kind of stuff. Sometimes ccTLDs may have an advantage over gTLDs when a local search is performed but these are not ccTLDs, these are gTLS.
Check this out: https://plus.google.com/+MattCutts/posts/4VaWg4TMM5F
You can also check out my post on new gTLDs: http://big.info/2015/02/disadvantages-new-gtlds.html
Cheers

How to restrict search engines from indexing my mediawiki site?

Is there a fool proof way to restrict your content from being indexed by major search engines?
Thanks
Prady

One possible way, is the Robots.txt file.
User-Agent: *
Disallow: /
Here is a blog post discussing other techniques, including meta tags.

Most search engines follow robots.txt. I've heard Yahoo Slurp! does not.
You could scan user agent for well known bots, such as Google, Yahoo, Bing, Internet Archive, etc and produce blank output. You will be penalised for giving alternate content to Google, but since you are blocking them, it won't be a problem.
The most important thing is whatever you publish publically can and will be accessed by bots such as search engine spiders.
Don't forget bots have a nasty habit of being where you don't want them to be (mixed with bad coding practices, can be quite disastrous).

Fool proof? I think not. You can restrict IP's, use Robots.txt, meta tags, but if a search engine really really wants your content indexed, it will find a way.

Negative Captchas - help me understand spam bots better

I have to decide a technique to prevent spam bots from registering my site. In this question I am mainly asking about negative captchas.
I came to know about many weaknesses of bots but want to know more. I read somewhere that majority of bots do not render/support javascript. Why is it so? How do I test that the visiting program can't evaluate javascript?
I started with this question Need suggestions/ideas for easy-to-use but secure captchas
Please answer to that question if you have some good captcha ideas.
Then I got ideas about negative captchas here
http://damienkatz.net/2007/01/negative_captch.html
But Damien has written that though this technique likely won't work on big community sites (for long), it will work just fine for most smaller sites.
So, what are the chances of somebody making site-specific bots? I assume my site will be a very popular one. How much safe this technique will be considering that?
Negative captchas using complex honeypot implementations here described here
http://nedbatchelder.com/text/stopbots.html
Does anybody know how easily can it be implemented? Are there some plugins available?
Thanks,
Sandeepan

I read somewhere that majority of bots do not render/support javascript. Why is it so?
Simplicity of implementation — you can read web page source and post forms with just dozen lines of code in high-level languages. I've seen bots that are ridiculously bad, e.g. parsing HTML with regular expressions and getting ../ in URLs wrong. But it works well enough apparently.
However, running JavaScript engine and implementing DOM library is much more complex task. You have to deal with scripts that do while(1);, that depend on timers, external resources, CSS, sniff browsers and do lots of crazy stuff. The amount of work you need to do quickly starts looking like writing a full browser engine.
It's also computationally much much expensive, so probably it's not as profitable for spammers — they can have dumb bot that silently spams 100 pages/second, or fully-featured one that spams 2 pages/second and hogs victim's computer like a typical web browser would.
There's middle ground in implementing just a simple site-specific hack, like filling in certain form field if known script pattern is noticed in the page.
So, what are the chances of somebody making site-specific bots? I assume my site will be a very popular one. How much safe this technique will be considering that?
It's a cost/benefit trade-off. If you have high pagerank, lots of visitors or something of monetary value, or useful for spamming, then some spammer might notice you and decide workaround is worth his time. OTOH if you just have a personal blog or small forum, there's million others unprotected waiting to be spammed.

How do I test that the visiting program can't evaluate javascript?
Create a hidden field with some fixed value, then write a js which increments or changes it and you will see in the response..

What are the common sense SEO practices that aren't dodgy or crap? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
In SEO there are a few techniques that have been flagged that need to avoided at all costs. These are all techniques that used to be perfectly acceptable but are now taboo. Number 1: Spammy guest blogging: Blowing up a page with guest comments is no longer a benefit. Number 2: Optimized Anchors: These have become counterproductive, instead use safe anchors. Number 3: Low Quality Links: Often sites will be flooded with hyperlinks that take you to low quality Q&A sites, don’t do this. Number 4: Keyword Heavy Content: Try and avoid too many of these, use longer well written sections more liberally. Number 5: Link-Back Overuse: Back links can be a great way to redirect to your site but over saturation will make people feel trapped

Content, Content, CONTENT! Create worthwhile content that other people will want to link to from their sites.

Google has the best tools for webmasters, but remember that they aren't the only search engine around. You should also look into Bing and Yahoo!'s webmaster tool offerings (here are the tools for Bing; here for Yahoo). Both of them also accept sitemap.xml files, so if you're going to make one for Google, then you may as well submit it elsewhere as well.
Google Analytics is very useful for helping you tweak this sort of thing. It makes it easy to see the effect that your changes are having.
Google and Bing both have very useful SEO blogs. Here is Google's. Here is Bing's. Read through them--they have a lot of useful information.
Meta keywords and meta descriptions may or may not be useful these days. I don't see the harm in including them if they are applicable.
If your page might be reached by more than one URL (i.e., www.mysite.com/default.aspx versus mysite.com/default.aspx versus www.mysite.com/), then be aware that that sort of thing sometimes confuses search engines, and they may penalize you for what they perceive as duplicated content. Use the link rel="canoncial" element to help avoid this problem.
Adjust your site's layout so that the main content comes as early as possible in the HTML source.
Understand and utilize your robots.txt and meta robots tags.
When you register your domain name, go ahead and claim it for as long of a period of time as you can. If your domain name registration is set to expire ten years from now rather than one year from now, search engines will take you more seriously.
As you probably know already, having other reputable sites that link to your site is a good thing (as long as those links are legitimate).
I'm sure there are many more tips as well. Good luck!

In addition to having quality content, content should be added/updated regularly. I believe that Google (an likely others) will have some bias toward the general "freshness" of content on your site.
Also, try to make sure that the content that the crawler sees is as close as possible to what the user will see (can be tricky for localized pages). If you're careless, your site may be be blacklisted for "bait-and-switch" tactics.

Don't implement important text-based
sections in Flash - Google will
probably not see them and if it does,
it'll screw it up.
Google can Index Flash. I don't know how well but it can. :)

A well organized, easy to navigate, hierarchical site.

There are many SEO practices that all work and that people should take into consideration. But fundamentally, I think it's important to remember that Google doesn't necessarily want people to be using SEO. More and more, google is striving to create a search engine that is capable of ranking websites based on how good the content is, and solely on that. It wants to be able to see what good content is in ways in which we can't trick it. Think about, at the very beginning of search engines, a site which had the same keyword on the same webpage repeated 200 times was sure to rank for that keyword, just like a site with any number of backlinks, regardless of the quality or PR of the sites they come from, was assured Google popularity. We're past that now, but is SEO is still , in a certain way, tricking a search engine into making it believe that your site has good content, because you buy backlinks, or comments, or such things.
I'm not saying that SEO is a bad practice, far from that. But Google is taking more and more measures to make its search results independant of the regular SEO practices we use today. That is way I can't stress this enough: write good content. Content, content, content. Make it unique, make it new, add it as often as you can. A lot of it. That's what matters. Google will always rank a site if it sees that there is a lot of new content, and even more so if it sees content coming onto the site in other ways, especially through commenting.

Common sense is uncommon. Things that appear obvious to me or you wouldn't be so obvious to someone else.
SEO is the process of effectively creating and promoting valuable content or tools, ensuring either is totally accessible to people and robots (search engine robots).
The SEO process includes and is far from being limited to such uncommon sense principles as:
Improving page load time (through minification, including a trailing slash in URLs, eliminating unnecessary code or db calls, etc.)
Canonicalization and redirection of broken links (organizing information and ensuring people/robots find what they're looking for)
Coherent, semantic use of language (from inclusion and emphasis of targeted keywords where they semantically make sense [and earn a rankings boost from SE's] all the way through semantic permalink architecture)
Mining search data to determine what people are going to be searching for before they do, and preparing awesome tools/content to serve their needs
SEO matters when you want your content to be found/accessed by people -- especially for topics/industries where many players compete for attention.
SEO does not matter if you do not want your content to be found/accessed, and there are times when SEO is inappropriate. Motives for not wanting your content found -- the only instances when SEO doesn't matter -- might vary, and include:
Privacy
When you want to hide content from the general public for some reason, you have no incentive to optimize a site for search engines.
Exclusivity
If you're offering something you don't want the general public to have, you need not necessarily optimize that.
Security
For example, say, you're an SEO looking to improve your domain's page load time, so you serve static content through a cookieless domain. Although the cookieless domain is used to improve the SEO of another domain, the cookieless domain need not be optimized itself for search engines.
Testing In Isolation
Let's say you want to measure how many people link to a site within a year which is completely promoted with AdWords, and through no other medium.
When One's Business Doesn't Rely On The Web For Traffic, Nor Would They Want To
Many local businesses or businesses which rely on point-of-sale or earning their traffic through some other mechanism than digital marketing may not want to even consider optimizing their site for search engines because they've already optimized it for some other system, perhaps like people walking down a street after emptying out of bars or an amusement park.
When Competing Differently In An A Saturated Market
Let's say you want to market entirely through social media, or internet cred & reputation here on SE. In such instances, you don't have to worry much about SEO.

Go real and do for user not for robots you will reach the success!!
Thanks!

robots.txt: disallow all but a select few, why not? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I've been thinking a while about disallowing every crawler except Ask, Google, Microsoft, and Yahoo! from my site.
The reasoning behind this is that I've never seen any traffic being generated by any of the other web-crawlers out there.
My questions are:
Is there any reason not to?
Has anybody done this?
Did you notice any negative effects?
Update:
Up till now I used the blacklist approach: if I do not like the crawler, I add them to the disallow list.
I'm no fan of blacklisting however as this is a never ending story: there are always more crawlers out there.
I'm no so much worried about the real ugly misbehaving crawlers, they are detected and blocked automatically. (and they typically do no ask for robots.txt anyhow :)
However, many crawlers are not really misbehaving in any way, they just do not seem to generate any value for me / my customers.
There are for example a couple of crawlers that power website who claim they will be The Next Google; Only Better. I've never seen any traffic coming from them and I'm quite sceptical about them becoming better than any of the four search engines mentioned above.
Update 2:
I've been analysing the traffic to several sites for some time now, and it seems that for reasonable small sites, 100 unique human visitors a day (=visitors that I cannot identify as being not human). About 52% of the generated traffic is by automated processes.
60% of all automated visitors is not reading robots.txt, 40% (21% of total traffic)
does request robots.txt. (this includes Ask, Google, Microsoft, and Yahoo!)
So my thinking is, If I block all the well behaved crawlers that do not seem to generate any value for me, I could reduce the bandwidth use and server load by around 12% - 17%.

The internet is a publishing mechanism. If you want to whitelist your site, you're against the grain, but that's fine.
Do you want to whitelist your site?
Bear in mind that badly behaved bots which ignore robots.txt aren't affected anyway (obviously), and well behaved bots are probably there for a good reason, it's just that that's opaque to you.

Whilst other sites that crawl your sites might not be sending any content your way, its possible that they themselves are being indexed by google et al, and so adding to your page rank, blocking them from your site might affect this.

Is there any reason not to?
Do you want to be left out of something which could be including your site which you have no knowledge of and is indirectly bringing a lot of content your way.
If some strange crawlers are hammering your site and eating your bandwidth you may want to, but it is quite possible that such crawlers wouldn’t honour your robots.txt either.
Examine your log files and see what crawlers you have and what proportion of your bandwidth they are eating. There may be more direct ways to block traffic which is bombarding your site.

This is currently a bit awkward, as there is no “Allow” field. The easy way is to put all files to be disallowed into a separate directory, say “stuff”, and leave the one file in the level above this directory.

My only worry is that you may miss the next big thing.
There was a long period where AltaVista was the search engine. Possibly even more than Google is now. (there was no bing, or Ask, and Yahoo was a directory, rather than a search engine as such). Sites that blocked all but Altavista back then would have never seen traffic from Google, and therefore never known how popular it was getting, unless they heard about it from another source, which might have put them at a considerable disadvantage for a while.
Pagerank tends to be biased towards older sites. You don't want to appear newer than you are because you were blocking access via robots.txt for no reason. These guys: http://www.dotnetdotcom.org/ may be completely useless now, but maybe in 5 years time, the fact that you weren't in their index now will count against you in the next big search engine.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas