Should I get rid of bots visiting my site? - seo

I have been noticing on my trackers that bots are visiting my site ALOT. Should I change or edit my robots.txt or change something? Not sure if thats good, because they are indexing or what?

Should i change or edit my robots.txt or change something?
Depends on the bot. Some bots will dutifully ignore robots.txt.
We had a similar problem 18 months ago with the google AD bot because our customer was purchasing Soooo many ads.
Google AD bots will (as documented) ignore wildcard (*) exclusions, but listen to explicit ignores.
Remember, bots that honor robots.txt will just not crawl your site. This is undesirable if you want them to get access to your data for indexing.
A better solution is to throttle or supply static content to the bots.
Not sure if thats good, because they are indexing or what?
They could be indexing/scraping/stealing. All the same really. What I think you want is to throttle their http request processing based on UserAgents. How to do this depends on your web server and app container.
As suggested in other answers, if the bot is malicious, then you'll need to either find the UserAgent pattern and send them 403 forbiddens. Or, if the malicious bots dynamically change user agent strings you have a two further options:
White-list UserAgents - e.g. create a user agent filter that only accepts certain user agents. This is very imperfect.
IP banning - the http header will contain the source IP. Or, if you're getting DOS'd (denial of service attack), then you have bigger problems

I really don't think changing the robots.txt is going to help, because only GOOD bots abide by it. All other ignore it and parse your content as they please. Personally I use http://www.codeplex.com/urlrewriter to get rid of the undesirable robots by responding with a forbidden message if they are found.

The spam bots don't care about robots.txt. You can block them with something like mod_security (which is a pretty cool Apache plugin in its own right). Or you could just ignore them.

You might have to use .htaccess to deny some bots to screw with your logs.
See here : http://spamhuntress.com/2006/02/13/another-hungry-java-bot/
I had lots of Java bots crawling my site, adding
SetEnvIfNoCase User-Agent ^Java/1. javabot=yes
SetEnvIfNoCase User-Agent ^Java1. javabot=yes
Deny from env=javabot
made them stop. Now they only get 403 one time and that's it :)

I once worked for a customer who had a number of "price comparison" bots hitting the site all of the time. The problem was that our backend resources were scarce and cost money per transaction.
After trying to fight off some of these for some time, but the bots just kept changing their recognizable characteristics. We ended up with the following strategy:
For each session on the server we determined if the user was at any point clicking too fast. After a given number of repeats, we'd set the "isRobot" flag to true and simply throttle down the response speed within that session by adding sleeps. We did not tell the user in any way, since he'd just start a new session in that case.

Related

How can I find out who is hotlinking my content?

I have lots of videos on my website, I am curious to know what websites are hotlinking to it.
I am using cpanel with awstats, I have google analytics too.
The server is running Apache.
Actually you can check Referer header.
If you want block all requests outside of your domain. Here is example for Apache server.
But this technique has 2 disadvantages:
Very-very easy to send faked Referer header
Some browsers in very rear case may not send Referer header at all
Most common way to prevent content from cross-linking is generate dynamic temporary links with limited session time.

Mod Security 403 persistnace require cookie deletion

I have been searching for quite some time and finally decided to posts this question on how Mod Security locks out a user from a domain.
I have a large site with a lot of legacy URL's with '$' and "%" in them, this was removed but there are legacy links all over that will trip some mod security rules.
The main issue is once a rule is triggered, a 403 error is returned as expected on that page, but going to any other page on the domain now will throw a 403 error as well utill the cookies are cleared on the browser. This of course is not user friendly as many people will not know about the clear cookies fix and If they are locked out I cant obviously let them know easily while not wanting to remove all the rules that cause this.
example of a url
[code]
Request: GET /phpBB2/promotions/9927-1st-deposit-bonus-125%25move-up-sun-palace-casin.html
Action Description: Access denied with code 403 (phase 2).
Justification: Invalid URL Encoding: Non-hexadecimal digits used at REQUEST_URI.
[/code]
this is a
950107: URL Encoding Abuse Attack Attempt
Also many errors in the Mod Security Log I see simply GET / as the trigger and that obviously is not the root cause.
My first thought is if this rule is firing incorrectly because such a use case is (or was) valid then you should probably disable this rule with:
SecRuleRemoveById 950107
If your page no longer exists then this will presumably return a standard 404 message without this rule. Which is probably more correct to the user than 403 and better UX. Will also mean you don't get ModSecurity alerts in your logs for these false positives. OK it could mean you miss a genuine attack this rule is designed to block, so you need to weigh up the risk of that versus the downside of upsetting some of your users. Personally I don't think this rule is protecting you from a major security problem so would disable it.
Also need to understand exactly what the problem is here. Rule 950107 checks the URL, not the cookies so, while I understand it firing for the initial request, it must be a different rule which is firing for the subsequent blocking errors? Not sure how cookies are being set incorrectly and to such a value that causes problems in future so need more details.
It is possible to remove cookies using a combination of ModSecurity and Apache using a method documented by Ivan Ristic in his ModSecurity Handbook, but it's a bit convoluted and involves a couple of steps:
1) Create a rule which detects when you want to remove the cookies and sets an environmental variables. For example to check for a URL containing the word "value_to_check" use a rule like this:
SecRule REQUEST_URI "value_to_check" "id:12345,phase:1,setenv:DISABLE_OUTBOUND_SESSION"
2) Get Apache to request the browser unsets the SESSIONID Cookie where that env variable is set, using mod_header:
Header always set Set-Cookie "SESSIONID=; expires 31-Dec-1999 00:00:00 GMT" env=DISABLE_OUTBOUND_SESSION
Note that this doesn't fix any bad incoming cookies. There's similar fixes for that depending on the exact problem.
I would suggest however you understand the full problem first as there may be better solutions.

Moving website from HTTP to fully HTTPS and SEO implications

Alright, you think that this might be one of the most asked question on the internet, and you're tired reading the exact same answers. So let's focus on one of the most common answer, and forget about the others.
One of the common answer is:
"The https-site and the http-site are two completely different sites;
it’s a little bit like having a www version of the site and a non-www
version. Make sure you have 301 redirects from the http URLs to the
https ones." (source:
http://www.seomoz.org/ugc/seo-for-https-with-s-like-secure)
So here's my question:
Why are people saying that https and http are two different websites? How different is https://www.mydomain.com from http://www.mydomain.com?
The URI is the same and the content is the same. Only the protocol changes.
Why would the protocol have any impact on SEO? Whether or not the content is encrypted from point A to point B, why would that matter SEO wise?
Thanks for your help!
-H
Http and https could technically be two different sites. You could configure your server to server completely different content. They have two different urls (the difference being that s).
That being said, almost all webmasters with both http and https serve nearly identical content whether the site is secure or not. Google recognizes this and allows you to run both at the same time without having to fear duplicate content penalties.
If you are moving from one one to another, you should treat it similarly to other url changes.
Put 301 redirects in place so that each page gets properly redirected to the same content at its new url
Register both versions in Google Webmaster Tools
I have not personally done this switch, but it should be doable without problems. I have made other types of sitewide url changes without problems in the last couple years.
The other alternative would be to run both http and https at the same time and switch users over more gradually. As they log in, for example.
Update to above answer as on August 2014, Google has just confirmed that sites secured by SSL will start getting a ranking boost. Check official statement here: http://googlewebmastercentral.blogspot.in/2014/08/https-as-ranking-signal.html
Don't think about it in terms of protocol. Think about it in terms of potentiality from a search engines point of view.
http://example.com and http://www.example.com can be completely different sites.
http://example.com/ and http://www.example.com/home can be completely different pages.
https://www.example.com and http://www.example.com can, again, be completely different sites.
In addition to this, https pages have a very hard time ranking. google etc.
If your entire site is https and pops an SSL certificate to an HTTP request, G views them as secure and that they're https for a reason. It's sometimes not very clever in this regard. If you have secure product or category pages, for instance, they simply will not rank compared to competitors. I have seen this time and again.
In recent months, it is becoming very clear Google will gently force webmasters to move to HTTPS.
Why are people saying that https and http are two different websites?
How different is www.mydomain.com from
www.mydomain.com?
Answer: Use the site: operator to find duplicate content. Go to a browser and type:
site:http://example-domain.com
and
site:https://example-domain.com
If you see both versions indexed in Google or other search engines they are duplicates. You must redirect the HTTP version to the HTTPS version to avoid diluting your websites authority and a possible penalty from Google's Panda algorithm.
Why would the protocol have any impact on SEO?
Answer:
For ecommerce websites, Google will not rank them well without being
secure. They do not want users to get their bank info etc stolen.
Google will be giving ranking boosts to sites that move to HTTPS in
the future. Although it is not a large ranking signal now, it could
become larger.
The guys at Google Chrome have submitted a proposal to dish out
warnings to users for ALL websites not using HTTPS. Yes, I know it
sounds crazy, but check
this out.
Info taken from this guide on how to move to HTTPS without killing your rank.
Recently, if SSL is inactive in Firefox browser, it shows an error. You must enable SSL and redirect the URL to HTTPS 301

Should robots.txt be obeyed for HEAD requests?

I have a list of URLs and wish to find whether they redirect to some other place or not, and if it does, what is their final location. This I am doing by sending HEAD requests to these URLs.
The list contains links to certain hosts which disallow my bot (any bot in general) in robots.txt.
My question is, in order to be polite-
should I follow robots.txt for HEAD requests too, and stop requesting these hosts ?
if there is a crawl delay mentioned in robots.txt, should I obey it for these HEAD requests ?
is there a web-service that can do this job for me and return the final URLs for a batch of input URLs ?
You should always abide by robots.txt, even for HEAD requests. If you don't do so not only are you violating the politeness preferences of the website but you're risking having your IP blocked permanently from the website. A simple HEAD request to a restricted and non-humanly-accessible directory/page on a website can put you on the operator's ban list.
should I follow robots.txt for HEAD requests too, and stop requesting these hosts ?
You should follow robots.txt or if you're already banned, then stop requesting those hosts.
if there is a crawl delay mentioned in robots.txt, should I obey it for these HEAD requests ?
Yes.
is there a web-service that can do this job for me and return the final URLs for a batch of input URLs ?
I don't know of any, but perhaps you can adopt an existing crawler to do that. What programming language do you prefer?

Removing Hacked URL Strings From Google

I recently suffered a hack on a number of websites which were hosted on the same server. I've identified and removed the source of the hack, and used Patrick Altoft's smart Google Alerts idea to monitor for further attempts.
I've then logged into Google webmaster tools, asked to be re-evaluated post hack, and I've also re-submitted site maps to speed up a re-crawl.
However I would like to remove the infected url's from Google, and was thinking the best way to speed up this process would be to use .htaccess to return a 404 error, whenever a page with a specific string variable appeared.
Is this possible with a .htaccess file, or is there a better course of action to take?
You can see the damage done here.
Thanks for any help and suggestions.
404 will work, but is possible not the best solution. A better solution would be 301: moved permanently, or 410: gone.
A 404 tells you that a page is missing, but not why. Google may keep these urls for a while to investigate later whether they exist again. By using 301 or 410, you explicitly tell Google that that url is not going to be fixed.
410 is the better option, but I'm not sure if this is possible from htaccess, athough you could 301 to a php-file that returns a 410 header.
Addition: Here's an article about redirecting using the '410, Gone' header with .htaccess. http://diveintomark.org/archives/2003/03/27/http_error_410_gone
Yep, give them 404/410/301 status code, then Google will remove them in a day or two. I've done that before. It will take way too long for Google to renew its cache with 200 status code.