How to track all website activity and filtering web robot data - tracking

I'm doing a very rudimentary tracking of page views by logging url, referral codes, sessions, times etc but finding it's getting bombarded with robots (Google, Yahoo etc). I'm wondering what an effective way is to filter out or not log these statistics?
I've experimented with robot IP lists etc but this isn't foolproof.
Is there some kind of robots.txt, htaccess, PHP server-side code, javascript or other method(s) that can "trick" robots or ignore non-human interaction?

Just to add - a technique you can employ within your interface would be to use Javascript to encapsulate the actions that lead to certain user-interaction view/counter increments, for a very rudimentary example, a robot will(can) not follow:
Chicken Farms
function viewItem(id)
{
window.location.href = 'www.example.com/items?id=' + id + '&from=userclick';
}
To make those clicks easier to track, they might yield a request such as
www.example.com/items?id=4&from=userclick
That would help you reliably track how many times something is 'clicked', but it has obvious drawbacks, and of course it really depends on what you're trying to achieve.

It depends on what you what to achieve.
If you want search bots to stop visiting certain paths/pages you can include them in robots.txt. The majority of well-behaving bots will stop hitting them.
If you want bots to index these paths but you don't want to see them in your reports then you need to implement some filtering logic. E.g. all major bots have a very clear user-agent string (e.g. Googlebot/2.1). You can use these strings to filter these hits out from your reporting.

Well the robots will all use a specific user-agent, so you can just disregard those requests.
But also, if you just use a robots.txt and deny them from visiting; well that will work too.

Don't redescover the weel!
Any statistical tool at the moment filters robots request. You can install AWSTATS (open source) even if you have a shared hosting. If you won't to install a software in your server you can use Google Analytics adding just a script at the end of your pages. Both solutions are very good. In this way you only have to log your errors (500, 404 and 403 are enough).

Related

Ways to keep Google from indexing Sites/Content

I've a case on my Hand where I must be super duper sure that google (or any yahoo / bing for that matter) does not index specific content, so the more redundant, the better.
As far as i know there are 3 Ways to accomplish that, I wonder if there are more (redundancy is key here) :
set meta tag to no-index
disallow affected url structure in robots.txt
post load the content via ajax
So if that are all methods, good, but it would be just dandy if someone has some Idea how to be even more sure :D
(I know thats a little bit insane, but if the content shows up in google somehow it will get really expensive for my company :'-( )
uh, there are a lot more
a) identify googlebot (works similar with other bots)
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=80553
and don't show them the content
b) return these pages with an HTTP 404 / HTTP 410 header instead of HTTP 200
c) only show these pages to clients with cookies / sesssions
d) render the whole content as image (and then disalow the image)
e) render the whole content as an image data URL (then a disalow is not needed)
f) user pipes | in the URL structure (works in google, don't know about the other pages)
g) use dynamic URLs that only work let say for 5 minutes
and these are just a few on top of my mind ... there are propably more
Well, I suppose you could require some sort of registration/authentication to see the content.
We're using the post-load content via ajax method at my work and it works pretty well. You just have to be sure that you're not returning anything if that same ajax route is hit without the xhr header. (We're using it in conjunction with authorization though.)
I just don't think there's anyway to be completely sure without actually locking down the data behind some sort of authentication. And if it's going to be expensive for your company if it gets out there, then you might want to seriously consider it.
What about blocking IPs from search engines and requests with search engine user-agents in .htaccess?
It might need more maintenance of the list of IPs and user-agents but it will work.

SEO Question, and about Server.Transfer (Asp.net)

So, we're trying to up our application in the rankings in the search engines, and one way our SEO guy told us to do that was to register similar domains...for example we have something like
http://www.myapplication.com/parks.html
so..we acquired the domain parks.com (again just an example).
Now when people go to http://www.parks.com ...we want it to display the content of http://www.myapplication.com/parks.html.
I could just put a forwarding page there, but from what i've been told that makes us look bad because it's technically a permanent redirect..and we're trying to get higher in the search engine rankings, not lower.
Is this a situation where we would use the Server.Transfer method of ASP.net?
How are situations like this handled, because I've defiantly seen this done by many websites.
We also don't want to cheat the system, we are showing relevant content and not spam or tricking customers in anyway, so the proper way to do achieve what i'm looking for would be great.
Thanks
Use your "similar" domain names to host individual and targetted landing pages that will point to your master content.
It's easier to manage and you will get a higher conversion rate.
Having to create individual page will force you to write relevent content and will increase the popularity of the page.
I also suggest you to not only build landing pages, but mini sites (of few pages).
SEO is sa very high demanding task.
Regarding technical aspects: Server.Transfer is what you should use. Never use Response.Redirect, Google and other search engines will drop your ranking.
I used permanent URL rewrite in the past. I changed my website and since lots of traffic was coming from others website linking mine, I wanted to have a permanent solution.
Read more about URL rewriting : http://msdn.microsoft.com/en-us/library/ms972974.aspx

Figure out if a website has restricted/password protected area

I have a big list of websites and I need to know if they have areas that are password protected.
I am thinking about doing this: downloading all of them with httrack and then writing a script that looks for keywords like "Log In" and "401 Forbidden". But the problem is these websites are different/some static and some dynamic (html, cgi, php,java-applets...) and most of them won't use the same keywords...
Do you have any better ideas?
Thanks a lot!
Looking for password fields will get you so far, but won't help with sites that use HTTP authentication. Looking for 401s will help with HTTP authentication, but won't get you sites that don't use it, or ones that don't return 401. Looking for links like "log in" or "username" fields will get you some more.
I don't think that you'll be able to do this entirely automatically and be sure that you're actually detecting all the password-protected areas.
You'll probably want to take a library that is good at web automation, and write a little program yourself that reads the list of target sites from a file, checks each one, and writes to one file of "these are definitely passworded" and "these are not", and then you might want to go manually check the ones that are not, and make modifications to your program to accomodate. Using httrack is great for grabbing data, but it's not going to help with detection -- if you write your own "check for password protected area" program with a general purpose HLL, you can do more checks, and you can avoid generating more requests per site than would be necessary to determine that a password-protected area exists.
You may need to ignore robots.txt
I recommend using the python port of perls mechanize, or whatever nice web automation library your preferred language has. Almost all modern languages will have a nice library for opening and searching through web pages, and looking at HTTP headers.
If you are not capable of writing this yourself, you're going to have a rather difficult time using httrack or wget or similar and then searching through responses.
Look for forms with password fields.
You may need to scrape the site to find the login page. Look for links with phrases like "log in", "login", "sign in", "signin", or scrape the whole site (needless to say, be careful here).
I would use httrack with several limits and then search the downloaded files for password fields.
Typically, a login form could be found within two links of the home page. Almost all ecommerce sites, web apps, etc. have login forms that are accessed just by clicking on one link on the home page, but another layer or even two of depth would almost guarantee that you didn't miss any.
I would also limit the speed that httrack downloads, tell it not to download any non-HTML files, and prevent it from downloading external links. I'd also limit the number of simultaneous connections to the site to 2 or even 1. This should work for just about all of the sites you are looking at, and it should be keep you off the hosts.deny list.
You could just use wget and do something like:
wget -A html,php,jsp,htm -S -r http://www.yoursite.com > output_yoursite.txt
This will cause wget to download the entire site recursively, but only download endings listed with the -A option, in this case try to avoid heavy files.
The header will be directed to file output_yoursite.txt which you then can parse for the header value 401, which means that the part of the site requires authentication, and parse the files accordingly to Konrad's recommendation also.
Looking for 401 codes won't reliably catch them as sites might not produce links to anything you don't have privileges for. That is, until you are logged in, it won't show you anything you need to log in for. OTOH some sites (ones with all static content for example) manage to pop a login dialog box for some pages so looking for password input tags would also miss stuff.
My advice: find a spider program that you can get source for, add in whatever tests (plural) you plan on using and make it stop of the first positive result. Look for a spider that can be throttled way back, can ignore non HTML files (maybe by making HEAD requests and looking at the mime type) and can work with more than one site independently and simultaneously.
You might try using cURL and just attempting to connect to each site in turn (possibly put them in a text file and read each line, try to connect, repeat).
You can set up one of the callbacks to check the HTTP response code and do whatever you need from there.

SEO and Session Parameters

If we develop a site with SEO compatible, is it possible to use Session Variables?
If not what is the alternative?
Thanks many.
Best regards.
A search engine indexes pages on your site based on their URLs. If your URLs are not dependent on the unique Session ID assigned to every request, then a spider should not have a problem indexing your site.
That said, the content of your pages also matters. If the page content relies heavily on Session variables (or Viewstate params), you might have a problem getting that page indexed. The best way is to have unique and static URLs for each section of your site.
From Googles Webmaster Guide:
"Allow search bots to crawl your sites without session IDs or arguments that track their path through the site. These techniques are useful for tracking individual user behavior, but the access pattern of bots is entirely different. Using these techniques may result in incomplete indexing of your site, as bots may not be able to eliminate URLs that look different but actually point to the same page."
So I don't think tht it is a good idea to require content that you want indexed to require a session. It depends on what your requirements are as to possible alternatives/workarounds.
You should use cookies, because they are machine dependent. Session identifiers in the URL are very unsave (session-stealing), because you lose your session if you send the url to somebody.
I agree with Cerebrus. Just make sure that
You have unique and static URLs. If you don't have unique urls you will loose the links to that page.
You have the same title for all states of a page
You target the same keywords for all states of a page
Session variables can be passed in a HTTP request as a cookie, a POST variable, or in the URL.
Search engines do not support cookies or POST variables, and it they try to avoid pages with session variables in the URL.
You can use cookie or POST based session tracking for your users, but be aware that requests from search engines will always appear as the start of a new session.
URL is independent on the unique Session ID assigned to every request, then the web spider should not have a problem indexing your site on google.

Possible to prevent search engine spiders from infinitely crawling paging links on search results?

Our SEO team would like to open up our main dynamic search results page to spiders and remove the 'nofollow' from the meta tags. It is currently accessible to spiders via allowing the path in robots.txt, but with a 'nofollow' clause in the meta tag which prevents spiders from going beyond the first page.
<meta name="robots" content="index,nofollow">
I am concerned that if we remove the 'nofollow', the impact to our search system will be catastrophic, as spiders will start crawling through all pages in the result set. I would appreciate advice as to:
1) Is there a way to remove the 'nofollow' from the meta tag, but prevent spiders from following only certain links on the page? I have read mixed opinions on rel="nofollow", is this a viable option?
<a rel="nofollow" href="http://www.mysite.com/paginglink" >Next Page</a>
2) Is there a way to control the 'depth' of how far spiders will go? It wouldn't be so bad if they hit a few pages, then stopped.
3) Our search results pages have the standard next/previous links, which would in theory cause spiders to hit pages recursively to infinity, what is the effect of this on SEO?
I understand that different spiders behave differently, but am mainly concerned with the big players, such as Google, Yahoo, MSN.
Note our Search results pages and paging links are not bot-friendly, in that they are not re-written and have a ?name=value query string, but from what I've seen spiders no longer just abort when they see the '?' as the results pages ARE getting indexed with decent page rank.
To be honest you are looking at nofollow wrong. Chances are the search spiders are already especially Google, Yahoo, and MSN searching the nofollow pages, because they still have to hit those pages to see if they have a noindex.
The real problem is nofollow doesn't actually mean don't follow, it just means don't pass on my reputation to this link. So unless you are aggressively blocking bots, which it doesn't sound like you are, changing the ROBOTS meta tag and robot commands on links will not effect performance because they are already hitting your site. To confirm this just look at your HTTP Server Log.
So my vote is that you will not see any problem with removing the robot limits.
I've seen Google index a calendar system that had relative links on each page through the end of time (Jan 19, 2038 - see: http://en.wikipedia.org/wiki/Year_2038_problem). We didn't notice the load on our servers until it exposed a bug in the source code dealing with dates in 2038.
I don't know about the other search engines, but Google offers a number of helpful tools for controlling how much the googlebot impacts your server infrastructure. See http://www.google.com/webmasters/.
There is an option in webmaster tools to set the crawl rate for your site.
Google bots are pretty intelligent about not traversing an entire database of dynamically-generated pages, as long as the URLs give some hint that they are dynamic (i.e. file extension of .asp or .jsp, etc. and numeric ids as query parameters). If you use rewrite rules to make your URLs "friendly", then the bots have a harder time determining whether or not it's a static page they are reading or a dynamically generated page. See this Google article for more information about dynamic vs. static URLs.
You may also want to consider creating a Google Sitemap to give the bots a better idea about what pages on your site can be indexed and which cannot.