How does Google track search result clicks? Is this the best way? - tracking

As the question states, I'm trying to figure out how google tracks clicks on search results. When you view the source, you find the following:
<em>Yahoo</em>!
The function rwt is, which is pretty messy:
windows.rwt=function(b,d,e,g,h,f,i,j){
var a=encodeURIComponent||escape,c=b.href.split("#");
b.href=["/url?sa=t\x26source\x3dweb",d?"&oi="+a(d):"",e?"&cad="+a(e):"","&ct=",a(g),"&cd=",a(h),"&url=",a(c[0]).replace(/\+/g,"%2B"),"&ei=7_C2SbqXBMW0-AbU4OWnCw",f?"&usg="+f:"",i,c[1]?"#"+c[1]:""].join("");
b.onmousedown="";
return true};
So it looks like Google is changing the href of the a tag to /url?... which I'm assuming is where their tracking is. From LiveHeaders in Firefox, it looks like this page is redirecting the browser to the original href of the a tag.
Is this correct and is this the best method of tracking clicks on links on your site, such as ads?

It's actually changing the href of the link rather than the window location. It's setting b.href, and b refers to the link itself. This runs in onmousedown, so when you release the mouse and the click is handled you magically get sent to that new href.
Any click tracking pretty much comes down to sending the user to some equivalent of Google's /url?... script, counting the click, and performing a 302 redirect to the real destination.
This javascript href replacement has the advantage of automatically filtering out any robots that don't run scripts. The downside is that it also filters out any real people that have javascript disabled. If, like Google, you just care which link is most popular with your real human users, this works out quite well. The clicks that you do record should be representative of real human traffic, and you can safely ignore the clicks from non-javascript users because they probably have the same preferences anyway.
Most adverts just link straight to the counting URL with no javascript replacement. This means that you definitely count every real click on the link, but you need to worry about filtering out requests from robots, since they'll now see your counting URL too.
Which you prefer really depends on why you want to track the clicks.

I think most people expect ads to click through via some sort of tracking system, so I shouldn't worry too much about following this particular javascript implementation - as much as anything that's probably there to ensure that the user sees the correct link in the browsers status bar, that various other interesting bits of info (search terms, position on the result set at the time, who you are, etc) are sent across (without you realising it) and that the links still work if JavaScript is disabled.
Generally, yes directing the user through some tracking page with the ID of the ad they have clicked on, and possibly some additional indication of where they have come from is sensible - that way you aren't relying on other mechanisms (such as JS event handlers) to track clicks on the links, it's certainly the way most ad systems I've used work.

Related

Is rel=self the correct rel tag to use for forum permalinks?

I have been building a forum from scratch with my friends just for fun, and we're starting to see bots and scrapers go by. The problem we're having is that you can load a page /post/1 with four replies, and each reply includes a little permalink to itself /reply/1#reply-1. If I am on /post/1 and navigate to /reply/1, I'll end up right back where I started, just with the anchor to the reply. But! Scrapers have no idea this is the case, so they're opening every /post link and then following every /reply link, and it's causing performance issues, so I've been looking around SEO sites to try to fix it.
I've started using rel=canonical on the /reply page, to tell the bots they're all the same, but as far as I can tell that doesn't help me until the bot has already loaded the page, and thus I wind up with tons of traffic. Would it be correct to change my
Permalink
tags to
Permalink
since they should be the same content? Or would this be misusing rel="self" and there's another, better rel tag I should be using instead?
The self link type is not defined for HTML (but for Atom), so it can’t be used in HTML5 documents.
The canonical link type is appropriate for your case (if you make sure that it always points to the correct page, in case the thread is paginated), but it doesn’t prevent bots from crawling the URLs.
If you want to prevent crawling, no link type will help (not even the nofollow link type, but it’s not appropriate for your case anyway). You’d have to use robots.txt, e.g.:
User-agent: *
Disallow: /reply/
That said, you might want to consider changing the permalink design. I think it’s not useful (neither for your users nor for bots) to have such an architecture. It’s a good practice to have exactly one URL per document, and if users want to link to a certain post, there is no reason to require a new page load if it’s actually the same document.
So I would either use the "canonical" URL and add a fragment component (/post/1#reply-1, or what might make more sense: /threads/1#post-1), or (if you think it can be useful for your users) I would create a page that only contains the reply (with a link back to the full thread).

Dynamic url shortening script for text input

We are looking for script, which automatically detects url, as you type and shorten it, in text input window, before press "submit". The shortening service used is http://yourls.org/
Have you tried implementing one yourself? Deploy the shortener to your own web site (it's written in PHP, as far as I can see from a cursory glance at the web site) and provide a simple Ajax endpoint which will dynamically perform a shortening conversion, then implement calls to that from the main page using JavaScript.
You might want to impose a reasonable delay to allow the user to finish typing, to avoid performing lots of unnecessary conversions of bogus URLs (which may require, e.g. writes to a file or database - I haven't looked at how the library referenced does things).
I'm not sure what you're trying to achieve; if you create new shortened URLs for each substring before the user has finished typing the full URL, you will just proliferate your database.
I don't see how shortening a URL before it's finished makes sense.
If you want to relieve the user from the arduous task of clicking the submit button, then initiate the submit using javascript (jQuery, or something). I'm not sure if that's what you want to do.
http://monkeytooth.net/2010/12/htaccess-php-how-to-wordpress-slugs/
simple means of implementing the concept its a lot more easier than one would think. Querying a DB or some other means of matching the slug/id with the that of which is found in the URL wouldn't be all to hard either. The linked article doesn't really go in depth as what to do next but catching and breaking the URL apart is the essential process of making it work. I have person used the method myself on several sites and it works like a charm for me and the sites it was used on.

Stop spam without captcha

I want to stop spammers from using my site. But I find CAPTCHA very annoying. I am not just talking about the "type the text" type, but anything that requires the user to waste his time to prove himself human.
What can I do here?
Requiring Javascript to post data blocks a fair amount of spam bots while not interfering with most users.
You can also use an nifty trick:
<input type="text" id="not_human" name="name" />
<input type="text" name="actual_name" />
<style>
#not_human { display: none }
</style>
Most bots will populate the first field, so you can block them.
I combine a few methods that seem quite successful so far:
Provide an input field with the name email and hide it with CSS
display: none. When the form is submitted check if this field is
empty. Bots tend to fill this with a bogus emailaddress.
Provide another hidden input field which contains the time the page
is loaded. Check if the time between loading and submitting the page
is larger the minimum time it takes to fill in the form. I use
between 5 and 10 seconds.
Then check if the number of GET parameters are as you would expect.
If your forms action is POST and the underlying URL of your
submission page is index.php?p=guestbook&sub=submit, then you
expect 2 GET parameters. Bots try to add GET parameters so this
check would fail.
And finally, check if the HTTP_USER_AGENT is set, which bots sometimes don't set,
and that the HTTP_REFERER is the URL of the page of your form. Bots
sometimes just POST to the submission page causing the HTTP_REFERER
to be something else.
I got most of my information from http://www.braemoor.co.uk/software/antispam.shtml and http://www.nogbspam.com/.
Integrate the Akismet API to automatically filter your users' posts.
If you're looking for a .NET solution, the Ajax Control Toolkit has a control named NoBot.
NoBot is a control that attempts to provide CAPTCHA-like bot/spam prevention without requiring any user interaction. NoBot has the benefit of being completely invisible. NoBot is probably most relevant for low-traffic sites where blog/comment spam is a problem and 100% effectiveness is not required.
NoBot employs a few different anti-bot techniques:
Forcing the client's browser to perform a configurable JavaScript calculation and verifying the result as part of the postback. (Ex: the calculation may be a simple numeric one, or may also involve the DOM for added assurance that a browser is involved)
Enforcing a configurable delay between when a form is requested and when it can be posted back. (Ex: a human is unlikely to complete a form in less than two seconds)
Enforcing a configurable limit to the number of acceptable requests per IP address per unit of time. (Ex: a human is unlikely to submit the same form more than five times in one minute)
More discussion and demonstration at this blogpost by Jacques-Louis Chereau on NoBot.
<ajaxToolkit:NoBot
ID="NoBot2"
runat="server"
OnGenerateChallengeAndResponse="CustomChallengeResponse"
ResponseMinimumDelaySeconds="2"
CutoffWindowSeconds="60"
CutoffMaximumInstances="5" />
I would be careful using CSS or Javascript tricks to ensure a user is a genuine real life human, as you could be introducing accessibility issues, cross browser issues, etc. Not to mention spam bots can be fairly sophisticated, so employing cute little CSS display tricks may not even work anyway.
I would look into Akismet.
Also, you can be creative in the way you validate user data. For example, let's say you have a registration form that requires a user email and address. You can be fairly hardcore in how you validate the email address, even going so far as to ensure the domain is actually set up to receive mail, and that there is a mailbox on that domain that matches what was provided. You could also use Google Maps API to try and geolocate an address and ensure it's valid.
To take this even further, you could implement "hard" and "soft" validation errors. If the mail address doesn't match a regex validation string, then that's a hard fail. Not being able to check the DNS records of the domain to ensure it accepts mail, or that the mailbox exists, is a "soft" fail. When you encounter a soft fail, you could then ask for CAPTCHA validation. This would hopefully reduce the amount of times you'd have to push for CAPTCHA verification, because if you're getting enough activity on the site, valid people should be entering valid data at least some of the time!
I realize this is a rather old post, however, I came across an interesting solution called the "honey-pot captcha" that is easy to implement and doesn't require javascript:
Provide a hidden text box!
Most spambots will gladly complete the hidden text box allowing you to politely ignore them.
Most of your users will never even know the difference.
To prevent a user with a screen reader from falling into your trap simply label the text box "If you are human, leave blank" or something to that affect.
Tada! Non-intrusive spam-blocking! Here is the article:
http://www.campaignmonitor.com/blog/post/3817/stopping-spambots-with-two-simple-captcha-alternatives
Since it is extremely hard to avoid it at 100% I recommend to read this IBM article posted 2 years ago titled 'Real Web 2.0: Battling Web spam', where visitor behavior and control workflow are analyzed well and concise
Web spam comes in many forms, including:
Spam articles and vandalized articles on wikis
Comment spam on Weblogs
Spam postings on forums, issue trackers, and other discussion sites
Referrer spam (when spam sites pretend to refer users to a target
site that lists referrers)
False user entries on social networks
Dealing with Web spam is very difficult, but a Web developer
neglects spam prevention at his or her
peril. In this article, and in a
second part to come later, I present
techniques, technologies, and services
to combat the many sorts of Web spam.
Also is linked a very interesting "...hashcash technique for minimizing spam on Wikis and such, in addition to e-mail."
How about a human readable question that tells the user to put in the first letter of the value he put in the first name field and the last letter of the last name field or something like this?
Or show some hidden fields which are filled with JavaScript with values like referer and so one. Check for equality of these fields with the ones you have stored in the session before.
If the values are empty, the user has no javascript. Then it would be no spam. But a bot will at least fill in some of them.
Surely you should select one thing Honeypot or BOTCHA.

Refresh browser via cron(or not) to a different page on remote request?

I need to display pages in a tutorial fashion. I looked in to netsupport, beamyourscreen and other possibilities but, I do not want the viewers to download anything. I cannot use gd / send screenshots due to audio / video instructions embedded in some of the pages.
Basically, I need the ability to "refresh" a users browser window to a different page via an interface on my end. Whether via a form submission, javascript or any other type of "controller" that allows me to change the page on the viewers browser. PERL preferred but, PHP / javascript whatever works and is cross browser. I set up a simple javascript page forward timer that "works" but, page load times and conversation interruptions are a huge factor.
The entire tutorial website will be developed around this ability.
I was looking in to curl / cron / wget methods but, found little information.
I have seen forum and chat scripts that basically perform a similar task but, there must be a simple(ish) solution in leau of hacking up another script to suit my needs.
I do not want others to control the pages either. The site really, only needs to be accessable during the tutorial however, It "could" remain web accessable as long as user interaction was normal unless (being controlled).
The initial site concept is based on instructing people how to properly introduce new pets into a home. Will be operated by a veteranarian that saved my pets life. I wanted to give something back.
Possible? I really appreciate simple examples etc...
You have no other way but to keep polling the server for "instructions" using javascript. No, you can't send nothing to the end user browser, neither curl nor wget.
Mainly, you'll have to set up a simple request/response protocol between the browser and the server.
If you want to go deeper, you can use something like cometd/meteord/etc. If not, a hidden iframe that reloads himself and receives pages with javascript code for the needed actions can do the trick.
Another alternative.
With javascript dopolling and single character flatfile. Have a simple one character flatfile with a single var. Write it in perl (it is faster and uses less resources than php). The parent script calls a javascript variable in a flatfile. It hits the flatfile and goes wherever the var sets it. The flatfile is written to by the controller. Done.
I guess you could also rename an empty flatfile and use that as the controller. I am usure which is faster, open and read a specific file or hit the directory and return the file name. On the controller side, opening and writing to a file vs renaming a file. Maybe they counter each other in resources and time?
This way the site can act as a normal site. When you want to have remote users see a "presentation" (automatically being shown the site pages at the controllers pace), the controller activates polling and tells the viewers to push a start button. This allows a remote instructor to load pages for the viewers at his leisure.
It is a simple solution that works with nothing really sophisticated going on. No frames are needed either. Just need javascript enabled.
Any better suggestions are welcome!
It occurred to me that what you might want to use is HTML Push technology. Check out the wiki, they have several links. I have never used it myself

How do you access browser history?

Some e-Marketing tools claim to choose which web page to display based on where you were before. That is, if you've been browsing truck sites and then go to Ford.com, your first page would be of the Ford Explorer.
I know you can get the immediate preceding page with HTTP_REFERRER, but how do you know where they were 6 sites ago?
Javascript this should get you started: http://www.dicabrio.com/javascript/steal-history.php
There are more nefarius means to: http://ha.ckers.org/blog/20070228/steal-browser-history-without-javascript/
Edit:I wanted to add that although this works it is a sleazy marketing teqnique and an invasion of privacy.
Unrelated but relevant, if you only want to look one page back and you can't get to the headers of a page, then document.referrer gives you the place a visitor came from.
You can't access the values for the entries in browser history (neither client side nor server side). All you can do is to send the browser back or forward a number of steps. The entries of the history are otherwise hidden from programmatic access.
Also note that HTTP_REFERER won't be there if the user typed the address in the URL bar instead of following a link to your page.
The browser history can't be directly accessed, but you can compare a list of sites with the user's history. This can be done because the browser attributes a different CSS style to a link that hasn't been visited and one that has.
Using this style difference you can change the content of you pages using pure CSS, but in general javascript is used. There is a good article here about using this trick to improve the user experience by displaying only the RSS aggregator or social bookmarking links that the user actually uses: http://www.niallkennedy.com/blog/2008/02/browser-history-sniff.html