What is the most conclusive way to evaluate an n-way split test where n > 2? - data-science

I have plenty of experience designing, running and evaluating two-way split tests (A/B Tests). Those are by far the most common in digital marketing, where I do most of my work.
However, I'm wondering if anything about the methodology needs to change when more variants are introduced into an experiment (creating, say, a 3-way test (A/B/C Test)).
My instinct tells me I should just run n-1 evaluations against the control group.
If I run a 3-way split test, for example, instinct says I should find significance and power twice:
Treatment A vs Control
Treatment B vs Control
So, in that case, I'm finding out which, if any, treatment performed better than the control (1-tailed test, alt: treatment - control > 0, the basic marketing hypothesis).
But, I'm doubting my instinct. It's occurred to me that running a third test contrasting Treatment A with Treatment B could yield confusing results.
For example, what if there's not enough evidence to reject a null that treatment B = treatment A?
That would lead to a goofy conclusion like this:
Treatment A = Control
Treatment B > Control
Treatment B = Treatment A
If treatments A and B are likely only different due to random chance, how could only one of them outperform the control?
And that's making me wonder if there's a more statistically sound way to evaluate split tests with more than one treatment variable. Is there?

Your instincts are correct, and you can feel less goofy by rewording your statements:
We could find no statistically significant difference between Treatment A and Control.
Treatment B is significantly better than Control.
However it remains inconclusive whether Treatment B is better than Treatment A.
This would be enough to declare Treatment B a winner, with the possible followup of retesting A vs B. But depending on your specific situation, you may have a business need to actually make sure Treatment B is better than Treatment A before moving forward and you can make no such decision with your data. You must gather more data and/or restart a new test.
What I've found is a far more common scenario is Treatment A and Treatment B both soundly beat control (as they're often closely related and have related hypotheses), but there is no statistically significant difference between Treatment A or Treatment B. This is an interesting scenario where if you are required to pick a winner, it's okay throwing significance out the window and picking the one that has the strongest effect. The reason why is that the significance level (eg. 95%) is set to avoid false positives and making unnecessary changes. There's an assumption that there are switching costs. In this case, you must pick A or B and throw out control, so in my opinion it's okay picking the best one until you have more data.

Related

Robust Measures of Algorithmic Trading - Based on Robert Pardo's Book

I am optimizing algorithmic strategies. In the process of choosing from a pool of many optimized strategies, I am in the phase of searching (evaluating) for robustness of the strategy.
Following the guidelines of Dr. Pardo's book "The Evaluation of Trading Strategies" in page 231 Dr. Pardo recomends, in the Numeral 3 to apply the following ratio to the optimized data:
" 3. The ratio of the total profit of all profitable simulations divided by the
total profit of all simulationsis significantly positive"
The Question: from the optimization results, I am not being able to properly understand what does Mr. Pardo means by stating "...all simulationsis significantly positive"; what does Mr. Pardo means by 'significantly positive?
a.) with 95% confidence level?
b.) with a certain p value?
c.) the relation of the average net profit of each simulation minus it' standard deviation
Even though the sentence might seem 'simple' I would REALLY like to understand what Mr. Pardo means by the statement and HOW to calculate it, in order to filter the most robust algorithmic strategies.
The aim of analyzing the optimization profile of an algorithmic simulation is to be able to filter robust strategies.
Therefore the ratio should help us to uncover if the simulation results are on the right track or not.
So, we would like to impose some 'penalties' to our results, so we can select the robust cases from those of doubtful (not robust) result.
I came to the following penalizing measures (found in the book of Mr. Pardo and other sources).
a.) we can use a market return (yearly value) as a benchmark, so all the simulations whose result are below such level, can be excluded from our analysis,
b.) some other benchmark to divide those 'robust' results from those more 'doubtful' (for example, deducing to each result one standard deviation)
From (a) and (b), we can create the ratio:
the total sum of all profitable simulations divided by the profitable results considered robust
The ratio should be greater or equal than 1.
If the ratio is equal to 1 then it means that our simulation result has given interesting results (we are analyzing the positive values in this ratio, but profitable results should always be compared to the negative results also).
If the ratio is greater from 1, then we have not reach the possible scenario, and the result should be compared with the other tests for optimizations.
While simulating trading algorithms, no result is absolute but partial and it's value is taken in relationship to what we expect from the algorithm.
If someone has a better explanation or idea or concept you might find interesting please share, I would gladly read it.
Best regards to all.
Remark on the subject
With all due respect to the subject ( published in 2008 ) the term robustness has its own meaning if-and-only-if the statement also clarifies in which particular respect is the robustness measured and against what phenomena is it to be exposed & tested the Model-under-review's response ( against what perturbances -- type and scale -- shall the Model-under-test hold its robust behaviour, measures of which were both defined and quantified a-priori the test ).
In any case, where such context of the robustness is not defined, the material, be it printed by any bold name, sounds -- and forgive me to speak in plain English -- just like a PR-story, an over-hyped e-zine headline or like a paid advertorial.
Serious quantitative model evaluations, the more if one strives to perform an optimisation ( with respect to some defined quantitative goal ), requires a more thorough insight into the subject than to axiomatically post a trivial "must-have" imperative of
large-average && small-HiLo-range && small StDev.
Any serious Quant-Modelling effort, if it were not to just spoil the consumed hundreds-of-thousands CPU core hours of deep parametric-spaces' scans, shall incorporate a serious parametrisation decision in either dimension of the main TruTrading Strategy sub-spaces --
{ aSelectPOLICY, aDetectPOLICY, anActPOLICY, anAllocatePOLICY, aTerminatePOLICY }
A failure to do so, either cripples the model or leads to a blind-belief, where it is hard to guess, whether the former or the latter is a greater of the both Quant-sins.
Remark on the cited hypothesis
The book states, without any effort to proof the construction, that:
The more robust trading strategywill have an optimization profile with a: 1. Largeaverageprofit 2. Small maximum-minimumrange3. Small standarddeviation
Is it correct?
Now kindly spend a few moments and review this 4D-animated view of a Model-under-test ( visualisation of which is reduced into just four dimensions for easier visual perception ), where none of the above stands true.
<aMouseRightCLICK>.openPictureOnAnotherTab to see full HiRes picture details
Based on contemporary state-of-art adaptive money-management practice, that fails to be correct, be it due to a poor parametrisation ( thus artificially leading the model into a rather "flat-profits" sub-space of aParamSetVectorSPACE )
or due to a principal mis-concept or a poor practice ( including the lack thereof ) of the implementation of the most powerful profit-booster ever -- the very money-management model sub-space.
Item 1 becomes insignificant at all.
Item 2 works right on the contrary to the stated postulate.
Item 3 cannot yield anything but the opposite due to 1 & 2 above.

Creating a testing strategy to check data consistency between two systems

With a quick search over stackoverflow was not able to find anything so here is my question.
I am trying to write down the testing strategy for a application where two applications sync with each other every day to keep a huge amount of data in sync.
As its a huge amount of data I don't really want to cross check everything. But just want to do a random check every time a data sync happens. What should be the strategy here for such system?
I am thinking of this 2 approach.
1) Get a count of all data and cross check both are same
2) Choose a random 5 data entry and verify that their proprty are in sync.
Any suggestion would be great.
What you need is known as Risk Management, in Software Testing it is called Software Risk Management.
It seems your question is not about "how to test" what you are about to test but how to describe what you do and why you do that (based on the question I assume you need this explanation for yourself too...).
Adding SRM to your Test Strategy should describe:
The risks of not fully testing all and every data in the mirrored system
A table scaling down SRM vs amount of data tested (ie probability of error if only n% of data tested versus -e.g.- 2n% tested), in other words saying -e.g.!- 5% of lost data/invalid data/data corrupption/etc if x% of data was tested with a k minute/hour execution time
Based on previous point, a break down of resources used for the different options (e.g. HW load% for n hours, manhours used is y, costs of HW/SW/HR use are z USD)
Probability -and cost- of errors/issues with automation code (ie data comparison goes wrong and results in false positive or false negative, giving an overhead to DBA, dev and/or testing)
What happens if SRM option taken (!!e.g.!! 10% of data tested giving 3% of data corruption/loss risk and 0.75% overhead risk -false positive/negative results-) results in actual failure, ie reference to Business Continuity and effects of data, integrity, etc loss
Everything else comes to your mind and you feel it applies to your *current issue* in your *current system* with your *actual preferences*.

Building ranking with genetic algorithm,

Question after BIG edition :
I need to built a ranking using genetic algorithm, I have data like this :
P(a>b)=0.9
P(b>c)=0.7
P(c>d)=0.8
P(b>d)=0.3
now, lets interpret a,b,c,d as names of football teams, and P(x>y) is probability that x wins with y. We want to build ranking of teams, we lack some observations P(a>d),P(a>c) are missing due to lack of matches between a vs d and a vs c.
Goal is to find ordering of team names, which the best describes current situation in that four team league.
If we have only 4 teams than solution is straightforward, first we compute probabilities for all 4!=24 orderings of four teams, while ignoring missing values we have :
P(abcd)=P(a>b)P(b>c)P(c>d)P(b>d)
P(abdc)=P(a>b)P(b>c)(1-P(c>d))P(b>d)
...
P(dcba)=(1-P(a>b))(1-P(b>c))(1-P(c>d))(1-P(b>d))
and we choose the ranking with highest probability. I don't want to use any other fitness function.
My question :
As numbers of permutations of n elements is n! calculation of probabilities for all
orderings is impossible for large n (my n is about 40). I want to use genetic algorithm for that problem.
Mutation operator is simple switching of places of two (or more) elements of ranking.
But how to make crossover of two orderings ?
Could P(abcd) be interpreted as cost function of path 'abcd' in assymetric TSP problem but cost of travelling from x to y is different than cost of travelling from y to x, P(x>y)=1-P(y<x) ? There are so many crossover operators for TSP problem, but I think I have to design my own crossover operator, because my problem is slightly different from TSP. Do you have any ideas for solution or frame for conceptual analysis ?
The easiest way, on conceptual and implementation level, is to use crossover operator which make exchange of suborderings between two solutions :
CrossOver(ABcD,AcDB) = AcBD
for random subset of elements (in this case 'a,b,d' in capital letters) we copy and paste first subordering - sequence of elements 'a,b,d' to second ordering.
Edition : asymetric TSP could be turned into symmetric TSP, but with forbidden suborderings, which make GA approach unsuitable.
It's definitely an interesting problem, and it seems most of the answers and comments have focused on the semantic aspects of the problem (i.e., the meaning of the fitness function, etc.).
I'll chip in some information about the syntactic elements -- how do you do crossover and/or mutation in ways that make sense. Obviously, as you noted with the parallel to the TSP, you have a permutation problem. So if you want to use a GA, the natural representation of candidate solutions is simply an ordered list of your points, careful to avoid repitition -- that is, a permutation.
TSP is one such permutation problem, and there are a number of crossover operators (e.g., Edge Assembly Crossover) that you can take from TSP algorithms and use directly. However, I think you'll have problems with that approach. Basically, the problem is this: in TSP, the important quality of solutions is adjacency. That is, abcd has the same fitness as cdab, because it's the same tour, just starting and ending at a different city. In your example, absolute position is much more important that this notion of relative position. abcd means in a sense that a is the best point -- it's important that it came first in the list.
The key thing you have to do to get an effective crossover operator is to account for what the properties are in the parents that make them good, and try to extract and combine exactly those properties. Nick Radcliffe called this "respectful recombination" (note that paper is quite old, and the theory is now understood a bit differently, but the principle is sound). Taking a TSP-designed operator and applying it to your problem will end up producing offspring that try to conserve irrelevant information from the parents.
You ideally need an operator that attempts to preserve absolute position in the string. The best one I know of offhand is known as Cycle Crossover (CX). I'm missing a good reference off the top of my head, but I can point you to some code where I implemented it as part of my graduate work. The basic idea of CX is fairly complicated to describe, and much easier to see in action. Take the following two points:
abcdefgh
cfhgedba
Pick a starting point in parent 1 at random. For simplicity, I'll just start at position 0 with the "a".
Now drop straight down into parent 2, and observe the value there (in this case, "c").
Now search for "c" in parent 1. We find it at position 2.
Now drop straight down again, and observe the "h" in parent 2, position 2.
Again, search for this "h" in parent 1, found at position 7.
Drop straight down and observe the "a" in parent 2.
At this point note that if we search for "a" in parent one, we reach a position where we've already been. Continuing past that will just cycle. In fact, we call the sequence of positions we visited (0, 2, 7) a "cycle". Note that we can simply exchange the values at these positions between the parents as a group and both parents will retain the permutation property, because we have the same three values at each position in the cycle for both parents, just in different orders.
Make the swap of the positions included in the cycle.
Note that this is only one cycle. You then repeat this process starting from a new (unvisited) position each time until all positions have been included in a cycle. After the one iteration described in the above steps, you get the following strings (where an "X" denotes a position in the cycle where the values were swapped between the parents.
cbhdefga
afcgedbh
X X X
Just keep finding and swapping cycles until you're done.
The code I linked from my github account is going to be tightly bound to my own metaheuristics framework, but I think it's a reasonably easy task to pull the basic algorithm out from the code and adapt it for your own system.
Note that you can potentially gain quite a lot from doing something more customized to your particular domain. I think something like CX will make a better black box algorithm than something based on a TSP operator, but black boxes are usually a last resort. Other people's suggestions might lead you to a better overall algorithm.
I've worked on a somewhat similar ranking problem and followed a technique similar to what I describe below. Does this work for you:
Assume the unknown value of an object diverges from your estimate via some distribution, say, the normal distribution. Interpret your ranking statements such as a > b, 0.9 as the statement "The value a lies at the 90% percentile of the distribution centered on b".
For every statement:
def realArrival = calculate a's location on a distribution centered on b
def arrivalGap = | realArrival - expectedArrival |
def fitness = Σ arrivalGap
Fitness function is MIN(fitness)
FWIW, my problem was actually a bin-packing problem, where the equivalent of your "rank" statements were user-provided rankings (1, 2, 3, etc.). So not quite TSP, but NP-Hard. OTOH, bin-packing has a pseudo-polynomial solution proportional to accepted error, which is what I eventually used. I'm not quite sure that would work with your probabilistic ranking statements.
What an interesting problem! If I understand it, what you're really asking is:
"Given a weighted, directed graph, with each edge-weight in the graph representing the probability that the arc is drawn in the correct direction, return the complete sequence of nodes with maximum probability of being a topological sort of the graph."
So if your graph has N edges, there are 2^N graphs of varying likelihood, with some orderings appearing in more than one graph.
I don't know if this will help (very brief Google searches did not enlighten me, but maybe you'll have more success with more perseverance) but my thoughts are that looking for "topological sort" in conjunction with any of "probabilistic", "random", "noise," or "error" (because the edge weights can be considered as a reliability factor) might be helpful.
I strongly question your assertion, in your example, that P(a>c) is not needed, though. You know your application space best, but it seems to me that specifying P(a>c) = 0.99 will give a different fitness for f(abc) than specifying P(a>c) = 0.01.
You might want to throw in "Bayesian" as well, since you might be able to start to infer values for (in your example) P(a>c) given your conditions and hypothetical solutions. The problem is, "topological sort" and "bayesian" is going to give you a whole bunch of hits related to markov chains and markov decision problems, which may or may not be helpful.

Rough estimate of test cases

I'm curious how many test cases others have for a site similar to mine. It's your basic CRUD with business workflow website. 3 user roles, a couple input pages, a couple search pages, a business rule engine, etc. Maybe 50k lines of .NET code (workflow and persistence altogether). DB with about 10 main tables plus about 100 supporting tables (lookups, logs, etc.). The main UI for entering data is quite big, around 100 data fields, multiple grids, about 5 action/submit type buttons.
I know this is vague and I'm only hoping for order of magnitude figures. I'm also thinking of basic test cases, not code coverage type cases. But like if I told you we had 25 test cases I'm sure you'd say way WAY not enough. So I'm just looking for ballpark figures.
TIA
I would have as many test cases as it takes to ensure a high level of confidence in the system.
The number of tables, rules, lines of code, etc is actually immaterial.
You should have the appropriate unit tests to ensure your domain objects and business rules are firing correctly. You should have tests to ensure your queries execute appropriately (this is a harder one).
You might even want to have test cases for paths through the software. In other words, click here, get this page, click there, edit a field, save the page, go back... This type is the most difficult as the tests are usually recorded and have to be rerecorded when the pages change (ie: a field is added or removed).
Generally speaking it's more about coverage than number of tests. You want your tests to cover as much of the applications funcionality as is feasible. Note that I didn't say possible. You can cover an entire application (100%) with test cases, but for every little change, bug fix, etc you'll have to recode those tests. This is more desired for a mature app. For newer apps you don't want to hamstring your developers and QA team that way as they'll spend inordinate amounts of time fixing/changing unit tests...
For any system, you could easily spend as much time developing your automated tests as you do the system itself. In some cases, even more.
As for our group, we tend to have lots of unit tests. However, for testing paths through the system we only record those once a particular area has moved into a "maintenance" type of mode. Meaning we expect little change for quite a while in that area and the path test is simply to ensure no one jacked it up.
UPDATE: the comments here led me to the following:
Going a little further: Let's examine 1 small piece of code:
Int32 AddNumbers(Int32 a, Int32 b) {
return a+b;
}
On the face of it you could get away with a single test:
Int32 result = AddNumbers(1,2);
Assert.Equals(result, 3);
However, that probably isn't enough. What happens if you do this:
Int32 result = AddNumbers(Int32.MaxValue, 1);
Assert.Equals(result, (Int32.MaxValue+1));
Now we have a failure. Here's another one:
Int32 result = AddNumbers(Int32.MinValue, -1);
Assert.Equals(result, (Int32.MinValue-1));
So, we have an extremely simple method that requires at least 3 tests. The initial to see if it can give any result, then 2 for bounds checking. That's 3 tests for essentially 2 lines of code (method definition and the one line computation).
As your code becomes more complex, things get really dicey:
Decimal DivideThis(Decimal a, Decimal b) {
result = Decimal.Divide(a,b);
}
This slight change introduces yet another exception condition beyond bounds: DivideByZero. So now we are up to 4 tests required for 2 lines of code.
Now, let's simplify it a bit:
String AppendData(String data, String toAppend) {
return String.Format("{0}{1}", data, toAppend);
}
Our test case here is:
String result = AppendData("Hello", "World");
Assert.Equals(result, "HelloWorld");
That's just one test case for the code block, with no others really needed.
What does this tell us: For starters 2 lines of code might cause us to need between 1 and 4 test cases. You mentioned 50k lines... Using that logic, you will need between 50,000 and 200,000 test cases...
Of course, life is rarely so simple. In those 50k lines of code you have, there are going to be large blocks of code that have very limited inputs. For example a mortgage interest calculator might take 3 parameters, and return 1 value (the APR). The code itself might run 100 lines or so (been awhile, just work with me). The number of test cases for this is going to be determined by edge cases along the lines of making sure you properly handle rounding.
So, let's say it's 5 cases: which brings us to 20 lines of code = 1 case. Calculating that out your 50k lines might result in 2,500 test cases. Obviously much smaller than what we expected above.
Finally, I'm going to throw another wrinkle into the mix. Some test systems can handle inputs and your assertions coming from a data file. Considering our first one we could have a data file that has a line for each parameter combination we want to test. In this scenario, we only need 1 test case to cover 3 (or more..) possible conditions.
The test case might look like (pseudo code):
read input file.
parse expected result, parameter 1, parameter 2
run method
assert method result = parsed result
repeat for each line of the file
With that capability, we are down to 1 test case per scenario. I would say 1 per method, but the reality is that most methods are rarely standalone and it's entirely possible that numerous methods are implicitly tested through explicit testing of others; therefore not requiring their own individual tests.
This leads me to this: It is impossible to determine the right number of test cases without a full understanding of your code base. 5 cases that are at the UI level might be enough for complete coverage depending on the complexity of the tests; or it might take thousands. Therefore it's much better to base it on code coverage. What percentage of the code, and branching logic, are you testing?
If you ask a car salesman for a rough price of a car and he would give me that price, I wouldn't buy my car there, because he forgot to ask me some important questions. What kind of car do you want? Which extras do you want on the car? etc.
Same for number of test cases .... If a hiring manager would ask me that question I would probably give him the following answer.
#test cases = between #Requirements*2 and #Requirements*infinite (some requirements can lead to bollions of possibilities)
I also would say that based on my experience the number would realistically be #Requirements*5 (is the number I use at the initial phase, for projects with new, changed and omitted functionality)
where the following error margin has to be taken depending on the phase I am making this estimate:
Initiation phase : error margins = 400%
...
Testing phase : error margin = 10%
By the time you start the testing phase, detailed requirements/specs are available, volatillity of requirements is stabilized, creep of requirements is almost zero, etc.
At that time I also will be able to give better estimates ...

Any mental technique to quickly deduce the required ifs/elses in a program with A LOT of conditional logic?

Often in programming, it is a very common requirement that some piece of functionality will require a lot of conditional logic, but not quite enough to warrant a rules engine.
For example, testing a number is divisible by x but also a multiple of something, a factor of something else, a square root of something, etc. As you can imagine, something along these lines will easily involve a lot of ifs/elses.
While it is possible to reduce the clutter with more modern programming techniques, how do you quickly and, in a calculated fashon, deduce the required ifs/elses?
For example, in a program to deduce the necessary quote for a car insurance prospective customer (rules engines aside btw), there would be conditional logic for age, location, driving points, what age those points are collected at, etc. Is there any mental technique to quickly deduce the redundant conditional branches? Is it just plain experience and no special mental technique? This is important because pair programming there is a lot of noise and thus difficult to actually think something through or even get enough time to implement the idea.
Thanks
I would suggest that trying to do this sort of thing in your head is asking for trouble, and trying to do it with a partner is going to make it much worse. Sometimes you have to sit and think and even make some notes on paper. If you don't like propositional logic, try decision tables.
I would add simple, readable, short methods, such as:
IsMinor(..)
IsRecordClean(..)
And then use them in conjunction to create new methods with meaningful names, such as:
IsMeetingPreReqs(..) //which checks several "simple" conditions
IsValidForInsurance(..)
(Sorry for the examples, I'm struggling with my English here, but you get the point..)
IMO that will make your code much clearer, and thus reduce the chances of being confused by distractions.
Not mental per se, but kinda..
I would say you should use propositional logic.
Suggest that:
q = age is greater than 18
p = location is within 10 miles
r = driving points are less than 3
s = age is less than 18 when points are collected
you could say...
(^ is AND)
if (q ^ p ^ r ^ s) {
//you are eligible or something!
} else {
//get outta here
}