How to identify records which have clusters or lumps in data? - sql

I have a tableau table as follows:
This data can be visualized as follows:
I'd like to flag cases that have lumps/clusters. This would flag items B, C and D because there are spikes only in certain weeks of the 13 weeks. Items A and E would not be flagged as they mostly have a 'flat' profile.
How can I create such a flag in Tableau or SQL to isolate this kind of a case?
What I have tried so far?:
I've tried a logic where for each item I calculate the MAX and MEDIAN. Items that need to be flagged will have a larger (MAX - MEDIAN) value than items that have a fairly 'flat' profile.
Please let me know if there's a better way to create this flag.
Thanks!

Agree with the other commenters that this question could be answered in many different ways and you might need a PhD in Stats to come up with an ideal answer. However, given your basic requirements this might be the easiest/simplest solution you can implement.
Here is what I did to get here:
Create a parameter to define your "spike". If it is going to always be a fixed number you can hardcode this in your formulas. I called min "Min Spike Value".
Create a formula for the Median Values in each bucket. {fixed [Buckets]: MEDIAN([Values])} . (A, B, ... E = "Buckets"). This gives you one value for each letter/bucket that you can compare against.
Create a formula to calculate the difference of each number against the median. abs(sum([Values])-sum([Median Values])). We use the absolute value here because a spike can either be negative or positive (again, if you want to define it that way...). I called this "Spike to Current Value abs difference"
Create a calculated field that evaluates to a boolean to see if the current value is above the threshold for a spike. [Spike to Current Value abs difference] > min([Min Spike Value])
Setup your viz to use this boolean to highlight the spikes. The beauty of the parameter is you can change the value for what a spike should be and it will highlight accordingly. Above the value was 4, but if you change it to 8:

Related

Searching for groups of objects given a reduction function

I have a few questions about a type of search.
First, is there a name and if so what is the name of the following type of search? I want to search for subsets of objects from some collection such that a reduction and filter function applied to the subset is true. For example, say I have the following objects, each of which contains an id and a value.
[A,10]
[B,10]
[C,10]
[D,9]
[E,11]
I want to search for "all the sets of objects whose summed values equal 30" and I would expect the output to be, {{A,B,C}, {A,D,E}, {B,D,E}, {C,D,E}}.
Second, is the only strategy to perform this search brute-force? Is there some type of general-purpose algorithm for this? Or are search optimizations dependent on the reduction function?
Third, if you came across this problem, what tools would you use to solve it in a general way? Assume the reduction and filter functions could be anything and are not necessarily the sum function. Does SQL provide a good API for this type of search? What about Prolog? Any interesting tips and tricks would be appreciated.
Thanks.
I cannot comment on the problem in general but brute forcing search can be easily done in prolog.
w(a,10).
w(b,10).
w(c,10).
w(d,9).
w(e,11).
solve(0, [], _).
solve(N, [X], [X|_]) :- w(X, N).
solve(N, [X|Xs], [X|Bs]) :-
w(X, W),
W < N,
N1 is N - W,
solve(N1, Xs, Bs).
solve(N, [X|Xs], [_|Bs]) :- % skip element if previous clause fails
solve(N, [X|Xs], Bs).
Which gives
| ?- solve(30, X, [a, b, c, d, e]).
X = [a,b,c] ? ;
X = [a,d,e] ? ;
X = [b,d,e] ? ;
X = [c,d,e] ? ;
(1 ms) no
Sql is TERRIBLE at this kind of problem. Until recently there was no way to get 'All Combinations' of row elements. Now you can do so with Recursive Common Table Expressions, but you are forced by its limitations to retain all partial results as well as final results which you would have to filter out for your final results. About the only benefit you get with SQL's recursive procedure is that you can stop evaluating possible combinations once a sub-path exceeds 30, your target total. That makes it slightly less ugly than an 'evaluate all 2^N combinations' brute force solution (unless every combination sums to less than the target total).
To solve this with SQL you would be running an algorithm that can be described as:
Seed your result set with all table entries less than your target total and their value as a running sum.
Iteratively join your prior result with all combinations of table that were not already used in the result set and whose value added to running sum is less than or equal to target total. Running sum becomes old running sum plus value, and append ID to ID LIST. Union this new result to the old results. Iterate until no more records qualify.
Make a final pass of the result set to filter out the partial sums that do not total to your target.
Oh, and unless you make special provisions, solutions {A,B,C}, {C,B,A}, and {A,C,B} all look like different solutions (order is significant).

How might one most efficiently calculate contingent values?

Suppose that I have 10 values n_1, n_2, ... n_10 and that given any 1 of these value, the other 9 can be calculated. Let f_i(n_j) be the function that calculates the value n_i using the values of n_j (where i != j). These functions are relatively simple (i.e. contain no more than a few exponential functions or powers).
In terms of the functions used, what would be the most efficient way of creating a program to calculate the other 9 values in n_1, ..., n_10 given the 1 that is initially known?
Would the best option be to minimize the number of functions used (and thus minimize the number of lines of code), or to create a function defining every single mapping?
For example, would it be most efficient to use only the 18 functions
f_1(n_2), f_1(n_3), ..., f_1(n_10) [1]
f_2(n_1), f_3(n_1), ..., f_10(n_1) [2]
And then, for whatever input is provided by the user, the value of n_1 may be calculated by using the relevant function in line 1, from which every other value of intererest may be calculated using functions from line [2]?
Or would it be better to define all 90 mappings, and so that only a single function (rather than 2 functions) must be called to calculate each of the 9 other values?
Edit: The specific result that I am trying to achieve is as follows...
I am currently using VBA, with a user form of the following format:
The conversion frequency is a required field (so lets just say, for example, that it is always equal to 2 and forget about it). I want to use on change events so that whenever the user changes any of the 6 fields below the conversion frequency field, the other 5 fields are auto-filled with the correct value. However, since the user need only update any one out of six fields, with the other 5 fields being calculated from this, we will require 6^6-6 = 30 different functions to do these calculations. We will thus end up with a lot of repetitive code.
My question regards the best practices to follow when working with a form where one of many inputs may be provided, and all other fields must be updated as a result of the input provided and its value.
Or, equivalently, is there a way to update all fields when the value of one field changes? Can this be done without the number of lines of code required increasing exponentially as the number of fields increases?
I think you are grossly overthinking this. Think of this in terms of the formulas you need; which I think are 6. 6 functions that take 5 inputs each:
calculateEIR(nominalInterestRate, ForceOfInterest, DiscountFactor, EffectiveDiscountRate, NominalDiscountRate)
calculateNIR(EffectiveInterestRate, ForceOfInterest, DiscountFactor, EffectiveDiscountRate, NominalDiscountRate)
' and so on...
The event handlers, and the code to calculate the values are their own thing. Your onchange event handlers simply need to call the correct methods; this is 6 event handlers calling 5 methods each, so 11 functions if you want to keep count. It's a lot of copypasta. For example:
sub textEffectiveInterestRate_onchange()
Me.textNominalInterstRate.value = calculateNIR(Me.textEffectiveInterestRate.value, Me.textForceOfInterest.value, etc...)
Me.textForceOfInterest.value = calculateForceOfInterest(Me.textEffectiveInterestRate.value, Me.textNominalInterstRate.value, etc...)
' And every other function aside from calculateEIR()
end sub
I am unsure about the specifics of how you are changing all the values based on a change in the others (since I don't know the formulas), but in general, you should not in any way need 30 functions...

Clever way to check if value meets threshold in VBA

Disclaimer: Numbers below are randomly generated
What I'm trying to do is, purely in VBA, look at the ratio of [column B]/[column A] and checking whether or not the ratio in row 10 (=1,241/468) is below the minimum of the ratios or above the maximum of the ratios in rows 1 through 9 but only compared to the rows where there is a 1 in column C.
That is, compare Cell(B10)/Cell(A10) to Cell(B2)/Cell(A2), Cell(B3)/Cell(A3), etc. (only comparing against rows with a 1 in column C).
The workbook I'm working with has a lot more data and columns and I'm not allowed to explicitly edit the cells, so defining a new column is out of the question. Is there a way to do this in VBA such that it essentially returns a boolean depending whether or not the ratio in the last row violates the threshold defined above?
You can achieve the minimum and maximum ratios (with criteria) easily with the AGGREGATE¹ function's SMALL sub-function and LARGE sub-function.
        
The formulas in D13:E13 are,
=AGGREGATE(15, 6, ((B1:B9)/(A1:A9))/C1:C9, 1)
=AGGREGATE(14, 6, ((B1:B9)/(A1:A9))/C1:C9, 1)
The 6 is the AGGREGATE parameter for ignoring error values. By dividing the ratio
by the value in column C we are producing #DIV/0! errors for anything we do not want considered leaving them ignored. If the values in C were more diverse, we could divide by (C1:C9=1) to produce the same results.
Since we are using the SMALL and LARGE sub-functions, we can easily retrieve the second, third, etc. ratios by increasing the k parameter (the 1 off the back end).
I've modified some of the values in your sample slightly to demonstrate that the min and max with criteria are being picked up correctly.
These can be adapted to VBA with the WorksheetFunction object or Application.Evaluate method.
¹The AGGREGATE¹ function's was introduced with Excel 2010. It is not available in previous versions.

Converting from excel formula for Using forecast with times

When using forecast, you input a number and it should return a value based on the known X data and Known Y data.
However if you put in a time this does not work.
I need two things.
First of all I need the VBA equivalent of forecast. I suspect this to be application.forecast
Then how to use the date as a value for the forecast to work as it should
The formula is as follows:
=FORECAST(15:00:00,A10:A33,B10:B33)
Currently this equation flags up an error.
Any ideas to get this to work for time values?
I see two potential problem areas. The first is the time. Use the TIME function to get a precise time. Second, in D9:D12, the values are left-aligned. Typically, this means they are text, not true numbers. If you absolutely require the m suffix, use a Custom number Format of General\m in order that they retain their numeric status while displaying an m as an increment suffix. If you type the m in, they become text-that-look-like-numbers and are useless for any maths.
=FORECAST(TIME(15, 0, 0), B10:B33, A10:A33)
That returns 3.401666667 which is either 09:38 AM or 3.4 m (it's been a while since I played with the FORECAST function).

Unbounded knapsack/coin change with optimal solution for non-standard coins

I have the following problem:
Given a target size N and some denominations of some randomly
generated coins stored in array denominations[] check using dynamic
programming if there is an optimal solution which will fill the entire
target with the least amount of coins used.
This is a typical example of coin change problem, however unlike real life currency where each denomination is carefully picked so that any target can be matched this is not the case.
I am familiar with the algorithm used in coin change problem and how to construct a dynamic programming array to find the solution but how do I check if there is a solution in the first place?
Let the state be denoted by DP[i][sum] :the minimum number of coins used to form sum using starting i coins of the array denominations .
Then the recurrence can be formulated as:
DP[i][sum]= min(DP[i-1][sum],DP[i][sum-denominations[i]]+1)
Why??
The first DP[i-1][sum] denotes the number of coins needed to form sum using i-1 coins only (the case in which we exclude the i th denomination), The other case when we include the i th coin (Note : I have assumed we can include tthe coin multiple times that is why I wrote DP[i][sum-denominations[i]].
Now, The base cases, DP[i][0]=0 i.e. the NULL set(for all i belonging from 0 to n(the number of denominations)! and DP[0][i]=+INFINITY where 1<=i<=sum .
Now when the DP table is filled up, You can easily check whether DP[n(the size)][sum] is not equal to +INFINITY then there exists a solution else wise not..
If you know how to construct a solution (as you said) , You can construct the solution for this solution too..
P.S. : For allowing only single time inclusion of a coin denomination the recurrence changes to
DP[i][sum]= min(DP[i-1][sum],DP[i-1][sum-denominations[i]]+1)
by the same logic! I think base cases will be same!