How do I suppress individual cell values in an MDX query? - ssas

I've got an MDX query that's returning 2 different values, a total number of lookups and a number of lookups failed (and some other stuff, but that's the important part).
with
member [Pct Success] as 'iif(isempty([Num Lookup]) or [Num Lookup]=0, null,
100 * (coalesceempty([Num Lookup] - [Num Failed], 0) / [Num Lookup]))'
select
{
[Measures].[Pct Success],
[Measures].[Num Lookup],
[Measures].[Num Failed]
} on 0,
[Calendar].[Date].children on 1
from
[Cube]
Now what I'm trying to do is to get another success measurement, but I want this one to suppress any [Num Failed] cells that are below a specific threshold. Specifically, if I have at least 4 successful lookups (Num Lookup - Num Failed > 3 && Num Lookup > 4), then I want to make that cell's [Num Failed] = 0 (or rather I want to make a copy of [Num Failed] = 0 since I still need to display the original % Success measure).
The underlying facts look like this (just the pertinent subset of columns - the Line column is there for ease of reference, it's not in the actual facts):
Line | CalendarKey | Num Failed | Num Lookup
1 | 20130601 | 2 | 8
2 | 20130601 | 5 | 8
3 | 20130601 | 1 | 8
4 | 20130601 | 0 | 7
5 | 20130601 | 7 | 8
6 | 20130602 | 2 | 6
7 | 20130602 | 1 | 7
8 | 20130602 | 5 | 10
9 | 20130602 | 7 | 9
etc.
What I expect to see for results based on those facts above is:
| % Success | % Filt Success | Num Filt Failed | Num Failed | Num Lookup
20130601 | 61.53 | 69.23 | 12 | 15 | 39
20130602 | 53.13 | 71.88 | 9 | 15 | 32
In the above fact sample, lines 1, 3, 4, 7 & 8 all would have a filtered failed value of 0, which gives us the sample output listed above.
My initial thought is to use another member in the with clause as the copy of [Num Failed] and then a Cell Calculation to do the suppression, but I can't get the Cell Calculation to work correctly - it either doesn't modify the value, or it errors out during evaluation.
Here's the non-working version that "should" return what I'm looking for, but doesn't:
with
member [Measures].[Num Threshold Failure] AS [Num Failed]
Cell Calculation [Data Filter] For '[Measures].[Num Threshold Failure]' AS 'NULL', Condition = 'CalculationPassValue((([Measures].[Num Lookup] - [Measures].[Num Failure]) < 4) AND ([Measures].[Num Lookup] > 4), 1)'
member [Pct Success] as 'iif(isempty([Num Lookup]) or [Num Lookup]=0, null, 100 * (coalesceempty([Num Lookup] - [Num Failed], 0) / [Num Lookup]))'
member [Pct Filtered Success] as 'iif(isempty([Num Lookup]) or [Num Lookup]=0, null, 100 * (coalesceempty([Num Lookup] - [Num Threshold Failure], 0) / [Num Lookup]))'
select
{
[Measures].[Pct Success],
[Measures].[Pct Filtered Success],
[Measures].[Num Threshold Failure],
[Measures].[Num Failed],
[Measures].[Num Lookup]
} on 0,
{ [Calendar].[Date].children } on 1
from
[Cube]

I do not understand your question in every detail, but as far as I understand it, the following should answer it, or at least come close:
with
member [Pct Success] as iif([Measures].[Num Lookup]=0,
null,
100 * (coalesceempty([Measures].[Num Lookup] -[Measures]. [Num Failed], 0) / [Measures].[Num Lookup]))
member [Filtered Failed] as iif([Measures].[Num Lookup] - [Measures].[Num Failed] > 3 and [Measures].[Num Lookup] > 4),
0,
[Measures].[Num Failed])
member [Bottom Filtered failed] as Sum(Leaves(),
iif([Measures].[Num Lookup] - [Measures].[Num Failed] > 3 and [Measures].[Num Lookup] > 4),
0,
[Measures].[Num Failed]))
member [Pct Filtered Success] as iif([Measures].[Num Lookup]=0,
null,
100 * (coalesceempty([Measures].[Num Lookup] -[Measures]. [Filtered Failed], 0) / [Measures].[Num Lookup]))
select
{
[Measures].[Pct Success],
[Measures].[Num Lookup],
[Measures].[Num Failed],
[Measures].[Num Filtered Failed],
[Measures].[Bottom Filtered Failed],
[Measures].[Pct Filtered Success]
} on 0,
[Calendar].[Date].children on 1
from [Cube]
BTW: You do not need to include member definitions in the WITH clause in single quotes unless you aim for the SQL Server 2000 dialect of MDX.
And, according to this blog post of the former lead developer of the MDX processor, you can simplify the check for empty and null to just checking for zero.
EDIT:
As you stated that users want to use several different tolerances in what-if analyses, what you could do if your cube is not huge, and the number of different tolerances is just a handful, you could pre-calculate the what-if cases, thus making use of the fast response times of Analysis Services for aggregated values.
To do this, you would proceed as follows: build a small dimension table , say dim_tolerance, containing e. g. the numbers 0 to 10, or the numbers 0, 1, 2, 3, 5, 8, 10, and 12, or whatever makes sense. Then build a new fact table referencing the same dimensions as the current one, plus the new one, and fill it with the single measure [num failed filtered] calculated as the dim_tolerance value dictates. You could then remove the [num failed] measure from the main fact table (as it would be the same as [num failed filtered] with tolerance 0). Make the attribute in the new dimension non-aggregateable with a default value of 0.

Related

Find entries with array of dates if any is currently available within a certain range in postgres

I have a postgres table with columns:
id: text
availabilities: integer[]
A certain ID can has multiply availabilities (different days (not continuous) in a range for up to a few years). Each availability is a Unix timestamp (in seconds) for a certain day.
Hours, minutes, seconds, ms are set to 0, i.e. a timestamp represents the start of a day.
Question:
How can I find all IDs very fast, which contain at least one availability inbetween a certain from-to range (also timestamp)?
I can also store them differently in an array, e.g "days since epoch", if needed (to get 1 (day) steps instead of 86400 (second) steps).
However, if possible (and speed is roughly same), I want to use an array and on row per each entry.
Example:
Data (0 = day-1, 86400 = day-2, ...)
| id | availabilities |
| 1 | [0 , 86400, 172800, 259200 ]
| 2 | [ 86400, 259200 ]
| 3 | [ , 345600 ]
| 4 | [ , 172800, ]
| 5 | [0, ]
Now I want to get a list of IDs which contains at least 1 availability which:
is between 86400 AND 259200 --> ID 1, 2, 4
is between 172800 AND 172800 --> ID 1, 4
is between 259200 AND (max-int) --> ID 1,2,3
In PostgreSQL unnest function is the best function for converting array elements to rows and gets the best performance. You can use this function. Sample Query:
with mytable as (
select 1 as id, '{12,2500,6000,200}'::int[] as pint
union all
select 2 as id, '{0,200,3500,150}'::int[]
union all
select 4 as id, '{20,10,8500,1100,9000,25000}'::int[]
)
select id, unnest(pint) as pt from mytable;
-- Return
1 12
1 2500
1 6000
1 200
2 0
2 200
2 3500
2 150
4 20
4 10
4 8500
4 1100
4 9000
4 25000

Average difference between values SQL

I'm trying to find the difference between values using SQL where the second value is always larger than the previous value.
Example Data:
Car_ID | Trip_ID | Mileage
1 1 10,000
1 2 11,000
1 3 11,500
2 1 5,000
2 2 7,000
2 3 8,000
Expect Calculation:
Car_ID: 1
(Trip 2 - Trip 1) = 1,000
(Trip 3 - Trip 2) = 500
Average Difference: 750
Car_ID: 2
(Trip 2 - Trip 1) = 2,000
(Trip 3 - Trip 2) = 1,000
Average Difference: 1,500
Expected Output:
Car_ID | Average_Difference
1 750
2 1,500
You can use aggregation:
select car_id,
(max(mileage) - min(mileage)) / nullif(count(*) - 1, 0)
from t
group by car_id;
That is, the average as you have defined it is the maximum minus the minimum divided by one less than the number of trips.

Custom Rolling Computation

Assume I have a model that has A(t) and B(t) governed by the following equations:
A(t) = {
WHEN B(t-1) < 10 : B(t-1)
WHEN B(t-1) >=10 : B(t-1) / 6
}
B(t) = A(t) * 2
The following table is provided as input.
SELECT * FROM model ORDER BY t;
| t | A | B |
|---|------|------|
| 0 | 0 | 9 |
| 1 | null | null |
| 2 | null | null |
| 3 | null | null |
| 4 | null | null |
I.e. we know the values of A(t=0) and B(t=0).
For each row, we want to calculate the value of A & B using the equations above.
The final table should be:
| t | A | B |
|---|---|----|
| 0 | 0 | 9 |
| 1 | 9 | 18 |
| 2 | 3 | 6 |
| 3 | 6 | 12 |
| 4 | 2 | 4 |
We've tried using lag, but because of the models' recursive-like nature, we end up only getting A & B at (t=1)
CREATE TEMPORARY FUNCTION A_fn(b_prev FLOAT64) AS (
CASE
WHEN b_prev < 10 THEN b_prev
ELSE b_prev / 6.0
END
);
SELECT
t,
CASE WHEN t = 0 THEN A ELSE A_fn(LAG(B) OVER (ORDER BY t)) END AS A,
CASE WHEN t = 0 THEN B ELSE A_fn(LAG(B) OVER (ORDER BY t)) * 2 END AS B
FROM model
ORDER BY t;
Produces:
| t | A | B |
|---|------|------|
| 0 | 0 | 9 |
| 1 | 9 | 18 |
| 2 | null | null |
| 3 | null | null |
| 4 | null | null |
Each row is dependent on the row above it. It seems it should be possible to compute a single row at a time, while iterating through the rows? Or does BigQuery not support this type of windowing?
If it is not possible, what do you recommend?
Round #1 - starting point
Below is for BigQuery Standard SQL and works (for me) with up to 3M rows
#standardSQL
CREATE TEMP FUNCTION x(v FLOAT64, t INT64)
RETURNS ARRAY<STRUCT<t INT64, v FLOAT64>>
LANGUAGE js AS """
var i, result = [];
for (i = 1; i <= t; i++) {
if (v < 10) {v = 2 * v}
else {v = v / 3};
result.push({t:i, v});
};
return result
""";
SELECT 0 AS t, 0 AS A, 9 AS B UNION ALL
SELECT line.t, line.v / 2, line.v FROM UNNEST(x(9, 3000000)) line
Going above 3M rows produces Resources exceeded during query execution: UDF out of memory.
To overcome this - i think you should just implement it on the client - so no JS UDF Limits are applied. I think it is reasonable "workaround" because looks like anyway you have no really data in BQ and just one starting value (9 in this example). But even if you do have other valuable columns in the table - you can then JOIN produced result back to table ON t value - so should be Ok!
Round #2 - It could be billions ... - so let's take care of scale, parallelization
Below is a little trick to avoid JS UDFs Resource and/or Memory error
So, I was able to run it for 2B rows in one shot!
#standardSQL
CREATE TEMP FUNCTION anchor(seed FLOAT64, len INT64, batch INT64)
RETURNS ARRAY<STRUCT<t INT64, v FLOAT64>> LANGUAGE js AS """
var i, result = [], v = seed;
for (i = 0; i <= len; i++) {
if (v < 10) {v = 2 * v} else {v = v / 3};
if (i % batch == 0) {result.push({t:i + 1, v})};
}; return result
""";
CREATE TEMP FUNCTION x(value FLOAT64, start INT64, len INT64)
RETURNS ARRAY<STRUCT<t INT64, v FLOAT64>>
LANGUAGE js AS """
var i, result = []; result.push({t:0, v:value});
for (i = 1; i < len; i++) {
if (value < 10) {value = 2 * value} else {value = value / 3};
result.push({t:i, v:value});
}; return result
""";
CREATE OR REPLACE TABLE `project.dataset.result` AS
WITH settings AS (SELECT 9 init, 2000000000 len, 1000 batch),
anchors AS (SELECT line.* FROM settings, UNNEST(anchor(init, len, batch)) line)
SELECT 0 AS t, 0 AS A, init AS B FROM settings UNION ALL
SELECT a.t + line.t, line.v / 2, line.v
FROM settings, anchors a, UNNEST(x(v, t, batch)) line
In above query - you "control" initial values in below line
WITH settings AS (SELECT 9 init, 2000000000 len, 1000 batch),
in above example, 9 is initial value, 2,000,000,000 is number of rows to be calculated and 1000 is a batch to process with (this is important one to keep BQ Engine out of throwing Resource and/or Memory error - you cannot make it too big or too small - i feel I got some sense of what it needs to be - but not enough for trying to formulate it)
Some stats (settings - execution time):
1M: SELECT 9 init, 1000000 len, 1000 batch - 0 min 9 sec
10M: SELECT 9 init, 10000000 len, 1000 batch - 0 min 50 sec
100M: SELECT 9 init, 100000000 len, 600 batch - 3 min 4 sec
100M: SELECT 9 init, 100000000 len, 40 batch - 2 min 56 sec
1B: SELECT 9 init, 1000000000 len, 10000 batch - 29 min 39 sec
1B: SELECT 9 init, 1000000000 len, 1000 batch - 27 min 50 sec
2B: SELECT 9 init, 2000000000 len, 1000 batch - 48 min 27 sec
Round #3 - some thoughts and comments
Obviously, as I mentioned in #1 above - this type of calculation is more suited for being implemented on client of your choice - so it is hard for me to judge practical value of above - but I really had fun playing with it! In reality, I had few more cool ideas in mind and also implemented and played with them - but above (in #2) was the most practical/scalable one
Note: The most interesting part of above solution is anchors table. It is very cheap to generate and allows to set anchors in batch-size interval - so having this you can for example calculate value of row = 2,000,035 or 1,123,456,789 (for example) without actually processing all previous rows - and this will take fraction of sec. Or you can parallelize calculation of all rows by starting several threads/calculations using respective anchors, etc. Quite a number of opportunities.
Finally, it really depends on your specific use-case which way to go further - so I am leaving it up to you
It seems it should be possible to compute a single row at a time, while iterating through the rows
Support for Scripting and Stored Procedures is now in beta (as of October 2019)
You can submit multiple statements separated with semi-colons and BigQuery is able to run them now.
So, conceptually your process could look like below script:
DECLARE b_prev FLOAT64 DEFAULT NULL;
DECLARE t INT64 DEFAULT 0;
DECLARE arr ARRAY<STRUCT<t INT64, a FLOAT64, b FLOAT64>> DEFAULT [STRUCT(0, 0.0, 9.0)];
SET b_prev = 9.0 / 2;
LOOP
SET (t, b_prev) = (t + 1, 2 * b_prev);
IF t >= 100 THEN LEAVE;
ELSE
SET b_prev = CASE WHEN b_prev < 10 THEN b_prev ELSE b_prev / 6.0 END;
SET arr = (SELECT ARRAY_CONCAT(arr, [(t, b_prev, 2 * b_prev)]));
END IF;
END LOOP;
SELECT * FROM UNNEST(arr);
Even though above script is simpler and more directly represents logic for non-technical personal and easier to manage - it does not fit in scenarios were you need to loop through more than 100 or more iterations. For example above script took close to 2 min while my original solution for same 100 rows took just 2 sec
But still great for simple / smaller cases

Percentage of variable corresponding to percentage of other variable

I have two numerical variables, and would like to calculate the percentage of one variable corresponding to at least 50% of the other variable's sum.
For example:
A | B
__________
2 | 8
1 | 20
3 | 12
5 | 4
2 | 7
1 | 11
4 | 5
Here, the sum of column B is 68, so I'm looking for the rows (in B's descending order) where cumulative sum is at least 34.
In that case, they are rows 2, 3 & 6 (cumulative sum of 45). The sum of these row's column A is 5, which I want to compare to the total sum of column A (18).
Therefore, the result I'm looking for is 5 / 18 * 100 = 28.78%
I'm looking for a way to implement this in QlikSense, or in SQL.
Here's one way you can do it - there is probably some optimisation to be done, but this gives what you want.
Source:
LOAD
*,
RowNo() as RowNo_Source
Inline [
A , B
2 , 8
1 , 20
3 , 12
5 , 4
2 , 7
1 , 11
4 , 5
];
SourceSorted:
NoConcatenate LOAD *,
RowNo() as RowNo_SourceSorted
Resident Source
Order by B asc;
drop table Source;
BTotal:
LOAD sum(B) as BTotal
Resident SourceSorted;
let BTotal=peek('BTotal',0);
SourceWithCumu:
NoConcatenate LOAD
*,
rangesum(peek('BCumu'),B) as BCumu,
$(BTotal) as BTotal,
rangesum(peek('BCumu'),B)/$(BTotal) as BCumuPct,
if(rangesum(peek('BCumu'),B)/$(BTotal)>=0.5,A,0) as AFiltered
Resident SourceSorted;
Drop Table SourceSorted;
I worked with a debug fields that might be useful but you could of course remove these.
Then in the front end you do your calculation of sum(AFiltered)/sum(A) to get the stat you want and format it as a percentage.

Tabulate Command Stata

I don't know if Stata can do this but I use the tabulate command a lot in order to find frequencies. For instance, I have a success variable which takes on values 0 to 1 and I would like to know the success rate for a certain group of observations ie tab success if group==1. I was wondering if I can do sort of the inverse of this operation. That is, I would like to know if I can find a value of "group" for which the frequency is greater than or equal to 15% for example.
Is there a command that does this?
Thanks
As an example
sysuse auto
gen success=mpg<29
Now I want to find the value of price such that the frequency of the success variable is greater than 75% for example.
According to #Nick:
ssc install groups
sysuse auto
count
74
#return list optional
local nobs=r(N) # r(N) gives total observation
groups rep78, sel(f >(0.15*`r(N)')) #gives the group for which freq >15 %
+---------------------------------+
| rep78 Freq. Percent % <= |
|---------------------------------|
| 3 30 43.48 57.97 |
| 4 18 26.09 84.06 |
+---------------------------------+
groups rep78, sel(f >(0.10*`nobs'))# more than 10 %
+----------------------------------+
| rep78 Freq. Percent % <= |
|----------------------------------|
| 2 8 11.59 14.49 |
| 3 30 43.48 57.97 |
| 4 18 26.09 84.06 |
| 5 11 15.94 100.00 |
+----------------------------------+
I'm not sure if I fully understand your question/situation, but I believe this might be useful. You can egen a variable that is equal to the mean of success, by group, and then see which observations have the value for mean(success) that you're looking for.
egen avgsuccess = mean(success), by(group)
tab group if avgsuccess >= 0.15
list group if avgsuccess >= 0.15
Does that accomplish what you want?