SSRS - Median by Group - Jagged Array? - vba

I'm faced with a challenge in two parts. I've been requested to replace 3 columns in a Matrix that aggregates on Name in the row group. The 3 columns are outside of the Column group.
Challenge 1 - Matrices want to summarize the Data pane. There seems to be no way to show the Raw Data (and then hide it, in order for these rows during runtime to populate an array).
Challenge 2 - I need to calculate the median BY name. This means either, during runtime, I need to calculate one median for each name at a time, reset the array, and start fresh for the next name value, OR, I need a multi-dimensional array where each ordinal is itself an array corresponding to a name.
I'm also a total code monkey at VB.
Here's what I've currently borrowed from an online post about calculating Median in SSRS.
Dim values As System.Collections.ArrayList
Function AddValue(ByVal newValue As Decimal)
If (values Is Nothing) Then
values = New System.Collections.ArrayList()
End If
values.Add(newValue)
End Function
Function GetMedian() As Decimal
Dim count As Integer = values.Count
If (count > 0) Then
values.Sort()
GetMedian = values(count / 2)
End If
End Function

I think I have a SQL solution that doesn't utilize a loop.
;WITH Counts AS
(
SELECT SalesPerson, c = COUNT(*)
FROM dbo.Sales
GROUP BY SalesPerson
)
SELECT a.SalesPerson, Median = AVG(0.+Amount)
FROM Counts a
CROSS APPLY
(
SELECT TOP (((a.c - 1) / 2) + (1 + (1 - a.c % 2)))
b.Amount, r = ROW_NUMBER() OVER (ORDER BY b.Amount)
FROM dbo.Sales b
WHERE a.SalesPerson = b.SalesPerson
ORDER BY b.Amount
) p
WHERE r BETWEEN ((a.c - 1) / 2) + 1 AND (((a.c - 1) / 2) + (1 + (1 - a.c % 2)))
GROUP BY a.SalesPerson;

Related

Perform loop and calculation on BigQuery Array type

My original data, B is an array of INT64:
And I want to calculate the difference between B[n+1] - B[n], hence result in a new table as follow:
I figured out I can somehow achieve this by using LOOP and IF condition:
DECLARE x INT64 DEFAULT 0;
LOOP
SET x = x + 1
IF(x < array_length(table.B))
THEN INSERT INTO newTable (SELECT A, B[OFFSET(x+1)] - B[OFFSET(x)]) from table
END IF;
END LOOP;
The problem is that the above idea doesn't work on each row of my data, cause I still need to loop through each row in my data table, but I can't find a way to integrate my scripting part into a normal query, where I can
SELECT A, [calculation script] from table
Can someone point me how can I do it? Or any better way to solve this problem?
Thank you.
Below actually works - BigQuery
select * replace(
array(select diff from (
select offset, lead(el) over(order by offset) - el as diff
from unnest(B) el with offset
) where not diff is null
order by offset
) as B
)
from `project.dataset.table` t
if to apply to sample data in your question - output is
You can use unnest() with offset for this purpose:
select id, a,
array_agg(b_el - prev_b_el order by n) as b_diffs
from (select t.*, b_el, lag(b_el) over (partition by t.id order by n) as prev_b_el
from t cross join
unnest(b) b_el with offset n
) t
where prev_b_el is not null
group by t.id, t.a

Out of range integer: infinity

So I'm trying to work through a problem thats a bit hard to explain and I can't expose any of the data I'm working with but what Im trying to get my head around is the error below when running the query below - I've renamed some of the tables / columns for sensitivity issues but the structure should be the same
"Error from Query Engine - Out of range for integer: Infinity"
WITH accounts AS (
SELECT t.user_id
FROM table_a t
WHERE t.type like '%Something%'
),
CTE AS (
SELECT
st.x_user_id,
ad.name as client_name,
sum(case when st.score_type = 'Agility' then st.score_value else 0 end) as score,
st.obs_date,
ROW_NUMBER() OVER (PARTITION BY st.x_user_id,ad.name ORDER BY st.obs_date) AS rn
FROM client_scores st
LEFT JOIN account_details ad on ad.client_id = st.x_user_id
INNER JOIN accounts on st.x_user_id = accounts.user_id
--WHERE st.x_user_id IN (101011115,101012219)
WHERE st.obs_date >= '2020-05-18'
group by 1,2,4
)
SELECT
c1.x_user_id,
c1.client_name,
c1.score,
c1.obs_date,
CAST(COALESCE (((c1.score - c2.score) * 1.0 / c2.score) * 100, 0) AS INT) AS score_diff
FROM CTE c1
LEFT JOIN CTE c2 on c1.x_user_id = c2.x_user_id and c1.client_name = c2.client_name and c1.rn = c2.rn +2
I know the query works for sure because when I get rid of the first CTE and hard code 2 id's into a where clause i commented out it returns the data I want. But I also need it to run based on the 1st CTE which has ~5k unique id's
Here is a sample output if i try with 2 id's:
Based on the above number of row returned per id I would expect it should return 5000 * 3 rows = 150000.
What could be causing the out of range for integer error?
This line is likely your problem:
CAST(COALESCE (((c1.score - c2.score) * 1.0 / c2.score) * 100, 0) AS INT) AS score_diff
When the value of c2.score is 0, 1.0/c2.score will be infinity and will not fit into an integer type that you’re trying to cast it into.
The reason it’s working for the two users in your example is that they don’t have a 0 value for c2.score.
You might be able to fix this by changing to:
CAST(COALESCE (((c1.score - c2.score) * 1.0 / NULLIF(c2.score, 0)) * 100, 0) AS INT) AS score_diff

Split float between list of numbers

I have problem with splitting 0.00xxx float values between numbers.
Here is example of input data
0 is sum of 1-3 float numbers.
As result I want to see rounded numbers without loosing sum of 1-3:
IN:
0 313.726
1 216.412
2 48.659
3 48.655
OUT:
0 313.73
1 216.41
2 48.66
3 48.66
How it should work:
Idea is to split the lowest rest(in our example it's 0.002 from value 216.412) between highest. 0.001 to 48.659 = 48.66 and 0.001 to 48.655 = 48.656 after this we can round numbers without loosing data.
After sitting on this problem yesterday I found the solution. The query as I think should look like this.
select test.*,
sum(value - trunc(value, 2)) over (partition by case when id = 0 then 0 else 1 end) part,
row_number() over(partition by case when id = 0 then 0 else 1 end order by value - trunc(value, 2) desc) rn,
case when row_number() over(partition by case when id = 0 then 0 else 1 end order by value - trunc(value, 2) desc) / 100 <=
round(sum(value - trunc(value, 2)) over (partition by case when id = 0 then 0 else 1 end), 2) then trunc(value, 2) + 0.01 else trunc(value, 2) end result
from test;
But still for me it's strange to add const value "0.01" while getting the result.
Any ideas to improve this query?
You could use the round() sql function when presenting results. Round()'s second argument is the number of significant digits you want to round the number to. Issuing this select on the test table:
select id, round(value, 2) from test;
gives you the following result
0 313.73
1 216.41
2 48.66
3 48.65
Generally, you can use the stored numbers for summations and then use the round() function for presentation of the results: Here is a way to do the sum with the full significant digits and then use the round() function for presenting the final result:
select sum(value) from test where id != 0
gives the result: 313.726
select round(sum(value), 2) from test where id != 0
gives the result: 313.73
By the way allow me two observations:
1) the rounding you give for id = 3 is confusing to me: 48.654 rounds to 48.65 rather than 48.66 in two significant digits. Am I missing something?
2) Strictly speaking this issue is not a pl/sql issue as labeled. It is totally in the realm of sql. However there is a round() function in pl/sql as well and the same principles apply.
select id, value,
case when id <> max(id) over () then round(value, 2)
else round(value, 2) - sum(round(value, 2)) over () +
round(first_value(value) over (order by id), 2) * 2
end val_rnd
from test
Output:
ID VALUE VAL_RND
------ ---------- ----------
0 313.726 313.73
1 216.413 216.41
2 48.659 48.66
3 48.654 48.66
Above query works, but it moves all difference to last row. And this is not "honest" and maybe not what you are after for other scenarios.
The most "unhonest" behavior is observable with big number of values, all equal 0.005.
To make full distribution you need to:
sum all original values in sub-rows and subtract rounded total value from row with id 0,
use row_number() to sort sub-rows in order of difference between rounded value and original value (maybe descending, it depends on sign of difference, use sign(), abs),
assign to each row value increased by .01 (or decreased if difference < 0 ) until it reaches difference/.01 (use case when ),
union row with id = 0 containing rounded sum
optionally sort results.
It's hard (but achievable) in one query. Alternative is some PL/SQL procedure or function, which might be more readable.
If I get you correct, you don't want to use round because rounding the partial numbers don't match the rounded total.
In this case simple trick is applied. You use round for all but the last number. The last fraction is calculated as a difference between the rounded sum and the rounded parts so far (all but the last one).
You may express this with analytical function as follows
WITH total AS
(SELECT id, value, ROUND(value,2) value_rounded FROM test WHERE id = 0
),
rounded AS
( SELECT id, value, ROUND(value,2) value_rounded FROM test WHERE id != 0
)
SELECT id, value_rounded FROM total
UNION ALL
SELECT id,
CASE
WHEN row_number() over (order by id) != COUNT(*) over ()
THEN
/* not the last row - regular result */
value_rounded
ELSE
/* last row - corrected result */
(select value_rounded from total) - SUM(value_rounded) over (order by id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
END AS value
FROM rounded
ORDER BY id;
Note that this is the test for the last numer
row_number() over (order by id) != COUNT(*) over ()
and this is the sum of all parts from begin (UNBOUNDED PRECEDING) up to the one but last ( 1 PRECEDING)
SUM(value_rounded) over (order by id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
I splitted your data in two source total - one row with the total and and rounded parts.
UPDATE
In some case the last corrected number shows an "ugly" large difference to the original value,
as the differences in one rounding direction are higher that in the opposite one.
The following select takes this in account and distributes the difference between the parts.
The example bellow illustrated this on teh example with lot of 0.05s
WITH nums AS
(SELECT rownum id, 0.005 value FROM dual connect by level <= 5
),
rounded AS
( SELECT id, value, ROUND(value,2) value_rounded FROM nums
),
with_diff as
(SELECT id, value, value_rounded,
-- difference so far - between the exact SUM and SUM of rounded parts
-- cut to two decimal points
floor(100* (
sum(value) over (order by id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) -
sum(value_rounded) over (order by id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)))
/ 100 diff_so_far
FROM rounded),
delta_diff as
(select id, value, value_rounded,DIFF_SO_FAR,
DIFF_SO_FAR - LAG(DIFF_SO_FAR,1,0) over (order by ID) as diff_delta
from with_diff)
SELECT id, value,
CASE
WHEN row_number() over (order by id) != COUNT(*) over ()
THEN
/* not the last row - take the rounded value and ... */
value_rounded +
/* ... add or subtract the delta difference */
diff_delta
ELSE
/* last row - corrected result */
round(sum(value) over(),2) - SUM(value_rounded + diff_delta) over (order by id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
END AS value_rounded, diff_delta
FROM delta_diff
ORDER BY id;
ID VALUE VALUE_ROUNDED DIFF_DELTA
---------- ---------- ------------- ----------
1 ,005 0 -0,01
2 ,005 ,01 0
3 ,005 0 -0,01
4 ,005 ,01 0
5 ,005 ,01 -0,01
pragamtic solution based on following rules:
1) check the difference between the rounded sum and sum of rounded parts.
select round(sum(value),2) - sum(round(value,2)) from test where id != 0;
2) apply this difference
e.g. if you get 0.01, this means one rounded part must be increased by 0.01
if you get -.02, it means two rounded parts must be decreased by 0.01
The query below simple correct the last N parts:
with diff as (
select round(sum(value),2) - sum(round(value,2)) diff from test where id != 0
), diff_values as
(select sign(diff)*.01 diff_value, abs(100*diff) corr_cnt
from diff)
select id, round(value,2)
+ case when row_number() over (order by id desc) <= corr_cnt then diff_value else 0 end result
from test, diff_values where id != 0
order by id;
ID RESULT
---------- ----------
1 216,41
2 48,66
3 48,66
If the number of corrected records in much higher than two, check the data and the rounding precision.

Median depending on other variable in SQL

I want to calculate median conditionally, say separately for men and women in a sex variable as an example. I would like to use the modification of one of the following two alternative codes provided here http://sqlperformance.com/2012/08/t-sql-queries/median as the fastest method for calculating median.
Alternative 1
DECLARE #c BIGINT = (SELECT COUNT(*) FROM dbo.EvenRows);
SELECT AVG(1.0 * val)
FROM (
SELECT val FROM dbo.EvenRows
ORDER BY val
OFFSET (#c - 1) / 2 ROWS
FETCH NEXT 1 + (1 - #c % 2) ROWS ONLY
) AS x;
Alternative 2
SELECT #Median = AVG(1.0 * val)
FROM
(
SELECT o.val, rn = ROW_NUMBER() OVER (ORDER BY o.val), c.c
FROM dbo.EvenRows AS o
CROSS JOIN (SELECT c = COUNT(*) FROM dbo.EvenRows) AS c
) AS x
WHERE rn IN ((c + 1)/2, (c + 2)/2);
Where to put GROUP statement? Can someone advise me how to make a function with parameters MEDIAN(ColumnToCalculateMedian, ColumnContaingSex)

Pearson Correlation SQL Server

I have two tables:
ID,YRMO,Counts
1,Dec 2013,4
1,Jan 2014,6
1,Feb 2014,7
2,Jan,2014,6
2,Feb,2014,8
ID,YRMO,Counts
1,Dec 2013,10
1,Jan 2014,8
1,March 2014,12
2,Jan 2014,6
2,Feb 2014,10
I want to find the pearson corelation coefficient for each sets of ID. There are about more than 200 different IDS.
Pearson correlation is a measure of the linear correlation (dependence) between two variables X and Y, giving a value between +1 and −1 inclusive
More can be found here :http://oreilly.com/catalog/transqlcook/chapter/ch08.html
at calculating correlation section
To calculate Pearson Correlation Coefficient; you need to first calculate Mean then standard daviation and then correlation coefficient as outlined below
1. Calculate Mean
insert into tab2 (tab1_id, mean)
select ID, sum([counts]) /
(select count(*) from tab1) as mean
from tab1
group by ID;
2. Calculate standard deviation
update tab2
set stddev = (
select sqrt(
sum([counts] * [counts]) /
(select count(*) from tab1)
- mean * mean
) stddev
from tab1
where tab1.ID = tab2.tab1_id
group by tab1.ID);
3. Finally Pearson Correlation Coefficient
select ID,
((sf.sum1 / (select count(*) from tab1)
- stats1.mean * stats2.mean
)
/ (stats1.stddev * stats2.stddev)) as PCC
from (
select r1.ID,
sum(r1.[counts] * r2.[counts]) as sum1
from tab1 r1
join tab1 r2
on r1.ID = r2.ID
group by r1.ID
) sf
join tab2 stats1
on stats1.tab1_id = sf.ID
join tab2 stats2
on stats2.tab1_id = sf.ID
Which on your posted data results in
See a demo fiddle here http://sqlfiddle.com/#!3/0da20/5
EDIT:
Well refined a bit. You can use the below function to get PCC but I am not getting exact same result as of your but rather getting 0.999996000000000 for ID = 1.
This could be a great entry point for you. You can refine the calculation further from here.
create function calculate_PCC(#id int)
returns decimal(16,15)
as
begin
declare #mean numeric(16,5);
declare #stddev numeric(16,5);
declare #count numeric(16,5);
declare #pcc numeric(16,12);
declare #store numeric(16,7);
select #count = CONVERT(numeric(16,5), count(case when Id=#id then 1 end)) from tab1;
select #mean = convert(numeric(16,5),sum([Counts])) / #count
from tab1 WHERE ID = #id;
select #store = (sum(counts * counts) / #count) from tab1 WHERE ID = #id;
set #stddev = sqrt(#store - (#mean * #mean));
set #pcc = ((#store - (#mean * #mean)) / (#stddev * #stddev));
return #pcc;
end
Call the function like
select db_name.dbo.calculate_PCC(1)
A Single-Pass Solution:
There are two flavors of the Pearson correlation coefficient, one for a Sample and one for an entire Population. These are simple, single-pass, and I believe, correct formulas for both:
-- Methods for calculating the two Pearson correlation coefficients
SELECT
-- For Population
(avg(x * y) - avg(x) * avg(y)) /
(sqrt(avg(x * x) - avg(x) * avg(x)) * sqrt(avg(y * y) - avg(y) * avg(y)))
AS correlation_coefficient_population,
-- For Sample
(count(*) * sum(x * y) - sum(x) * sum(y)) /
(sqrt(count(*) * sum(x * x) - sum(x) * sum(x)) * sqrt(count(*) * sum(y * y) - sum(y) * sum(y)))
AS correlation_coefficient_sample
FROM (
-- The following generates a table of sample data containing two columns with a luke-warm and tweakable correlation
-- y = x for 0 thru 99, y = x - 100 for 100 thru 199, etc. Execute it as a stand-alone to see for yourself
-- x and y are CAST as DECIMAL to avoid integer math, you should definitely do the same
-- Try TOP 100 or less for full correlation (y = x for all cases), TOP 200 for a PCC of 0.5, TOP 300 for one near 0.33, etc.
-- The superfluous "+ 0" is where you could apply various offsets to see that they have no effect on the results
SELECT TOP 200
CAST(ROW_NUMBER() OVER (ORDER BY [object_id]) - 1 + 0 AS DECIMAL) AS x,
CAST((ROW_NUMBER() OVER (ORDER BY [object_id]) - 1) % 100 AS DECIMAL) AS y
FROM sys.all_objects
) AS a
As I noted in the comments, you can try the example with TOP 100 or less for full correlation (y = x for all cases); TOP 200 yields correlations very near 0.5; TOP 300, around 0.33; etc. There is a place ("+ 0") to add an offset if you like; spoiler alert, it has no effect. Make sure you CAST your values as DECIMAL - integer math can significantly impact these calcs.