TSQL - comparing grouped values within a table - sql

I need to compare grouped data to look for shifts in a calculated value. The output of my current SQL looks something like this...
Grp_ID_1 / Metric / State / Value
A Metric1 OH 50
B Metric1 OH 65
A Metric1 CA 20
B Metric1 CA 35
In the example above, I need to calculate the difference between A-Metric1-OH value of 50 and B-metric1-OH value of 65.

You can use LEAD to calculate difference between rows.
SELECT LEAD(State, 1,0) OVER (ORDER BY Grp_ID_1 ) AS NextState ,
State - LEAD(State, 1,0) OVER (ORDER BY Grp_ID_1 ) AS StateDif
FROM yourTable

SELECT grp_ID_1, metric, state, value,
(SELECT MAX(value)
FROM tablename
) - value AS Difference
FROM tablename group by state, grp_ID_1, metric, value
having state = 'OH'

Related

Oracle SQL: Using LAG when the current row is missing

I have a table from which I'm trying to extract information using a LAG function.
Type
Date
Value
A
01
1
A
02
2
B
01
3
I'm trying to get lines by Type with the Value from this month and the month before that, so ideally:
Type
Date
Value M
Value M-1
A
02
2
1
B
02
0
3
SELECT
Type,
Date,
Value as Value M,
LAG (Value,1,0) over(PARTITION BY Type ORDER BY Date) as Value M-1
FROM Table
Except that, of course, because there is no line for Type B and Month 02, I don't get a line for Type B.
Do you have any suggestions?
A simple lag probably won't do the trick, because you need to construct a record for the last month if it doesn't exist for a given type. If your date is stored as an integer as in the sample data, a pattern like this is something to consider. If it's stored as a date, you'll need to have some kind of ranking baked in to the join or extract(month from date) - 1 (being careful about January), but this should give the gist.
WITH TYPE_LATEST_MONTH AS
( SELECT DISTINCT
TYPE,
(SELECT MAX(DATE) FROM TABLE) AS LATEST_MONTH
FROM TABLE
)
SELECT TLM.TYPE,
TLM.LATEST_MONTH AS DATE,
COALESCE(TLM.VALUE_M, 0) AS VALUE_M,
COALESCE(TLM_PREV.VALUE_M, 0) AS VALUE_M_MINUS_1
FROM TYPE_LATEST_MONTH TLM
LEFT
JOIN TABLE TBL
ON TLM.Type = TBL.Type
AND TLM.LATEST_MONTH = TBL.DATE
LEFT
JOIN TABLE TBL_PREV
ON TLM.Type = TBL_PREV.Type
AND TLM.LATEST_MONTH = TBL_PREV.DATE - 1

SQL Sum over partition "NOT" by column

I need to build analytical SQL queries in which the client should specify any metrics(summing of values in a specific column) or dimensions (group by specific columns).
Assume that I have a table with columns hour, dim_a, dim_b, metric_a, metric_b, metric_c with values showed in csv below
hour,dim_a,dim_b,metric_a,metric_b
0,A,X,4,4
0,A,Y,4,24
0,B,Y,20,24
1,B,Y,21,35
1,A,Y,4,35
1,C,Y,10,35
2,B,Y,21,30
2,C,Y,3,30
2,A,Y,6,30
Take a look at metric_b. This metric is always the same if values hour and dim_b are the same regardless of value of dim_a. For example:
1,B,Y,21,35
1,A,Y,4,35
1,C,Y,10,35
If we select columns hour, dim_b, metric_b and take distinct values table will look like:
hour,dim_b,metric_b
0,X,4
0,Y,24
1,Y,35
2,Y,30
And by this values all aggregations against metric_b should be done
I would to like run analytical queries over this data grouping by specific dimensions and doing aggregations of metrics with special aggregation when it comes to metric_b.
when I want to group by hour, dim_a, dim_b, and see metrics metric_a and metric_b. Expected result is
hour,dim_a,dim_b,metric_a,metric_b
0,A,X,4,4
0,A,Y,4,24
0,B,Y,20,24
1,B,Y,21,35
1,A,Y,4,35
1,C,Y,10,35
2,B,Y,21,30
2,C,Y,3,30
2,A,Y,6,30
When I want to group by dim_a, dim_b, and see metrics metric_a and metric_b. Expected result is
dim_a,dim_b,metric_a,metric_b
A,X,4,4
A,Y,14,89
B,Y,62,89
C,Y,13,89
Value of metric_b is calculated from 89 = 24 + 35 + 30; 4 = 4
When I want to group by dim_b, and see metric. metric_a and metric_b. Expected result is:
dim_b,metric_a,metric_b
X,4,4
Y,89,89
Value of metric_b is calculated from 89 = 24 + 35 + 30; 4 = 4
And finally when I want to group by dim_a, and see metric. metric_a and metric_b. The expected result is:
dim_a,metric_a,metric_b
A,18,93
B,62,93
C,13,93
Value of metric_b is calculated from 93 = 24 + 35 + 30 + 4
So aggregation of metric_b should be a sum of metric_b but not taking in the accounts dim_a as a grouping column, but taking everything else. Is there SQL syntax that could help me doing this?
What's more I would like to say that these queries are going to be run on AWS Redshift and there are 20 metrics and dimension 16 so 36 columns. And there will be up to 100 billions of rows there.
for number 2:
SELECT *
FROM (
SELECT dim_a
,dim_b
,sum(metric_a) a
FROM dbo.Table_2 t
GROUP BY dim_a
,dim_b
) a
CROSS APPLY (
SELECT sum(metric_b) b
FROM (
SELECT DISTINCT metric_b
,hour
,dim_b
FROM dbo.Table_2
) t2
WHERE t2.dim_b = a.dim_b
) c
for number 3 :
SELECT *
FROM (
SELECT dim_b
,sum(metric_a) a
FROM dbo.Table_2 t
GROUP BY dim_b
) a
CROSS APPLY (
SELECT sum(metric_b) b
FROM (
SELECT DISTINCT metric_b
,hour
,dim_b
FROM dbo.Table_2
) t2
WHERE t2.dim_b = a.dim_b
) c
for number 4:
SELECT *
FROM (
SELECT dim_a
,sum(metric_a) a
FROM dbo.Table_2 t
GROUP BY dim_a
) a
CROSS APPLY (
SELECT sum(metric_b) b
FROM (
SELECT DISTINCT metric_b
,hour
,dim_b
FROM dbo.Table_2
) t2
) c

SQL SUM and value conversion

I'm looking to transform data in my SUM query to acknowledge that some numeric values are negative in nature, although not represented as such.
I look for customer balance where the example dataset includes also credit transactions that are not written as negative in the database (although all records that have value C for credit in inv_type column should be treated as negative in the SQL SUM function). As an example:
INVOICES
inv_no inv_type cust_no value
1 D 25 10
2 D 35 30
3 C 25 5
4 D 25 50
5 C 35 2
My simple SUM function would not give me the correct answer:
select cust_no, sum(value) from INVOICES
group by cust_no
This query would obviously sum the balance of customer no 25 for 65 and no 35 for 32, although the anticipated answer would be 10-5+50 = 55 and 30 - 2 = 28
Should I perhaps utilize CAST function somehow? Unfortunately I'm not up to date on the underlying db engine, however good chance of it being of IBM origin. Most of the basic SQL code has worked out so far though.
You can use the case expression inside of a sum(). The simplest syntax would be:
select cust_no,
sum(case when inv_type = 'C' then - value else value end) as total
from invoices
group by cust_no;
Note that value could be a reserved word in your database, so you might need to escape the column name.
You should be able to write a projection (select) first to obtain a signed value column based on inv_type or whatever, and then do a sum over that.
Like this:
select cust_no, sum(value) from (
select cust_no
, case when inv_type='D' then [value] else -[value] end [value]
from INVOICES
) SUMS
group by cust_no
You can put an expression in the sum that calculates a negative value if the invoice is a credit:
select
cust_no,
sum
(
case inv_type
when 'C' then -[value]
else [value]
end
) as [Total]
from INVOICES

Fill values "down" when pivoting

I'm doing a PIVOT command. My row label is a date field. My columns are locations like [NY], [TX], etc. Some of the values from the source data are null, but once it's pivoted I'd like to "fill down" those nulls with the last known value in date order.
That is if column NY has a value for 1/1/2010 but null for 1/2/2010 I want to fill down the value from 1/1/2010 to 1/2/2010, and any other null dates below until another value already exists. So basically I'm filling in the null gaps with the same data for the closes date that has data for each of the columns.
An example of my pivot query I currently have is:
SELECT ReadingDate, [NY],[TX],[WI]
FROM
(SELECT NAME As 'NodeName',
CAST(FORMAT(readingdate, 'M/d/yyyy') as Date) As 'ReadingDate',
SUM(myvalue) As 'Value'
FROM MyTable) as SourceData
PIVOT (SUM(Value) FOR NodeName IN ([NY],[TX],[WI])) as PivotTable
Order BY ReadingDate
But I'm not sure how to do this "fill down" to fill in null values
Sample source data
1/1/2010, TX, 1
1/1/2010, NY, 5
1/2/2010, NY null
1/1/2010, WI, 3
1/3/2010, WI, 7
...
Notice how there is no WI for 1/2 or NY for 1/3 which would result in nulls in the pivot result. There is also a null record too also resulting in a null. For NY once pivoted 1/2 needs to be filled in with 5 because it's the last known value, but 1/3 also needs to be filed in with 5 once pivoted since that record didn't even exist but when pivoted it would show up as null value because it didn't exist but another location had the record.
This can be a pain in SQL Server. ANSI supports a nice feature on LAG(), called IGNORE NULLs, but SQL Server doesn't (yet) support it. I would start with the using conditional aggregation (personal preference):
select cast(readingdate as date) as readingdate,,
sum(case when name = 'NY' then value end) as NY,
sum(case when name = 'TX' then value end) as TX,
sum(case when name = 'WI' then value end) as WI
from mytable
group by cast(readingdate as date);
So, we have to be a bit more clever. We can assign the NULL values into groups based on the number of non-NULL values before them. Fortunately, this is easy to do using a cumulative COUNT() function. Then, we can get the one non-NULL value in this group by using MAX() (or MIN()):
with t as (
select cast(readingdate as date) as readingdate,
sum(case when name = 'NY' then value end) as NY,
sum(case when name = 'TX' then value end) as TX,
sum(case when name = 'WI' then value end) as WI,
from mytable
group by cast(readingdate as date)
),
t2 as (
select t.*,
count(NY) over (order by readingdate) as NYgrp,
count(TX) over (order by readingdate) as TXgrp,
count(WI) over (order by readingdate) as WIgrp
from t
)
select readingdate,
coalesce(NY, max(NY) over (partition by NYgrp)) as NY,
coalesce(TX, max(TX) over (partition by TXgrp)) as TX,
coalesce(WI, max(WI) over (partition by WIgrp)) as WI
from t2;

Joining next Sequential Row

I am planing an SQL Statement right now and would need someone to look over my thougts.
This is my Table:
id stat period
--- ------- --------
1 10 1/1/2008
2 25 2/1/2008
3 5 3/1/2008
4 15 4/1/2008
5 30 5/1/2008
6 9 6/1/2008
7 22 7/1/2008
8 29 8/1/2008
Create Table
CREATE TABLE tbstats
(
id INT IDENTITY(1, 1) PRIMARY KEY,
stat INT NOT NULL,
period DATETIME NOT NULL
)
go
INSERT INTO tbstats
(stat,period)
SELECT 10,CONVERT(DATETIME, '20080101')
UNION ALL
SELECT 25,CONVERT(DATETIME, '20080102')
UNION ALL
SELECT 5,CONVERT(DATETIME, '20080103')
UNION ALL
SELECT 15,CONVERT(DATETIME, '20080104')
UNION ALL
SELECT 30,CONVERT(DATETIME, '20080105')
UNION ALL
SELECT 9,CONVERT(DATETIME, '20080106')
UNION ALL
SELECT 22,CONVERT(DATETIME, '20080107')
UNION ALL
SELECT 29,CONVERT(DATETIME, '20080108')
go
I want to calculate the difference between each statistic and the next, and then calculate the mean value of the 'gaps.'
Thougts:
I need to join each record with it's subsequent row. I can do that using the ever flexible joining syntax, thanks to the fact that I know the id field is an integer sequence with no gaps.
By aliasing the table I could incorporate it into the SQL query twice, then join them together in a staggered fashion by adding 1 to the id of the first aliased table. The first record in the table has an id of 1. 1 + 1 = 2 so it should join on the row with id of 2 in the second aliased table. And so on.
Now I would simply subtract one from the other.
Then I would use the ABS function to ensure that I always get positive integers as a result of the subtraction regardless of which side of the expression is the higher figure.
Is there an easier way to achieve what I want?
The lead analytic function should do the trick:
SELECT period, stat, stat - LEAD(stat) OVER (ORDER BY period) AS gap
FROM tbstats
The average value of the gaps can be done by calculating the difference between the first value and the last value and dividing by one less than the number of elements:
select sum(case when seqnum = num then stat else - stat end) / (max(num) - 1);
from (select period, row_number() over (order by period) as seqnum,
count(*) over () as num
from tbstats
) t
where seqnum = num or seqnum = 1;
Of course, you can also do the calculation using lead(), but this will also work in SQL Server 2005 and 2008.
By using Join also you achieve this
SELECT t1.period,
t1.stat,
t1.stat - t2.stat gap
FROM #tbstats t1
LEFT JOIN #tbstats t2
ON t1.id + 1 = t2.id
To calculate the difference between each statistic and the next, LEAD() and LAG() may be the simplest option. You provide an ORDER BY, and LEAD(something) returns the next something and LAG(something) returns the previous something in the given order.
select
x.id thisStatId,
LAG(x.id) OVER (ORDER BY x.id) lastStatId,
x.stat thisStatValue,
LAG(x.stat) OVER (ORDER BY x.id) lastStatValue,
x.stat - LAG(x.stat) OVER (ORDER BY x.id) diff
from tbStats x