In BigQuery, "where" clause with array-column of null values causing issue - google-bigquery

This is a follow-up of BigQuery - Compute 0 - 100 percentiles for multiple columns, over multiple groups which was posted last year. The question is related to computing 0-100 percentiles for multiple columns in a table. Here's a reproducible example below. The post appears long but it is mostly reproducible example + screenshots of output to help resolve the issue:
with
raw_data as (
select 24997 as competitionId, 0.9167 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8571 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.7778 as ft2Pct, 0.0 as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8125 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.5625 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.6842 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.7317 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8333 as ft2Pct, 0.5 as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8000 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.7500 as ft2Pct, null as ft3Pct, 1.0 as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.6944 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.7500 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8571 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.9091 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.6667 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8261 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8108 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.7895 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8571 as ft2Pct, 1.0 as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.7727 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8333 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.6923 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8571 as ft2Pct, 1.0 as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.9268 as ft2Pct, 1.0 as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.7660 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8571 as ft2Pct, null as ft3Pct, 0.8333 as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8636 as ft2Pct, 1.0 as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8036 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.9000 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8108 as ft2Pct, 1.0 as ft3Pct, null as ftTechPct, null as ftFlagPct
),
-- A) Positive Percentiles
-- A1) compute quantiles: will be saved in messy arrays
positive_pctile_arrays as (
select
competitionId
,approx_quantiles(ft2Pct, 10) as ft2Pct
,approx_quantiles(ft3Pct, 10) as ft3Pct
,approx_quantiles(ftTechPct, 10) as ftTechPct
,approx_quantiles(ftFlagPct, 10) as ftFlagPct
from raw_data
group by 1
),
-- A2) and unnest arrays
positive_pctiles as (
select
competitionId
,pctile
,ft2Pct
,ft3Pct
,ftTechPct
,ftFlagPct
from positive_pctile_arrays as a
,a.ft2Pct with offset as pctile
,a.ft3Pct with offset as ft3PctPctile
,a.ftTechPct with offset as ftTechPctPctile
,a.ftFlagPct with offset as ftFlagPctPctile
where
pctile = ft3PctPctile and
pctile = ftTechPctPctile and
pctile = ftFlagPctPctile
)
-- select * from raw_data
select * from positive_pctile_arrays
-- select * from positive_pctiles
A few comments:
We are grouping by competitionId because our full data has >1 competitionId, even though the example has only 1.
We want to compute 0 - 100 percentiles for these values, however for this example, we use approx_quantiles(., 10) instead of approx_quantiles(., 100) for brevity.
In our data, all values for ftFlagPct are null. As a result, in A1 positive_pctile_arrays, the ftFlagPct column is blank.
Because of this, when we try to unnest these arrays in A2, it looks like the where clause filters all of the rows away. If you uncomment select * from positive_pctiles, this final output table will be empty.
If we comment ftFlagPct out from both A1 and A2, we mostly get the unnested table that we want:
Our desired output is this table, with an additional column for ftFlagPct that has all null values. It seems we need the query to detect that the ftFlagPct array-column in positive_pctile_arrays is null / empty, and then somehow handle the left join differently?
Edit: We are working on a solution where we identify and replace the null-array with an array of dummy values (e.g, all 999999), and then replace the 999999's with nulls in the final output. Will post answer if we can resolve this.

So, replacing
approx_quantiles(ftFlagPct, 10)
with
,case
when array_length(approx_quantiles(ftFlagPct, 10)) is null then generate_array(999990, 1000000, 1)
else approx_quantiles(ftFlagPct, 10)
end as ftFlagPct
...works to the extent that the final output table is retained (not filtered to 0 rows), with the 11 values 999990, 999991, ..., 1000000 in the ftFlagPct column. We don't love this solution by any means, but it gives us something to work with, and we can now easily replace these values with null values. Very open to a cleaner answer!
And then we can simply add ,case when ftFlagPct > 999989 then null else ftFlagPct end as ftFlagPct to the select statement of the last query...
Edit: this throws an error for types == float, and our data has a combo of floats and ints, so we're still working on this.

Related

How to get the sum of data from different dates

I need to get 3 data
The sum of data between n dates / days
like, I need to get the sum of 01-12 to 05-12 and this / days = 5
The result of 1 but this is for validation
The difference between #1 and #2
How to get the sum of data from different dates also I need this 3 points for every segment I have like
segment
1
2
3
TTT
456465
456465
0
CCC
478888
478886
2
select segment,
(SELECT var1 AS 1.-
+ ( SELECT var1 AS 1.-
from table
where data = 20221207
group by segement
)
from table
where data = 20221206
group by segement) AS 1.-,
sum(IMP_SDO_MED_CONT_ML) AS 2.-,
(1.- - 2.-) AS difference,
from table
WHERE DATA = 20221207
group by segment;
You appear to want to sum the data using conditional aggregation and then can find #3 using subtraction:
SELECT segment,
SUM(CASE WHEN data BETWEEN 20221201 AND 20221212 THEN var1 END) AS "1",
SUM(IMP_SDO_MED_CONT_ML) AS "2",
SUM(CASE WHEN data BETWEEN 20221201 AND 20221212 THEN var1 END)
- SUM(IMP_SDO_MED_CONT_ML) AS "3"
FROM table_name
GROUP BY segment
However, without sample input data it is difficult to check what you are expecting.
It's realy hard to tell what the question is without the data. Maybe it is about calculating the difference of sums of two columns in some period per segments. If that is the case then the query would be:
Select
SEGMENT,
Sum(var_1) "FLD_1",
Sum(IMP_SDO_MED_CONT_ML) "FLD_2",
Sum(var_1) - Sum(IMP_SDO_MED_CONT_ML) "DIFF"
From a_tbl
Where DATE_NUMBER Between 20221201 And 20221203
Group By SEGMENT
Order By SEGMENT
... and if we invent some sample data like below
WITH
a_tbl AS
(
Select 'AAA' "SEGMENT", 1234 "VAR_1", 1233 "IMP_SDO_MED_CONT_ML", 20221201 "DATE_NUMBER" From Dual Union All
Select 'AAA' "SEGMENT", 5678 "VAR_1", 5677 "IMP_SDO_MED_CONT_ML", 20221202 "DATE_NUMBER" From Dual Union All
Select 'AAA' "SEGMENT", 9101 "VAR_1", 9103 "IMP_SDO_MED_CONT_ML", 20221203 "DATE_NUMBER" From Dual Union All
Select 'BBB' "SEGMENT", 8765 "VAR_1", 8766 "IMP_SDO_MED_CONT_ML", 20221201 "DATE_NUMBER" From Dual Union All
Select 'BBB' "SEGMENT", 6666 "VAR_1", 6665 "IMP_SDO_MED_CONT_ML", 20221202 "DATE_NUMBER" From Dual Union All
Select 'BBB' "SEGMENT", 4423 "VAR_1", 4420 "IMP_SDO_MED_CONT_ML", 20221203 "DATE_NUMBER" From Dual Union All
Select 'CCC' "SEGMENT", 1234 "VAR_1", 1233 "IMP_SDO_MED_CONT_ML", 20221201 "DATE_NUMBER" From Dual Union All
Select 'CCC' "SEGMENT", 5678 "VAR_1", 5677 "IMP_SDO_MED_CONT_ML", 20221203 "DATE_NUMBER" From Dual Union All
Select 'DDD' "SEGMENT", 1234 "VAR_1", 1233 "IMP_SDO_MED_CONT_ML", 20221201 "DATE_NUMBER" From Dual Union All
Select 'EEE' "SEGMENT", 5678 "VAR_1", 5678 "IMP_SDO_MED_CONT_ML", 20221203 "DATE_NUMBER" From Dual
)
... then the result would be
SEGMENT
FLD_1
FLD_2
DIFF
AAA
16013
16013
0
BBB
19854
19851
3
CCC
6912
6910
2
DDD
1234
1233
1
EEE
5678
5678
0
OR maybe it is the difference between sum of the period for one column and sum of the last day of period for the other column. In that case query could be:
Select
SEGMENT,
Sum(var_1) "FLD_1",
Sum(CASE WHEN DATE_NUMBER = 20221203 THEN IMP_SDO_MED_CONT_ML ELSE 0 END) "FLD_2",
Sum(var_1) - Sum(CASE WHEN DATE_NUMBER = 20221203 THEN IMP_SDO_MED_CONT_ML ELSE 0 END) "DIFF"
From a_tbl
Where DATE_NUMBER Between 20221201 And 20221203
Group By SEGMENT
Order By SEGMENT
... resulting (with same invented data) as
SEGMENT
FLD_1
FLD_2
DIFF
AAA
16013
9103
6910
BBB
19854
4420
15434
CCC
6912
5677
1235
DDD
1234
0
1234
EEE
5678
5678
0

SQL: How to split data from quaterly to monthly with date

I have the data in the sql table in quarterly format. I need to be able to split it into monthly with value split evenly ([value/3) in to each month. Can you please assist on how to achieve this using SQL? Thank you.
start
end
value
2022-01-01
2022-04-01
25629
2022-04-01
2022-07-01
993621
CREATE TABLE #your_tbl
("start_dt" timestamp, "end_dt" timestamp, "values" int)
;
INSERT INTO #your_tbl
("start_dt", "end_dt", "values")
VALUES
('2020-01-01 00:00:00', '2020-04-01 00:00:00', 114625),
('2020-04-01 00:00:00', '2020-07-01 00:00:00', 45216),
('2020-07-01 00:00:00', '2020-10-01 00:00:00', 513574)
DECLARE #datefrom datetime
DECLARE #dateto datetime
SET #datefrom='2022-04-01'
SET #dateto = '2022-07-01'
;WITH cte AS
(
SELECT #datefrom as MyDate
UNION ALL
SELECT DATEADD(month,1,MyDate)
FROM cte
WHERE DATEADD(month,1,MyDate)<#dateto
),
combined AS (
SELECT *
FROM #your_tbl q
JOIN cte m
ON YEAR(m.MyDate) >= q.start_dt
AND MONTH(m.MyDate) < q.end_dt
)
SELECT *, [values]/COUNT(1) OVER(PARTITION BY [start_dt], [end_dt]) as monthly_values
FROM combined
DROP TABLE #your_tbl
In Oracle can you use this script:
with mytable as (
select to_date('2022-01-01', 'YYYY-MM-DD') as startX, to_date('2022-04-01', 'YYYY-MM-DD') as endX, 25629 as valueX from dual union
select to_date('2022-04-01', 'YYYY-MM-DD') as startX, to_date('2022-07-01', 'YYYY-MM-DD') as endX, 993621 as valueX from dual union
select to_date('2022-07-01', 'YYYY-MM-DD') as startX, to_date('2022-10-01', 'YYYY-MM-DD') as endX, 21 as valueX from dual union
select to_date('2022-10-01', 'YYYY-MM-DD') as startX, to_date('2023-01-01', 'YYYY-MM-DD') as endX, 7777 as valueX from dual
),
mymonths as (
select '01' as month_n from dual union
select '02' as month_n from dual union
select '03' as month_n from dual union
select '04' as month_n from dual union
select '05' as month_n from dual union
select '06' as month_n from dual union
select '07' as month_n from dual union
select '08' as month_n from dual union
select '09' as month_n from dual union
select '10' as month_n from dual union
select '11' as month_n from dual union
select '12' as month_n from dual
)
select month_n, startX, valueX/3
from mytable, mymonths
where month_n between to_char(startX, 'MM') and to_char(endX-1, 'MM');
MONTHS_N STARTX VALUEX/3
-------- ---------- ----------
01 01/01/2022 8543
02 01/01/2022 8543
03 01/01/2022 8543
04 01/04/2022 331207
05 01/04/2022 331207
06 01/04/2022 331207
07 01/07/2022 7
08 01/07/2022 7
09 01/07/2022 7
10 01/10/2022 2592,33333
11 01/10/2022 2592,33333
12 01/10/2022 2592,33333
Thank you.
Assuming you can figure out how to generate monthly dates, which is RDBMS dependent, here's a solution that might work depending on if you can use window functions.
Note this doesn't hard-code divide by 3 in case you're in a partial quarter.
WITH combined AS (
SELECT *,
FROM your_tbl q
JOIN monthly_dates m
ON m.monthly_dt >= q.start_dt
AND m.monthly_dt < q.end_dt
)
SELECT *
, values / COUNT(1) OVER(PARTITION BY start_dt, end_dt) as monthly_values
FROM combined
sqlfiddle

Unable to get multiple rows returned from a SELECT to summarize correctly by a specific column

I have an Oracle table that looks like this:
test_time test_name test_type test_location test_value
----------------- --------- --------- ------------- ----------
09/22/20 12:00:05 A RT Albany 200
09/22/20 12:00:05 A RT Chicago 500
09/22/20 12:00:05 B RT Albany 400
09/22/20 12:00:05 B RT Chicago 300
09/22/20 12:00:05 A WPL Albany 1500
09/22/20 12:00:05 A WPL Chicago 2300
09/22/20 12:00:05 B WPL Albany 2100
09/22/20 12:00:05 B WPL Chicago 1900
09/22/20 12:05:47 A RT Albany 300
09/22/20 12:05:47 A RT Chicago 400
09/22/20 12:05:47 B RT Albany 600
09/22/20 12:05:47 B RT Chicago 500
09/22/20 12:05:47 A WPL Albany 1700
09/22/20 12:05:47 A WPL Chicago 2000
09/22/20 12:05:47 B WPL Albany 1800
09/22/20 12:05:47 B WPL Chicago 2400
I want to run a SELECT against this table that will show me the average value of each location cited for a specific test_type (in this case, "RT") over the last 11 minutes, summarized by test_name. "11 minutes" is used to ensure that I will retrieve rows from at least two iterations of a script that inserts the records every five minutes.
I'd like the results of a SELECT statement against this table to look like this:
test_name albany_avg_val chicago_avg_val
--------- -------------- ---------------
A 250 450
B 500 400
(NOTE: the "albany_avg_val" for test_name "A" reflects the average value of the "test_value" values associated with the two iterations of test_name "A"/test_type "RT"/test_location "Albany" that ran at 12:00 and 12:05).
The SELECT statement I've built so far looks like this:
SELECT
test_name,
CASE test_location
WHEN 'Albany'
THEN ROUND(AVG( test_value ),0) albany_avg_val
WHEN 'Chicago'
THEN ROUND(AVG( test_value ),0) chicago_avg_val
END
FROM
test_table
WHERE
test_type = 'RT' AND test_time > sysdate - interval '11' minute;
...but it's not working as expected. Could someone help me with what I may be missing, please?
I think you want:
select
test_name,
round(avg(case when test_location = 'Albany' then test_value end)) albany_avg_val
round(avg(case when test_location = 'Chicago' then test_value end)) chicago_avg_val
from test_table
where
test_type = 'rt'
and test_location in ('Albany', 'Chicago')
and test_time > sysdate - 11 / 24 / 60
group by test_name
That is:
use group by!
move the case expression within aggregate function avg()
each column should be separated - a conditional expression cannot generate two columns
And also...:
prefiltering in the where clause improves the efficiency of the query
it is safer to use "numeric" date arithmetics against sysdate (which is a date); if you want interval arithmetics, use systimestamp instead
0 is the default precision for round()
Seems you need conditional aggregation :
SELECT
test_name,
AVG(CASE
WHEN test_location='Albany'
THEN ROUND( test_value ) END) AS albany_avg_val,
AVG(WHEN test_location='Chicago'
THEN ROUND( test_value ) END) AS chicago_avg_val
FROM test_table
WHERE test_type = 'RT'
AND test_time > sysdate - interval '11' minute;
GROUP BY test_name
second argument(0) for ROUND() function is redundant.
please try something like this
SELECT
test_name,
ROUND(AVG(CASE when test_location='Albany'
THEN test_value
else null end),0) albany_avg_val,
ROUND(AVG(CASE when test_location='Chicago'
THEN test_value
else null end),0) Chicago_avg_val
FROM
test_table
WHERE
test_type = 'RT' AND test_time > sysdate - interval '11' minute
group by test_name; ```
pivot clause was designed exactly for such things: the following query aggregates for all test_type values:
select *
from (select test_name, test_location, test_type, test_value from test_table)
pivot(
avg(test_value)
for test_location in ('Albany ' as Albany,'Chicago' as Chicago)
);
Results:
TEST_NAME TEST_TYPE ALBANY CHICAGO
--------- --------- ---------- ----------
A RT 250 450
B RT 500 400
A WPL 1600 2150
B WPL 1950 2150
Or if you want to filter only RT:
select *
from (select test_name, test_location, test_value from test_table where test_type='RT')
pivot(
avg(test_value)
for test_location in ('Albany ' as Albany,'Chicago' as Chicago)
);
Results:
TEST_NAME ALBANY CHICAGO
--------- ---------- ----------
B 500 400
A 250 450
Full test case with sample data:
with test_table(test_time,test_name,test_type,test_location,test_value) as (
select to_date('09/22/20 12:00:05','mm/dd/yy hh24:mi:ss'), 'A', 'RT ', 'Albany ', 200 from dual union all
select to_date('09/22/20 12:00:05','mm/dd/yy hh24:mi:ss'), 'A', 'RT ', 'Chicago', 500 from dual union all
select to_date('09/22/20 12:00:05','mm/dd/yy hh24:mi:ss'), 'B', 'RT ', 'Albany ', 400 from dual union all
select to_date('09/22/20 12:00:05','mm/dd/yy hh24:mi:ss'), 'B', 'RT ', 'Chicago', 300 from dual union all
select to_date('09/22/20 12:00:05','mm/dd/yy hh24:mi:ss'), 'A', 'WPL', 'Albany ', 1500 from dual union all
select to_date('09/22/20 12:00:05','mm/dd/yy hh24:mi:ss'), 'A', 'WPL', 'Chicago', 2300 from dual union all
select to_date('09/22/20 12:00:05','mm/dd/yy hh24:mi:ss'), 'B', 'WPL', 'Albany ', 2100 from dual union all
select to_date('09/22/20 12:00:05','mm/dd/yy hh24:mi:ss'), 'B', 'WPL', 'Chicago', 1900 from dual union all
select to_date('09/22/20 12:05:47','mm/dd/yy hh24:mi:ss'), 'A', 'RT ', 'Albany ', 300 from dual union all
select to_date('09/22/20 12:05:47','mm/dd/yy hh24:mi:ss'), 'A', 'RT ', 'Chicago', 400 from dual union all
select to_date('09/22/20 12:05:47','mm/dd/yy hh24:mi:ss'), 'B', 'RT ', 'Albany ', 600 from dual union all
select to_date('09/22/20 12:05:47','mm/dd/yy hh24:mi:ss'), 'B', 'RT ', 'Chicago', 500 from dual union all
select to_date('09/22/20 12:05:47','mm/dd/yy hh24:mi:ss'), 'A', 'WPL', 'Albany ', 1700 from dual union all
select to_date('09/22/20 12:05:47','mm/dd/yy hh24:mi:ss'), 'A', 'WPL', 'Chicago', 2000 from dual union all
select to_date('09/22/20 12:05:47','mm/dd/yy hh24:mi:ss'), 'B', 'WPL', 'Albany ', 1800 from dual union all
select to_date('09/22/20 12:05:47','mm/dd/yy hh24:mi:ss'), 'B', 'WPL', 'Chicago', 2400 from dual
)
select *
from (select test_name, test_location, test_type, test_value from test_table)
pivot(
avg(test_value)
for test_location in ('Albany ' as Albany, 'Chicago' as Chicago)
);

Event grouping in time series

I'm trying to build groups of precipitation events in my measurement data. I got a time, a measurement value and a flag noting if it's was raining:
00:00, 32.4, 0
00:10, 32.4, 0
00:20, 32.6, 1
00:30, 32.7, 1
00:40, 32.9, 1
00:50, 33.2, 1
01:00, 33.2, 0
01:10, 33.2, 0
01:20, 33.2, 0
01:30, 33.5, 1
01:40, 33.6, 1
01:50, 33.6, 0
02:00, 33.6, 0
...
Now I'd like to generate an event id for the precipitation events:
00:00, 32.4, 0, NULL
00:10, 32.4, 0, NULL
00:20, 32.6, 1, 1
00:30, 32.7, 1, 1
00:40, 32.9, 1, 1
00:50, 33.2, 1, 1
01:00, 33.2, 0, NULL
01:10, 33.2, 0, NULL
01:20, 33.2, 0, NULL
01:30, 33.5, 1, 2
01:40, 33.6, 1, 2
01:50, 33.6, 0, NULL
02:00, 33.6, 0, NULL
...
Then I'll be able to use grouping to summarize the events. Any hint how to do this in Oracle is much appreciated.
So far I was able to calculate the mentioned flag and the diff to the last row:
SELECT
measured_at,
station_id
ps, -- precipitation sum
ps - lag(ps, 1, NULL) OVER (ORDER BY measured_at ASC) as p, -- precipitation delta
CASE
WHEN ps - lag(ps, 1, NULL) OVER (ORDER BY measured_at ASC) > 0 THEN 1
ELSE 0
END as rainflag
FROM measurements;
I think it must be possible to generate the required event id somehow, but can't figure it out. Thanks for your time!
Final solution using mt0 answer:
DROP TABLE events;
CREATE TABLE events (measured_at, station_id, ps) AS
SELECT TO_DATE('2016-05-01 12:00', 'YYYY-MM-DD HH24:MI'), 'XYZ', 32.4 FROM DUAL UNION ALL
SELECT TO_DATE('2016-05-01 12:10', 'YYYY-MM-DD HH24:MI'), 'XYZ', 32.6 FROM DUAL UNION ALL
SELECT TO_DATE('2016-05-01 12:20', 'YYYY-MM-DD HH24:MI'), 'XYZ', 32.7 FROM DUAL UNION ALL
SELECT TO_DATE('2016-05-01 12:30', 'YYYY-MM-DD HH24:MI'), 'XYZ', 32.9 FROM DUAL UNION ALL
SELECT TO_DATE('2016-05-01 12:40', 'YYYY-MM-DD HH24:MI'), 'XYZ', 33.2 FROM DUAL UNION ALL
SELECT TO_DATE('2016-05-01 12:50', 'YYYY-MM-DD HH24:MI'), 'XYZ', 33.2 FROM DUAL UNION ALL
SELECT TO_DATE('2016-05-01 13:00', 'YYYY-MM-DD HH24:MI'), 'XYZ', 33.2 FROM DUAL UNION ALL
SELECT TO_DATE('2016-05-01 13:10', 'YYYY-MM-DD HH24:MI'), 'XYZ', 33.2 FROM DUAL UNION ALL
SELECT TO_DATE('2016-05-01 13:20', 'YYYY-MM-DD HH24:MI'), 'XYZ', 33.5 FROM DUAL UNION ALL
SELECT TO_DATE('2016-05-01 13:30', 'YYYY-MM-DD HH24:MI'), 'XYZ', 33.6 FROM DUAL UNION ALL
SELECT TO_DATE('2016-05-01 13:40', 'YYYY-MM-DD HH24:MI'), 'XYZ', 33.6 FROM DUAL UNION ALL
SELECT TO_DATE('2016-05-01 13:50', 'YYYY-MM-DD HH24:MI'), 'XYZ', 33.5 FROM DUAL UNION ALL
SELECT TO_DATE('2016-05-01 17:00', 'YYYY-MM-DD HH24:MI'), 'XYZ', 39.1 FROM DUAL UNION ALL
SELECT TO_DATE('2016-05-01 17:10', 'YYYY-MM-DD HH24:MI'), 'XYZ', 39.2 FROM DUAL UNION ALL
SELECT TO_DATE('2016-05-01 17:20', 'YYYY-MM-DD HH24:MI'), 'XYZ', 39.2 FROM DUAL;
WITH
flagged AS (
SELECT
measured_at,
station_id,
ps,
CASE
WHEN measured_at - lag(measured_at, 1, NULL) OVER (ORDER BY measured_at) = (1/144) THEN ps - lag(ps, 1, NULL) OVER (ORDER BY measured_at)
ELSE NULL
END as delta_p,
CASE
WHEN ps - lag(ps, 1, NULL) OVER (ORDER BY measured_at) > 0 THEN 1
ELSE 0
END AS rain
FROM events
),
eventmarked AS (
SELECT
f.*,
CASE
WHEN f.delta_p >= 0 THEN f.delta_p
ELSE NULL
END AS p,
CASE rain
WHEN 1 THEN COUNT(1) OVER (ORDER BY measured_at) - SUM(rain) OVER (ORDER BY measured_at)
END as event
FROM flagged f
),
summarized AS (
SELECT
em.*,
sum(CASE p WHEN 0 THEN NULL ELSE p END) OVER (PARTITION BY event ORDER BY measured_at) as e_ps
FROM eventmarked em
)
SELECT measured_at, station_id, ps, p, e_ps FROM summarized
ORDER BY measured_at;
Oracle Setup:
CREATE TABLE events ( measured_at, station_id, ps ) AS
SELECT '00:00', 32.4, 0 FROM DUAL UNION ALL
SELECT '00:10', 32.4, 0 FROM DUAL UNION ALL
SELECT '00:20', 32.6, 1 FROM DUAL UNION ALL
SELECT '00:30', 32.7, 1 FROM DUAL UNION ALL
SELECT '00:40', 32.9, 1 FROM DUAL UNION ALL
SELECT '00:50', 33.2, 1 FROM DUAL UNION ALL
SELECT '01:00', 33.2, 0 FROM DUAL UNION ALL
SELECT '01:10', 33.2, 0 FROM DUAL UNION ALL
SELECT '01:20', 33.2, 0 FROM DUAL UNION ALL
SELECT '01:30', 33.5, 1 FROM DUAL UNION ALL
SELECT '01:40', 33.6, 1 FROM DUAL UNION ALL
SELECT '01:50', 33.6, 0 FROM DUAL UNION ALL
SELECT '02:00', 33.6, 0 FROM DUAL;
Query:
SELECT measured_at,
station_id,
ps,
CASE WHEN rainflag IS NOT NULL THEN DENSE_RANK() OVER ( ORDER BY rainflag ) END AS rainflag
FROM (
SELECT e.*,
CASE ps
WHEN 1
THEN COUNT( 1 ) OVER ( ORDER BY measured_at )
- SUM( ps ) OVER ( ORDER BY measured_at )
END AS rainflag
FROM events e
)
ORDER BY measured_at;
Query 2
SELECT measured_at,
station_id,
ps,
CASE ps WHEN 1
THEN SUM( rainflag ) OVER ( ORDER BY measured_at )
END AS rainflag
FROM (
SELECT e.*,
CASE WHEN ps > LAG( ps, 1, 0 ) OVER ( ORDER BY measured_at )
THEN 1
END AS rainflag
FROM events e
);
Output:
MEASURED_AT STATION_ID PS RAINFLAG
----------- ---------- ---------- ----------
00:00 32.4 0
00:10 32.4 0
00:20 32.6 1 1
00:30 32.7 1 1
00:40 32.9 1 1
00:50 33.2 1 1
01:00 33.2 0
01:10 33.2 0
01:20 33.2 0
01:30 33.5 1 2
01:40 33.6 1 2
01:50 33.6 0
02:00 33.6 0
Alternative solution using only LAG function.
In the subquery the column PS2 marks the rain started events. The main query simple sums this flag while ignoring the time that is not raining.
with ev as (
select measured_at, station_id, ps,
case when ps = 1 and lag(ps,1,0) over (order by measured_at) = 0
then 1 else 0 end ps2
from events)
select measured_at, station_id, ps, ps2,
case when ps = 1 then
sum(ps2) over (order by measured_at) end rf
from ev
;
MEASURED_AT STATION_ID PS PS2 RF
----------- ---------- ---------- ---------- ----------
00:00 32,4 0 0
00:10 32,4 0 0
00:20 32,6 1 1 1
00:30 32,7 1 0 1
00:40 32,9 1 0 1
00:50 33,2 1 0 1
01:00 33,2 0 0
01:10 33,2 0 0
01:20 33,2 0 0
01:30 33,5 1 1 2
01:40 33,6 1 0 2
01:50 33,6 0 0
02:00 33,6 0 0

How to calculate price change over 3 years in SQL query

I need to calculate the price change of an item (both in cost and % change) over the last three years.
The table has four fields:
SKU_no, Date_updated, Price, Active_flag
When the Active_flag field is A, the item is active, when I it is inactive. Some items haven't changed prices in years so they won't have three years of entries with an inactive flag.
Sample table
SKU_NO Update_date Price Active_flag
30 1/1/1999 40.8 I
33 1/1/2014 70.59 A
33 1/1/2013 67.23 I
33 1/1/2012 60.03 I
33 1/1/2011 55.08 I
33 1/1/2010 55.08 I
34 1/1/2009 51 A
36 1/1/2014 70.59 A
36 1/1/2013 67.23 I
36 1/1/2012 60.03 I
38 1/1/2002 43.32 A
38 1/1/2001 43.32 I
38 4/8/2000 43.32 I
38 1/1/1999 43.32 I
39 1/1/2014 73.08 A
39 1/1/2013 69.6 I
39 1/1/2012 62.13 I
39 1/1/2011 57 I
39 1/1/2010 57 I
39 1/1/2009 52.8 I
This is the first query I wrote. I'm not too familiar with complex calculations
select
s.VENDOR,
s.FISCAL_YEAR,
s.FISCAL_MONTH_NO,
s.FISCAL_YEAR||'_'||FISCAL_MONTH_NO as PERIOD,
CASE WHEN S.COST_USED_FLAG IN ('CONTRACT') THEN 'CONTRACT' ELSE 'NON-CONTRACT' END AS CONTRACT_TYPE,
CASE WHEN ((s.FISCAL_YEAR = 2014 AND FISCAL_MONTH_NO <=9) OR (FISCAL_YEAR = 2013 AND FISCAL_MONTH_NO >=10)) THEN 'CP_1'
WHEN ((s.FISCAL_YEAR = 2013 AND FISCAL_MONTH_NO <= 9) OR (FISCAL_YEAR = 2012 AND FISCAL_MONTH_NO >=10)) THEN 'CP_2'
WHEN ((s.FISCAL_YEAR = 2012 AND FISCAL_MONTH_NO <= 9) OR (FISCAL_YEAR = 2011 AND FISCAL_MONTH_NO >=10)) THEN 'CP_3'
ELSE 'NULL' END CAGR_PERIODS,
CASE WHEN s.MARKET IN ('PO', 'SC', 'OC') THEN 'PC' ELSE 'EC' END AS MARKET_TYPE,
s.MARKET,
s.COST_PLUS_FLAG,
s.COST_USED_FLAG,
LPAD(S.PC_ITEM_NO,6,'0') AS NEW_ITEM_NO,
s.PC_ITEM_NO,
i.ITEM_NO,
i.VEND_CAT_NUM,
i.DESCRIPTION,
s.PC_PROD_CAT,
s.PC_PROD_SUBCAT,
i.SELL_UOM,
i.QTY_PER_SELL_UOM,
i.PRIMARY_UOM,
i.HEAD_CONV_FACT,
SUM(s.QTY_EACH) AS QUANTITY_SOLD,
SUM(s.EXT_GROSS_COGS) AS TOTAL_COGS,
SUM(s.EXT_GROSS_COGS)/ SUM(s.QTY_EACH) as NET_SALES,
SUM(s.EXT_SALES)/ SUM(s.QTY_EACH) as ASP,
SUM(s.EXT_SALES) AS TOTAL_SALES,
SUM(S.EXT_SALES) - SUM(S.EXT_GROSS_COGS) as GROSS_PROFIT
from SIXSIGMA.CIA_ALL_SALES_TREND_DATA s
INNER JOIN MGMSH.ITEM i
ON S.PC_ITEM_NO = I.ITEM_NO
WHERE S.VENDOR = 'BD' AND
(S.EXT_SALES IS NOT NULL AND S.FISCAL_YEAR IN ('2013','2012','2011'))
GROUP BY
s.VENDOR,
s.FISCAL_YEAR,
s.FISCAL_MONTH_NO,
s.FISCAL_YEAR||'_'||FISCAL_MONTH_NO,
CASE WHEN s.MARKET IN ('PO', 'SC', 'OC') THEN 'PC' ELSE 'EC' END,
CASE WHEN S.COST_USED_FLAG IN ('CONTRACT') THEN 'CONTRACT' ELSE 'NON-CONTRACT' END,
CASE WHEN ((s.FISCAL_YEAR = 2014 AND FISCAL_MONTH_NO <=9) OR (FISCAL_YEAR = 2013 AND FISCAL_MONTH_NO >=10)) THEN 'CP_1'
WHEN ((s.FISCAL_YEAR = 2013 AND FISCAL_MONTH_NO <= 9) OR (FISCAL_YEAR = 2012 AND FISCAL_MONTH_NO >=10)) THEN 'CP_2'
WHEN ((s.FISCAL_YEAR = 2012 AND FISCAL_MONTH_NO <= 9) OR (FISCAL_YEAR = 2011 AND FISCAL_MONTH_NO >=10)) THEN 'CP_3'
ELSE 'NULL' END,
s.MARKET,
s.COST_USED_FLAG,
s.COST_PLUS_FLAG,
s.PC_ITEM_NO,
s.PC_PROD_CAT,
i.SELL_UOM,
i.QTY_PER_SELL_UOM,
i.PRIMARY_UOM,
i.HEAD_CONV_FACT,
i.DESCRIPTION,
i.VEND_CAT_NUM,
s.PC_PROD_SUBCAT,
i.ITEM_NO
ORDER BY s.PC_ITEM_NO,s.FISCAL_YEAR, s.FISCAL_MONTH_NO
There are several ways to approach this, but I would recommend a windowing function such as LAG or LEAD. With these functions, you can reference neighboring rows. For example:
lead(column, offset, default) over (partition by some_column order by column)
And in the example below:
lead(price, 1, price) over (partition by sku_no order by update_date desc)
Here is a working example with sample data:
with sample_data as (
select '30' sku_no, to_date('1/1/1999','DD/MM/YYYY') update_date, 40.8 price, 'I' active_flag from dual union all
select '33', to_date('1/1/2014','DD/MM/YYYY'), 70.59, 'A' from dual union all
select '33', to_date('1/1/2013','DD/MM/YYYY'), 67.23, 'I' from dual union all
select '33', to_date('1/1/2012','DD/MM/YYYY'), 60.03, 'I' from dual union all
select '33', to_date('1/1/2011','DD/MM/YYYY'), 55.08, 'I' from dual union all
select '33', to_date('1/1/2010','DD/MM/YYYY'), 55.08, 'I' from dual union all
select '34', to_date('1/1/2009','DD/MM/YYYY'), 51 , 'A' from dual union all
select '36', to_date('1/1/2014','DD/MM/YYYY'), 70.59, 'A' from dual union all
select '36', to_date('1/1/2013','DD/MM/YYYY'), 67.23, 'I' from dual union all
select '36', to_date('1/1/2012','DD/MM/YYYY'), 60.03, 'I' from dual union all
select '38', to_date('1/1/2002','DD/MM/YYYY'), 43.32, 'A' from dual union all
select '38', to_date('1/1/2001','DD/MM/YYYY'), 43.32, 'I' from dual union all
select '38', to_date('4/8/2000','DD/MM/YYYY'), 43.32, 'I' from dual union all
select '38', to_date('1/1/1999','DD/MM/YYYY'), 43.32, 'I' from dual union all
select '39', to_date('1/1/2014','DD/MM/YYYY'), 73.08, 'A' from dual union all
select '39', to_date('1/1/2013','DD/MM/YYYY'), 69.6 , 'I' from dual union all
select '39', to_date('1/1/2012','DD/MM/YYYY'), 62.13, 'I' from dual union all
select '39', to_date('1/1/2011','DD/MM/YYYY'), 57 , 'I' from dual union all
select '39', to_date('1/1/2010','DD/MM/YYYY'), 57 , 'I' from dual union all
select '39', to_date('1/1/2009','DD/MM/YYYY'), 52.8 , 'I' from dual)
select
sku_no,
update_date,
price,
lead(price,1, price) over (partition by sku_no order by update_date desc) prior_price, -- Showing the offset
price - lead(price,1, price) over (partition by sku_no order by update_date desc) price_difference, -- Calculate the difference
round((price - lead(price,1, price) over (partition by sku_no order by update_date desc)) * 100 /price, 2) percent_change -- Calculate the percentage
from sample_data
where update_date >= add_months(trunc(sysdate,'YYYY'),-36); -- You said in the last three years
You can also use LAG with a different order by sort. If you want to calculate the difference from three years prior, I would suggest using the KEEP function.