FIRST_VALUE in Athena or Spark - sql

select id
,id2
,FIRST_VALUE(CASE WHEN app THEN date0 ELSE NULL END) IGNORE NULLS OVER (PARTITION BY id ORDER BY date0) as date_result
from (
select 1 id, 22 as id2, false app, Date'2019-03-13' as date0
union
select 1 id, 23 as id2, true app, Date'2019-03-14' as date0
union
select 1 id, 23 as id2, true app, Date'2019-03-15' as date0
)
Above query is returning like below in Athena
id
id2
date_result
1
22
1
23
2019-03-14
1
23
2019-03-14
But I was expecting like below since we do ignore nulls and partition by id for date_result
id
id2
date_result
1
22
2019-03-14
1
23
2019-03-14
1
23
2019-03-14
Could you please let me know what I am doing wrong in first_value? what is the best way to achieve this result in both Athena and spark? Thanks
I have added it in the description

Could you please let me know what I am doing wrong in first_value?
default frame for windows functions is unbounded preceding - current row:
If frame_end is not specified, a default value of CURRENT ROW is used.
If no frame is specified, a default frame of RANGE UNBOUNDED PRECEDING is used.
If you want to find value across the whole partition you need to specify the frame, for example:
with dataset(id, id2, app, date0) as (
values (1, 22, false, Date'2019-03-13'),
(1, 23, true ,Date'2019-03-14'),
(1, 23, true ,Date'2019-03-15')
)
select id
, id2
, FIRST_VALUE(if(app, date0)) IGNORE NULLS
OVER (PARTITION BY id ORDER BY date0 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as date_result
from dataset;

Related

SQL Window Function - Number of Rows since last Max

I am trying to create a SQL query that will pull the number of rows since the last maximum value within a windows function over the last 5 rows. In the example below it would return 2 for row 8. The max value is 12 which is 2 rows from row 8.
For row 6 it would return 5 because the max value of 7 is 5 rows away.
|ID | Date | Amount
| 1 | 1/1/2019 | 7
| 2 | 1/2/2019 | 3
| 3 | 1/3/2019 | 4
| 4 | 1/4/2019 | 1
| 5 | 1/5/2019 | 1
| 6 | 1/6/2019 | 12
| 7 | 1/7/2019 | 2
| 8 | 1/8/2019 | 4
I tried the following:
SELECT ID, date, MAX(amount)
OVER (ORDER BY date ASC ROWS 5 PRECEDING) mymax
FROM tbl
This gets me to the max values but I am unable to efficiently determine how many rows away it is. I was able to get close using multiple variables within the SELECT but this did not seem efficient or scalable.
You can calculate the cumulative maximum and then use row_number() on that.
So:
select t.*,
row_number() over (partition by running_max order by date) as rows_since_last_max
from (select t.*,
max(amount) over (order by date rows between 5 preceding and current row) as running_max
from tbl t
) t;
I think this works for your sample data. It might not work if you have duplicates.
In that case, you can use date arithmetic:
select t.*,
datediff(day,
max(date) over (partition by running_max order by date),
date
) as days_since_most_recent_max5
from (select t.*,
max(amount) over (order by date rows between 5 preceding and current row) as running_max
from tbl t
) t;
EDIT:
Here is an example using row number:
select t.*,
(seqnum - max(case when amount = running_amount then seqnum end) over (partition by running_max order by date)) as rows_since_most_recent_max5
from (select t.*,
max(amount) over (order by date rows between 5 preceding and current row) as running_max,
row_number() over (order by date) as seqnum
from tbl t
) t;
It would be :
select *,ID-
(
SELECT ID
FROM
(
SELECT
ID,amount,
Maxamount =q.mymax
FROM
Table_4
) AS derived
WHERE
amount = Maxamount
) as result
from (
SELECT ID, date,
MAX(amount)
OVER (ORDER BY date ASC ROWS 5 PRECEDING) mymax
FROM Table_4
)as q

How to combine 2 rows into single row in Teradata

I have a resultset in the below-mentioned form returned by a SQL:
ID Key
1 A
2 A
3 A
Now my requirement is to show the data in the below form:
Key ID1 ID2 ID3
A 1 2 3
How to build an SQL for this?
A Windowed Aggregate based solution with a single STATS-step in Explain:
SELECT
key,
-- value from 1st row = current row
ID AS ID1,
-- value from next row, similar to LEAD(ID, 1) Over (PARTITION BY Key ORDER BY ID)
Min(ID)
Over (PARTITION BY Key
ORDER BY ID
ROWS BETWEEN 1 Following AND 1 Following) AS ID2 ,
-- value from 3rd row
Min(ID)
Over (PARTITION BY Key
ORDER BY ID
ROWS BETWEEN 2 Following AND 2 Following) AS ID3
FROM mytable
QUALIFY -- only return the 1st row
Row_Number()
Over (PARTITION BY key
ORDER BY ID) = 1
As teradata 14.10 doesn't have a PIVOT function and assuming that for every unique key, there will be no more than 3 IDs( as mentioned in comments), you can use row_number() and aggregate function as below to get your desired result.
SELECT
key1,
MAX(CASE WHEN rn = 1 THEN ID END) AS ID1,
MAX(CASE WHEN rn = 2 THEN ID END) AS ID2,
MAX(CASE WHEN rn = 3 THEN ID END) AS ID3
FROM
(SELECT
t.*,
ROW_NUMBER() OVER (PARTITION BY key1 ORDER BY ID) AS rn
FROM table1 t) t
GROUP BY key1;
Result:
+------------+-----+-----+-----+
| key1 | id1 | id2 | id3 |
+------------+-----+-----+-----+
| A | 1 | 2 | 3 |
+------------+-----+-----+-----+
DEMO

oracle dates group

How to get optimized query for this
date_one | date_two
------------------------
01.02.1999 | 31.05.2003
01.01.2004 | 01.01.2010
02.01.2010 | 10.10.2011
11.10.2011 | (null)
I need to get this
date_one | date_two | group
------------------------------------
01.02.1999 | 31.05.2003 | 1
01.01.2004 | 01.01.2010 | 2
02.01.2010 | 10.10.2011 | 2
11.10.2011 | (null) | 2
The group number is assigned as follows. Order the rows by date_one ascending. First row gets group = 1. Then for each row if date_one is the date immediately following date_two of the previous row, the group number stays the same as in the previous row, otherwise it increases by one.
You can do this using left join and a cumulative sum:
select t.*, sum(case when tprev.date_one is null then 1 else 0 end) over (order by t.date_one) as grp
from t left join
t tprev
on t.date_one = tprev.date_two + 1;
The idea is to find where the gaps begin (using the left join) and then do a cumulative sum of such beginnings to define the group.
If you want to be more inscrutable, you could write this as:
select t.*,
count(*) over (order by t.date_one) - count(tprev.date_one) over (order by t.date_one) as grp
from t left join
t tprev
on t.date_one = tprev.date_two + 1;
One way is using window function:
select
date_one,
date_two,
sum(x) over (order by date_one) grp
from (
select
t.*,
case when
lag(date_two) over (order by date_one) + 1 =
date_one then 0 else 1 end x
from t
);
It finds the date_two from the last row using analytic function lag and check if it in continuation with date_one from this row (in increasing order of date_one).
How it works:
lag(date_two) over (order by date_one)
(In the below explanation, when I say first, next, previous or last row, it's based on increasing order of date_one with null values at the end)
The above produces produces NULL for the first row as there is no row before it to get date_two from and previous row's date_two for the subsequent rows.
case when
lag(date_two)
over (order by date_one) + 1 = date_one then 0
else 1 end
Since, the lag produces NULL for the very first row (since NULL = anything expression always finally evaluates to false), output of case will be 1.
For further rows, similar check will be done to produce a new column x in the query output which has value 1 when the previous row's date_two is not in continuation with this row's date_one.
Then finally, we can do an incremental sum on x to find the required group values. See the value of x below for understanding:
SQL> with t (date_one,date_two) as (
2 select to_date('01.02.1999','dd.mm.yyyy'),to_date('31.05.2003','dd.mm.yyyy') from dual union
all
3 select to_date('01.01.2004','dd.mm.yyyy'),to_date('01.01.2010','dd.mm.yyyy') from dual union
all
4 select to_date('02.01.2010','dd.mm.yyyy'),to_date('10.10.2011','dd.mm.yyyy') from dual union
all
5 select to_date('11.10.2011','dd.mm.yyyy'),null from dual
6 )
7 select
8 date_one,
9 date_two,
10 x,
11 sum(x) over (order by date_one) grp
12 from (
13 select
14 t.*,
15 case when
16 lag(date_two) over (order by date_one) + 1 =
17 date_one then 0 else 1 end x
18 from t
19 );
DATE_ONE DATE_TWO X GRP
--------- --------- ---------- ----------
01-FEB-99 31-MAY-03 1 1
01-JAN-04 01-JAN-10 1 2
02-JAN-10 10-OCT-11 0 2
11-OCT-11 0 2
SQL>

Return items that add up to a maximum given value from a table

I'm trying to build a sql query that will return a list of IDs that have a total sum, which is less than OR greater than a given value using the least number of items.
Here's an example of the table I'll be querying.
ID Value
-----------
226 2.3
331 3.1
25 1.5
28 1.5
29 1.2
52 5.2
38 3.5
Here it is sorted by Value asc.
ID Value
----------
29 1.2
25 1.5
28 1.5
226 2.3
331 3.1
38 3.5
52 5.2
Example A :
If my value is 6, I would expect the query to return IDs 29, 25, 28 and 226.
1.2 + 1.5 + 1.5 + 2.3 = 6.5
Example B :
If my value is 19, I would expect the query to return all of the IDs (29, 25, 28, 226, 331, 38, 52).
1.2 + 1.5 + 1.5 + 2.3 + 3.1 + 3.5 + 5.2 = 18.3
I've tried the suggested answer found here:
SQL select elements where sum of field is less than N
However, that's not giving me exactly what I need since it only returns IDs that add up to LESS than the set value. Also it is assuming that the ID is ascending which isn't the case when I sort by asc value.
Is this even possible within a sql statement? or would I have to do a procedure/function to accomplish this task?
Assuming you are talking the values in ascending order and you want to stop when you are closest to the desired total then:
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE data ( ID, Value ) AS
SELECT 226, 2.3 FROM DUAL
UNION ALL SELECT 331, 3.1 FROM DUAL
UNION ALL SELECT 25, 1.5 FROM DUAL
UNION ALL SELECT 28, 1.5 FROM DUAL
UNION ALL SELECT 29, 1.2 FROM DUAL
UNION ALL SELECT 52, 5.2 FROM DUAL
UNION ALL SELECT 38, 3.5 FROM DUAL;
Query 1:
WITH differences AS (
SELECT ID,
Value,
ABS(
SUM( Value ) OVER ( ORDER BY Value ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW )
- 6 -- REPLACE THIS WITH :desired_total
) AS difference
FROM data
),
min_difference AS (
SELECT MIN( Value ) KEEP ( DENSE_RANK FIRST ORDER BY difference ASC ) AS max_value,
MIN( ID ) KEEP ( DENSE_RANK FIRST ORDER BY difference ASC ) AS max_id
FROM differences
)
SELECT ID,
Value
FROM differences d
INNER JOIN
min_difference m
ON ( d.value < max_value
OR ( d.value = m.max_value AND d.id <= max_id ) )
Results:
| ID | VALUE |
|-----|-------|
| 226 | 2.3 |
| 25 | 1.5 |
| 28 | 1.5 |
| 29 | 1.2 |
Edit - Stops when running total is just greater than or equal to desired total
SQL Fiddle
Query 1:
Calculate the running totals for each row (in order of ascending value) then select all the rows where the running total is less than the desired total and also the next row (with the minimum running total of the totals greater than or equal to the desired total).
WITH running_totals AS (
SELECT ID,
Value,
SUM( Value ) OVER ( ORDER BY Value ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS running_total
FROM data
)
SELECT ID,
Value
FROM running_totals
WHERE running_total < 6
UNION ALL
SELECT MIN( id ) KEEP ( DENSE_RANK FIRST ORDER BY running_total ),
MIN( value ) KEEP ( DENSE_RANK FIRST ORDER BY running_total )
FROM running_totals
WHERE running_total >= 6
Results:
| ID | VALUE |
|-----|-------|
| 29 | 1.2 |
| 28 | 1.5 |
| 25 | 1.5 |
| 226 | 2.3 |
EDIT 2 - An alternative method:
WITH running_totals AS (
SELECT ID,
Value,
SUM( Value ) OVER ( ORDER BY Value, ID ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS running_total,
ROW_NUMBER() OVER ( ORDER BY Value, ID ) AS idx
FROM data
)
SELECT ID,
Value
FROM running_totals
WHERE idx <= (SELECT MAX(idx) + 1 FROM running_totals WHERE running_total < 6 );

SELECT records until new value SQL

I have a table
Val | Number
08 | 1
09 | 1
10 | 1
11 | 3
12 | 0
13 | 1
14 | 1
15 | 1
I need to return the last values where Number = 1 (however many that may be) until Number changes, but do not need the first instances where Number = 1. Essentially I need to select back until Number changes to 0 (15, 14, 13)
Is there a proper way to do this in MSSQL?
Based on following:
I need to return the last values where Number = 1
Essentially I need to select back until Number changes to 0 (15, 14,
13)
Try (Fiddle demo ):
select val, number
from T
where val > (select max(val)
from T
where number<>1)
EDIT: to address all possible combinations (Fiddle demo 2)
;with cte1 as
(
select 1 id, max(val) maxOne
from T
where number=1
),
cte2 as
(
select 1 id, isnull(max(val),0) maxOther
from T
where val < (select maxOne from cte1) and number<>1
)
select val, number
from T cross join
(select maxOne, maxOther
from cte1 join cte2 on cte1.id = cte2.id
) X
where val>maxOther and val<=maxOne
I think you can use window functions, something like this:
with cte as (
-- generate two row_number to enumerate distinct groups
select
Val, Number,
row_number() over(partition by Number order by Val) as rn1,
row_number() over(order by Val) as rn2
from Table1
), cte2 as (
-- get groups with Number = 1 and last group
select
Val, Number,
rn2 - rn1 as rn1, max(rn2 - rn1) over() as rn2
from cte
where Number = 1
)
select Val, Number
from cte2
where rn1 = rn2
sql fiddle demo
DEMO: http://sqlfiddle.com/#!3/e7d54/23
DDL
create table T(val int identity(8,1), number int)
insert into T values
(1),(1),(1),(3),(0),(1),(1),(1),(0),(2)
DML
; WITH last_1 AS (
SELECT Max(val) As val
FROM t
WHERE number = 1
)
, last_non_1 AS (
SELECT Coalesce(Max(val), -937) As val
FROM t
WHERE EXISTS (
SELECT val
FROM last_1
WHERE last_1.val > t.val
)
AND number <> 1
)
SELECT t.val
, t.number
FROM t
CROSS
JOIN last_1
CROSS
JOIN last_non_1
WHERE t.val <= last_1.val
AND t.val > last_non_1.val
I know it's a little verbose but I've deliberately kept it that way to illustrate the methodolgy.
Find the highest val where number=1.
For all values where the val is less than the number found in step 1, find the largest val where the number<>1
Finally, find the rows that fall within the values we uncovered in steps 1 & 2.
select val, count (number) from
yourtable
group by val
having count(number) > 1
The having clause is the key here, giving you all the vals that have more than one value of 1.
This is a common approach for getting rows until some value changes. For your specific case use desc in proper spots.
Create sample table
select * into #tmp from
(select 1 as id, 'Alpha' as value union all
select 2 as id, 'Alpha' as value union all
select 3 as id, 'Alpha' as value union all
select 4 as id, 'Beta' as value union all
select 5 as id, 'Alpha' as value union all
select 6 as id, 'Gamma' as value union all
select 7 as id, 'Alpha' as value) t
Pull top rows until value changes:
with cte as (select * from #tmp t)
select * from
(select cte.*, ROW_NUMBER() over (order by id) rn from cte) OriginTable
inner join
(
select cte.*, ROW_NUMBER() over (order by id) rn from cte
where cte.value = (select top 1 cte.value from cte order by cte.id)
) OnlyFirstValueRecords
on OriginTable.rn = OnlyFirstValueRecords.rn and OriginTable.id = OnlyFirstValueRecords.id
On the left side we put an original table. On the right side we put only rows whose value is equal to the value in first line.
Records in both tables will be same until target value changes. After line #3 row numbers will get different IDs associated because of the offset and will never be joined with original table:
LEFT RIGHT
ID Value RN ID Value RN
1 Alpha 1 | 1 Alpha 1
2 Alpha 2 | 2 Alpha 2
3 Alpha 3 | 3 Alpha 3
----------------------- result set ends here
4 Beta 4 | 5 Alpha 4
5 Alpha 5 | 7 Alpha 5
6 Gamma 6 |
7 Alpha 7 |
The ID must be unique. Ordering by this ID must be same in both ROW_NUMBER() functions.