SQL - Results based partially on aggregate of particular column - sql

Thanks in advance for any assistance. I have a situation where I need a snapshot of SQL data but part of the results need to be based on the aggregate of one column. Here's a tiny subset of my data:
| A | B | last_date | next_date | C | D |
| 1 | 3 | 01/01/2000 | 01/01/2003 | 1 | 1 |
| 1 | 3 | 01/01/2001 | 01/01/2004 | 1 | 2 |
| 2 | 3 | 01/01/2002 | 01/01/2005 | 2 | 3 |
| 2 | 4 | 01/01/2003 | 01/01/2006 | 3 | 4 |
My results need to be grouped by columns A and B, the MAX of last_date and the MIN of next date. But the kicker is that the values for columns C and D should be the values that correspond to the MIN of next date. So for the above data subset my results would be:
| A | B | last_date | next_date | C | D |
| 1 | 3 | 01/01/2001 | 01/01/2003 | 1 | 1 |
| 2 | 3 | 01/01/2002 | 01/01/2005 | 2 | 3 |
| 2 | 4 | 01/01/2003 | 01/01/2006 | 3 | 4 |
Note how the first row of results has the value of last_date from the 2nd row of the initial data, but the values for columns C and D correspond to the first row from the initial data. In the case where there is an exact duplication of columns A, B, max(last_date), and min(next_date) but the values for columns C and D don't match, then I don't care which one is returned - but I must only return one row, not multiples.

You can use row_number adn get this results as below:
Select A, B, MaxLast_date, MinNext_date, C, D from (
select *, max(last_date) over(partition by A, B) as MaxLast_date, Min(next_date) over(partition by A, B) as MinNext_date,
next_rn = Row_number() over(partition by A, B order by next_date) from #yourtable
) a
Where a.next_rn = 1
Other way is with top (1) with ties as below:
Select top(1) with ties *, max(last_date) over(partition by A, B) as MaxLast_date, Min(next_date) over(partition by A, B) as MinNext_date
from #yourtable
Order by Row_number() over(partition by A, B order by next_date)
Output:
+---+---+--------------+--------------+---+---+
| A | B | MaxLast_date | MinNext_date | C | D |
+---+---+--------------+--------------+---+---+
| 1 | 3 | 2001-01-01 | 2003-01-01 | 1 | 1 |
| 2 | 3 | 2002-01-01 | 2005-01-01 | 2 | 3 |
| 2 | 4 | 2003-01-01 | 2006-01-01 | 3 | 4 |
+---+---+--------------+--------------+---+---+
Demo

Related

Using LAG function with higher offset

Suppose we have the following input table
cat | value | position
------------------------
1 | A | 1
1 | B | 2
1 | C | 3
1 | D | 4
2 | C | 1
2 | B | 2
2 | A | 3
2 | D | 4
As you can see, the values A,B,C,D change position in each category, I want to track this change by adding a column change in front of each value, the output should look like this:
cat | value | position | change
---------------------------------
1 | A | 1 | NULL
1 | B | 2 | NULL
1 | C | 3 | NULL
1 | D | 4 | NULL
2 | C | 1 | 2
2 | B | 2 | 0
2 | A | 3 | -2
2 | D | 4 | 0
For example C was in position 3 in category 1 and moved to position 1 in category 2 and therefore has a change of 2. I tried inmplementing this using the LAG() function with an offset of 4 but I failed, how can I write this query.
Use lag() - with the proper partition by clause:
select
t.*,
lag(position) over(partition by value order by cat) - position change
from mytable t
You can use lag and then order by to maintain original order. Here is the demo.
select
*,
lag(position) over (partition by value order by cat) - position as change
from yourTable
order by
cat, position
output:
| cat | value | position | change |
| --- | ----- | -------- | ------ |
| 1 | A | 1 | null |
| 1 | B | 2 | null |
| 1 | C | 3 | null |
| 1 | D | 4 | null |
| 2 | C | 1 | 2 |
| 2 | B | 2 | 0 |
| 2 | A | 3 | -2 |
| 2 | D | 4 | 0 |
I think you just want lag() with the right partition by:
select t.*,
(lag(position) over (partition by value order by cat) - position) as change
from t;
Here is a db<>fiddle.

TSQL - Referencing a changed value from previous row

I am trying to do a row calculation whereby the larger value will carry forward to the subsequent rows until a larger value is being compared. It is done by comparing the current value to the previous row using the lag() function.
Code
DECLARE #TAB TABLE (id varchar(1),d1 INT , d2 INT)
INSERT INTO #TAB (id,d1,d2)
VALUES ('A',0,5)
,('A',1,2)
,('A',2,4)
,('A',3,6)
,('B',0,4)
,('B',2,3)
,('B',3,2)
,('B',4,5)
SELECT id
,d1
,d2 = CASE WHEN id <> (LAG(id,1,0) OVER (ORDER BY id,d1)) THEN d2
WHEN d2 < (LAG(d2,1,0) OVER (ORDER BY id,d1)) THEN (LAG(d2,1,0) OVER (ORDER BY id,d1))
ELSE d2 END
Output (Added row od2 for clarity)
+----+----+----+ +----+
| id | d1 | d2 | | od2|
+----+----+----+ +----+
| A | 0 | 5 | | 5 |
| A | 1 | 5 | | 2 |
| A | 2 | 4 | | 4 |
| A | 3 | 6 | | 6 |
| B | 0 | 4 | | 4 |
| B | 2 | 4 | | 3 |
| B | 3 | 3 | | 2 |
| B | 4 | 5 | | 5 |
+----+----+----+ +----+
As you can see from the output it lag function is referencing the original value of the previous row rather than the new value. Is there anyway to achieve this?
Desired Output
+----+----+----+ +----+
| id | d1 | d2 | | od2|
+----+----+----+ +----+
| A | 0 | 5 | | 5 |
| A | 1 | 5 | | 2 |
| A | 2 | 5 | | 4 |
| A | 3 | 6 | | 6 |
| B | 0 | 4 | | 4 |
| B | 2 | 4 | | 3 |
| B | 3 | 4 | | 2 |
| B | 4 | 5 | | 5 |
+----+----+----+ +----+
Try this:
SELECT id
,d1
,d2
,MAX(d2) OVER (PARTITION BY ID ORDER BY d1)
FROM #TAB
The idea is to use the MAX to get the max value from the beginning to the current row for each partition.
Thanks for providing the DDL scripts and the DML.
One way of doing it would be using recursive cte as follows.
1. First rank all the records according to id, d1 and d2. -> cte block
2. Use recursive cte and get the first elements using rnk=1
3. the field "compared_val" will check against the values from the previous rnk to see if the value is > than the existing and if so it would swap
DECLARE #TAB TABLE (id varchar(1),d1 INT , d2 INT)
INSERT INTO #TAB (id,d1,d2)
VALUES ('A',0,5)
,('A',1,2)
,('A',2,4)
,('A',3,6)
,('B',0,4)
,('B',2,3)
,('B',3,2)
,('B',4,5)
;with cte
as (select row_number() over(partition by id order by d1,d2) as rnk
,id,d1,d2
from #TAB
)
,data(rnk,id,d1,d2,compared_val)
as (select rnk,id,d1,d2,d2 as compared_val
from cte
where rnk=1
union all
select a.rnk,a.id,a.d1,a.d2,case when b.compared_val > a.d2 then
b.compared_val
else a.d2
end
from cte a
join data b
on a.id=b.id
and a.rnk=b.rnk+1
)
select * from data order by id,d1,d2

Hive - over (partition by ...) with a column not in group by

Is it possible to do something like:
select
avg(count(distinct user_id))
over (partition by some_date) as average_users_per_day
from user_activity
group by user_type
(notably, the partition by column, some_date, is not in the group by columns)
The idea I'm going for is something like: the average users per day by user type.
I know how to do it using subqueries (see below), but I'd like to know if there is a nice way using only over (partition by ...) and group by.
Notes:
From reading this answer, my understanding (correct me if I'm wrong) is that the following query:
select
avg(count(distinct a)) over (partition by b)
from foo
group by b
can be expanded equivalently to:
select
avg(count_distinct_a)
from (
select
b,
count(distinct a) as count_distinct_a
from foo
group by b
)
group by b
And from that, I can tweak it a bit to achieve what I want:
select
avg(count_distinct_user_id) as average_users_per_day
from (
select
user_type,
count(distinct user_id) as count_distinct_user_id
from user_activity
group by user_type, some_date
)
group by user_type
(notably, the inner group by user_type, some_date differs from the outer group by user_type)
I'd like to be able to tell the partition by-group by interaction to use a "sub-group-by" for the windowing part. Please let me know if my understanding of partition by/group by is completely off.
EDIT: Some sample data and desired output.
Source table:
+---------+-----------+-----------+
| user_id | user_type | some_date |
+---------+-----------+-----------+
| 1 | a | 1 |
| 1 | a | 2 |
| 2 | a | 1 |
| 3 | a | 2 |
| 3 | a | 2 |
| 4 | b | 2 |
| 5 | b | 1 |
| 5 | b | 3 |
| 5 | b | 3 |
| 6 | c | 1 |
| 7 | c | 1 |
| 8 | c | 4 |
| 9 | c | 2 |
| 9 | c | 3 |
| 9 | c | 4 |
+---------+-----------+-----------+
Sample intermediate table (for reasoning with):
+-----------+-----------+---------------------+
| user_type | some_date | distinct_user_count |
+-----------+-----------+---------------------+
| a | 1 | 2 |
| a | 2 | 2 |
| b | 1 | 1 |
| b | 2 | 1 |
| b | 3 | 1 |
| c | 1 | 2 |
| c | 2 | 1 |
| c | 3 | 1 |
| c | 4 | 2 |
+-----------+-----------+---------------------+
SQL is: select user_type, some_date, count(distinct user_id) from user_activity group by user_type, some_date.
Desired result:
+-----------+---------------------+
| user_type | average_daily_users |
+-----------+---------------------+
| a | 2 |
| b | 1 |
| c | 1.5 |
+-----------+---------------------+

SQL: Subtract from consecutive rows with specific value

I have a table like the following one:
+------+-----+------+-------+
| ID | day | time | count |
+------+-----+------+-------+
| abc1 | 1 | 12 | 1 |
| abc1 | 1 | 13 | 3 |
| abc1 | 2 | 14 | 2 |
| abc2 | 2 | 18 | 4 |
| abc2 | 2 | 19 | 8 |
| abc2 | 3 | 15 | 3 |
+------+-----+------+-------+
What I want to do is subtract the "count" from the next row if the ID is the same, the day has the same value as the current row and the time is bigger by a value (ex. +1).
So the new table I want to get has this layout:
+------+-----+------+-------+------------+
| ID | day | time | count | difference |
+------+-----+------+-------+------------+
| abc1 | 1 | 12 | 1 | 2 |
| abc1 | 1 | 13 | 3 | null |
| abc1 | 2 | 14 | 2 | null |
| abc2 | 2 | 18 | 4 | 4 |
| abc2 | 2 | 19 | 8 | null |
| abc2 | 3 | 15 | 3 | null |
+------+-----+------+-------+------------+
As you can see only the rows that have the same ID, day and a time difference of 1 are subtracted.
You can use the following query that makes use of LEAD window function:
SELECT ID, day, time, count,
CASE WHEN lTime - time = 1 THEN lCount - count
ELSE NULL
END as difference
FROM (
SELECT ID, day, time, count,
LEAD(time) OVER w AS lTime,
LEAD(count) OVER w AS lCount
FROM mytable
WINDOW w AS (PARTITION BY ID, day ORDER BY time) ) t
The above query uses the same window twice, in order to get value of next record within the same partition. The outer query uses these next values in order to enforce the requirements.
Demo here
after seeing your example data and expected output, I would suggest to use left join like this :
SELECT a.*,
b.count - a.count
FROM MyTable a
LEFT JOIN MyTable b
ON a.ID = b.ID
AND a.time = b.time - 1
AND a.count < b.count
NOTE : if there are two or more rows which statisfies the join criteria then it will show multiple rows.

group by top two results based on order

I have been trying to get this to work with some row_number, group by, top, sort of things, but I am missing some fundamental concept. I have a table like so:
+-------+-------+-------+
| name | ord | f_id |
+-------+-------+-------+
| a | 1 | 2 |
| b | 5 | 2 |
| c | 6 | 2 |
| d | 2 | 1 |
| e | 4 | 1 |
| a | 2 | 3 |
| c | 50 | 4 |
+-------+-------+-------+
And my desired output would be:
+-------+---------+--------+-------+
| f_id | ord_n | ord | name |
+-------+---------+--------+-------+
| 2 | 1 | 1 | a |
| 2 | 2 | 5 | b |
| 1 | 1 | 2 | d |
| 1 | 2 | 4 | e |
| 3 | 1 | 2 | a |
| 4 | 1 | 50 | c |
+-------+---------+--------+-------+
Where data is ordered by the ord value, and only up to two results per f_id. Should I be working on a Stored Procedure for this or can I just do it with SQL? I have experimented with some select TOP subqueries, but nothing has even come close..
Here are some statements to create the test table:
create table help(name varchar(255),ord tinyint,f_id tinyint);
insert into help values
('a',1,2),
('b',5,2),
('c',6,2),
('d',2,1),
('e',4,1),
('a',2,3),
('c',50,4);
You may use Rank or DENSE_RANK functions.
select A.name, A.ord_n, A.ord , A.f_id from
(
select
RANK() OVER (partition by f_id ORDER BY ord asc) AS "Rank",
ROW_NUMBER() OVER (partition by f_id ORDER BY ord asc) AS "ord_n",
help.*
from help
) A where A.rank <= 2
Sqlfiddle demo