SQL Windowing Aggregation Over Two Consecutive Date - sql

I am not an SQL expert and finding this a bit challenging. Imagine I have the following table but with more users:
+---------+--------+--------+-------------+
| user_id | amount | date | sum_per_day |
+---------+--------+--------+-------------+
| user8 | 300 | 7/2/20 | 300 |
| user8 | 150 | 6/2/20 | 400 |
| user8 | 250 | 6/2/20 | 400 |
| user8 | 25 | 5/2/20 | 100 |
| user8 | 25 | 5/2/20 | 100 |
| user8 | 25 | 5/2/20 | 100 |
| user8 | 25 | 5/2/20 | 100 |
| user8 | 50 | 2/2/20 | 50 |
+---------+--------+--------+-------------+
As you see they are grouped by user_id. Now what I like to do is add a column called sum_over_two_day which satisfies the following conditions:
Grouped by user_id
For each user it is grouped by the date
The sum is then calculated per two consecutive calendar days for amount (today + previous calendar day)
So the output will be this:
+---------+--------+--------+-------------+------------------+
| user_id | amount | date | sum_per_day | sum_over_two_day |
+---------+--------+--------+-------------+------------------+
| user8 | 300 | 7/2/20 | 300 | 700 |
| user8 | 150 | 6/2/20 | 400 | 500 |
| user8 | 250 | 6/2/20 | 400 | 500 |
| user8 | 25 | 5/2/20 | 100 | 100 |
| user8 | 25 | 5/2/20 | 100 | 100 |
| user8 | 25 | 5/2/20 | 100 | 100 |
| user8 | 25 | 5/2/20 | 100 | 100 |
| user8 | 50 | 2/2/20 | 50 | 50 |
+---------+--------+--------+-------------+------------------+

The proper way is to use a window function with a RANGE clause:
SELECT user_id,
amount,
date,
sum(amount) OVER (PARTITION BY user_id
ORDER BY date
RANGE BETWEEN INTERVAL '1 day' PRECEDING
AND CURRENT ROW)
AS sum_over_two_day
FROM atable
ORDER BY user_id, date;
user_id | amount | date | sum_over_two_day
---------+--------+------------+------------------
user8 | 50 | 2020-02-02 | 50
user8 | 25 | 2020-02-05 | 100
user8 | 25 | 2020-02-05 | 100
user8 | 25 | 2020-02-05 | 100
user8 | 25 | 2020-02-05 | 100
user8 | 250 | 2020-02-06 | 500
user8 | 150 | 2020-02-06 | 500
user8 | 300 | 2020-02-07 | 700
(8 rows)

Try this workaround for your problem:
select
t1.user_id,
t1.amount,
date(t1.date_),
(select sum(amount) from tab where user_id=t1.user_id and date_=t1.date_ ),
(select sum(amount) from tab where user_id=t1.user_id and date_ between t1.date_-1 and t1.date_ )
from tab t1
with Window function for first sum
select
t1.user_id,
t1.amount,
date(t1.date_),
sum(t1.amount) over (partition by t1.user_id,t1.date_),
(select sum(amount) from tab where user_id=t1.user_id and date_ between t1.date_-1 and t1.date_ )
from tab t1
see DEMO

Related

Query last unt price before with detail record

sample
+---------+------------+----------+------------+
| prdt_no | order_date | quantity | unit_price |
+---------+------------+----------+------------+
| A001 | 2020-01-01 | 100 | 10 |
| A001 | 2020-01-10 | 200 | 10 |
| A001 | 2020-02-01 | 100 | 20 |
| A001 | 2020-02-05 | 100 | 20 |
| A001 | 2020-02-07 | 100 | 20 |
| A001 | 2020-02-10 | 100 | 15 |
| A002 | 2020-01-01 | 100 | 10 |
| A002 | 2020-01-10 | 200 | 10 |
| A002 | 2020-02-01 | 100 | 20 |
| A002 | 2020-02-05 | 100 | 20 |
| A002 | 2020-02-07 | 100 | 20 |
| A002 | 2020-02-10 | 100 | 15 |
+---------+------------+----------+------------+
expected
if query condition is order_date between 2020-02-02 and 2020-02-10 then the data will be expected to get below result
+---------+------------+----------+------------+------------------------+-----------------+-------------+-----------------------------+
| prdt_no | order_date | quantity | unit_price | last_unit_price_before | unit_price_diff | cost_reduce | last_unit_price_change_date |
+---------+------------+----------+------------+------------------------+-----------------+-------------+-----------------------------+
| A001 | 2020-02-05 | 100 | 20 | 10 | 10 | 1000 | 2020-02-01 |
| A001 | 2020-02-07 | 100 | 20 | 10 | 10 | 1000 | 2020-02-01 |
| A001 | 2020-02-10 | 100 | 15 | 20 | -5 | -500 | 2020-02-10 |
| A002 | 2020-02-05 | 100 | 20 | 10 | 10 | 1000 | 2020-02-01 |
| A002 | 2020-02-07 | 100 | 20 | 10 | 10 | 1000 | 2020-02-01 |
| A002 | 2020-02-10 | 100 | 15 | 20 | -5 | -500 | 2020-02-10 |
+---------+------------+----------+------------+------------------------+-----------------+-------------+-----------------------------+
logic
I hope to get same product last unit price before then use it to calculate the price difference
the data record count actually over 200K
like photo
test demo link
SQL Server 2012 | db<>fiddle
you can use OUTER APPLY() to get the last low with price difference
SELECT *,
unit_price_diff = T.[unit_price] - L.[last_unit_price_before]
FROM T
OUTER APPLY
(
SELECT TOP 1
last_unit_price_before = x.[unit_price],
last_unit_price_change_date = x.[order_date]
FROM T x
WHERE x.[prdt_no] = T.[prdt_no]
AND x.[order_date] < T.[order_date]
AND x.[unit_price] <> T.[unit_price]
ORDER BY x.[order_date] DESC
) L
WHERE T.[order_date] >= '2020-02-01'
AND T.[order_date] <= '2020-02-10'
db<>fiddle

PostgreSQL: Count number of rows in table 1 for distinct rows in table 2

I am working with really big data that at the moment I become confused, looking like I'm just repeating one thing.
I want to count the number of trips per user from two tables, trips and session.
psql=> SELECT * FROM trips limit 10;
trip_id | session_ids | daily_user_id | seconds_start | seconds_end
---------+-----------------+---------------+---------------+-------------
400543 | {172079} | 17118 | 1575550944 | 1575551181
400542 | {172078} | 17118 | 1575541533 | 1575542171
400540 | {172077} | 17118 | 1575539001 | 1575539340
400538 | {172076} | 17117 | 1575540499 | 1575541999
400534 | {172074,172075} | 17117 | 1575537161 | 1575539711
400530 | {172073} | 17116 | 1575447043 | 1575447682
400529 | {172071} | 17115 | 1575496394 | 1575497803
400527 | {172070} | 17113 | 1575495241 | 1575496034
400525 | {172068} | 17115 | 1575485658 | 1575489378
400524 | {172067} | 17113 | 1575488721 | 1575490491
(10 rows)
psql=> SELECT * FROM session limit 10;
session_id | user_id | key | start_time | daily_user_id
------------+---------+--------------------------+------------+---------------
172079 | 43 | hLB8S7aSfp4gAFp7TykwYQ==+| 1575550921 | 17118
| | | |
172078 | 43 | YATMrL/AQ7Nu5q2dQTMT1A==+| 1575541530 | 17118
| | | |
172077 | 43 | fOLX4tqvsyFOP3DCyBZf1A==+| 1575538997 | 17118
| | | |
172076 | 7 | 88hwGj4Mqa58juy0PG/R4A==+| 1575540515 | 17117
| | | |
172075 | 7 | 1O+8X49+YbtmoEa9BlY5OQ==+| 1575538384 | 17117
| | | |
172074 | 7 | XOR7hsFCNk+soM75ZhDJyA==+| 1575537405 | 17117
| | | |
172073 | 42 | rAQWwYgqg3UMTpsBYSpIpA==+| 1575447109 | 17116
| | | |
172072 | 276 | 0xOsxRRN3Sq20VsXWjlrzQ==+| 1575511120 | 17114
| | | |
172071 | 7 | P4beN3W/ZrD+TCpZGYh23g==+| 1575496642 | 17115
| | | |
172070 | 43 | OFi30Zv9e5gmLZS5Vb+I7Q==+| 1575495238 | 17113
| | | |
(10 rows)
Goal: get the distribution of trips per user
Attempt:
psql=> SELECT COUNT(distinct trip_id) as trips
, count(distinct user_id) as users
, extract(year from to_timestamp(seconds_start)) as year_date
, extract(month from to_timestamp(seconds_start)) as month_date
FROM trips
INNER JOIN session
ON session_id = ANY(session_ids)
GROUP BY year_date, month_date
ORDER BY year_date, month_date;
+-------+-------+-----------+------------+
| trips | users | year_date | month_date |
+-------+-------+-----------+------------+
| 371 | 44 | 2016 | 3 |
| 12207 | 185 | 2016 | 4 |
| 3859 | 88 | 2016 | 5 |
| 1547 | 28 | 2016 | 6 |
| 831 | 17 | 2016 | 7 |
| 427 | 4 | 2016 | 8 |
| 512 | 13 | 2016 | 9 |
| 431 | 11 | 2016 | 10 |
| 1011 | 26 | 2016 | 11 |
| 791 | 15 | 2016 | 12 |
| 217 | 8 | 2017 | 1 |
| 490 | 17 | 2017 | 2 |
| 851 | 18 | 2017 | 3 |
| 1890 | 66 | 2017 | 4 |
| 2143 | 43 | 2017 | 5 |
| . | | | |
| . | | | |
| . | | | |
+-------+-------+-----------+------------+
This resultset count number of users and trips, my intention is actually to get an analysis of trips per user, like so:
+------+-------------+
| user | no_of_trips |
+------+-------------+
| 1 | 489 |
| 2 | 400 |
| 3 | 12 |
| 4 | 102 |
| . | |
| . | |
| . | |
+------+-------------+
How do I do this, please?
You seem to just want aggregation by user_id:
SELECT s.user_id, COUNT(distinct t.trip_id) as trips
FROM trips t INNER JOIN
session s
ON s.session_id = ANY(t.session_ids)
GROUP BY s.user_id ;
I'm pretty sure that the COUNT(DISTINCT) is unnecessary, so I would advise removing it:
SELECT s.user_id, COUNT(*) as trips
FROM trips t INNER JOIN
session s
ON s.session_id = ANY(t.session_ids)
GROUP BY s.user_id ;

windowing function avg in Hive with - over (order by colName)

i'm trying to understand how windowing function avg works, and somehow it seems to not be working as i expect.
here is the dataset :
select * from winsales;
+-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+
| winsales.salesid | winsales.dateid | winsales.sellerid | winsales.buyerid | winsales.qty | winsales.qty_shipped |
+-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+
| 30001 | NULL | 3 | b | 10 | 10 |
| 10001 | NULL | 1 | c | 10 | 10 |
| 10005 | NULL | 1 | a | 30 | NULL |
| 40001 | NULL | 4 | a | 40 | NULL |
| 20001 | NULL | 2 | b | 20 | 20 |
| 40005 | NULL | 4 | a | 10 | 10 |
| 20002 | NULL | 2 | c | 20 | 20 |
| 30003 | NULL | 3 | b | 15 | NULL |
| 30004 | NULL | 3 | b | 20 | NULL |
| 30007 | NULL | 3 | c | 30 | NULL |
| 30001 | NULL | 3 | b | 10 | 10 |
+-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+
When i fire the following query ->
select salesid, sellerid, qty, avg(qty) over (order by sellerid) as avg_qty from winsales order by sellerid,salesid;
I get the following ->
+----------+-----------+------+---------------------+--+
| salesid | sellerid | qty | avg_qty |
+----------+-----------+------+---------------------+--+
| 10001 | 1 | 10 | 20.0 |
| 10005 | 1 | 30 | 20.0 |
| 20001 | 2 | 20 | 20.0 |
| 20002 | 2 | 20 | 20.0 |
| 30001 | 3 | 10 | 18.333333333333332 |
| 30001 | 3 | 10 | 18.333333333333332 |
| 30003 | 3 | 15 | 18.333333333333332 |
| 30004 | 3 | 20 | 18.333333333333332 |
| 30007 | 3 | 30 | 18.333333333333332 |
| 40001 | 4 | 40 | 19.545454545454547 |
| 40005 | 4 | 10 | 19.545454545454547 |
+----------+-----------+------+---------------------+--+
Question is - how is the avg(qty) being calculated.
Since i'm not using partition by, i would expect the avg(qty) to be the same for all rows.
Any ideas ?
if you want to have same avg(qty) to get for all rows then remove order by sellerid in over clause, then you are going to have 19.545454545454547 value for all the rows.
Query to get same avg(qty) for all rows:
hive> select salesid, sellerid, qty, avg(qty) over () as avg_qty from winsales order by sellerid,salesid;
If we include order by sellerid in over clause then you are getting cumulative avg is caluculated for each sellerid.
i.e. for
sellerid 1 you are having 2 records total 2 records with qty as 10,30 so avg would be
(10+30)/2.
sellerid 2 you are having 2 records total 4 records with qty as 20,20 so avg would be
(10+30+20+20)/4 = 20.0
sellerid 3 you are having 5 records total 9 records with qty as so 10,10,15,20,30 avg would be
(10+30+20+20+10+10+15+20+30)/9 = 18.333
sellerid 4 avg is 19.545454545454547
when we include over clause then this is an expected behavior from hive.

Getting average salary of last 3 salaries

+------------+--------+-----------+---------------+
| paydate | salary | ninumber | payrollnumber |
+------------+--------+-----------+---------------+
| 2015-05-15 | 1000 | jh330954b | 6 |
| 2015-04-15 | 1250 | jh330954b | 5 |
| 2015-03-15 | 800 | jh330954b | 4 |
| 2015-02-15 | 894 | jh330954b | 3 |
| 2015-05-15 | 500 | ew56780e | 6 |
| 2015-04-15 | 1500 | ew56780e | 5 |
| 2015-03-15 | 2500 | ew56780e | 4 |
| 2015-02-15 | 3000 | ew56780e | 3 |
| 2015-05-15 | 400 | rt321298z | 6 |
| 2015-04-15 | 582 | rt321298z | 5 |
| 2015-03-15 | 123 | rt321298z | 4 |
| 2015-02-15 | 659 | rt321298z | 3 |
+------------+--------+-----------+---------------+
The above list is the data in my database. I need to get the average of the previous 3 salaries for each individual and output this.
I don't know where to begin with this so I cannot provide any of my working so far.
In SQL Server, you can use row_number() to get the last three salaries in a subquery. Then use avg():
select ninumber, avg(salary)
from (select t.*,
row_number() over (partition by ninumber order by payrollnumber desc) as seqnum
from table t
) t
where seqnum <= 3
group by ninumber;

Rolling total with no sub-select and no vendor specific extensions

What I'm trying to achieve: rolling total for quantity and amount for a given day, grouped by hour.
It's easy in most cases, but if you have some additional columns (dir and product in my case) and you don't want to group/filter on them, that's a problem.
I know there are extensions in Oracle and MSSQL specifically for that, and there's SELECT OVER PARTITION in Postgres.
At the moment I'm working on an app prototype, and it's backed by MySQL, and I have no idea what it will be using in production, so I'm trying to avoid vendor lock-in.
The entrire table:
> SELECT id, dir, product, date, hour, quantity, amount FROM sales
ORDER BY date, hour;
+------+-----+---------+------------+------+----------+--------+
| id | dir | product | date | hour | quantity | amount |
+------+-----+---------+------------+------+----------+--------+
| 2230 | 65 | ABCDEDF | 2014-09-11 | 1 | 1 | 10 |
| 2231 | 64 | ABCDEDF | 2014-09-11 | 3 | 4 | 40 |
| 2232 | 64 | ABCDEDF | 2014-09-11 | 5 | 5 | 50 |
| 2235 | 64 | ZZ | 2014-09-11 | 7 | 6 | 60 |
| 2233 | 64 | ABCDEDF | 2014-09-11 | 7 | 6 | 60 |
| 2237 | 66 | ABCDEDF | 2014-09-11 | 7 | 6 | 60 |
| 2234 | 64 | ZZ | 2014-09-18 | 3 | 1 | 11 |
| 2236 | 66 | ABCDEDF | 2014-09-18 | 3 | 1 | 100 |
| 2227 | 64 | ABCDEDF | 2014-09-18 | 3 | 1 | 100 |
| 2228 | 64 | ABCDEDF | 2014-09-18 | 5 | 2 | 200 |
| 2229 | 64 | ABCDEDF | 2014-09-18 | 7 | 3 | 300 |
+------+-----+---------+------------+------+----------+--------+
For a given date:
> SELECT id, dir, product, date, hour, quantity, amount FROM sales
WHERE date = '2014-09-18'
ORDER BY hour;
+------+-----+---------+------------+------+----------+--------+
| id | dir | product | date | hour | quantity | amount |
+------+-----+---------+------------+------+----------+--------+
| 2227 | 64 | ABCDEDF | 2014-09-18 | 3 | 1 | 100 |
| 2236 | 66 | ABCDEDF | 2014-09-18 | 3 | 1 | 100 |
| 2234 | 64 | ZZ | 2014-09-18 | 3 | 1 | 11 |
| 2228 | 64 | ABCDEDF | 2014-09-18 | 5 | 2 | 200 |
| 2229 | 64 | ABCDEDF | 2014-09-18 | 7 | 3 | 300 |
+------+-----+---------+------------+------+----------+--------+
The results that I need, using sub-select:
> SELECT date, hour, SUM(quantity),
( SELECT SUM(quantity) FROM sales s2
WHERE s2.hour <= s1.hour AND s2.date = s1.date
) AS total
FROM sales s1
WHERE s1.date = '2014-09-18'
GROUP by date, hour;
+------------+------+---------------+-------+
| date | hour | sum(quantity) | total |
+------------+------+---------------+-------+
| 2014-09-18 | 3 | 3 | 3 |
| 2014-09-18 | 5 | 2 | 5 |
| 2014-09-18 | 7 | 3 | 8 |
+------------+------+---------------+-------+
My concerns for using sub-select:
once there are round million records in the table, the query may become too slow, not sure if it's subject to optimizations even though it has no HAVING statements.
if I had to filter on a product or dir, I will have to put those conditions to both main SELECT and sub-SELECT too (WHERE product = / WHERE dir =).
sub-select only counts a single sum, while I need two of them (sum(quantity) и sum(amount)) (ERROR 1241 (21000): Operand should contain 1 column(s)).
The closest result I were able to get using JOIN:
> SELECT DISTINCT(s1.hour) AS ih, s2.date, s2.hour, s2.quantity, s2.amount, s2.id
FROM sales s1
JOIN sales s2 ON s2.date = s1.date AND s2.hour <= s1.hour
WHERE s1.date = '2014-09-18'
ORDER by ih;
+----+------------+------+----------+--------+------+
| ih | date | hour | quantity | amount | id |
+----+------------+------+----------+--------+------+
| 3 | 2014-09-18 | 3 | 1 | 100 | 2236 |
| 3 | 2014-09-18 | 3 | 1 | 100 | 2227 |
| 3 | 2014-09-18 | 3 | 1 | 11 | 2234 |
| 5 | 2014-09-18 | 3 | 1 | 100 | 2236 |
| 5 | 2014-09-18 | 3 | 1 | 100 | 2227 |
| 5 | 2014-09-18 | 5 | 2 | 200 | 2228 |
| 5 | 2014-09-18 | 3 | 1 | 11 | 2234 |
| 7 | 2014-09-18 | 3 | 1 | 100 | 2236 |
| 7 | 2014-09-18 | 3 | 1 | 100 | 2227 |
| 7 | 2014-09-18 | 5 | 2 | 200 | 2228 |
| 7 | 2014-09-18 | 7 | 3 | 300 | 2229 |
| 7 | 2014-09-18 | 3 | 1 | 11 | 2234 |
+----+------------+------+----------+--------+------+
I could stop here and just use those results to group by ih (hour), calculate the sum for quantity and amount and be happy. But something eats me up telling that this is wrong.
If I remove DISTINCT most rows become to be duplicated. Replacing JOIN with its invariants doesn't help.
Once I remove s2.id from statement you get a complete mess with disappearing/collapsion meaningful rows (e.g. ids 2236/2227 got collapsed):
> SELECT DISTINCT(s1.hour) AS ih, s2.date, s2.hour, s2.quantity, s2.amount
FROM sales s1
JOIN sales s2 ON s2.date = s1.date AND s2.hour <= s1.hour
WHERE s1.date = '2014-09-18'
ORDER by ih;
+----+------------+------+----------+--------+
| ih | date | hour | quantity | amount |
+----+------------+------+----------+--------+
| 3 | 2014-09-18 | 3 | 1 | 100 |
| 3 | 2014-09-18 | 3 | 1 | 11 |
| 5 | 2014-09-18 | 3 | 1 | 100 |
| 5 | 2014-09-18 | 5 | 2 | 200 |
| 5 | 2014-09-18 | 3 | 1 | 11 |
| 7 | 2014-09-18 | 3 | 1 | 100 |
| 7 | 2014-09-18 | 5 | 2 | 200 |
| 7 | 2014-09-18 | 7 | 3 | 300 |
| 7 | 2014-09-18 | 3 | 1 | 11 |
+----+------------+------+----------+--------+
Summing doesn't help, and it adds up to the mess.
First row (hour = 3) should have SUM(s2.quantity) equal 3, but it has 9. What does SUM(s1.quantity) shows is a complete mystery to me.
> SELECT DISTINCT(s1.hour) AS hour, sum(s1.quantity), s2.date, SUM(s2.quantity)
FROM sales s1 JOIN sales s2 ON s2.date = s1.date AND s2.hour <= s1.hour
WHERE s1.date = '2014-09-18'
GROUP BY hour;
+------+------------------+------------+------------------+
| hour | sum(s1.quantity) | date | sum(s2.quantity) |
+------+------------------+------------+------------------+
| 3 | 9 | 2014-09-18 | 9 |
| 5 | 8 | 2014-09-18 | 5 |
| 7 | 15 | 2014-09-18 | 8 |
+------+------------------+------------+------------------+
Bonus points/boss level:
I also need a column that will show total_reference, the same rolling total for the same periods for a different date (e.g. 2014-09-11).
If you want a cumulative sum in MySQL, the most efficient way is to use variables:
SELECT date, hour,
(#q := q + #q) as cumeq, (#a := a + #a) as cumea
FROM (SELECT date, hour, SUM(quantity) as q, SUM(amount) as a
FROM sales s
WHERE s.date = '2014-09-18'
GROUP by date, hour
) dh cross join
(select #q := 0, #a := 0) vars
ORDER BY date, hour;
If you are planning on working with databases such as Oracle, SQL Server, and Postgres, then you should use a database more similar in functionality and that supports that ANSI standard window functions. The right way to do this is with window functions, but MySQL doesn't support those. Postgres, SQL Server, and Oracle all have free versions that yo can use for development purposes.
Also, with proper indexing, you shouldn't have a problem with the subquery approach, even on large tables.