How to fill NULL value using min-max in SQL Server? - sql

I want to get front value and back value using min-max comparing.
Here is my sample table.
https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=322aafeb7970e25f9a85d0cd2f6c00d6
For example, this is temp01 table.
I want to get start&end NULL value in temp01.
So, I fill
price = [NULL, NULL, 13000]
to
price = [12000, 12000, 13000]
Because the 12000 is minimum value in [230, 1]. And I fill end NULL is filled maximum value in [cat01, cat02] group
| SEQ | cat01 | cat02 | dt_day | price |
+-----+-------+-------+------------+-------+
| 1 | 230 | 1 | 2019-01-01 | NULL |
| 2 | 230 | 1 | 2019-01-02 | NULL |
| 3 | 230 | 1 | 2019-01-03 | 13000 |
...
| 11 | 230 | 1 | 2019-01-11 | NULL |
| 12 | 230 | 1 | 2019-01-12 | NULL |
| 1 | 230 | 2 | 2019-01-01 | NULL |
| 2 | 230 | 2 | 2019-01-02 | NULL |
| 3 | 230 | 2 | 2019-01-03 | 12000 |
...
| 12 | 230 | 2 | 2019-01-11 | NULL |
| 13 | 230 | 2 | 2019-01-12 | NULL |
[result]
| SEQ | cat01 | cat02 | dt_day | price |
+-----+-------+-------+------------+-------+
| 1 | 230 | 1 | 2019-01-01 | 12000 | --START
| 2 | 230 | 1 | 2019-01-02 | 12000 |
| 3 | 230 | 1 | 2019-01-03 | 13000 |
| 4 | 230 | 1 | 2019-01-04 | 12000 |
| 5 | 230 | 1 | 2019-01-05 | NULL |
| 6 | 230 | 1 | 2019-01-06 | NULL |
| 7 | 230 | 1 | 2019-01-07 | 19000 |
| 8 | 230 | 1 | 2019-01-08 | 20000 |
| 9 | 230 | 1 | 2019-01-09 | 21500 |
| 10 | 230 | 1 | 2019-01-10 | 21500 |
| 11 | 230 | 1 | 2019-01-11 | 21500 |
| 12 | 230 | 1 | 2019-01-12 | 21500 |
| 13 | 230 | 1 | 2019-01-13 | 21500 | --END
| 1 | 230 | 2 | 2019-01-01 | 12000 | --START
| 2 | 230 | 2 | 2019-01-02 | 12000 |
| 3 | 230 | 2 | 2019-01-03 | 12000 |
| 4 | 230 | 2 | 2019-01-04 | 17000 |
| 5 | 230 | 2 | 2019-01-05 | 22000 |
| 6 | 230 | 2 | 2019-01-06 | NULL |
| 7 | 230 | 2 | 2019-01-07 | 23000 |
| 8 | 230 | 2 | 2019-01-08 | 23200 |
| 9 | 230 | 2 | 2019-01-09 | NULL |
| 10 | 230 | 2 | 2019-01-10 | 24000 |
| 11 | 230 | 2 | 2019-01-11 | 24000 |
| 12 | 230 | 2 | 2019-01-12 | 24000 |
| 13 | 230 | 2 | 2019-01-13 | 24000 | --END
Please, let me know what a good way to fill NULL using linear relationships.

find the min() and max() for the price GROUP BY cat01, cat02.
Also find the min and max seq for the row where price is not null
after that it is just simply inner join to your table and update where price is null
with val as
(
select cat01, cat02,
min_price = min(price),
max_price = max(price),
min_seq = min(case when price is not null then seq end),
max_seq = max(case when price is not null then seq end)
from temp01
group by cat01, cat02
)
update t
set price = case when t.seq < v.min_seq then min_price
when t.seq > v.max_seq then max_price
end
FROM temp01 t
inner join val v on t.cat01 = v.cat01
and t.cat02 = v.cat02
where t.price is null
dbfiddle
EDIT : returning the price as a new column in SELECT query
with val as
(
select cat01, cat02, min_price = min(price), max_price = max(price),
min_seq = min(case when price is not null then seq end),
max_seq = max(case when price is not null then seq end)
from temp01
group by cat01, cat02
)
select t.*,
new_price = coalesce(t.price,
case when t.seq < v.min_seq then min_price
when t.seq > v.max_seq then max_price
end)
FROM temp01 t
left join val v on t.cat01 = v.cat01
and t.cat02 = v.cat02
Updated dbfiddle

Related

cumulative amount to current_date

base_table
month id sales cumulative_sales
2021-01-01 33205 10 10
2021-02-01 33205 15 25
Based on the base table above, I would like to add more rows up to the current month,
even if there is no sales.
Expected table
month id sales cumulative_sales
2021-01-01 33205 10 10
2021-02-01 33205 15 25
2021-03-01 33205 0 25
2021-04-01 33205 0 25
2021-05-01 33205 0 25
.........
2021-11-01 33205 0 25
My query stops at
select month, id, sales,
sum(sales) over (partition by id
order by month
rows between unbounded preceding and current row) as cumulative_sales
from base_table
This works. Assumes the month column is constrained to hold only "first of the month" dates. Use the desired hard-coded start date, or use another CTE to get the earliest date from base_table:
with base_table as (
select *
from (values
('2021-01-01'::date,33205,10)
,('2021-02-01' ,33205,15)
,('2021-01-01' ,12345,99)
,('2021-04-01' ,12345,88)
) dat("month",id,sales)
)
select cal.dt::date
,list.id
,coalesce(dat.sales,0) as sales
,coalesce(sum(dat.sales) over (partition by list.id order by cal.dt),0) as cumulative_sales
from generate_series('2020-06-01' /* use desired start date here */,current_date,'1 month') cal(dt)
cross join (select distinct id from base_table) list
left join base_table dat on dat."month" = cal.dt and dat.id = list.id
;
Results:
| dt | id | sales | cumulative_sales |
+------------+-------+-------+------------------+
| 2020-06-01 | 12345 | 0 | 0 |
| 2020-07-01 | 12345 | 0 | 0 |
| 2020-08-01 | 12345 | 0 | 0 |
| 2020-09-01 | 12345 | 0 | 0 |
| 2020-10-01 | 12345 | 0 | 0 |
| 2020-11-01 | 12345 | 0 | 0 |
| 2020-12-01 | 12345 | 0 | 0 |
| 2021-01-01 | 12345 | 99 | 99 |
| 2021-02-01 | 12345 | 0 | 99 |
| 2021-03-01 | 12345 | 0 | 99 |
| 2021-04-01 | 12345 | 88 | 187 |
| 2021-05-01 | 12345 | 0 | 187 |
| 2021-06-01 | 12345 | 0 | 187 |
| 2021-07-01 | 12345 | 0 | 187 |
| 2021-08-01 | 12345 | 0 | 187 |
| 2021-09-01 | 12345 | 0 | 187 |
| 2021-10-01 | 12345 | 0 | 187 |
| 2021-11-01 | 12345 | 0 | 187 |
| 2020-06-01 | 33205 | 0 | 0 |
| 2020-07-01 | 33205 | 0 | 0 |
| 2020-08-01 | 33205 | 0 | 0 |
| 2020-09-01 | 33205 | 0 | 0 |
| 2020-10-01 | 33205 | 0 | 0 |
| 2020-11-01 | 33205 | 0 | 0 |
| 2020-12-01 | 33205 | 0 | 0 |
| 2021-01-01 | 33205 | 10 | 10 |
| 2021-02-01 | 33205 | 15 | 25 |
| 2021-03-01 | 33205 | 0 | 25 |
| 2021-04-01 | 33205 | 0 | 25 |
| 2021-05-01 | 33205 | 0 | 25 |
| 2021-06-01 | 33205 | 0 | 25 |
| 2021-07-01 | 33205 | 0 | 25 |
| 2021-08-01 | 33205 | 0 | 25 |
| 2021-09-01 | 33205 | 0 | 25 |
| 2021-10-01 | 33205 | 0 | 25 |
| 2021-11-01 | 33205 | 0 | 25 |
The cross join pairs every date output by generate_series() with every id value from base_table.
The left join ensures that no dt+id pairs get dropped from the output when no such record exists in base_table.
The coalesce() functions ensure that the sales and cumulative_sales show 0 instead of null for dt+id combinations that don't exist in base_table. Remove them if you don't mind seeing nulls in those columns.

PostgresSql:Comparing two tables and obtaining its result and compare it with third table

TABLE 2 : trip_delivery_sales_lines
+-------+---------------------+------------+----------+------------+-------------+--------+--+
| Sl no | Order_date | Partner_id | Route_id | Product_id | Product qty | amount | |
+-------+---------------------+------------+----------+------------+-------------+--------+--+
| 1 | 2020-08-01 04:25:35 | 34567 | 152 | 432 | 2 | 100 | |
| 2 | 2021-09-11 02:25:35 | 34572 | 130 | 312 | 4 | 150 | |
| 3 | 2020-05-10 04:25:35 | 34567 | 152 | 432 | 3 | 123 | |
| 4 | 2021-02-16 01:10:35 | 34572 | 130 | 432 | 5 | 123 | |
| 5 | 2020-02-19 01:10:35 | 34567 | 152 | 432 | 2 | 600 | |
| 6 | 2021-03-20 01:10:35 | 34569 | 152 | 123 | 1 | 123 | |
| 7 | 2021-04-23 01:10:35 | 34570 | 152 | 432 | 4 | 200 | |
| 8 | 2021-07-08 01:10:35 | 34567 | 152 | 432 | 3 | 32 | |
| 9 | 2019-06-28 01:10:35 | 34570 | 152 | 432 | 2 | 100 | |
| 10 | 2018-11-14 01:10:35 | 34570 | 152 | 432 | 5 | 20 | |
| | | | | | | | |
+-------+---------------------+------------+----------+------------+-------------+--------+--+
From Table 2 : we had to find partners in route=152 and find the sum of product_qty of the last 2 sale [can be selected by desc order_date]
. We can find its result in table 3.
34567 – Serial number [ 1,8]
34570 – Serial number [ 7,9]
34569 – Serial number [6]
TABLE 3 : RESULT OBTAINED FROM TABLE 1,2
+------------+-------+
| Partner_id | count |
+------------+-------+
| 34567 | 5 |
| 34569 | 1 |
| 34570 | 6 |
| | |
+------------+-------+
From table 4 we want to find the above partner_ids leaf count
TABLE 4 :coupon_leaf
+------------+-------+
| Partner_id | Leaf |
+------------+-------+
| 34567 | XYZ1 |
| 34569 | XYZ2 |
| 34569 | DDHC |
| 34567 | DVDV |
| 34570 | DVFDV |
| 34576 | FVFV |
| 34567 | FVV |
| | |
+------------+-------+
From that we can find result as:
34567 – 3
34569-2
34570 -1
TABLE 5: result obtained from TABLE 4
+------------+-------+
| Partner_id | count |
+------------+-------+
| 34567 | 3 |
| 34569 | 2 |
| 34570 | 1 |
| | |
+------------+-------+
Now we want compare table 3 and 5
If partner_id count [table 3] > partner_id count [table 4]
Print partner_id
I want a single query to do all these operation
distinct partner_id can be found by: fROM TABLE 1
SELECT DISTINCT partner_id
FROM trip_delivery_sales ts
WHERE ts.route_id='152'
GROUP BY ts.partner_id
This answers the original version of the problem.
You seem to want to compare totals after aggregating tables 2 and 3. I don't know what table1 is for. It doesn't seem to do anything.
So:
select *
from (select partner_id, sum(quantity) as sum_quantity
from (select tdsl.*,
row_number() over (partition by t2.partner_id order by order_date) as seqnum
from trip_delivery_sales_lines tdsl
) tdsl
where seqnum <= 2
group by tdsl.partner_id
) tdsl left join
(select cl.partner_id, count(*) as leaf_cnt
from coupon_leaf cl
group by cl.partner_id
) cl
on cl.partner_id = tdsl.partner_id
where leaf_cnt is null or sum_quantity > leaf_cnt

How to get interpolation value in SQL Server?

I want to get interpolation value for NULL. Interpolation is a statistical method by which related known values are used to estimate an unknown price or potential yield of a security. Interpolation is achieved by using other established values that are located in sequence with the unknown value.
Here is my sample table and code.
https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=673fcd5bc250bd272e8b6da3d0eddb90
I want to get this result:
| SEQ | cat01 | cat02 | dt_day | price | coeff |
+-----+-------+-------+------------+-------+--------+
| 1 | 230 | 1 | 2019-01-01 | 16000 | 0 |
| 2 | 230 | 1 | 2019-01-02 | NULL | 1 |
| 3 | 230 | 1 | 2019-01-03 | 13000 | 0 |
| 4 | 230 | 1 | 2019-01-04 | NULL | 1 |
| 5 | 230 | 1 | 2019-01-05 | NULL | 2 |
| 6 | 230 | 1 | 2019-01-06 | NULL | 3 |
| 7 | 230 | 1 | 2019-01-07 | 19000 | 0 |
| 8 | 230 | 1 | 2019-01-08 | 20000 | 0 |
| 9 | 230 | 1 | 2019-01-09 | 21500 | 0 |
| 10 | 230 | 1 | 2019-01-10 | 21500 | 0 |
| 11 | 230 | 1 | 2019-01-11 | NULL | 1 |
| 12 | 230 | 1 | 2019-01-12 | NULL | 2 |
| 13 | 230 | 1 | 2019-01-13 | 23000 | 0 |
| 1 | 230 | 2 | 2019-01-01 | NULL | 1 |
| 2 | 230 | 2 | 2019-01-02 | NULL | 2 |
| 3 | 230 | 2 | 2019-01-03 | 12000 | 0 |
| 4 | 230 | 2 | 2019-01-04 | 17000 | 0 |
| 5 | 230 | 2 | 2019-01-05 | 22000 | 0 |
| 6 | 230 | 2 | 2019-01-06 | NULL | 1 |
| 7 | 230 | 2 | 2019-01-07 | 23000 | 0 |
| 8 | 230 | 2 | 2019-01-08 | 23200 | 0 |
| 9 | 230 | 2 | 2019-01-09 | NULL | 1 |
| 10 | 230 | 2 | 2019-01-10 | NULL | 2 |
| 11 | 230 | 2 | 2019-01-11 | NULL | 3 |
| 12 | 230 | 2 | 2019-01-12 | NULL | 4 |
| 13 | 230 | 2 | 2019-01-13 | 23000 | 0 |
I use this code. I think this code incorrect.
coeff is the NULL is in order set.
This code is for implementing interpolation.
I tried to find out between the empty values and divide them by the number of spaces.
But, this code is incorrect.
WITH ROW_VALUE AS
(
SELECT SEQ
, dt_day
, cat01
, cat02
, price
, ROW_NUMBER() OVER (ORDER BY dt_day) AS sub_seq
FROM (
SELECT SEQ
, cat01
, cat02
, dt_day
, dt_week
, dt_month
, price
FROM temp01
WHERE price IS NOT NULL
)val
)
,STEP_CHANGE AS(
SELECT RV1.SEQ AS id_Start
, RV1.SEQ - 1 AS id_End
, RV1.cat01
, RV1.cat02
, RV1.dt_day
, RV1.price
, (RV2.price - RV1.price)/(RV2.SEQ - RV1.SEQ) AS change1
FROM ROW_VALUE RV1
LEFT JOIN ROW_VALUE RV2 ON RV1.cat01 = RV2.cat01
AND RV1.cat02 = RV2.cat02
AND RV1.SEQ = RV2.SEQ - 1
)
SELECT *
FROM STEP_CHANGE
ORDER BY cat01, cat02, dt_day
Please, let me know what a good way to fill NULL using linear relationships.
If there is another good way, please recommend it.
If I assume that you mean linear interpolation between the previous price and the next price based on the number of days that passed, then you can use the following method:
Use window functions to get the next and previous days with prices for each row.
Use window functions or joins to get the prices on those days as well.
Use arithmetic to calculate the linear interpolation.
You SQL Fiddle uses SQL Server, so I assume that is the database you are using. The code looks like this:
select t.*,
coalesce(t.price,
(tprev.price +
(tnext.price - tprev.price) / datediff(day, prev_price_day, next_price_day) *
datediff(day, t.prev_price_day, t.dt_day)
)
) as imputed_price
from (select t.*,
max(case when price is not null then dt_day end) over (partition by cat01, cat02 order by dt_day asc) as prev_price_day,
min(case when price is not null then dt_day end) over (partition by cat01, cat02 order by dt_day desc) as next_price_day
from temp01 t
) t left join
temp01 tprev
on tprev.cat01 = t.cat01 and
tprev.cat02 = t.cat02 and
tprev.dt_day = t.prev_price_day left join
temp01 tnext
on tnext.cat01 = t.cat01 and
tnext.cat02 = t.cat02 and
tnext.dt_day = t.next_price_day
order by cat01, cat02, dt_day;
Here is a db<>fiddle.

windowing function avg in Hive with - over (order by colName)

i'm trying to understand how windowing function avg works, and somehow it seems to not be working as i expect.
here is the dataset :
select * from winsales;
+-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+
| winsales.salesid | winsales.dateid | winsales.sellerid | winsales.buyerid | winsales.qty | winsales.qty_shipped |
+-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+
| 30001 | NULL | 3 | b | 10 | 10 |
| 10001 | NULL | 1 | c | 10 | 10 |
| 10005 | NULL | 1 | a | 30 | NULL |
| 40001 | NULL | 4 | a | 40 | NULL |
| 20001 | NULL | 2 | b | 20 | 20 |
| 40005 | NULL | 4 | a | 10 | 10 |
| 20002 | NULL | 2 | c | 20 | 20 |
| 30003 | NULL | 3 | b | 15 | NULL |
| 30004 | NULL | 3 | b | 20 | NULL |
| 30007 | NULL | 3 | c | 30 | NULL |
| 30001 | NULL | 3 | b | 10 | 10 |
+-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+
When i fire the following query ->
select salesid, sellerid, qty, avg(qty) over (order by sellerid) as avg_qty from winsales order by sellerid,salesid;
I get the following ->
+----------+-----------+------+---------------------+--+
| salesid | sellerid | qty | avg_qty |
+----------+-----------+------+---------------------+--+
| 10001 | 1 | 10 | 20.0 |
| 10005 | 1 | 30 | 20.0 |
| 20001 | 2 | 20 | 20.0 |
| 20002 | 2 | 20 | 20.0 |
| 30001 | 3 | 10 | 18.333333333333332 |
| 30001 | 3 | 10 | 18.333333333333332 |
| 30003 | 3 | 15 | 18.333333333333332 |
| 30004 | 3 | 20 | 18.333333333333332 |
| 30007 | 3 | 30 | 18.333333333333332 |
| 40001 | 4 | 40 | 19.545454545454547 |
| 40005 | 4 | 10 | 19.545454545454547 |
+----------+-----------+------+---------------------+--+
Question is - how is the avg(qty) being calculated.
Since i'm not using partition by, i would expect the avg(qty) to be the same for all rows.
Any ideas ?
if you want to have same avg(qty) to get for all rows then remove order by sellerid in over clause, then you are going to have 19.545454545454547 value for all the rows.
Query to get same avg(qty) for all rows:
hive> select salesid, sellerid, qty, avg(qty) over () as avg_qty from winsales order by sellerid,salesid;
If we include order by sellerid in over clause then you are getting cumulative avg is caluculated for each sellerid.
i.e. for
sellerid 1 you are having 2 records total 2 records with qty as 10,30 so avg would be
(10+30)/2.
sellerid 2 you are having 2 records total 4 records with qty as 20,20 so avg would be
(10+30+20+20)/4 = 20.0
sellerid 3 you are having 5 records total 9 records with qty as so 10,10,15,20,30 avg would be
(10+30+20+20+10+10+15+20+30)/9 = 18.333
sellerid 4 avg is 19.545454545454547
when we include over clause then this is an expected behavior from hive.

SQL Getting Running Count with SUM and OVER

In sql I have a history table for each item we have and they can have a record of in or out with a quantity for each action. I'm trying to get a running count of how many of an item we have based on whether it's an activity of out or in. Here is my final sql:
SELECT itemid,
activitydate,
activitycode,
SUM(quantity) AS quantity,
SUM(CASE WHEN activitycode = 'IN'
THEN quantity
WHEN activitycode = 'OUT'
THEN -quantity
ELSE 0 END) OVER (PARTITION BY itemid ORDER BY activitydate rows unbounded preceding) AS runningcount
FROM itemhistory
GROUP BY itemid,
activitydate,
activitycode
This results in:
+--------+-------------------------+--------------+----------+--------------+
| itemid | activitydate | activitycode | quantity | runningcount |
+--------+-------------------------+--------------+----------+--------------+
| 1 | 2017-06-08 13:58:00.000 | IN | 1 | 1 |
| 1 | 2017-06-08 16:02:00.000 | IN | 6 | 2 |
| 1 | 2017-06-15 11:43:00.000 | OUT | 3 | 1 |
| 1 | 2017-06-19 12:36:00.000 | IN | 1 | 2 |
| 2 | 2017-06-08 13:50:00.000 | IN | 5 | 1 |
| 2 | 2017-06-12 12:41:00.000 | IN | 4 | 2 |
| 2 | 2017-06-15 11:38:00.000 | OUT | 2 | 1 |
| 2 | 2017-06-20 12:54:00.000 | IN | 15 | 2 |
| 2 | 2017-06-08 13:52:00.000 | IN | 5 | 3 |
| 2 | 2017-06-12 13:09:00.000 | IN | 1 | 4 |
| 2 | 2017-06-15 11:47:00.000 | OUT | 1 | 3 |
| 2 | 2017-06-20 13:14:00.000 | IN | 1 | 4 |
+--------+-------------------------+--------------+----------+--------------+
I want the end result to look like this:
+--------+-------------------------+--------------+----------+--------------+
| itemid | activitydate | activitycode | quantity | runningcount |
+--------+-------------------------+--------------+----------+--------------+
| 1 | 2017-06-08 13:58:00.000 | IN | 1 | 1 |
| 1 | 2017-06-08 16:02:00.000 | IN | 6 | 7 |
| 1 | 2017-06-15 11:43:00.000 | OUT | 3 | 4 |
| 1 | 2017-06-19 12:36:00.000 | IN | 1 | 5 |
| 2 | 2017-06-08 13:50:00.000 | IN | 5 | 5 |
| 2 | 2017-06-12 12:41:00.000 | IN | 4 | 9 |
| 2 | 2017-06-15 11:38:00.000 | OUT | 2 | 7 |
| 2 | 2017-06-20 12:54:00.000 | IN | 15 | 22 |
| 2 | 2017-06-08 13:52:00.000 | IN | 5 | 27 |
| 2 | 2017-06-12 13:09:00.000 | IN | 1 | 28 |
| 2 | 2017-06-15 11:47:00.000 | OUT | 1 | 27 |
| 2 | 2017-06-20 13:14:00.000 | IN | 1 | 28 |
+--------+-------------------------+--------------+----------+--------------+
You want sum(sum()), because this is an aggregation query:
SELECT itemid, activitydate, activitycode,
SUM(quantity) AS quantity,
SUM(SUM(CASE WHEN activitycode = 'IN' THEN quantity
WHEN activitycode = 'OUT' THEN -quantity
ELSE 0
END)
) OVER (PARTITION BY itemid ORDER BY activitydate ) AS runningcount
FROM itemhistory
GROUP BY itemid, activitydate, activitycode