I want to get interpolation value for NULL. Interpolation is a statistical method by which related known values are used to estimate an unknown price or potential yield of a security. Interpolation is achieved by using other established values that are located in sequence with the unknown value.
Here is my sample table and code.
https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=673fcd5bc250bd272e8b6da3d0eddb90
I want to get this result:
| SEQ | cat01 | cat02 | dt_day | price | coeff |
+-----+-------+-------+------------+-------+--------+
| 1 | 230 | 1 | 2019-01-01 | 16000 | 0 |
| 2 | 230 | 1 | 2019-01-02 | NULL | 1 |
| 3 | 230 | 1 | 2019-01-03 | 13000 | 0 |
| 4 | 230 | 1 | 2019-01-04 | NULL | 1 |
| 5 | 230 | 1 | 2019-01-05 | NULL | 2 |
| 6 | 230 | 1 | 2019-01-06 | NULL | 3 |
| 7 | 230 | 1 | 2019-01-07 | 19000 | 0 |
| 8 | 230 | 1 | 2019-01-08 | 20000 | 0 |
| 9 | 230 | 1 | 2019-01-09 | 21500 | 0 |
| 10 | 230 | 1 | 2019-01-10 | 21500 | 0 |
| 11 | 230 | 1 | 2019-01-11 | NULL | 1 |
| 12 | 230 | 1 | 2019-01-12 | NULL | 2 |
| 13 | 230 | 1 | 2019-01-13 | 23000 | 0 |
| 1 | 230 | 2 | 2019-01-01 | NULL | 1 |
| 2 | 230 | 2 | 2019-01-02 | NULL | 2 |
| 3 | 230 | 2 | 2019-01-03 | 12000 | 0 |
| 4 | 230 | 2 | 2019-01-04 | 17000 | 0 |
| 5 | 230 | 2 | 2019-01-05 | 22000 | 0 |
| 6 | 230 | 2 | 2019-01-06 | NULL | 1 |
| 7 | 230 | 2 | 2019-01-07 | 23000 | 0 |
| 8 | 230 | 2 | 2019-01-08 | 23200 | 0 |
| 9 | 230 | 2 | 2019-01-09 | NULL | 1 |
| 10 | 230 | 2 | 2019-01-10 | NULL | 2 |
| 11 | 230 | 2 | 2019-01-11 | NULL | 3 |
| 12 | 230 | 2 | 2019-01-12 | NULL | 4 |
| 13 | 230 | 2 | 2019-01-13 | 23000 | 0 |
I use this code. I think this code incorrect.
coeff is the NULL is in order set.
This code is for implementing interpolation.
I tried to find out between the empty values and divide them by the number of spaces.
But, this code is incorrect.
WITH ROW_VALUE AS
(
SELECT SEQ
, dt_day
, cat01
, cat02
, price
, ROW_NUMBER() OVER (ORDER BY dt_day) AS sub_seq
FROM (
SELECT SEQ
, cat01
, cat02
, dt_day
, dt_week
, dt_month
, price
FROM temp01
WHERE price IS NOT NULL
)val
)
,STEP_CHANGE AS(
SELECT RV1.SEQ AS id_Start
, RV1.SEQ - 1 AS id_End
, RV1.cat01
, RV1.cat02
, RV1.dt_day
, RV1.price
, (RV2.price - RV1.price)/(RV2.SEQ - RV1.SEQ) AS change1
FROM ROW_VALUE RV1
LEFT JOIN ROW_VALUE RV2 ON RV1.cat01 = RV2.cat01
AND RV1.cat02 = RV2.cat02
AND RV1.SEQ = RV2.SEQ - 1
)
SELECT *
FROM STEP_CHANGE
ORDER BY cat01, cat02, dt_day
Please, let me know what a good way to fill NULL using linear relationships.
If there is another good way, please recommend it.
If I assume that you mean linear interpolation between the previous price and the next price based on the number of days that passed, then you can use the following method:
Use window functions to get the next and previous days with prices for each row.
Use window functions or joins to get the prices on those days as well.
Use arithmetic to calculate the linear interpolation.
You SQL Fiddle uses SQL Server, so I assume that is the database you are using. The code looks like this:
select t.*,
coalesce(t.price,
(tprev.price +
(tnext.price - tprev.price) / datediff(day, prev_price_day, next_price_day) *
datediff(day, t.prev_price_day, t.dt_day)
)
) as imputed_price
from (select t.*,
max(case when price is not null then dt_day end) over (partition by cat01, cat02 order by dt_day asc) as prev_price_day,
min(case when price is not null then dt_day end) over (partition by cat01, cat02 order by dt_day desc) as next_price_day
from temp01 t
) t left join
temp01 tprev
on tprev.cat01 = t.cat01 and
tprev.cat02 = t.cat02 and
tprev.dt_day = t.prev_price_day left join
temp01 tnext
on tnext.cat01 = t.cat01 and
tnext.cat02 = t.cat02 and
tnext.dt_day = t.next_price_day
order by cat01, cat02, dt_day;
Here is a db<>fiddle.
Related
I'm trying to figure out a way to make the rank() window function "skip" in its counting some rows with null values in a specific column. Pretty much, what I want is to count how many paid transactions there are, for each client, before each transaction/row.
I tried using case when inside the rank() and I got something similar to the results I expect, but still not quite what I need.
+-------------------------------------------------------+
| What I need |
+-------------+------+----------+-----------------------+
| CLIENT | CODE | PAYMENT | PAID_PURCHASES_SO_FAR |
| A | 341 | 17/09/21 | 0 |
| A | 342 | 18/09/21 | 1 |
| A | 343 | (null) | 2 |
| A | 344 | 18/09/21 | 2 |
| A | 345 | 19/09/21 | 3 |
| A | 346 | 19/09/21 | 4 |
| A | 347 | (null) | 5 |
| A | 348 | 24/09/21 | 5 |
| B | 855 | (null) | 0 |
| B | 856 | 20/09/21 | 0 |
| B | 857 | (null) | 1 |
+-------------+------+----------+-----------------------+
-+------------------------------------------------------+
| What I got |
-+------------+------+----------+-----------------------+
| CLIENT | CODE | PAYMENT | PAID_PURCHASES_SO_FAR |
| A | 341 | 17/09/21 | 0 |
| A | 342 | 18/09/22 | 1 |
| A | 343 | (null) | (null) |
| A | 344 | 18/09/22 | 2 |
| A | 345 | 19/09/22 | 3 |
| A | 346 | 19/09/21 | 4 |
| A | 347 | (null) | (null) |
| A | 348 | 24/09/21 | 5 |
| B | 855 | (null) | (null) |
| B | 856 | 20/09/21 | 0 |
| B | 857 | (null) | (null) |
-+------------+------+----------+-----------------------+
In a single image: comparison
And here my code:
SELECT
CLIENT
, CODE
, PAYMENT
, CASE WHEN PAYMENT IS NOT NULL THEN DENSE_RANK() OVER(PARTITION BY CLIENT, (CASE WHEN PAYMENT IS NOT NULL THEN 1 ELSE 0 END) ORDER BY CODE) - 1 END NUMBER_OF_PURCHASES_SO_FAR
FROM FOO.BAR
Note: The CODE column may be used as time reference. E.g. code = 750 came before code = 751, and so on.
Any help would be appreciated.
Thanks in advance.
You can use conditional aggregation combined with a window frame, as in:
select *, coalesce(sum(case when payment is null then 0 else 1 end)
over(partition by client order by code
rows between unbounded preceding and 1 preceding), 0)
as ppsf
from t
order by client, code
Result:
client code payment ppsf
------- ----- --------- ----
A 341 17/09/21 0
A 342 18/09/21 1
A 343 null 2
A 344 18/09/21 2
A 345 19/09/21 3
A 346 19/09/21 4
A 347 null 5
A 348 24/09/21 5
B 855 null 0
B 856 20/09/21 0
B 857 null 1
See running example at db<>fiddle.
This is it:
SELECT
"CLIENT"
, "CODE"
, "PAYMENT"
, rank() over(partition by "CLIENT"
order by COALESCE("PAYMENT",'01/01/70') )
FROM Table1
http://sqlfiddle.com/#!15/f941a/14
I want to get front value and back value using min-max comparing.
Here is my sample table.
https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=322aafeb7970e25f9a85d0cd2f6c00d6
For example, this is temp01 table.
I want to get start&end NULL value in temp01.
So, I fill
price = [NULL, NULL, 13000]
to
price = [12000, 12000, 13000]
Because the 12000 is minimum value in [230, 1]. And I fill end NULL is filled maximum value in [cat01, cat02] group
| SEQ | cat01 | cat02 | dt_day | price |
+-----+-------+-------+------------+-------+
| 1 | 230 | 1 | 2019-01-01 | NULL |
| 2 | 230 | 1 | 2019-01-02 | NULL |
| 3 | 230 | 1 | 2019-01-03 | 13000 |
...
| 11 | 230 | 1 | 2019-01-11 | NULL |
| 12 | 230 | 1 | 2019-01-12 | NULL |
| 1 | 230 | 2 | 2019-01-01 | NULL |
| 2 | 230 | 2 | 2019-01-02 | NULL |
| 3 | 230 | 2 | 2019-01-03 | 12000 |
...
| 12 | 230 | 2 | 2019-01-11 | NULL |
| 13 | 230 | 2 | 2019-01-12 | NULL |
[result]
| SEQ | cat01 | cat02 | dt_day | price |
+-----+-------+-------+------------+-------+
| 1 | 230 | 1 | 2019-01-01 | 12000 | --START
| 2 | 230 | 1 | 2019-01-02 | 12000 |
| 3 | 230 | 1 | 2019-01-03 | 13000 |
| 4 | 230 | 1 | 2019-01-04 | 12000 |
| 5 | 230 | 1 | 2019-01-05 | NULL |
| 6 | 230 | 1 | 2019-01-06 | NULL |
| 7 | 230 | 1 | 2019-01-07 | 19000 |
| 8 | 230 | 1 | 2019-01-08 | 20000 |
| 9 | 230 | 1 | 2019-01-09 | 21500 |
| 10 | 230 | 1 | 2019-01-10 | 21500 |
| 11 | 230 | 1 | 2019-01-11 | 21500 |
| 12 | 230 | 1 | 2019-01-12 | 21500 |
| 13 | 230 | 1 | 2019-01-13 | 21500 | --END
| 1 | 230 | 2 | 2019-01-01 | 12000 | --START
| 2 | 230 | 2 | 2019-01-02 | 12000 |
| 3 | 230 | 2 | 2019-01-03 | 12000 |
| 4 | 230 | 2 | 2019-01-04 | 17000 |
| 5 | 230 | 2 | 2019-01-05 | 22000 |
| 6 | 230 | 2 | 2019-01-06 | NULL |
| 7 | 230 | 2 | 2019-01-07 | 23000 |
| 8 | 230 | 2 | 2019-01-08 | 23200 |
| 9 | 230 | 2 | 2019-01-09 | NULL |
| 10 | 230 | 2 | 2019-01-10 | 24000 |
| 11 | 230 | 2 | 2019-01-11 | 24000 |
| 12 | 230 | 2 | 2019-01-12 | 24000 |
| 13 | 230 | 2 | 2019-01-13 | 24000 | --END
Please, let me know what a good way to fill NULL using linear relationships.
find the min() and max() for the price GROUP BY cat01, cat02.
Also find the min and max seq for the row where price is not null
after that it is just simply inner join to your table and update where price is null
with val as
(
select cat01, cat02,
min_price = min(price),
max_price = max(price),
min_seq = min(case when price is not null then seq end),
max_seq = max(case when price is not null then seq end)
from temp01
group by cat01, cat02
)
update t
set price = case when t.seq < v.min_seq then min_price
when t.seq > v.max_seq then max_price
end
FROM temp01 t
inner join val v on t.cat01 = v.cat01
and t.cat02 = v.cat02
where t.price is null
dbfiddle
EDIT : returning the price as a new column in SELECT query
with val as
(
select cat01, cat02, min_price = min(price), max_price = max(price),
min_seq = min(case when price is not null then seq end),
max_seq = max(case when price is not null then seq end)
from temp01
group by cat01, cat02
)
select t.*,
new_price = coalesce(t.price,
case when t.seq < v.min_seq then min_price
when t.seq > v.max_seq then max_price
end)
FROM temp01 t
left join val v on t.cat01 = v.cat01
and t.cat02 = v.cat02
Updated dbfiddle
This is probably the clunkiest query I have ever made. I have to use a read-only account so I can't use temp tables or anything to make this easier. The goal is to return the MIN(RowNum) when sumPiecesScrapped = maxSum. I have tried adding the entire query into another subquery trying to return the MIN(RowNum) however, it is one-to-many that is tied to the primary key JobNo and when I tie it to JobNo and StepNo it gives me the same result as the one below.
SELECT
JobNo,
StepNo,
sumPiecesScrapped,
maxSum,
CASE
WHEN sumPiecesScrapped = maxSum THEN ROW_NUMBER() OVER(PARTITION BY JobNo ORDER BY JobNo, StepNo)
ELSE 0
END AS RowNum
FROM
(
SELECT
JobNo,
StepNo,
sumPiecesScrapped
FROM
(
SELECT
JobNo,
StepNo,
SUM(PiecesScrapped) as sumPiecesScrapped
FROM
(
SELECT
JobNo,
StepNo,
PiecesFinished,
PiecesScrapped
FROM TimeTicketDet
) tt2
GROUP BY JobNo, StepNo
) tt3
GROUP BY JobNo, StepNo, sumPiecesScrapped
) tt4
LEFT JOIN
(
SELECT
JobNo as tt5JobNo,
MAX(PiecesScrapped) as maxSum
FROM
(
SELECT
JobNo,
PiecesScrapped
FROM TimeTicketDet
) tt5
GROUP BY JobNo
) tt5
ON tt5.tt5JobNo = tt4.JobNo
WHERE tt4.JobNo = '12345'
Result:
+-------+--------+-------------------+--------+--------+
| JobNo | StepNo | sumPiecesScrapped | maxSum | RowNum |
+-------+--------+-------------------+--------+--------+
| 12345 | 10 | 0 | 5 | 0 |
| 12345 | 20 | 1 | 5 | 0 |
| 12345 | 30 | 5 | 5 | 3 |
| 12345 | 40 | 5 | 5 | 4 |
| 12345 | 60 | 5 | 5 | 5 |
| 12345 | 70 | 5 | 5 | 6 |
+-------+--------+-------------------+--------+--------+
Desired Result:
+-------+--------+-------------------+--------+--------+
| JobNo | StepNo | sumPiecesScrapped | maxSum | RowNum |
+-------+--------+-------------------+--------+--------+
| 12345 | 10 | 0 | 5 | 0 |
| 12345 | 20 | 1 | 5 | 0 |
| 12345 | 30 | 5 | 5 | 3 |
| 12345 | 40 | 5 | 5 | 3 |
| 12345 | 60 | 5 | 5 | 3 |
| 12345 | 70 | 5 | 5 | 3 |
+-------+--------+-------------------+--------+--------+
Other Possible Result:
+-------+--------+-------------------+--------+-----------+
| JobNo | StepNo | sumPiecesScrapped | maxSum | RowNum |
+-------+--------+-------------------+--------+-----------+
| 12345 | 10 | 0 | 5 | 0 |
| 12345 | 20 | 1 | 5 | 0 |
| 12345 | 30 | 5 | 5 | Something |
| 12345 | 40 | 5 | 5 | 0 |
| 12345 | 60 | 5 | 5 | 0 |
| 12345 | 70 | 5 | 5 | 0 |
+-------+--------+-------------------+--------+-----------+
i'm trying to understand how windowing function avg works, and somehow it seems to not be working as i expect.
here is the dataset :
select * from winsales;
+-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+
| winsales.salesid | winsales.dateid | winsales.sellerid | winsales.buyerid | winsales.qty | winsales.qty_shipped |
+-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+
| 30001 | NULL | 3 | b | 10 | 10 |
| 10001 | NULL | 1 | c | 10 | 10 |
| 10005 | NULL | 1 | a | 30 | NULL |
| 40001 | NULL | 4 | a | 40 | NULL |
| 20001 | NULL | 2 | b | 20 | 20 |
| 40005 | NULL | 4 | a | 10 | 10 |
| 20002 | NULL | 2 | c | 20 | 20 |
| 30003 | NULL | 3 | b | 15 | NULL |
| 30004 | NULL | 3 | b | 20 | NULL |
| 30007 | NULL | 3 | c | 30 | NULL |
| 30001 | NULL | 3 | b | 10 | 10 |
+-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+
When i fire the following query ->
select salesid, sellerid, qty, avg(qty) over (order by sellerid) as avg_qty from winsales order by sellerid,salesid;
I get the following ->
+----------+-----------+------+---------------------+--+
| salesid | sellerid | qty | avg_qty |
+----------+-----------+------+---------------------+--+
| 10001 | 1 | 10 | 20.0 |
| 10005 | 1 | 30 | 20.0 |
| 20001 | 2 | 20 | 20.0 |
| 20002 | 2 | 20 | 20.0 |
| 30001 | 3 | 10 | 18.333333333333332 |
| 30001 | 3 | 10 | 18.333333333333332 |
| 30003 | 3 | 15 | 18.333333333333332 |
| 30004 | 3 | 20 | 18.333333333333332 |
| 30007 | 3 | 30 | 18.333333333333332 |
| 40001 | 4 | 40 | 19.545454545454547 |
| 40005 | 4 | 10 | 19.545454545454547 |
+----------+-----------+------+---------------------+--+
Question is - how is the avg(qty) being calculated.
Since i'm not using partition by, i would expect the avg(qty) to be the same for all rows.
Any ideas ?
if you want to have same avg(qty) to get for all rows then remove order by sellerid in over clause, then you are going to have 19.545454545454547 value for all the rows.
Query to get same avg(qty) for all rows:
hive> select salesid, sellerid, qty, avg(qty) over () as avg_qty from winsales order by sellerid,salesid;
If we include order by sellerid in over clause then you are getting cumulative avg is caluculated for each sellerid.
i.e. for
sellerid 1 you are having 2 records total 2 records with qty as 10,30 so avg would be
(10+30)/2.
sellerid 2 you are having 2 records total 4 records with qty as 20,20 so avg would be
(10+30+20+20)/4 = 20.0
sellerid 3 you are having 5 records total 9 records with qty as so 10,10,15,20,30 avg would be
(10+30+20+20+10+10+15+20+30)/9 = 18.333
sellerid 4 avg is 19.545454545454547
when we include over clause then this is an expected behavior from hive.
In sql I have a history table for each item we have and they can have a record of in or out with a quantity for each action. I'm trying to get a running count of how many of an item we have based on whether it's an activity of out or in. Here is my final sql:
SELECT itemid,
activitydate,
activitycode,
SUM(quantity) AS quantity,
SUM(CASE WHEN activitycode = 'IN'
THEN quantity
WHEN activitycode = 'OUT'
THEN -quantity
ELSE 0 END) OVER (PARTITION BY itemid ORDER BY activitydate rows unbounded preceding) AS runningcount
FROM itemhistory
GROUP BY itemid,
activitydate,
activitycode
This results in:
+--------+-------------------------+--------------+----------+--------------+
| itemid | activitydate | activitycode | quantity | runningcount |
+--------+-------------------------+--------------+----------+--------------+
| 1 | 2017-06-08 13:58:00.000 | IN | 1 | 1 |
| 1 | 2017-06-08 16:02:00.000 | IN | 6 | 2 |
| 1 | 2017-06-15 11:43:00.000 | OUT | 3 | 1 |
| 1 | 2017-06-19 12:36:00.000 | IN | 1 | 2 |
| 2 | 2017-06-08 13:50:00.000 | IN | 5 | 1 |
| 2 | 2017-06-12 12:41:00.000 | IN | 4 | 2 |
| 2 | 2017-06-15 11:38:00.000 | OUT | 2 | 1 |
| 2 | 2017-06-20 12:54:00.000 | IN | 15 | 2 |
| 2 | 2017-06-08 13:52:00.000 | IN | 5 | 3 |
| 2 | 2017-06-12 13:09:00.000 | IN | 1 | 4 |
| 2 | 2017-06-15 11:47:00.000 | OUT | 1 | 3 |
| 2 | 2017-06-20 13:14:00.000 | IN | 1 | 4 |
+--------+-------------------------+--------------+----------+--------------+
I want the end result to look like this:
+--------+-------------------------+--------------+----------+--------------+
| itemid | activitydate | activitycode | quantity | runningcount |
+--------+-------------------------+--------------+----------+--------------+
| 1 | 2017-06-08 13:58:00.000 | IN | 1 | 1 |
| 1 | 2017-06-08 16:02:00.000 | IN | 6 | 7 |
| 1 | 2017-06-15 11:43:00.000 | OUT | 3 | 4 |
| 1 | 2017-06-19 12:36:00.000 | IN | 1 | 5 |
| 2 | 2017-06-08 13:50:00.000 | IN | 5 | 5 |
| 2 | 2017-06-12 12:41:00.000 | IN | 4 | 9 |
| 2 | 2017-06-15 11:38:00.000 | OUT | 2 | 7 |
| 2 | 2017-06-20 12:54:00.000 | IN | 15 | 22 |
| 2 | 2017-06-08 13:52:00.000 | IN | 5 | 27 |
| 2 | 2017-06-12 13:09:00.000 | IN | 1 | 28 |
| 2 | 2017-06-15 11:47:00.000 | OUT | 1 | 27 |
| 2 | 2017-06-20 13:14:00.000 | IN | 1 | 28 |
+--------+-------------------------+--------------+----------+--------------+
You want sum(sum()), because this is an aggregation query:
SELECT itemid, activitydate, activitycode,
SUM(quantity) AS quantity,
SUM(SUM(CASE WHEN activitycode = 'IN' THEN quantity
WHEN activitycode = 'OUT' THEN -quantity
ELSE 0
END)
) OVER (PARTITION BY itemid ORDER BY activitydate ) AS runningcount
FROM itemhistory
GROUP BY itemid, activitydate, activitycode