Returning MIN Row_Number() SQL - sql

This is probably the clunkiest query I have ever made. I have to use a read-only account so I can't use temp tables or anything to make this easier. The goal is to return the MIN(RowNum) when sumPiecesScrapped = maxSum. I have tried adding the entire query into another subquery trying to return the MIN(RowNum) however, it is one-to-many that is tied to the primary key JobNo and when I tie it to JobNo and StepNo it gives me the same result as the one below.
SELECT
JobNo,
StepNo,
sumPiecesScrapped,
maxSum,
CASE
WHEN sumPiecesScrapped = maxSum THEN ROW_NUMBER() OVER(PARTITION BY JobNo ORDER BY JobNo, StepNo)
ELSE 0
END AS RowNum
FROM
(
SELECT
JobNo,
StepNo,
sumPiecesScrapped
FROM
(
SELECT
JobNo,
StepNo,
SUM(PiecesScrapped) as sumPiecesScrapped
FROM
(
SELECT
JobNo,
StepNo,
PiecesFinished,
PiecesScrapped
FROM TimeTicketDet
) tt2
GROUP BY JobNo, StepNo
) tt3
GROUP BY JobNo, StepNo, sumPiecesScrapped
) tt4
LEFT JOIN
(
SELECT
JobNo as tt5JobNo,
MAX(PiecesScrapped) as maxSum
FROM
(
SELECT
JobNo,
PiecesScrapped
FROM TimeTicketDet
) tt5
GROUP BY JobNo
) tt5
ON tt5.tt5JobNo = tt4.JobNo
WHERE tt4.JobNo = '12345'
Result:
+-------+--------+-------------------+--------+--------+
| JobNo | StepNo | sumPiecesScrapped | maxSum | RowNum |
+-------+--------+-------------------+--------+--------+
| 12345 | 10 | 0 | 5 | 0 |
| 12345 | 20 | 1 | 5 | 0 |
| 12345 | 30 | 5 | 5 | 3 |
| 12345 | 40 | 5 | 5 | 4 |
| 12345 | 60 | 5 | 5 | 5 |
| 12345 | 70 | 5 | 5 | 6 |
+-------+--------+-------------------+--------+--------+
Desired Result:
+-------+--------+-------------------+--------+--------+
| JobNo | StepNo | sumPiecesScrapped | maxSum | RowNum |
+-------+--------+-------------------+--------+--------+
| 12345 | 10 | 0 | 5 | 0 |
| 12345 | 20 | 1 | 5 | 0 |
| 12345 | 30 | 5 | 5 | 3 |
| 12345 | 40 | 5 | 5 | 3 |
| 12345 | 60 | 5 | 5 | 3 |
| 12345 | 70 | 5 | 5 | 3 |
+-------+--------+-------------------+--------+--------+
Other Possible Result:
+-------+--------+-------------------+--------+-----------+
| JobNo | StepNo | sumPiecesScrapped | maxSum | RowNum |
+-------+--------+-------------------+--------+-----------+
| 12345 | 10 | 0 | 5 | 0 |
| 12345 | 20 | 1 | 5 | 0 |
| 12345 | 30 | 5 | 5 | Something |
| 12345 | 40 | 5 | 5 | 0 |
| 12345 | 60 | 5 | 5 | 0 |
| 12345 | 70 | 5 | 5 | 0 |
+-------+--------+-------------------+--------+-----------+

Related

Get the count of longest streak including the break point

I am working on the problem where I have to get the count of streak with max value, but to get the exact result I have to count that point as well where the streak breaks. My table looks like this
+-----------------+--------+-------+
| customer_number | Months | Flags |
+-----------------+--------+-------+
| 1 | 12 | 1 |
| 1 | 1 | 1 |
| 1 | 2 | 1 |
| 1 | 3 | 1 |
| 1 | 4 | 1 |
| 1 | 5 | 1 |
| 1 | 8 | 1 |
| 1 | 9 | 1 |
| 1 | 10 | 1 |
| 1 | 11 | 1 |
| 6 | 12 | 1 |
| 6 | 1 | 1 |
| 6 | 2 | 1 |
| 6 | 3 | 1 |
| 6 | 4 | 1 |
| 6 | 5 | 4 |
| 6 | 9 | 1 |
| 6 | 10 | 1 |
| 6 | 11 | 1 |
| 7 | 5 | 1 |
| 8 | 9 | 1 |
| 8 | 10 | 1 |
| 8 | 11 | 1 |
| 9 | 9 | 1 |
| 9 | 10 | 1 |
| 9 | 11 | 1 |
| 10 | 11 | 1 |
+-----------------+--------+-------+
and my desired output is
+----------+--------------------+
| Customer | Consecutive streak |
+----------+--------------------+
| 1 | 10 |
| 6 | 6 |
| 7 | 1 |
| 8 | 3 |
| 9 | 3 |
| 10 | 1 |
+----------+--------------------+
the code I have
SELECT customer_number, max(streak) max_consecutive_streak FROM (
SELECT customer_number, COUNT(*) as streak
FROM
(select *,
(row_number() over (order by customer_number) -
row_number() over (order by customer_number)
) as counts
from table1
) cc
group by customer_number, counts
)
GROUP BY 1;
It is working good but for customer_number 6 it returns 5 but I want it to be 6, means it should count 4 as well in its longest streak as the streak breaks at this point. Any idea how can I achieve that?
You can use a cte with row_number:
with cte(r, id, flag) as (
select row_number() over (order by c.customer_number), c.* from customers c
),
freq(id, t, f) as (
select c2.id, c2.f, count(*) from
(select c.id, (select sum(c1.flag!=c.flag) from cte c1 where c1.id=c.id and c1.r <= c.r) f from cte c)
c2 group by c2.id, c2.f
)
select id, max(f) from freq group by id;

How to get interpolation value in SQL Server?

I want to get interpolation value for NULL. Interpolation is a statistical method by which related known values are used to estimate an unknown price or potential yield of a security. Interpolation is achieved by using other established values that are located in sequence with the unknown value.
Here is my sample table and code.
https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=673fcd5bc250bd272e8b6da3d0eddb90
I want to get this result:
| SEQ | cat01 | cat02 | dt_day | price | coeff |
+-----+-------+-------+------------+-------+--------+
| 1 | 230 | 1 | 2019-01-01 | 16000 | 0 |
| 2 | 230 | 1 | 2019-01-02 | NULL | 1 |
| 3 | 230 | 1 | 2019-01-03 | 13000 | 0 |
| 4 | 230 | 1 | 2019-01-04 | NULL | 1 |
| 5 | 230 | 1 | 2019-01-05 | NULL | 2 |
| 6 | 230 | 1 | 2019-01-06 | NULL | 3 |
| 7 | 230 | 1 | 2019-01-07 | 19000 | 0 |
| 8 | 230 | 1 | 2019-01-08 | 20000 | 0 |
| 9 | 230 | 1 | 2019-01-09 | 21500 | 0 |
| 10 | 230 | 1 | 2019-01-10 | 21500 | 0 |
| 11 | 230 | 1 | 2019-01-11 | NULL | 1 |
| 12 | 230 | 1 | 2019-01-12 | NULL | 2 |
| 13 | 230 | 1 | 2019-01-13 | 23000 | 0 |
| 1 | 230 | 2 | 2019-01-01 | NULL | 1 |
| 2 | 230 | 2 | 2019-01-02 | NULL | 2 |
| 3 | 230 | 2 | 2019-01-03 | 12000 | 0 |
| 4 | 230 | 2 | 2019-01-04 | 17000 | 0 |
| 5 | 230 | 2 | 2019-01-05 | 22000 | 0 |
| 6 | 230 | 2 | 2019-01-06 | NULL | 1 |
| 7 | 230 | 2 | 2019-01-07 | 23000 | 0 |
| 8 | 230 | 2 | 2019-01-08 | 23200 | 0 |
| 9 | 230 | 2 | 2019-01-09 | NULL | 1 |
| 10 | 230 | 2 | 2019-01-10 | NULL | 2 |
| 11 | 230 | 2 | 2019-01-11 | NULL | 3 |
| 12 | 230 | 2 | 2019-01-12 | NULL | 4 |
| 13 | 230 | 2 | 2019-01-13 | 23000 | 0 |
I use this code. I think this code incorrect.
coeff is the NULL is in order set.
This code is for implementing interpolation.
I tried to find out between the empty values and divide them by the number of spaces.
But, this code is incorrect.
WITH ROW_VALUE AS
(
SELECT SEQ
, dt_day
, cat01
, cat02
, price
, ROW_NUMBER() OVER (ORDER BY dt_day) AS sub_seq
FROM (
SELECT SEQ
, cat01
, cat02
, dt_day
, dt_week
, dt_month
, price
FROM temp01
WHERE price IS NOT NULL
)val
)
,STEP_CHANGE AS(
SELECT RV1.SEQ AS id_Start
, RV1.SEQ - 1 AS id_End
, RV1.cat01
, RV1.cat02
, RV1.dt_day
, RV1.price
, (RV2.price - RV1.price)/(RV2.SEQ - RV1.SEQ) AS change1
FROM ROW_VALUE RV1
LEFT JOIN ROW_VALUE RV2 ON RV1.cat01 = RV2.cat01
AND RV1.cat02 = RV2.cat02
AND RV1.SEQ = RV2.SEQ - 1
)
SELECT *
FROM STEP_CHANGE
ORDER BY cat01, cat02, dt_day
Please, let me know what a good way to fill NULL using linear relationships.
If there is another good way, please recommend it.
If I assume that you mean linear interpolation between the previous price and the next price based on the number of days that passed, then you can use the following method:
Use window functions to get the next and previous days with prices for each row.
Use window functions or joins to get the prices on those days as well.
Use arithmetic to calculate the linear interpolation.
You SQL Fiddle uses SQL Server, so I assume that is the database you are using. The code looks like this:
select t.*,
coalesce(t.price,
(tprev.price +
(tnext.price - tprev.price) / datediff(day, prev_price_day, next_price_day) *
datediff(day, t.prev_price_day, t.dt_day)
)
) as imputed_price
from (select t.*,
max(case when price is not null then dt_day end) over (partition by cat01, cat02 order by dt_day asc) as prev_price_day,
min(case when price is not null then dt_day end) over (partition by cat01, cat02 order by dt_day desc) as next_price_day
from temp01 t
) t left join
temp01 tprev
on tprev.cat01 = t.cat01 and
tprev.cat02 = t.cat02 and
tprev.dt_day = t.prev_price_day left join
temp01 tnext
on tnext.cat01 = t.cat01 and
tnext.cat02 = t.cat02 and
tnext.dt_day = t.next_price_day
order by cat01, cat02, dt_day;
Here is a db<>fiddle.

windowing function avg in Hive with - over (order by colName)

i'm trying to understand how windowing function avg works, and somehow it seems to not be working as i expect.
here is the dataset :
select * from winsales;
+-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+
| winsales.salesid | winsales.dateid | winsales.sellerid | winsales.buyerid | winsales.qty | winsales.qty_shipped |
+-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+
| 30001 | NULL | 3 | b | 10 | 10 |
| 10001 | NULL | 1 | c | 10 | 10 |
| 10005 | NULL | 1 | a | 30 | NULL |
| 40001 | NULL | 4 | a | 40 | NULL |
| 20001 | NULL | 2 | b | 20 | 20 |
| 40005 | NULL | 4 | a | 10 | 10 |
| 20002 | NULL | 2 | c | 20 | 20 |
| 30003 | NULL | 3 | b | 15 | NULL |
| 30004 | NULL | 3 | b | 20 | NULL |
| 30007 | NULL | 3 | c | 30 | NULL |
| 30001 | NULL | 3 | b | 10 | 10 |
+-------------------+------------------+--------------------+-------------------+---------------+-----------------------+--+
When i fire the following query ->
select salesid, sellerid, qty, avg(qty) over (order by sellerid) as avg_qty from winsales order by sellerid,salesid;
I get the following ->
+----------+-----------+------+---------------------+--+
| salesid | sellerid | qty | avg_qty |
+----------+-----------+------+---------------------+--+
| 10001 | 1 | 10 | 20.0 |
| 10005 | 1 | 30 | 20.0 |
| 20001 | 2 | 20 | 20.0 |
| 20002 | 2 | 20 | 20.0 |
| 30001 | 3 | 10 | 18.333333333333332 |
| 30001 | 3 | 10 | 18.333333333333332 |
| 30003 | 3 | 15 | 18.333333333333332 |
| 30004 | 3 | 20 | 18.333333333333332 |
| 30007 | 3 | 30 | 18.333333333333332 |
| 40001 | 4 | 40 | 19.545454545454547 |
| 40005 | 4 | 10 | 19.545454545454547 |
+----------+-----------+------+---------------------+--+
Question is - how is the avg(qty) being calculated.
Since i'm not using partition by, i would expect the avg(qty) to be the same for all rows.
Any ideas ?
if you want to have same avg(qty) to get for all rows then remove order by sellerid in over clause, then you are going to have 19.545454545454547 value for all the rows.
Query to get same avg(qty) for all rows:
hive> select salesid, sellerid, qty, avg(qty) over () as avg_qty from winsales order by sellerid,salesid;
If we include order by sellerid in over clause then you are getting cumulative avg is caluculated for each sellerid.
i.e. for
sellerid 1 you are having 2 records total 2 records with qty as 10,30 so avg would be
(10+30)/2.
sellerid 2 you are having 2 records total 4 records with qty as 20,20 so avg would be
(10+30+20+20)/4 = 20.0
sellerid 3 you are having 5 records total 9 records with qty as so 10,10,15,20,30 avg would be
(10+30+20+20+10+10+15+20+30)/9 = 18.333
sellerid 4 avg is 19.545454545454547
when we include over clause then this is an expected behavior from hive.

How to do this GROUP BY with the wanted result?

Basically, I have a table with all the bus stops of a route with the time_from_start value, that helps to put them in a good order.
CREATE TABLE `api_routestop` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`route_id` int(11) NOT NULL,
`station_id` varchar(10) NOT NULL,
`time_from_start` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `api_routestop_4fe3422a` (`route_id`),
KEY `api_routestop_15e3331d` (`station_id`)
)
I want to return for each stop of a line the time to go to the next stop.
I tried with this QUERY :
SELECT r1.station_id, r2.station_id, r1.route_id, COUNT(*), (r2.time_from_start - r1.time_from_start) as time
FROM api_routestop r1
LEFT JOIN api_routestop r2 ON r1.route_id = r2.route_id AND r1.id <> r2.id
GROUP BY r1.station_id
HAVING time >= 0
ORDER BY r1.route_id, r1.time_from_start, r2.time_from_start
But the group by seams not to work and the result looks like :
+------------+------------+----------+----------+------+
| station_id | station_id | route_id | COUNT(*) | time |
+------------+------------+----------+----------+------+
| Rub01 | Sal01 | 1 | 16 | 1 |
| Lyc02 | Sch02 | 2 | 17 | 2 |
| Paq01 | PoB01 | 3 | 15 | 1 |
| LaT02 | Gco02 | 4 | 16 | 1 |
| Sup01 | Tur01 | 5 | 132 | 1 |
| Oeu02 | CtC02 | 6 | 20 | 2 |
| Ver02 | Elo02 | 7 | 38 | 1 |
| Can01 | Mbo01 | 8 | 70 | 1 |
| Ver01 | Elo01 | 9 | 77 | 1 |
| MCH01 | for02 | 10 | 77 | 1 |
+------------+------------+----------+----------+------+
If I do that :
SELECT r1.station_id, r2.station_id, r1.route_id, COUNT(*), (r2.time_from_start - r1.time_from_start) as time
FROM api_routestop r1
LEFT JOIN api_routestop r2 ON r1.route_id = r2.route_id AND r1.id <> r2.id
GROUP BY r1.station_id, r2.station_id, r1.route_id
HAVING time >= 0
ORDER BY r1.route_id, r1.time_from_start, r2.time_from_start
I am approching :
+------------+------------+----------+----------+------+
| station_id | station_id | route_id | COUNT(*) | time |
+------------+------------+----------+----------+------+
| Rub01 | Sal01 | 1 | 1 | 1 |
| Rub01 | ARM01 | 1 | 1 | 2 |
| Rub01 | MaV01 | 1 | 1 | 4 |
| Rub01 | COl01 | 1 | 1 | 5 |
| Rub01 | Str01 | 1 | 1 | 6 |
| Rub01 | Jau01 | 1 | 1 | 7 |
| Rub01 | Cdp01 | 1 | 1 | 9 |
| Rub01 | Rep01 | 1 | 1 | 11 |
| Rub01 | CoT01 | 1 | 1 | 12 |
| Rub01 | Ctr01 | 1 | 1 | 14 |
| Rub01 | FLy01 | 1 | 1 | 15 |
| Rub01 | Lib01 | 1 | 1 | 17 |
| Rub01 | Bru01 | 1 | 1 | 18 |
| Rub01 | Sch01 | 1 | 1 | 20 |
| Rub01 | Lyc01 | 1 | 1 | 22 |
| Rub01 | Res01 | 1 | 1 | 24 |
| Sal01 | ARM01 | 1 | 1 | 1 |
| Sal01 | MaV01 | 1 | 1 | 3 |
| Sal01 | COl01 | 1 | 1 | 4 |
| Sal01 | Str01 | 1 | 1 | 5 |
| Sal01 | Jau01 | 1 | 1 | 6 |
| Sal01 | Cdp01 | 1 | 1 | 8 |
| Sal01 | Rep01 | 1 | 1 | 10 |
| Sal01 | CoT01 | 1 | 1 | 11 |
| Sal01 | Ctr01 | 1 | 1 | 13 |
| Sal01 | FLy01 | 1 | 1 | 14 |
| Sal01 | Lib01 | 1 | 1 | 16 |
| Sal01 | Bru01 | 1 | 1 | 17 |
| Sal01 | Sch01 | 1 | 1 | 19 |
| Sal01 | Lyc01 | 1 | 1 | 21 |
...
3769 rows in set (0.07 sec)
But what do I have to do to have only the first result for the same r1.station_id and r1.route_id ?
You're getting a lot of results back because your getting every stop joined to every other stop on the same route.
So you'll need to identify the "Next" stop as the stop that has the same route ID but has a minimum time from start later than the current one
Update Added routeId to the next_stop sub query to deal with the case of stations used in multiple routes
SELECT
r1.station_id,
r2.station_id,
r1.route_id,
r2.time_from_start - r1.time_from_start as time
FROM
api_routestop r1
INNER JOIN (SELECT
r1.station_id , r2.route_id, min(r2.time_from_start) next_time_from_start
FROM
api_routestop r1
LEFT JOIN api_routestop r2 ON r1.route_id = r2.route_id AND r1.id <> r2.id
and r2.time_from_start > r1.time_from_start
GROUP BY r1.Station_id, r2.route_id) next_stop
ON r1.Station_id = next_stop.station_id
and r1.route_id = next_stop.route_id
LEFT JOIN api_routestop r2
ON r2.time_from_start = r2.next_time_from_start
and r1.route_id = r2.route_id
AND r2.time_from_start > r1.time_from_start
SELECT station_id, coalesce(
(SELECT time_from_start
FROM api_routestop t2
WHERE t2.time_from_start > t1.time_from_start
AND t2.time_from_start <= (SELECT time_from_start FROM api_routestop t5 WHERE t5.station_id = '4' AND t5.route_id=t1.route_id)
AND t2.route_id = t1.route_id
ORDER BY t2.time_from_start LIMIT 1), time_from_start) - time_from_start AS difference
FROM api_routestop t1
WHERE t1.route_id = 1
AND t1.time_from_start >= (SELECT time_from_start FROM api_routestop t4 WHERE t4.station_id = '2' AND t4.route_id=t1.route_id)
AND t1.time_from_start <= (SELECT time_from_start FROM api_routestop t5 WHERE t5.station_id = '4' AND t5.route_id=t1.route_id)
ORDER BY time_from_start
Are you open to changing the schema? If so simply adding a column containing a sequential integer for all stops on route will make this query a lot easier and more efficient.
Failing that this will do it.
SELECT
station_id,
route_id,
time_from_start,
time_to_next
FROM
(
SELECT
station_id,route_id,time_from_start,
IF( #prev <> route_id, null, #time_from_start-time_from_start ) AS time_to_next,
#time_from_start := time_from_start,
#prev := route_id
FROM api_routestop
JOIN (SELECT #time_from_start := NULL, #prev := 0) AS r
ORDER BY route_id, time_from_start DESC
) t
ORDER BY route_id,time_from_start

SQL query that throws away rows that are older & satisfy a condition?

Issuing the following query:
SELECT t.seq,
t.buddyId,
t.mode,
t.type,
t.dtCreated
FROM MIM t
WHERE t.userId = 'ali'
ORDER BY t.dtCreated DESC;
...returns me 6 rows.
+-------------+------------------------+------+------+---------------------+
| seq | buddyId | mode | type | dtCreated |
+-------------+------------------------+------+------+---------------------+
| 12 | abcdefghij25#gmail.com | 2 | 1 | 2009-09-14 12:39:05 |
| 11 | abcdefghij25#gmail.com | 4 | 1 | 2009-09-14 12:39:02 |
| 10 | op_eee_81#hotmail.com | 1 | -1 | 2009-09-14 12:39:00 |
| 9 | abcdefghij25#gmail.com | 1 | -1 | 2009-09-14 12:38:59 |
| 8 | op_eee_81#hotmail.com | 2 | 1 | 2009-09-14 12:37:53 |
| 7 | abcdefghij25#gmail.com | 2 | 1 | 2009-09-14 12:37:46 |
+-------------+------------------------+------+------+---------------------+
I want to return rows based on this condition:
If there are duplicate rows with the same buddyId, only return me the latest (as specified by dtCreated).
So, the query should return me:
+-------------+------------------------+------+------+---------------------+
| seq | buddyId | mode | type | dtCreated |
+-------------+------------------------+------+------+---------------------+
| 12 | abcdefghij25#gmail.com | 2 | 1 | 2009-09-14 12:39:05 |
| 10 | op_eee_81#hotmail.com | 1 | -1 | 2009-09-14 12:39:00 |
+-------------+------------------------+------+------+---------------------+
I've tried with no success to use a UNIQUE function but it's not working.
This should only return the most recent entry for each userId.
SELECT a.seq
, a.buddyId
, a.mode
, a.type
, a.dtCreated
FROM mim AS [a]
JOIN (SELECT MAX(dtCreated) FROM min GROUP BY buddyId) AS [b]
ON a.dtCreated = b.dtCreated
AND a.userId = b.userId
WHERE userId='ali'
ORDER BY dtCreated DESC;