SQL Query/ Assigning Rank - sql

I am interested in SQL query not the PLSQL code.
We need to assign the rank based on date and id value
Input table should look like below
+------------+----+
| date | id |
+------------+----+
| 01-01-2018 | A |
| 02-01-2018 | A |
| 03-01-2018 | C |
| 04-01-2018 | B |
| 05-01-2018 | A |
| 06-01-2018 | C |
| 07-01-2018 | C |
| 08-01-2018 | B |
| 09-01-2018 | B |
| 10-01-2018 | B |
+------------+----+
output table should look like below
+------------+----+------+
| date | id | rank |
+------------+----+------+
| 01-01-2018 | A | 1 |
| 02-01-2018 | A | 2 |
| 03-01-2018 | C | 1 |
| 04-01-2018 | B | 1 |
| 05-01-2018 | A | 1 |
| 06-01-2018 | C | 1 |
| 07-01-2018 | C | 2 |
| 08-01-2018 | B | 1 |
| 09-01-2018 | B | 2 |
| 10-01-2018 | B | 3 |
+------------+----+------+

This is a type of gaps-and-islands problem. In this case, the simplest solution is probably the difference of row numbers:
select t.*,
row_number() over (partition by id, (seqnum - seqnum_i)
order by date
) as ranking
from (select t.*,
row_number() over (order by date) as seqnum,
row_number() over (partition by id order by date) as seqnum_i
from t
) t;
Why this works is a little tricky to explain. The difference of the two row numbers assigns a constant value to advance values of the same id. If you stare at the results of the subquery, you will see how this works.
Then the outer query just uses row_number() to assign the sequential number you want.

Related

Finding MAX date aggregated by order - Oracle SQL

I have a data orders that looks like this:
| Order | Step | Step Complete Date |
|:-----:|:----:|:------------------:|
| A | 1 | 11/1/2019 |
| | 2 | 11/1/2019 |
| | 3 | 11/1/2019 |
| | 4 | 11/3/2019 |
| | 5 | 11/3/2019 |
| | 6 | 11/5/2019 |
| | 7 | 11/5/2019 |
| B | 1 | 12/1/2019 |
| | 2 | 12/2/2019 |
| | 3 | |
| C | 1 | 10/21/2019 |
| | 2 | 10/23/2019 |
| | 3 | 10/25/2019 |
| | 4 | 10/25/2019 |
| | 5 | 10/25/2019 |
| | 6 | |
| | 7 | 10/27/2019 |
| | 8 | 10/28/2019 |
| | 9 | 10/29/2019 |
| | 10 | 10/30/2019 |
| D | 1 | 10/30/2019 |
| | 2 | 11/1/2019 |
| | 3 | 11/1/2019 |
| | 4 | 11/2/2019 |
| | 5 | 11/2/2019 |
What I need to accomplish is the following:
For each order, assign the 'Order_Completion_Date' field as the most recent 'Step_Complete_Date'. If ANY 'Step_Complete_Date' is NULL, then the value for 'Order_Completion_Date' should be NULL.
I set up a SQL FIDDLE with this data and my attempt, below:
SELECT
OrderNum,
MAX(Step_Complete_Date)
FROM
OrderNums
WHERE
Step_Complete_Date IS NOT NULL
GROUP BY
OrderNum
This is yielding:
ORDERNUM MAX(STEP_COMPLETE_DATE)
D 11/2/2019
A 11/5/2019
B 12/2/2019
C 10/30/2019
How can I achieve:
| OrderNum | Order_Completed_Date |
|:--------:|:--------------------:|
| A | 11/5/2019 |
| B | NULL |
| C | NULL |
| D | 11/2/2019 |
Aggregate function with KEEP can handle this
select ordernum,
max(step_complete_date)
keep (DENSE_RANK FIRST ORDER BY step_complete_date desc nulls first) res
FROM
OrderNums
GROUP BY
OrderNum
You can use a CASE expression to first count if there are any NULL values and if not then find the maximum value:
Query 1:
SELECT OrderNum,
CASE
WHEN COUNT( CASE WHEN Step_Complete_Date IS NULL THEN 1 END ) > 0
THEN NULL
ELSE MAX(Step_Complete_Date)
END AS Order_Completion_Date
FROM OrderNums
GROUP BY OrderNum
Results:
| ORDERNUM | ORDER_COMPLETION_DATE |
|----------|-----------------------|
| D | 11/2/2019 |
| A | 11/5/2019 |
| B | (null) |
| C | (null) |
First, you are representing dates as varchars in mm/dd/yyyy format (at least in fiddle). With max function it can produce incorrect result, try for example order with dates '11/10/2019' and '11/2/2019'.
Second, the most simple solution is IMHO to use fallback date for nulls and get null back when fallback date wins:
SELECT
OrderNum,
NULLIF(MAX(NVL(Step_Complete_Date,'~')),'~')
FROM
OrderNums
GROUP BY
OrderNum
(Example is still for varchars since tilde is greater than any digit. For dates, you could use 9999-12-31, for instance.)

Sorting column A based on a column B which contains previous values from column A

I'd like to sort column A based on a column B which contains previous values from column A.
This is what I have:
+----+----------+----------+
| ID | A | B |
+----+----------+----------+
| 1 | 17209061 | |
| 2 | 53199491 | 51249612 |
| 3 | 61249612 | 17209061 |
| 4 | 51249612 | 61249612 |
+----+----------+----------+
And this is what I'd like to have:
+----+----------+----------+----------+
| ID | A | B | Sort_seq |
+----+----------+----------+----------+
| 1 | 17209061 | | 1 |
| 3 | 61249612 | 17209061 | 2 |
| 4 | 51249612 | 61249612 | 3 |
| 2 | 53199491 | 51249612 | 4 |
+----+----------+----------+----------+
I'm sure there's an easy way to do this. Do you have any ideas?
Thank you!
Just use lag() in order by:
order by lag(a) over (order by id) nulls first
If you want a column, then:
select t.id, t.a, t.prev_a,
row_number() over (order by prev_a nulls first)
from (select t.*, lag(a) over (order by id) as prev_a
from t
) t;

Hive - over (partition by ...) with a column not in group by

Is it possible to do something like:
select
avg(count(distinct user_id))
over (partition by some_date) as average_users_per_day
from user_activity
group by user_type
(notably, the partition by column, some_date, is not in the group by columns)
The idea I'm going for is something like: the average users per day by user type.
I know how to do it using subqueries (see below), but I'd like to know if there is a nice way using only over (partition by ...) and group by.
Notes:
From reading this answer, my understanding (correct me if I'm wrong) is that the following query:
select
avg(count(distinct a)) over (partition by b)
from foo
group by b
can be expanded equivalently to:
select
avg(count_distinct_a)
from (
select
b,
count(distinct a) as count_distinct_a
from foo
group by b
)
group by b
And from that, I can tweak it a bit to achieve what I want:
select
avg(count_distinct_user_id) as average_users_per_day
from (
select
user_type,
count(distinct user_id) as count_distinct_user_id
from user_activity
group by user_type, some_date
)
group by user_type
(notably, the inner group by user_type, some_date differs from the outer group by user_type)
I'd like to be able to tell the partition by-group by interaction to use a "sub-group-by" for the windowing part. Please let me know if my understanding of partition by/group by is completely off.
EDIT: Some sample data and desired output.
Source table:
+---------+-----------+-----------+
| user_id | user_type | some_date |
+---------+-----------+-----------+
| 1 | a | 1 |
| 1 | a | 2 |
| 2 | a | 1 |
| 3 | a | 2 |
| 3 | a | 2 |
| 4 | b | 2 |
| 5 | b | 1 |
| 5 | b | 3 |
| 5 | b | 3 |
| 6 | c | 1 |
| 7 | c | 1 |
| 8 | c | 4 |
| 9 | c | 2 |
| 9 | c | 3 |
| 9 | c | 4 |
+---------+-----------+-----------+
Sample intermediate table (for reasoning with):
+-----------+-----------+---------------------+
| user_type | some_date | distinct_user_count |
+-----------+-----------+---------------------+
| a | 1 | 2 |
| a | 2 | 2 |
| b | 1 | 1 |
| b | 2 | 1 |
| b | 3 | 1 |
| c | 1 | 2 |
| c | 2 | 1 |
| c | 3 | 1 |
| c | 4 | 2 |
+-----------+-----------+---------------------+
SQL is: select user_type, some_date, count(distinct user_id) from user_activity group by user_type, some_date.
Desired result:
+-----------+---------------------+
| user_type | average_daily_users |
+-----------+---------------------+
| a | 2 |
| b | 1 |
| c | 1.5 |
+-----------+---------------------+

Hive Find Start and End of Group or Changing point

Here is the table:
+------+------+
| Name | Time |
+------+------+
| A | 1 |
| A | 2 |
| A | 3 |
| A | 4 |
| B | 5 |
| B | 6 |
| A | 7 |
| B | 8 |
| B | 9 |
| B | 10 |
+------+------+
I want to write a query to get:
+-------+--------+-----+
| Name | Start | End |
+-------+--------+-----+
| A | 1 | 4 |
| B | 5 | 6 |
| A | 7 | 7 |
| B | 8 | 10 |
+-------+--------+-----+
Does anyone know how to do it?
This is not the most efficient way, but it this works.
SELECT name, min(time) AS start,max(time) As end
FROM (
SELECT name,time, time- DENSE_RANK() OVER (partition by name ORDER BY
time) AS diff
FROM foo
) t
GROUP BY name,diff;
I would suggest try the following query and build a GenericUDF to identify the gaps, much more easier :)
SELECT name, sort_array(collect_list(time)) FROM foo GROUP BY name;

group by top two results based on order

I have been trying to get this to work with some row_number, group by, top, sort of things, but I am missing some fundamental concept. I have a table like so:
+-------+-------+-------+
| name | ord | f_id |
+-------+-------+-------+
| a | 1 | 2 |
| b | 5 | 2 |
| c | 6 | 2 |
| d | 2 | 1 |
| e | 4 | 1 |
| a | 2 | 3 |
| c | 50 | 4 |
+-------+-------+-------+
And my desired output would be:
+-------+---------+--------+-------+
| f_id | ord_n | ord | name |
+-------+---------+--------+-------+
| 2 | 1 | 1 | a |
| 2 | 2 | 5 | b |
| 1 | 1 | 2 | d |
| 1 | 2 | 4 | e |
| 3 | 1 | 2 | a |
| 4 | 1 | 50 | c |
+-------+---------+--------+-------+
Where data is ordered by the ord value, and only up to two results per f_id. Should I be working on a Stored Procedure for this or can I just do it with SQL? I have experimented with some select TOP subqueries, but nothing has even come close..
Here are some statements to create the test table:
create table help(name varchar(255),ord tinyint,f_id tinyint);
insert into help values
('a',1,2),
('b',5,2),
('c',6,2),
('d',2,1),
('e',4,1),
('a',2,3),
('c',50,4);
You may use Rank or DENSE_RANK functions.
select A.name, A.ord_n, A.ord , A.f_id from
(
select
RANK() OVER (partition by f_id ORDER BY ord asc) AS "Rank",
ROW_NUMBER() OVER (partition by f_id ORDER BY ord asc) AS "ord_n",
help.*
from help
) A where A.rank <= 2
Sqlfiddle demo