group by top two results based on order - sql

I have been trying to get this to work with some row_number, group by, top, sort of things, but I am missing some fundamental concept. I have a table like so:
+-------+-------+-------+
| name | ord | f_id |
+-------+-------+-------+
| a | 1 | 2 |
| b | 5 | 2 |
| c | 6 | 2 |
| d | 2 | 1 |
| e | 4 | 1 |
| a | 2 | 3 |
| c | 50 | 4 |
+-------+-------+-------+
And my desired output would be:
+-------+---------+--------+-------+
| f_id | ord_n | ord | name |
+-------+---------+--------+-------+
| 2 | 1 | 1 | a |
| 2 | 2 | 5 | b |
| 1 | 1 | 2 | d |
| 1 | 2 | 4 | e |
| 3 | 1 | 2 | a |
| 4 | 1 | 50 | c |
+-------+---------+--------+-------+
Where data is ordered by the ord value, and only up to two results per f_id. Should I be working on a Stored Procedure for this or can I just do it with SQL? I have experimented with some select TOP subqueries, but nothing has even come close..
Here are some statements to create the test table:
create table help(name varchar(255),ord tinyint,f_id tinyint);
insert into help values
('a',1,2),
('b',5,2),
('c',6,2),
('d',2,1),
('e',4,1),
('a',2,3),
('c',50,4);

You may use Rank or DENSE_RANK functions.
select A.name, A.ord_n, A.ord , A.f_id from
(
select
RANK() OVER (partition by f_id ORDER BY ord asc) AS "Rank",
ROW_NUMBER() OVER (partition by f_id ORDER BY ord asc) AS "ord_n",
help.*
from help
) A where A.rank <= 2
Sqlfiddle demo

Related

SQL return only rows where value exists multiple times and other value is present

I have a table like this in MS SQL SERVER
+------+------+
| ID | Cust |
+------+------+
| 1 | A |
| 1 | A |
| 1 | B |
| 1 | B |
| 2 | A |
| 2 | A |
| 2 | A |
| 2 | B |
| 3 | A |
| 3 | B |
| 3 | B |
| 3 | C |
| 3 | C |
+------+------+
I don't know the values in column "Cust" and I want to return all rows where the value of "Cust" appears multiple times and where at least one of the "ID" values is "1".
Like this:
+------+------+
| ID | Cust |
+------+------+
| 1 | A |
| 1 | A |
| 1 | B |
| 1 | B |
| 2 | A |
| 2 | A |
| 2 | A |
| 2 | B |
| 3 | A |
| 3 | B |
| 3 | B |
+------+------+
Any ideas? I can't find it.
You may use COUNT window function as the following:
SELECT ID, Cust
FROM
(
SELECT ID, Cust,
COUNT(*) OVER (PARTITION BY Cust) cn,
COUNT(CASE WHEN ID=1 THEN 1 END) OVER (PARTITION BY Cust) cn2
FROM table_name
) T
WHERE cn>1 AND cn2>0
ORDER BY ID, Cust
COUNT(*) OVER (PARTITION BY Cust) to check if the value of "Cust" appears multiple times.
COUNT(CASE WHEN ID=1 THEN 1 END) OVER (PARTITION BY Cust) to check that at least one of the "ID" values is "1".
See a demo.

SQL Select random rows partitioned by a column

I have a dataset looks like this
| Country | id |
-------------------
| a | 5 |
| a | 1 |
| a | 2 |
| b | 1 |
| b | 5 |
| b | 4 |
| b | 7 |
| c | 5 |
| c | 1 |
| c | 2 |
and i need a query which returns 2 random values from where country in ('a', 'c'):
| Country | id |
------------------
| a | 2 | -- Two random rows from Country = 'a'
| a | 1 |
| c | 1 |
| c | 5 | --Two random rows from Country = 'c'
This should work:
select Country, id from
(select Country,
id,
row_number() over(partition by Country order by rand()) as rn
from table_name
) t
where Country in ('a', 'c') and rn <= 2
Replace rand() with random() if you're using Postgres or newid() in SQL Server.

SQL Query/ Assigning Rank

I am interested in SQL query not the PLSQL code.
We need to assign the rank based on date and id value
Input table should look like below
+------------+----+
| date | id |
+------------+----+
| 01-01-2018 | A |
| 02-01-2018 | A |
| 03-01-2018 | C |
| 04-01-2018 | B |
| 05-01-2018 | A |
| 06-01-2018 | C |
| 07-01-2018 | C |
| 08-01-2018 | B |
| 09-01-2018 | B |
| 10-01-2018 | B |
+------------+----+
output table should look like below
+------------+----+------+
| date | id | rank |
+------------+----+------+
| 01-01-2018 | A | 1 |
| 02-01-2018 | A | 2 |
| 03-01-2018 | C | 1 |
| 04-01-2018 | B | 1 |
| 05-01-2018 | A | 1 |
| 06-01-2018 | C | 1 |
| 07-01-2018 | C | 2 |
| 08-01-2018 | B | 1 |
| 09-01-2018 | B | 2 |
| 10-01-2018 | B | 3 |
+------------+----+------+
This is a type of gaps-and-islands problem. In this case, the simplest solution is probably the difference of row numbers:
select t.*,
row_number() over (partition by id, (seqnum - seqnum_i)
order by date
) as ranking
from (select t.*,
row_number() over (order by date) as seqnum,
row_number() over (partition by id order by date) as seqnum_i
from t
) t;
Why this works is a little tricky to explain. The difference of the two row numbers assigns a constant value to advance values of the same id. If you stare at the results of the subquery, you will see how this works.
Then the outer query just uses row_number() to assign the sequential number you want.

Hive - over (partition by ...) with a column not in group by

Is it possible to do something like:
select
avg(count(distinct user_id))
over (partition by some_date) as average_users_per_day
from user_activity
group by user_type
(notably, the partition by column, some_date, is not in the group by columns)
The idea I'm going for is something like: the average users per day by user type.
I know how to do it using subqueries (see below), but I'd like to know if there is a nice way using only over (partition by ...) and group by.
Notes:
From reading this answer, my understanding (correct me if I'm wrong) is that the following query:
select
avg(count(distinct a)) over (partition by b)
from foo
group by b
can be expanded equivalently to:
select
avg(count_distinct_a)
from (
select
b,
count(distinct a) as count_distinct_a
from foo
group by b
)
group by b
And from that, I can tweak it a bit to achieve what I want:
select
avg(count_distinct_user_id) as average_users_per_day
from (
select
user_type,
count(distinct user_id) as count_distinct_user_id
from user_activity
group by user_type, some_date
)
group by user_type
(notably, the inner group by user_type, some_date differs from the outer group by user_type)
I'd like to be able to tell the partition by-group by interaction to use a "sub-group-by" for the windowing part. Please let me know if my understanding of partition by/group by is completely off.
EDIT: Some sample data and desired output.
Source table:
+---------+-----------+-----------+
| user_id | user_type | some_date |
+---------+-----------+-----------+
| 1 | a | 1 |
| 1 | a | 2 |
| 2 | a | 1 |
| 3 | a | 2 |
| 3 | a | 2 |
| 4 | b | 2 |
| 5 | b | 1 |
| 5 | b | 3 |
| 5 | b | 3 |
| 6 | c | 1 |
| 7 | c | 1 |
| 8 | c | 4 |
| 9 | c | 2 |
| 9 | c | 3 |
| 9 | c | 4 |
+---------+-----------+-----------+
Sample intermediate table (for reasoning with):
+-----------+-----------+---------------------+
| user_type | some_date | distinct_user_count |
+-----------+-----------+---------------------+
| a | 1 | 2 |
| a | 2 | 2 |
| b | 1 | 1 |
| b | 2 | 1 |
| b | 3 | 1 |
| c | 1 | 2 |
| c | 2 | 1 |
| c | 3 | 1 |
| c | 4 | 2 |
+-----------+-----------+---------------------+
SQL is: select user_type, some_date, count(distinct user_id) from user_activity group by user_type, some_date.
Desired result:
+-----------+---------------------+
| user_type | average_daily_users |
+-----------+---------------------+
| a | 2 |
| b | 1 |
| c | 1.5 |
+-----------+---------------------+

Ranking: How to reset the ROW_NUMBER or RANK to 1

Using SQL Server 2014:
Consider the following table:
DECLARE #Table TABLE (
Id int NOT NULL identity(1,1),
Col_Value varchar(2)
)
INSERT INTO #Table (Col_Value)
VALUES ('A'),('A'),('B'),('B'),('B'),('A'),('A'),('B'),('B'),('B'),('A'),('B'),('B'),('A'),('A'),('B'),('C'),('C'),('A'),('A'),('B'),('B'),('C')
How can I create a query that produces R column in the result like below
+----+------+---+
| ID | Data | R |
+----+------+---+
| 1 | A | 1 |
+----+------+---+
| 2 | A | 2 |
+----+------+---+
| 3 | B | 1 |
+----+------+---+
| 4 | B | 2 |
+----+------+---+
| 5 | B | 3 |
+----+------+---+
| 6 | A | 1 |
+----+------+---+
| 7 | A | 2 |
+----+------+---+
| 8 | B | 1 |
+----+------+---+
| 9 | B | 2 |
+----+------+---+
| 10 | B | 3 |
+----+------+---+
| 11 | A | 1 |
+----+------+---+
| 12 | B | 1 |
+----+------+---+
| 13 | B | 2 |
+----+------+---+
| 14 | A | 1 |
+----+------+---+
| 15 | A | 2 |
+----+------+---+
| 16 | B | 1 |
+----+------+---+
| 17 | C | 1 |
+----+------+---+
| 18 | C | 2 |
+----+------+---+
| 19 | A | 1 |
+----+------+---+
| 20 | A | 2 |
+----+------+---+
| 21 | B | 1 |
+----+------+---+
| 22 | B | 2 |
+----+------+---+
| 23 | C | 1 |
+----+------+---+
In the above result table, once Data column changes in a row, the R value resets to 1
Update 1
Ben Thul's answer works very well.
I suggest below post be updated with a reference to this answer.
T-sql Reset Row number on Field Change
This is known as a "gaps and islands" problem in the literature. First, my proposed solution:
with cte as (
select *, [Id] - row_number() over (partition by [Col_Value] order by [Id]) as [GroupID]
from #table
)
select [Id], [Col_Value], row_number() over (partition by [GroupID], [Col_Value] order by [Id])
from cte
order by [Id];
For exposition, note that if I enumerate all of the "A" values using row_number(), those that are contiguous have the row_number() value go up at the same rate as the Id value. Which is to say that their difference will be the same for those in that contiguous group (also known as an "island"). Once we calculate that group identifier, it's merely a matter of enumerating each member per group.