Rating rows by conditions - hive

The existing Dataset
Hello guys,
The existing dataset:
Student_id Book_id class_id timestamp
1101 NV5602 12 null
1101 NV5401 31 11/09/2021 16:50
1101 NV5401 12 null
I have book_id consisting of 2 letters at first and numbers afterward. I would like to pick for each Student_id the highest book_id number (NV5602 - in the example above).
If there is 2 book_id with the same number (in our example: NV5401) for the same student_id then rate the row with a timestampe as 1 and the other with null as 2.
If all the timestamps are nulls for the book_id and for the student_id - rate it as 1
the output should be like:
Student_id Book_id class_id timestamp row_number
1101 NV5602 12 null 1
1101 NV5401 31 11/09/2021 16:50 1
1101 NV5401 12 null 2
The desired output

Use row_number. Demo:
with mydata as (
select 1101 Student_id, 'NV5602' Book_id, 12 class_id, null timestamp_ union all
select 1101, 'NV5401', 31, '11/09/2021 16:50' union all
select 1101, 'NV5401', 12, null
)
select Student_id, Book_id, class_id, timestamp_,
row_number() over(partition by student_id, case when timestamp_ is null then 1 else 0 end order by regexp_extract(Book_id,'[A-Z]+(\\d+)$',1) desc) as row_number
from mydata
Result:
student_id book_id class_id timestamp_ row_number
1101 NV5401 31 11/09/2021 16:50 1
1101 NV5602 12 NULL 1
1101 NV5401 12 NULL 2

Related

Count distinct of multiple columns

I've been trying to figure a query out.
Let's say a table looks like this:
cus_id prod_category agreement_id type_id
111 10 123456 1
111 10 123456 1
111 10 123456 2
111 20 123456 2
123 20 987654 6
999 0 135790 99
999 0 246810 99
and so on...
I would like to get the count of prod_category for distinct values over agreement_id and type_id
so I would like to get a result like this:
cus_id prod_id count
111 10 2
111 20 1
123 20 1
999 0 2
We can use the following two level aggregation query:
SELECT cus_id, prod_category, COUNT(*) AS count
FROM
(
SELECT DISTINCT cus_id, prod_category, agreement_id, type_id
FROM yourTable
) t
GROUP BY cus_id, prod_category;
The inner distinct query de-duplicated tuples, and the outer aggregation query counts the number of distinct tuples per customer and product category.
You want to count distinct (agreement_id, type_id) tuples per (cus_id, prod_category) tuple.
"Per (cus_id, prod_category) tuple" translates to GROUP BY cus_id, prod_category in SQL.
And we count distinct (agreement_id, type_id) tuples with COUNT(DISTINCT agreement_id, type_id).
SELECT cus_id, prod_category, COUNT(DISTINCT agreement_id, type_id) AS distinct_count
FROM mytable
GROUP BY cus_id, prod_category
ORDER BY cus_id, prod_category;

Postgresql group by multiple columns and sort within each group

I have a table with the following columns:
order_id
amount
rating
percent
customer_id
ref_id
doc_id
1
1000
1
0.6
112
8
5
2
2000
2
0.1
111
8
8
2
3000
3
0.2
110
8
6
3
4000
5
0.1
100
7
7
3
4000
2
0.7
124
7
9
3
5000
4
0.6
143
5
10
4
2000
6
0.4
125
4
11
4
2500
1
0.55
185
4
12
4
1000
4
0.42
168
5
13
4
1200
8
0.8
118
1
14
for each order_id I want to find the doc_id having the highest amount, highest rating, highest percent, lowest customer_id.
for a single order id I can do it like this:
select order_id, doc_id
from orders
where order_id = 1625
order by amount desc nulls last,
rating desc nulls last,
percent desc nulls last,
customer_id asc
limit 1;
but I haven't been able to make it for all orders. So the output should be something like this:
order_id
doc_id
1
5
2
6
3
10
4
12
I am using Postgresql.
Any idea how I should do this?
Use FIRST_VALUE() window function:
SELECT DISTINCT order_id,
FIRST_VALUE(doc_id) OVER (
PARTITION BY order_id
ORDER BY amount DESC NULLS LAST, rating DESC NULLS LAST, percent DESC NULLS LAST, customer_id
) doc_id
FROM orders;
See the demo.
It looks like you need to implement row numbering in a window, ordering by your desired criteria:
with t as (
select *,
Row_Number()
over(partition by order_id
order by amount desc, rating desc, percent desc, customer_id
) seq
from Yourtable
)
select order_id, doc_id
from t
where seq=1

SQL - Order Data on a Column without including it in ranking

So I have a scenario where I need to order data on a column without including it in dense_rank(). Here is my sample data set:
This is the table:
create table temp
(
id integer,
prod_name varchar(max),
source_system integer,
source_date date,
col1 integer,
col2 integer);
This is the dataset:
insert into temp
(id,prod_name,source_system,source_date,col1,col2)
values
(1,'ABC',123,'01/01/2021',50,60),
(2,'ABC',123,'01/15/2021',50,60),
(3,'ABC',123,'01/30/2021',40,60),
(4,'ABC',123,'01/30/2021',40,70),
(5,'XYZ',456,'01/10/2021',80,30),
(6,'XYZ',456,'01/12/2021',75,30),
(7,'XYZ',456,'01/20/2021',75,30),
(8,'XYZ',456,'01/20/2021',99,30);
Now, I want to do dense_rank() on the data in such a way that for a combination of "prod_name and source_system", the rank gets incremented only if there is a change in col1 or col2 but the data should still be in ascending order of source_date.
Here is the expected result:
id
prod_name
source_system
source_date
col1
col2
Dense_Rank
1
ABC
123
01-01-21
50
60
1
2
ABC
123
15-01-21
50
60
1
3
ABC
123
30-01-21
40
60
2
4
ABC
123
30-01-21
40
70
3
5
XYZ
456
10-01-21
80
30
1
6
XYZ
456
12-01-21
75
30
2
7
XYZ
456
20-01-21
75
30
2
8
XYZ
456
20-01-21
99
30
3
As you can see above, the dates are changing but the expectation is that rank should only change if there is any change in either col1 or col2.
If I use this query
select id,prod_name,source_system,source_date,col1,col2,
dense_rank() over(partition by prod_name,source_system order by source_date,col1,col2) as rnk
from temp;
Then the result would come as:
id
prod_name
source_system
source_date
col1
col2
rnk
1
ABC
123
01-01-21
50
60
1
2
ABC
123
15-01-21
50
60
2
3
ABC
123
30-01-21
40
60
3
4
ABC
123
30-01-21
40
70
4
5
XYZ
456
10-01-21
80
30
1
6
XYZ
456
12-01-21
75
30
2
7
XYZ
456
20-01-21
75
30
3
8
XYZ
456
20-01-21
99
30
4
And, if I exclude source_date from order by in rank function i.e.
select id,prod_name,source_system,source_date,col1,col2,
dense_rank() over(partition by prod_name,source_system order by col1,col2) as rnk
from temp;
Then my result is coming as:
id
prod_name
source_system
source_date
col1
col2
rnk
3
ABC
123
30-01-21
40
60
1
4
ABC
123
30-01-21
40
70
2
1
ABC
123
01-01-21
50
60
3
2
ABC
123
15-01-21
50
60
3
6
XYZ
456
12-01-21
75
30
1
7
XYZ
456
20-01-21
75
30
1
5
XYZ
456
10-01-21
80
30
2
8
XYZ
456
20-01-21
99
30
3
Both the results are incorrect. How can I get the expected result? Any guidance would be helpful.
WITH cte AS (
SELECT *,
LAG(col1) OVER (PARTITION BY prod_name, source_system ORDER BY source_date, id) lag1,
LAG(col2) OVER (PARTITION BY prod_name, source_system ORDER BY source_date, id) lag2
FROM temp
)
SELECT *,
SUM(CASE WHEN (col1, col2) = (lag1, lag2)
THEN 0
ELSE 1
END) OVER (PARTITION BY prod_name, source_system ORDER BY source_date, id) AS `Dense_Rank`
FROM cte
ORDER BY id;
https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=ac70104c7c5dfb49c75a8635c25716e6
When comparing multiple columns, I like to look at the previous values of the ordering column, rather than the individual columns. This makes it much simpler to add more and more columns.
The basic idea is to do a cumulative sum of changes for each prod/source system. In Redshift, I would phrase this as:
select t.*,
sum(case when prev_date = prev_date_2 then 0 else 1 end) over (
partition by prod_name, source_system
order by source_date
rows between unbounded preceding and current row
)
from (select t.*,
lag(source_date) over (partition by prod_name, source_system order by source_date, id) as prev_date,
lag(source_date) over (partition by prod_name, source_system, col1, col2 order by source_date, id) as prev_date_2
from temp t
) t
order by id;
I think I have the syntax right for Redshift. Here is a db<>fiddle using Postgres.
Note that ties on the date can cause a problem -- regardless of the solution. This uses the id to break the ties. Perhaps id can just be used in general, but your code is using the date, so this uses the date with the id.

How to create a new sequential flag after a change in value (in SQL)

I'm trying to create a sequential numeric flag upon a change in section.
The flag should be 1 when a student joins a section and continue to be 1 until a change in section.
The flag should be 2 for the first change, 3 for the 2nd change and so forth.
Since a section can repeat after a change, I'm finding it challenging to create a desired outcome.
Any help would be greatly appreciated.
Sample data
create table dbo.cluster_test
(student_id int not null
,record_date date not null
,section varchar(30) null)
insert into cluster_test
(student_id, record_date, section)
values
(123, '2020-02-06', NULL)
,(123, '2020-05-14', 'A')
,(123, '2020-08-12', 'A')
,(123, '2020-09-01', 'B')
,(123, '2020-09-15', 'A')
,(123, '2020-09-29', 'A')
,(123, '2020-11-02', NULL)
,(123, '2020-11-30', NULL)
,(789, '2020-01-12', NULL)
,(789, '2020-04-12', 'A')
,(789, '2020-05-03', NULL)
,(789, '2020-06-13', 'A')
,(789, '2020-06-30', 'B')
,(789, '2020-07-01', 'B')
,(789, '2020-07-22', 'A')
Desired result
student_id
record_date
section
flag
123
2020-02-06
NULL
NULL
123
2020-05-14
A
1
123
2020-08-12
A
1
123
2020-09-01
B
2
123
2020-09-15
A
3
123
2020-09-29
A
3
123
2020-11-02
NULL
NULL
123
2020-11-30
NULL
NULL
789
2020-01-12
NULL
NULL
789
2020-04-12
A
1
789
2020-05-03
NULL
NULL
789
2020-06-13
A
2
789
2020-06-30
B
3
789
2020-07-01
B
3
789
2020-07-22
A
4
Attempt:
select
student_id
,record_date
,section
,case when section is not null then row_number() over(partition by student_id, section order by record_date asc)
end row#
,case when (section is not null) and (lag(section, 1) over(partition by student_id order by record_date asc) is null) then 'start'
when (lag(section, 1) over(partition by student_id order by record_date asc) is not null) and (section != lag(section, 1) over(partition by student_id order by record_date asc)) then 'change'
end chk_txt
,case when section is not null then (case when (section is not null) and (lag(section, 1) over(partition by student_id order by record_date asc) is null) then 1
when (lag(section, 1) over(partition by student_id order by record_date asc) is not null) and (section != lag(section, 1) over(partition by student_id order by record_date asc)) then 1
else 0
end)
end chk_val2
from cluster_test
order by 1, 2
It is gap and island problem. You can use analytical function as follows:
Select student_id, record_date, section,
Case when section is not null
Then sum(case when section is not null and (section <> lgs or lgs is null) then 1 end)
over (partition by student_id order by record_date)
End as flag
From (
Select student_id, record_date, section,
Lag(section) over (partition by student_id order by record_date) as lgs
From cluster_test t
) t
order by student_id, record_date;
db<>fiddle
You can go for multiple CTEs and get data as given below:
with cte_studentSection as (
SELECT student_id, record_Date, section
, lead(section) over(partition by student_id order by record_date) as nextSection
, row_number() over (partition by student_id order by record_date) as rnk
FROM dbo.Cluster_test
where section is not null
), cte_studentSectionFlag as (
SELECT Student_id, record_date, section, rnk, 1 as flag
from cte_studentSection as oc
where record_date = (SELECT MIN(record_Date) from cte_studentSection where student_id = oc.student_id)
union all
SELECT oc.Student_id, oc.record_date, oc.section,oc.rnk, case when oc.section = cte.section then cte.flag else cte.flag + 1 end
from cte_studentSection as oc
inner join cte_studentSectionFlag as cte on cte.rnk + 1 = oc.rnk and oc.student_id = cte.student_id
)
select student_id, record_date, section, flag
from cte_studentsectionflag
union all
select student_id, record_date, section, null as flag
from dbo.Cluster_test
where section is null
order by student_id, record_date;
student_id
record_date
section
flag
123
2020-02-06
NULL
NULL
123
2020-05-14
A
1
123
2020-08-12
A
1
123
2020-09-01
B
2
123
2020-09-15
A
3
123
2020-09-29
A
3
123
2020-11-02
NULL
NULL
123
2020-11-30
NULL
NULL
789
2020-01-12
NULL
NULL
789
2020-04-12
A
1
789
2020-05-03
NULL
NULL
789
2020-06-13
A
1
789
2020-06-30
B
2
789
2020-07-01
B
2
789
2020-07-22
A
3

T-SQL: Row_number() group by

I am using SQL Server 2008 R2 and have a structure as below:
create table #temp( deptid int, regionid int)
insert into #temp
select 15000, 50
union
select 15100, 51
union
select 15200, 50
union
select 15300, 52
union
select 15400, 50
union
select 15500, 51
union
select 15600, 52
select deptid, regionid, RANK() OVER(PARTITION BY regionid ORDER BY deptid) AS 'RANK',
ROW_NUMBER() OVER(PARTITION BY regionid ORDER BY deptid) AS 'ROW_NUMBER',
DENSE_RANK() OVER(PARTITION BY regionid ORDER BY deptid) AS 'DENSE_RANK'
from #temp
drop table #temp
And output currently is as below:
deptid regionid RANK ROW_NUMBER DENSE_RANK
--------------------------------------------------
15000 50 1 1 1
15200 50 2 2 2
15400 50 3 3 3
15100 51 1 1 1
15500 51 2 2 2
15300 52 1 1 1
15600 52 2 2 2
My requirement however is to row_number over regionid column but by grouping and not row by row. To explain better, below is my desired result set.
deptid regionid RN
-----------------------
15000 50 1
15200 50 1
15400 50 1
15100 51 2
15500 51 2
15300 52 3
15600 52 3
Please let me know if my question is unclear. Thanks.
Use dense_rank() over (order by regionid) to get the expected result.
select deptid, regionid,
DENSE_RANK() OVER( ORDER BY regionid) AS 'DENSE_RANK'
from #temp
Partitioning within a rank/row_number window function will assign numbers within the partitions, so you don't need to use a partition on regionid to order the regionids themselves.