Grouping using row number function - sql

I have been using the row_number() function to only select the observations that I need.
In my scenario, whenever there is two different name for a particular <id, entity_id, period, element>, the National one should be left-out. In case there is only one, take the only one.
+----+-----------+--------+---------------+---------------------------+
| id | entity_id | period | element | name |
+----+-----------+--------+---------------+---------------------------+
| 12 | ABC123 | 2021 | Overall value | National Compatible - XYZ |
| 12 | ABC123 | 2021 | Overall value | Overall Estimation |
+----+-----------+--------+---------------+---------------------------+
With cases like above, the following did the trick:
SELECT *
FROM (SELECT *,
Row_number()
OVER (
partition BY id, entity_id, period, element
ORDER BY NAME DESC) AS rn
FROM mydata) table
WHERE table.rn = 1
Problem is that now there are other cases like the following:
+----+-----------+--------+---------------+---------------------------+
| id | entity_id | period | element | name |
+----+-----------+--------+---------------+---------------------------+
| 12 | ABC123 | 2021 | Overall value | National Based - ZYX |
| 12 | ABC123 | 2021 | Overall value | Base Estimation |
+----+-----------+--------+---------------+---------------------------+
And with the current SQL this would not work as I would have to change the order by from descending to ascending.
Is there any possibility to de-prioritize the "National..." record and take the other one in case there are multiple ones?
I am running the query on Hive/Impala.

If you add another derived-table layer (or use a CTE) then you can add a CASE WHEN to check for "name" starting with 'National' and give it a simple integer "tag" value you can use to de-prioritize those rows.
...like so:
WITH q AS (
SELECT
"id",
"entity_id",
"period",
"element",
"name",
CASE WHEN "name" LIKE 'National%' THEN 1 ELSE 2 END AS "tag"
FROM
mydata
),
filtered AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY
"id", "entity_id", "period", "element"
ORDER BY
"tag" DESC,
"name" DESC
) AS rn
FROM
q
)
SELECT
*
FROM
filtered
WHERE
rn = 1

Related

Postgres create view with column values based on another table?

I'm implementing a view to store leaderboard data of the top 10 users that is computed using an expensive COUNT(*). I'm planning on the view to look something like this:
id SERIAL PRIMARY KEY
user_id TEXT
type TEXT
rank INTEGER
count INTEGER
-- adding an index to user_id
-- adding a two-column unique index to user_id and type
I'm having trouble with seeing how this view should be created to properly account for the rank and type. Essentially, I have a big table (~30 million rows) like this:
+----+---------+---------+----------------------------+
| id | user_id | type | created_at |
+----+---------+---------+----------------------------+
| 1 | 1 | Diamond | 2021-05-11 17:35:18.399517 |
| 2 | 1 | Diamond | 2021-05-12 17:35:17.399517 |
| 3 | 1 | Diamond | 2021-05-12 17:35:18.399517 |
| 4 | 2 | Diamond | 2021-05-13 17:35:18.399517 |
| 5 | 1 | Clay | 2021-05-14 17:35:18.399517 |
| 6 | 1 | Clay | 2021-05-15 17:35:18.399517 |
+----+---------+---------+----------------------------+
With the table above, I'm trying to achieve something like this:
+----+---------+---------+------+-------+
| id | user_id | type | rank | count |
+----+---------+---------+------+-------+
| 1 | 1 | Diamond | 1 | 3 |
| 2 | 2 | Diamond | 2 | 1 |
| 3 | 1 | Clay | 1 | 2 |
| 4 | 1 | Weekly | 1 | 5 | -- 3 diamonds + 2 clay obtained between Mon-Sun
| 5 | 2 | Weekly | 2 | 1 |
+----+---------+---------+------+-------+
By Weekly I am counting the time from the last Sunday to the upcoming Sunday.
Is this doable using only SQL, or is some kind of script needed? If doable, how would this be done? It's worth mentioning that there are thousands of different types, so not having to manually specify type would be preferred.
If there's anything unclear, please let me know and I'll do my best to clarify. Thanks!
The "weekly" rows are produced in a different way compared to the "user" rows (I called them two different "categories"). To get the result you want you can combine two queries using UNION ALL.
For example:
select 'u' as category, user_id, type,
rank() over(partition by type order by count(*) desc) as rk,
count(*) as cnt
from scores
group by user_id, type
union all
select 'w', user_id, 'Weekly',
rank() over(order by count(*) desc),
count(*) as cnt
from scores
group by user_id
order by category, type desc, rk
Result:
category user_id type rk cnt
--------- -------- -------- --- ---
u 1 Diamond 1 3
u 2 Diamond 2 1
u 1 Clay 1 2
w 1 Weekly 1 5
w 2 Weekly 2 1
See running example at DB Fiddle.
Note: For the sake of simplicity I left the filtering by timestamp out of the query. If you really needed to include only the rows of the last 7 days (or other period of time), it would be a matter of adding a WHERE clause in both subqueries.
I think this is what you were talking about, right?
WITH scores_plus_weekly AS ((
SELECT id, user_id, 'Weekly' AS type, created_at
FROM scores
WHERE created_at BETWEEN '2021-05-10' AND '2021-05-17'
)
UNION (
SELECT * FROM scores
))
SELECT
row_number() OVER (ORDER BY CASE "type" WHEN 'Diamond' THEN 0 WHEN 'Clay' THEN 1 ELSE 2 END, count(*) DESC) as "id",
user_id,
"type",
row_number() OVER (PARTITION BY count(*) DESC) as "rank",
count(*)
FROM scores_plus_weekly
GROUP BY user_id, "type"
ORDER BY "id";
I'm sure this is not the only way, but I thought the result wasn't too complex. This query first combines the original database with all scores from this week. For the sake of consistency I picked a date range that matches your entire example set. It then groups by user_id and type to get the counts for each combination. The row_numbers will give you the overall rank and the rank per type. A big part of this query consists of sorting by type, so if you're joining another table that contains the order or priority of the types, the CASE can probably be simplified.
Then, lastly, this entire query can be caught in a view using the CREATE VIEW score_ranks AS , followed by your query.

Oracle SQL: Counting how often an attribute occurs for a given entry and choosing the attribute with the maximum number of occurs

I have a table that has a number column and an attribute column like this:
1.
+-----+-----+
| num | att |
-------------
| 1 | a |
| 1 | b |
| 1 | a |
| 2 | a |
| 2 | b |
| 2 | b |
+------------
I want to make the number unique, and the attribute to be whichever attribute occured most often for that number, like this (This is the end-product im interrested in) :
2.
+-----+-----+
| num | att |
-------------
| 1 | a |
| 2 | b |
+------------
I have been working on this for a while and managed to write myself a query that looks up how many times an attribute occurs for a given number like this:
3.
+-----+-----+-----+
| num | att |count|
------------------+
| 1 | a | 1 |
| 1 | b | 2 |
| 2 | a | 1 |
| 2 | b | 2 |
+-----------------+
But I can't think of a way to only select those rows from the above table where the count is the highest (for each number of course).
So basically what I am asking is given table 3, how do I select only the rows with the highest count for each number (Of course an answer describing providing a way to get from table 1 to table 2 directly also works as an answer :) )
You can use aggregation and window functions:
select num, att
from (
select num, att, row_number() over(partition by num order by count(*) desc, att) rn
from mytable
group by num, att
) t
where rn = 1
For each num, this brings the most frequent att; if there are ties, the smaller att is retained.
Oracle has an aggregation function that does this, stats_mode().:
select num, stats_mode(att)
from t
group by num;
In statistics, the most common value is called the mode -- hence the name of the function.
Here is a db<>fiddle.
You can use group by and count as below
select id, col, count(col) as count
from
df_b_sql
group by id, col

T-SQL Remove Duplicates from Groups BUT NOT GET TOP 1 FROM EACH GROUP

I do NOT want to get top 1 from each group! Pay attention to the explanation which I have provided at the last portion of my question!
I have the following rows:
| Code | Type | SubType | Date |
|:----:|:----:|:-------:|:----------:|
| 100 | 10 | 1 | 17.12.2019 |
| 100 | 10 | 2 | 18.12.2019 |
| 100 | 10 | 2 | 19.12.2019 |
| 100 | 10 | 1 | 20.12.2019 |
What I need is to make groups of rows based on Code, Type and SubType columns. But not only should I keep the Date column, but I have to remove duplicate rows (based on Code, Type and SubType columns) from those groups which are in the middle as follows:
| Code | Type | SubType | Date |
|:----:|:----:|:-------:|:----------:|
| 100 | 10 | 1 | 17.12.2019 |
| 100 | 10 | 2 | 18.12.2019 |
| 100 | 10 | 1 | 20.12.2019 |
Let me to explain more about the scenario which leads to this situation, and thus I need to clean my data before displaying to the end user. I have a historical table which has 4 columns (Code, Type, SubType and Date). Each row of this table shows a change which have been occurred on the values of the fields of that row at a specific date. For instance, in the above example, there have been 4 changes on the row at 4 different dates. At first, the row has been generated with Code = 100, Type = 10 and SubType = 1 at 17.12.2019. Then SubType has been changed to 2 at 18.12.2019. Next day, at 19.12.2019, SubType has been changed again to 2 (which is a duplicate in my case). Finally, SubType has been changed again to 1 at 20.12.2019. In fact, I don't need to show the 3rd change as it is a duplicate in my case.
I tried using Row_Number()Over(Partition by Code, Type and SubType Order By Date), but I was not successful.
You want to keep the dates where something changes. My recommendation is lag on the date:
select t.*
from (select t.*,
lag(date) over (partition by code, type, subtype order by date) as prev_cts_date,
lag(date) over (order by date) as prev_date
from t
) t
where prev_cts_date is null or prev_cts_date <> prev_date;
One alternative is a lag() on each of the columns and then check each value for a change. Not only is that cumbersome, but the logic gets much worse if NULL values are involved.
Here is logic is just asking: "Is the previous date for the CTS combination the same as the previous date?" If so, discard the record.
This looks to me like a gaps-and-island problem. Here is one approach using row_number():
select code, type, SubType, Date
from (
select
t.*,
row_number() over(partition by code, type, rn1 - rn2 order by date) rn
from (
select
t.*,
row_number() over(partition by code, type order by date) rn1,
row_number() over(partition by code, type, SubType order by date) rn2
from mytable t
) t
) t
where rn = 1
This defines group by taking the difference of row numbers over partitions of code, type against partitions of code, type, subtype. Then, we select the first record per group, using row_number() again.
Demo on DB Fiddle:
code | type | SubType | Date
---: | ---: | ------: | :---------
100 | 10 | 1 | 17.12.2019
100 | 10 | 2 | 18.12.2019
100 | 10 | 1 | 20.12.2019

SQL set increasing integer where value of column is 1

I have a data set which looks like:
Id INT,
Choice VARCHAR,
Order INT
Id + Choice form the primary key.
Currently a lot of the rows have Order = 1.
What I would like to do is, for each Id, if there are multiple rows with that Id where Order = 1, set them to be 1, 2, 3, 4, etc.
I can't work out the SQL to do this.
Example data:
+----+--------+-------+
| Id | Choice | Order |
+----+--------+-------+
| 4 | hello | 1 |
| 4 | world | 1 |
| 4 | test | 1 |
+----+--------+-------+
Would become:
+----+--------+-------+
| Id | Choice | Order |
+----+--------+-------+
| 4 | hello | 1 |
| 4 | world | 2 |
| 4 | test | 3 |
+----+--------+-------+
We can try using ROW_NUMBER here with a partition by Id. As for the ordering in your Order column, I don't see any logic present for how you numbered things. In the absence of this, I use the Choice column to decide how to order the row numbering.
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Id ORDER BY Choice) rn
FROM yourTable
WHERE [Order] = 1
)
UPDATE cte
SET [Order] = rn;
Note: Please avoid naming your columns (tables, etc.) using reserved SQL keywords like ORDER. You will forever have to put that column name in square brackets, like this: [Order].

How to use DISTINCT ON (of PostgreSQL) in Firebird?

I have a TempTable with datas:
------------------------------------
| KEY_1 | KEY 2 | NAME | VALUE |
------------------------------------
| 1 | 0001 | NAME 2 | VALUE 1 |
| 1 | 0002 | NAME 1 | VALUE 3 |
| 1 | 0003 | NAME 3 | VALUE 2 |
| 2 | 0001 | NAME 1 | VALUE 2 |
| 2 | 0001 | NAME 2 | VALUE 1 |
------------------------------------
I want to get the following data:
------------------------------------
| KEY_1 | KEY 2 | NAME | VALUE |
------------------------------------
| 1 | 0001 | NAME 2 | VALUE 1 |
| 2 | 0001 | NAME 1 | VALUE 2 |
------------------------------------
In PostgreSQL, I use a query with DISTINCT ON:
SELECT DISTINCT ON (KEY_1) KEY_1, KEY_2, NAME, VALUE
FROM TempTable
ORDER BY KEY_1, KEY_2
In Firebird, how to get data as above datas?
PostgreSQL's DISTINCT ON takes the first row per stated group key considering the ORDER BY clause. In other DBMS (including later versions of Firebird), you'd use ROW_NUMBER for this. You number the rows per group key in the desired order and stay with those numbered #1.
select key_1, key_2, name, value
from
(
select key_1, key_2, name, value,
row_number() over (partition by key_1 order by key_2) as rn
from temptable
) numbered
where rn = 1
order by key_1, key_2;
In your example you have a tie (key_1 = 2 / key_2 = 0001 occurs twice) and the DBMS picks one of the rows arbitrarily. (You'd have to extend the sortkey both in DISTINCT ON and ROW_NUMBER to decide which to pick.) If you want two rows, i.e. showing all tied rows, you'd use RANK (or DENSE_RANK) instead of ROW_NUMBER, which is something DISTINCT ON is not capable of.
Firebird 3.0 supports window functions, so you can use:
select . . .
from (select t.*,
row_number() over (partition by key_1 order by key_2) as seqnum
from temptable t
) t
where seqnum = 1;
In earlier versions, you can use several methods. Here is a correlated subquery:
select t.*
from temptable t
where t.key_2 = (select max(t2.key_2)
from temptable t2
where t2.key_1 = t.key_1
);
Note: This will still return duplicate values for key_1 because of the duplicates for key_2. Alas . . . getting just one row is tricky unless you have a unique identifier for each row.