Trouble performing Postgres group by non-ID column to get ID containing max value - sql

I'm attempting to perform a GROUP BY on a join table table. The join table essentially looks like:
CREATE TABLE user_foos (
id SERIAL PRIMARY KEY,
user_id INT NOT NULL,
foo_id INT NOT NULL,
effective_at DATETIME NOT NULL
);
ALTER TABLE user_foos
ADD CONSTRAINT user_foos_uniqueness
UNIQUE (user_id, foo_id, effective_at);
I'd like to query this table to find all records where the effective_at is the max value for any pair of user_id, foo_id given. I've tried the following:
SELECT "user_foos"."id",
"user_foos"."user_id",
"user_foos"."foo_id",
max("user_foos"."effective_at")
FROM "user_foos"
GROUP BY "user_foos"."user_id", "user_foos"."foo_id";
Unfortunately, this results in the error:
column "user_foos.id" must appear in the GROUP BY clause or be used in an aggregate function
I understand that the problem relates to "id" not being used in an aggregate function and that the DB doesn't know what to do if it finds multiple records with differing ID's, but I know this could never happen due to my trinary primary key across those columns (user_id, foo_id, and effective_at).
To work around this, I also tried a number of other variants such as using the first_value window function on the id:
SELECT first_value("user_foos"."id"),
"user_foos"."user_id",
"user_foos"."foo_id",
max("user_foos"."effective_at")
FROM "user_foos"
GROUP BY "user_foos"."user_id", "user_foos"."foo_id";
and:
SELECT first_value("user_foos"."id")
FROM "user_foos"
GROUP BY "user_foos"."user_id", "user_foos"."foo_id"
HAVING "user_foos"."effective_at" = max("user_foos"."effective_at")
Unfortunately, these both result in a different error:
window function call requires an OVER clause
Ideally, my goal is to fetch ALL matching id's so that I can use it in a subquery to fetch the legitimate full row data from this table for matching records. Can anyone provide insight on how I can get this working?

Postgres has a very nice feature called distinct on, which can be used in this case:
SELECT DISTINCT ON (uf."user_id", uf."foo_id") uf.*
FROM "user_foos" uf
ORDER BY uf."user_id", uf."foo_id", uf."effective_at" DESC;
It returns the first row in a group, based on the values in parentheses. The order by clause needs to include these values as well as a third column for determining which is the first row in the group.

Try:
SELECT *
FROM (
SELECT t.*,
row_number() OVER( partition by user_id, foo_id ORDER BY effective_at DESC ) x
FROM user_foos t
)
WHERE x = 1

If you don't want to use a sub query based on a composite of all three keys then you need to create a "dense rank" window function field that orders subsets of id, user_id and foo_id by effective date with the rank order field. Then subquery that and take the records where rank_order=1. Since the rank ordering was by effective date you are getting all fields of the record with the highest effective date for each foo and user.
DATSET
1 1 1 01/01/2001
2 1 1 01/01/2002
3 1 1 01/01/2003
4 1 2 01/01/2001
5 2 1 01/01/2001
DATSET WITH RANK ORDER PARTITIONED BY FOO_ID, USER_ID ORDERED BY DATE DESC
1 3 1 1 01/01/2001
2 2 1 1 01/01/2002
3 1 1 1 01/01/2003
4 1 1 2 01/01/2001
5 1 2 1 01/01/2001
SELECT * FROM QUERY ABOVE WHERE RANK_ORDER=1
3 1 1 1 01/01/2003
4 1 1 2 01/01/2001
5 1 2 1 01/01/2001

Related

Updating column according to index within group

In our databases we have a table called conditions which references a table called attributes.
So it looks like this (ignoring some other columns that aren't relevant to the question)
id
attribute_id
execution_index
1
1000
1
2
1000
2
3
1000
1
4
2000
1
5
2000
2
6
2000
2
In theory the combination of attribute_id and execution_index should always be unique, but in practice they're not, and the software ends up essentially using the id to decide which comes first between two conditions with the same execution index. We want to add a uniqueness constraint to the table, but before we do that we need to update the execution indexes. So essentially we want to group them by attribute_id, order them by execution_index then id, and give them new execution indexes so that it becomes
id
attribute_id
execution_index
1
1000
1
2
1000
3
3
1000
2
4
2000
1
5
2000
2
6
2000
3
I'm not sure how to do this without just ordering by attribute_id, execution_index, id and then iterating through incrementing the execution_index by 1 each time and resetting it to be 1 whenever the attribute_id changes. (That would work but it'd be slow and someone is going to have to run this script on several dozen databases so I'd rather it didn't take more than a couple of seconds per database.)
Really I'd like to do something along the lines of
UPDATE c
SET c.execution_index = [this needs to be the index within the group somehow]
FROM condities c
GROUP BY c.attribute_id
ORDER BY c.execution_index asc, c.id asc
But I don't know how to make that actually work.
It looks like you can use an updatable CTE:
with cte as (
select *,
Row_Number() over(partition by attribute_id order by execution_index, id) new
from conditions
)
update cte set execution_index = new
I would suggest adding a new column and first updating that and checking the results are as expected.
Example Fiddle
WITH cte AS
(
SELECT
*,
ROW_NUMBER() OVER
(
PARTITION BY attribute_id
ORDER BY execution_index, id
) AS RowNum
FROM condities
)
UPDATE cte
SET execution_index = RowNum

SQL - JOIN 2 tables with either NULL OR MAX

I have two tables in Teradata that i need to LEFT JOIN.
The first one includes clients, the second their details with the validity end date. NULL represents currently valid.
Table1
client_id
1
2
Table2
client_id
valid_end
1
31.12.2021
1
31.12.2022
2
31.12.2020
2
null
I need to left join the two tables using the most recent record for each client from Table2.
In case there is a currently valid record with NULL, it is used. If there is not any NULL record, the highest date is used.
Result
client_id
valid_end
1
31.12.2022
2
null
Tried a lot using QUALIFY and MAX, but never reached the requested result. Thanks for advice.
Use ROW_NUMBER instead of MAX, NULLS FIRST sorts NULL before the highest date:
qualify
row_number()
over (partition by client_id
order by valid_end desc NULLS FIRST) = 1

Keyset pagination with composite key

I am using oracle 12c database and I have a table with the following structure:
Id NUMBER
SeqNo NUMBER
Val NUMBER
Valid VARCHAR2
A composite primary key is created with the field Id and SeqNo.
I would like to fetch the data with Valid = 'Y' and apply ketset pagination with a page size of 3. Assume I have the following data:
Id SeqNo Val Valid
1 1 10 Y
1 2 20 N
1 3 30 Y
1 4 40 Y
1 5 50 Y
2 1 100 Y
2 2 200 Y
Expected result:
----------------------------
Page 1
----------------------------
Id SeqNo Val Valid
1 1 10 Y
1 3 30 Y
1 4 40 Y
----------------------------
Page 2
----------------------------
Id SeqNo Val Valid
1 5 50 Y
2 1 100 Y
2 2 200 Y
Offset pagination can be done like this:
SELECT * FROM table ORDER BY Id, SeqNo OFFSET 3 ROWS FETCH NEXT 3 ROWS ONLY;
However, in the actual db it has more than 5 millions of records and using OFFSET is going to slow down the query a lot. Therefore, I am looking for a ketset pagination approach (skip records using some unique fields instead of OFFSET)
Since a composite primary key is used, I need to offset the page with information from more than 1 field.
This is a sample SQL that should work in PostgreSQL (fetch 2nd page):
SELECT * FROM table WHERE (Id, SeqNo) > (1, 4) AND Valid = 'Y' ORDER BY Id, SeqNo LIMIT 3;
How do I achieve the same in oracle?
Use row_number() analytic function with ceil arithmetic fuction. Arithmetic functions don't have a negative impact on performance, and row_number() over (order by ...) expression automatically orders the data without considering the insertion order, and without adding an extra order by clause for the main query. So, consider :
select Id,SeqNo,
ceil(row_number() over (order by Id,SeqNo)/3) as page
from tab
where Valid = 'Y';
P.S. It also works for Oracle 11g, while OFFSET 3 ROWS FETCH NEXT 3 ROWS ONLY works only for Oracle 12c.
Demo
You can use order by and then fetch rows using fetch and offset like following:
Select ID, SEQ, VAL, VALID FROM TABLE
WHERE VALID = 'Y'
ORDER BY ID, SEQ
--FETCH FIRST 3 ROWS ONLY -- first page
--OFFSET 3 ROWS FETCH NEXT 3 ROWS ONLY -- second pages
--OFFSET 6 ROWS FETCH NEXT 3 ROWS ONLY -- third page
--Update--
You can use row_number analytical function as following.
Select id, seqNo, Val, valid from
(Select t.*,
Row_number(order by id, seq) as rn from table t
Where valid = 'Y')
Where ceil(rn/3) = 2 -- for page no. 2
Cheers!!

Get the 2 options with min value for each each student_id

I have table name m_option:
m_option_id m_student_id value
1 1 5
2 1 5
3 1 6
4 1 7
5 2 1
6 2 2
7 2 3
8 2 3
9 2 4
I want to get the 2 rows with min value for each m_student_id:
m_option_id m_student_id value
1 1 5
2 1 5
5 2 1
6 2 2
You can use the row_number window function for that:
SELECT m_option_id, m_student_id, value
FROM (
SELECT
m_option_id, m_student_id, value,
row_number() OVER (PARTITION BY m_student_id ORDER BY value)
FROM m_option
) t
WHERE
row_number <= 2;
row_number will calculate the number of each row within its group. We then use that number to filter the the top 2 rows (i.e. lowest value) from each group.
Alternatively, you could use a LATERAL subquery:
SELECT m_option_id, m_student_id, value
FROM (SELECT DISTINCT m_student_id FROM m_option) s,
LATERAL (
SELECT m_option_id, value
FROM m_option
WHERE s.m_student_id=m_student_id
ORDER BY value
LIMIT 2
) t;
This will go through all distinct values of m_student_id and for each one of them will find the top 2 rows using a LATERAL subquery.
Assuming there can be many rows per student in table m_option, the key to performance is index usage. And that's most efficient if you have a separate student table listing all students uniquely (which you would typically have). Then:
SELECT m.m_option_id, s.student_id AS m_student_id, m.value
FROM student s
, LATERAL (
SELECT m_option_id, value
FROM m_option
WHERE m_student_id = s.student_id -- PK of table student
ORDER BY value
LIMIT 2
) m;
A multicolumn index on m_option makes this fast:
CREATE INDEX m_option_combo_idx ON m_option (m_student_id, value);
If you can get index-only scans out of it, append the column m_option_id as last index item:
CREATE INDEX m_option_combo_idx ON m_option (m_student_id, value, m_option_id)
Index columns in this order.
Is a composite index also good for queries on the first field?
Distilling a unique list of student_id from m_option would incur an expensive sequential scan over m_option and void any performance benefit.
This excludes students without any related rows in m_option. Use LEFT JOIN LATERAL () ON true to include such students in the result (extended with NULL values for the missing option):
What is the difference between LATERAL and a subquery in PostgreSQL?
If you do not have a student table, the other fast option is a recursive CTE.
Detailed explanation for either variant:
Optimize GROUP BY query to retrieve latest record per user

SQL query with grouping and MAX

I have a table that looks like the following but also has more columns that are not needed for this instance.
ID DATE Random
-- -------- ---------
1 4/12/2015 2
2 4/15/2015 2
3 3/12/2015 2
4 9/16/2015 3
5 1/12/2015 3
6 2/12/2015 3
ID is the primary key
Random is a foreign key but i am not actually using table it points to.
I am trying to design a query that groups the results by Random and Date and select the MAX Date within the grouping then gives me the associated ID.
IF i do the following query
select top 100 ID, Random, MAX(Date) from DateBase group by Random, Date, ID
I get duplicate Randoms since ID is the primary key and will always be unique.
The results i need would look something like this
ID DATE Random
-- -------- ---------
2 4/15/2015 2
4 9/16/2015 3
Also another question is there could be times where there are many of the same date. What will MAX do in that case?
You can use NOT EXISTS() :
SELECT * FROM YourTable t
WHERE NOT EXISTS(SELECT 1 FROM YourTable s
WHERE s.random = t.random
AND s.date > t.date)
This will select only those who doesn't have a bigger date for corresponding random value.
Can also be done using IN() :
SELECT * FROM YourTable t
WHERE (t.random,t.date) in (SELECT s.random,max(s.date)
FROM YourTable s
GROUP BY s.random)
Or with a join:
SELECT t.* FROM YourTable t
INNER JOIN (SELECT s.random,max(s.date) as max_date
FROM YourTable s
GROUP BY s.random) tt
ON(t.date = tt.max_date and s.random = t.random)
In SQL Server you could do something like the following,
select a.* from DateBase a inner join
(select Random,
MAX(dt) as dt from DateBase group by Random) as x
on a.dt =x.dt and a.random = x.random
This method will work in all versions of SQL as there are no vendor specifics (you'll need to format the dates using your vendor specific syntax)
You can do this in two stages:
The first step is to work out the max date for each random:
SELECT MAX(DateField) AS MaxDateField, Random
FROM Example
GROUP BY Random
Now you can join back onto your table to get the max ID for each combination:
SELECT MAX(e.ID) AS ID
,e.DateField AS DateField
,e.Random
FROM Example AS e
INNER JOIN (
SELECT MAX(DateField) AS MaxDateField, Random
FROM Example
GROUP BY Random
) data
ON data.MaxDateField = e.DateField
AND data.Random = e.Random
GROUP BY DateField, Random
SQL Fiddle example here: SQL Fiddle
To answer your second question:
If there are multiples of the same date, the MAX(e.ID) will simply choose the highest number. If you want the lowest, you can use MIN(e.ID) instead.