Suppose I've next data
id date another_info
1 2014-02-01 kjkj
1 2014-03-11 ajskj
1 2014-05-13 kgfd
2 2014-02-01 SADA
3 2014-02-01 sfdg
3 2014-06-12 fdsA
I want for each id extract last information:
id date another_info
1 2014-05-13 kgfd
2 2014-02-01 SADA
3 2014-06-12 fdsA
How could I manage that?
The most efficient way is to use Postgres' distinct on operator
select distinct on (id) id, date, another_info
from the_table
order by id, date desc;
If you want a solution that works across databases (but is less efficient) you can use a window function:
select id, date, another_info
from (
select id, date, another_info,
row_number() over (partition by id order by date desc) as rn
from the_table
) t
where rn = 1
order by id;
The solution with a window function is in most cases faster than using a sub-query.
select *
from bar
where (id,date) in (select id,max(date) from bar group by id)
Tested in PostgreSQL,MySQL
I found this as the fastest solution:
SELECT t1.*
FROM yourTable t1
LEFT JOIN yourTable t2 ON t2.tag_id = t1.tag_id AND t2.value_time > t1.value_time
WHERE t2.tag_id IS NULL
For most scenarios, The most efficient way is to use GROUP BY
I saw the accepted answer which determine that using distinct on (id) id is The most efficient way to solve the problem which was described in the question but I believe it's extremely not accurate.
Sadly I couldn't find any helpfull insights from POSTGRES doc' but I did find this article which refference few others and provide examples whereas
GROUP BY approach definitely leads to better performance
We had discussion over this subject at work and did a little experience over a table that holds some data about tags' blinks with 4,114,692 rows, and has indexes over tag_id and over timestamp (seperated indexes)
Here are the queries:
1.using ditinct:
select distinct on (tag_id) tag_id, timestamp, some_data
from blinks
order by id, timestamp desc;
2.using CTE + group by + join:
`with blink_last_timestamp as (
select tag_id, max(timestamp) as max_timestamp
from blinks
group by tag_id )
select bl.tag_id, max_timestamp, some_data
from blink_last_timestamp bl
join blinks b on
b.tag_id = bl.tag_id and
bd.timestamp = bl.max_timestamp`
The results where unambiguous and favored the second solution for this scenario (Which is pretty generic in my opinion),
showing that it is being 10X times (!) faster 1655.991 ms (00:01.656) vs 16723.346 ms (00:16.723) and of course delivered the same data.
Group by id and use any aggregate functions to meet the criteria of last record. For example
select id, max(date), another_info
from the_table
group by id, another_info
Related
My table looks like this:
ID FROM WHEN
1 mario 24.10.19
1 robin 23.10.19
2 mario 24.10.19
3 robin 23.10.19
3 mario 22.10.19
I just want the newest records from an ID. So the result should look like this:
ID FROM WHEN
1 mario 24.10.19
2 mario 24.10.19
3 robin 23.10.19
I dont know how to get this result
There are multiple methods. For just three columns in Oracle, I have had good luck with group by:
select id,
max("from") keep (dense_rank first order by "when" desc) as "from",
max("when") as when
from t
group by id;
Often a correlated subquery performs well, in this case, with an index on (id, when):
select t.*
from t
where t."when" = (select max(t2."when") from t t2 where t2.id = t.id);
And the canonical solution is to use window functions:
select t.*
from (select t.*,
row_number() over (partition by id order by "when" desc) as seqnum
from t
) t
where seqnum = 1;
Oracle has a smart optimizer but this has to do a bit more work, because row numbers are assigned to all rows before the filtering. That can make this a wee bit slower (in some databases) than alternative, but it is still a very viable solution.
I have a table like as shown below
What I would like to do is get the minimum of each subject. Though I am able to do this with row_number function, I would like to do this with groupby and min() approach. But it doesn't work.
row_number approach - works fine
SELECT * FROM (select subject_id,value,id,min_time,max_time,time_1,
row_number() OVER (PARTITION BY subject_id ORDER BY value) AS rank
from table A) WHERE RANK = 1
min() approach - doesn't work
select subject_id,id,min_time,max_time,time_1,min(value) from table A
GROUP BY SUBJECT_ID,id
As you can see just the two columns (subject_id and id) is enough to group the items together. They will help differentiate the group. But why am I not able to use the other columns in select clause. If I use the other columns, I may not get the expected output because time_1 has different values.
I expect my output to be like as shown below
In BigQuery you can use aggregation for this:
SELECT ARRAY_AGG(a ORDER BY value LIMIT 1)[SAFE_OFFSET(1)].*
FROM table A
GROUP BY SUBJECT_ID;
This uses ARRAY_AGG() to aggregate each record (the a in the argument list). ARRAY_AGG() allows you to order the result (by value) and to limit the size of the array. The latter is important for performance.
After you concatenate the arrays, you want the first element. The .* transforms the record referred to by a to the component columns.
I'm not sure why you don't want to use ROW_NUMBER(). If the problem is the lingering rank column, you an easily remove it:
SELECT a.* EXCEPT (rank)
FROM (SELECT a.*,
ROW_NUMBER() OVER (PARTITION BY subject_id ORDER BY value) AS rank
FROM A
) a
WHERE RANK = 1;
Are you looking for something like below-
SELECT
A.subject_id,
A.id,
A.min_time,
A.max_time,
A.time_1,
A.value
FROM table A
INNER JOIN(
SELECT subject_id, MIN(value) Value
FROM table
GROUP BY subject_id
) B ON A.subject_id = B.subject_id
AND A.Value = B.Value
If you do not required to select Time_1 column's value, this following query will work (As I can see values in column min_time and max_time is same for the same group)-
SELECT
A.subject_id,A.id,A.min_time,A.max_time,
--A.time_1,
MIN(A.value)
FROM table A
GROUP BY
A.subject_id,A.id,A.min_time,A.max_time
Finally, the best approach is if you can apply something like CAST(Time_1 AS DATE) on your time column. This will consider only the date part regardless of the time part. The query will be
SELECT
A.subject_id,A.id,A.min_time,A.max_time,
CAST(A.time_1 AS DATE) Time_1,
MIN(A.value)
FROM table A
GROUP BY
A.subject_id,A.id,A.min_time,A.max_time,
CAST(A.time_1 AS DATE)
-- Make sure the syntax of CAST AS DATE
-- in BigQuery is as I written here or bit different.
Below is for BigQuery Standard SQL and is most efficient way for such cases like in your question
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY value LIMIT 1)[OFFSET(0)]
FROM `project.dataset.table` t
GROUP BY subject_id
Using ROW_NUMBER is not efficient and in many cases lead to Resources exceeded error.
Note: self join is also very ineffective way of achieving your objective
A bit late to the party, but here is a cte-based approach which made sense to me:
with mins as (
select subject_id, id, min(value) as min_value
from table
group by subject_id, id
)
select distinct t.subject_id, t.id, t.time_1, t.min_time, t.max_time, m.min_value
from table t
join mins m on m.subject_id = t.subject_id and m.id = t.id
i've got a table that i need to return about 14 column values but only return 1 row for the duplicates on some of the columns.
The second problem is that between the duplicates i need to keep the one that has the biggest int in one of the columns that is not required to be unique.
Since the Table is somewhat big, I am seeking advice into doing this in the most efficient way.
should i be doing a group by?
my table is somewhat like this, i will simplify the number of columns.
ID(UniqueIdentifier) | ACCID(UniqueIdentifier) | DateTime(DateTime) | distance(int)|type(int)
28761188-0886-E911-822F-DD1FA635D450 1238FD8A-BD00-411A-A81C-0F6F5C026BCC 2019-06-03 14:04:41.000 2 3
41761188-0886-E911-822F-DD1FA635D450 1238FD8A-BD00-411A-A81C-0F6F5C026BCC 2019-06-03 14:04:41.000 1 3
I should be only selecting when ACCID and DATETIME is unique, the column ID in primary so will never be duplicate, and i need to keep the row with the biggest distance.
You can use the ROW_NUMBER() window function, as in:
select *
from (
select
id,
accid,
datetime,
distance,
type,
row_number() over(partition by accid, datetime order by type desc) as rn
from t
) x
where rn = 1
If you want to show multiple "ties", then replace ROW_NUMBER() by RANK().
I would suggest a correlated subquery with the right index as the fastest method:
select t.*
from t
where t.id = (select top (1) t2.id
from t t2
where t2.ACCID = t.ACCID
order by t2.distance desc
) ;
The best index is on (ACCID, distance desc, id).
Given a table like this, what query will the most recent calibration information for each monitor? In other words, I want to find the maximum date value for each of the monitors. Oracle-specific functionality is fine for my application.
monitor_id calibration_date value
---------- ---------------- -----
1 2011/10/22 15
1 2012/01/01 16
1 2012/01/20 17
2 2011/10/22 18
2 2012/01/02 19
The results for this example would look like this:
1 2012/01/20 17
2 2012/01/02 19
I'd tend to use analytic functions
SELECT monitor_id,
host_name,
calibration_date,
value
FROM (SELECT b.monitor_id,
b.host_name,
a.calibration_date,
a.value,
rank() over (partition by b.monitor_id order by a.calibration_date desc) rnk
FROM table_name a,
table_name2 b
WHERE a.some_key = b.some_key)
WHERE rnk = 1
You could also use correlated subqueries though that will be less efficient
SELECT monitor_id,
calibration_date,
value
FROM table_name a
WHERE a.calibration_date = (SELECT MAX(b.calibration_date)
FROM table_name b
WHERE a.monitor_id = b.monitor_id)
My personal preference is this:
SELECT DISTINCT
monitor_id
,MAX(calibration_date)
OVER (PARTITION BY monitor_id)
AS latest_calibration_date
,FIRST_VALUE(value)
OVER (PARTITION BY monitor_id
ORDER BY calibration_date DESC)
AS latest_value
FROM mytable;
A variation would be to use the FIRST_VALUE syntax for latest_calibration_date as well. Either way works.
The window functions solution should be the most efficient and result in only one table or index scan. The one I am posting here i think wins some points for being intuitive and easy to understand. I tested on SQL server and it performed 2nd to window functions, resulting in two index scans.
SELECT T1.monitor_id, T1.calibration_date, T1.value
FROM someTable AS T1
WHERE NOT EXISTS
(
SELECT *
FROM someTable AS T2
WHERE T2.monitor_id = T1.monitor_id AND T2.value > T1.value
)
GROUP BY T1.monitor_id, T1.calibration_date, T1.value
And just for the heck of it, here's another one along the same lines, but less performing (63% cost vs 37%) than the other (again in sql server). This one uses a Left Outer Join in the execution plan where as the first one uses an Anti-Semi Merge Join:
SELECT T1.monitor_id, T1.calibration_date, T1.value
FROM someTable AS T1
LEFT JOIN someTable AS T2 ON T2.monitor_id = T1.monitor_id AND T2.value > T1.value
WHERE T2.monitor_id IS NULL
GROUP BY T1.monitor_id, T1.calibration_date, T1.value
select monitor_id, calibration_date, value
from table
where calibration_date in(
select max(calibration_date) as calibration_date
from table
group by monitor_id
)
If I have a table with columns id, name, score, date
and I wanted to run a sql query to get the record where id = 2 with the earliest date in the data set.
Can you do this within the query or do you need to loop after the fact?
I want to get all of the fields of that record..
If you just want the date:
SELECT MIN(date) as EarliestDate
FROM YourTable
WHERE id = 2
If you want all of the information:
SELECT TOP 1 id, name, score, date
FROM YourTable
WHERE id = 2
ORDER BY Date
Prevent loops when you can. Loops often lead to cursors, and cursors are almost never necessary and very often really inefficient.
SELECT TOP 1 ID, Name, Score, [Date]
FROM myTable
WHERE ID = 2
Order BY [Date]
While using TOP or a sub-query both work, I would break the problem into steps:
Find target record
SELECT MIN( date ) AS date, id
FROM myTable
WHERE id = 2
GROUP BY id
Join to get other fields
SELECT mt.id, mt.name, mt.score, mt.date
FROM myTable mt
INNER JOIN
(
SELECT MIN( date ) AS date, id
FROM myTable
WHERE id = 2
GROUP BY id
) x ON x.date = mt.date AND x.id = mt.id
While this solution, using derived tables, is longer, it is:
Easier to test
Self documenting
Extendable
It is easier to test as parts of the query can be run standalone.
It is self documenting as the query directly reflects the requirement
ie the derived table lists the row where id = 2 with the earliest date.
It is extendable as if another condition is required, this can be easily added to the derived table.
Try
select * from dataset
where id = 2
order by date limit 1
Been a while since I did sql, so this might need some tweaking.
Using "limit" and "top" will not work with all SQL servers (for example with Oracle).
You can try a more complex query in pure sql:
select mt1.id, mt1."name", mt1.score, mt1."date" from mytable mt1
where mt1.id=2
and mt1."date"= (select min(mt2."date") from mytable mt2 where mt2.id=2)