Getting the latest entry per day / SQL Optimizing - sql

Given the following database table, which records events (status) for different objects (id) with its timestamp:
ID | Date | Time | Status
-------------------------------
7 | 2016-10-10 | 8:23 | Passed
7 | 2016-10-10 | 8:29 | Failed
7 | 2016-10-13 | 5:23 | Passed
8 | 2016-10-09 | 5:43 | Passed
I want to get a result table using plain SQL (MS SQL) like this:
ID | Date | Status
------------------------
7 | 2016-10-10 | Failed
7 | 2016-10-13 | Passed
8 | 2016-10-09 | Passed
where the "status" is the latest entry on a day, given that at least one event for this object has been recorded.
My current solution is using "Outer Apply" and "TOP(1)" like this:
SELECT DISTINCT rn.id,
tmp.date,
tmp.status
FROM run rn OUTER apply
(SELECT rn2.date, tmp2.status AS 'status'
FROM run rn2 OUTER apply
(SELECT top(1) rn3.id, rn3.date, rn3.time, rn3.status
FROM run rn3
WHERE rn3.id = rn.id
AND rn3.date = rn2.date
ORDER BY rn3.id ASC, rn3.date + rn3.time DESC) tmp2
WHERE tmp2.status <> '' ) tmp
As far as I understand this outer apply command works like:
For every id
For every recorded day for this id
Select the newest status for this day and this id
But I'm facing performance issues, therefore I think that this solution is not adequate. Any suggestions how to solve this problem or how to optimize the sql?

Your code seems too complicated. Why not just do this?
SELECT r.id, r.date, r2.status
FROM run r OUTER APPLY
(SELECT TOP 1 r2.*
FROM run r2
WHERE r2.id = r.id AND r2.date = r.date AND r2.status <> ''
ORDER BY r2.time DESC
) r2;
For performance, I would suggest an index on run(id, date, status, time).

Using a CTE will probably be the fastest:
with cte as
(
select ID, Date, Status, row_number() over (partition by ID, Date order by Time desc) rn
from run
)
select ID, Date, Status
from cte
where rn = 1

Do not SELECT from a log table, instead, write a trigger that updates a latest_run table like:
CREATE TRIGGER tr_run_insert ON run FOR INSERT AS
BEGIN
UPDATE latest_run SET Status=INSERTED.Status WHERE ID=INSERTED.ID AND Date=INSERTED.Date
IF ##ROWCOUNT = 0
INSERT INTO latest_run (ID,Date,Status) SELECT (ID,Date,Status) FROM INSERTED
END
Then perform reads from the much shorter lastest_run table.
This will add a performance penalty on writes because you'll need two writes instead of one. But will give you much more stable response times on read. And if you do not need to SELECT from "run" table you can avoid indexing it, therefore the performance penalty of two writes is partly compensated by less indexes maintenance.

Related

How to filter out conditions based on a group by in JPA?

I have a table like
| customer | profile | status | date |
| 1 | 1 | DONE | mmddyy |
| 1 | 1 | DONE | mmddyy |
In this case, I want to group by on the profile ID having max date. Profiles can be repeated. I've ruled out Java 8 streams as I have many conditions here.
I want to convert the following SQL into JPQL:
select customer, profile, status, max(date)
from tbl
group by profile, customer,status, date, column-k
having count(profile)>0 and status='DONE';
Can someone tell how can I write this query in JPQL if it is correct in SQL? If I declare columns in select it is needed in group by as well and the query results are different.
I am guessing that you want the most recent customer/profile combination that is done.
If so, the correct SQL is:
select t.*
from t
where t.date = (select max(t2.date)
from t t2
where t2.customer = t.customer and t2.profile = t.profile
) and
t.status = 'DONE';
I don't know how to convert this to JPQL, but you might as well start with working SQL code.
In your query date column not needed in group by and status='DONE' should be added with where clause
select customer, profile, status, max(date)
from tbl
where status='DONE'
group by profile, customer,status,
having count(profile)>0

Get specific row from each group

My question is very similar to this, except I want to be able to filter by some criteria.
I have a table "DOCUMENT" which looks something like this:
|ID|CONFIG_ID|STATE |MAJOR_REV|MODIFIED_ON|ELEMENT_ID|
+--+---------+----------+---------+-----------+----------+
| 1|1234 |Published | 2 |2019-04-03 | 98762 |
| 2|1234 |Draft | 1 |2019-01-02 | 98762 |
| 3|5678 |Draft | 3 |2019-01-02 | 24244 |
| 4|5678 |Published | 2 |2017-10-04 | 24244 |
| 5|5678 |Draft | 1 |2015-05-04 | 24244 |
It's actually a few more columns, but I'm trying to keep this simple.
For each CONFIG_ID, I would like to select the latest (MAX(MAJOR_REV) or MAX(MODIFIED_ON)) - but I might want to filter by additional criteria, such as state (e.g., the latest published revision of a document) and/or date (the latest revision, published or not, as of a specific date; or: all documents where a revision was published/modified within a specific date interval).
To make things more interesting, there are some other tables I want to join in.
Here's what I have so far:
SELECT
allDocs.ID,
d.CONFIG_ID,
d.[STATE],
d.MAJOR_REV,
d.MODIFIED_ON,
d.ELEMENT_ID,
f.ID FILE_ID,
f.[FILENAME],
et.COLUMN1,
e.COLUMN2
FROM DOCUMENT -- Get all document revisions
CROSS APPLY ( -- Then for each config ID, only look at the latest revision
SELECT TOP 1
ID,
MODIFIED_ON,
CONFIG_ID,
MAJOR_REV,
ELEMENT_ID,
[STATE]
FROM DOCUMENT
WHERE CONFIG_ID=allDocs.CONFIG_ID
ORDER BY MAJOR_REV desc
) as d
LEFT OUTER JOIN ELEMENT e ON e.ID = d.ELEMENT_ID
LEFT OUTER JOIN ELEMENT_TYPE et ON e.ELEMENT_TYPE_ID=et.ID
LEFT OUTER JOIN TREE t ON t.NODE_ID = d.ELEMENT_ID
OUTER APPLY ( -- This is another optional 1:1 relation, but it's wrongfully implemented as m:n
SELECT TOP 1
FILE_ID
FROM DOCUMENT_FILE_RELATION
WHERE DOCUMENT_ID=d.ID
ORDER BY MODIFIED_ON DESC
) as df -- There should never be more than 1, but we're using TOP 1 just in case, to avoid duplicates
LEFT OUTER JOIN [FILE] f on f.ID=df.FILE_ID
WHERE
allDocs.CONFIG_ID = '5678' -- Just for testing purposes
and d.state ='Released' -- One possible filter criterion, there may be others
It looks like the results are correct, but multiple identical rows are returned.
My guess is that for documents with 4 revisions, the same values are found 4 times and returned.
A simple SELECT DISTINCT would solve this, but I'd prefer to fix my query.
This would be a classic row_number & partition by question I think.
;with rows as
(
select <your-columns>,
row_number() over (partion by config_id order by <whatever you want>) as rn
from document
join <anything else>
where <whatever>
)
select * from rows where rn=1

Get 10 distinct projects with the latest updates in related tasks

I have two tables in a PostgreSQL 9.5 database:
project
- id
- name
task
- id
- project_id
- name
- updated_at
There are ~ 1000 projects (updated very rarely) and ~ 10 million tasks (updated very often).
I want to list those 10 distinct projects that have the latest task updates.
A basic query would be:
SELECT * FROM task ORDER BY updated_at DESC LIMIT 10;
However, there can be many updated tasks per project. So I won't get 10 unique projects.
If I try to add DISTINCT(project_id) somewhere in the query, I'm getting an error:
for SELECT DISTINCT, ORDER BY expressions must appear in select list
Problem is, I can't sort (primarily) by project_id, because I need to have tasks sorted by time. Sorting by updated_at DESC, project_id ASC doesn't work either, because several tasks of the same project can be among the latest.
I can't download all records because there are millions of them.
As a workaround I download 10x needed rows (without distinct) scope, and filter them in the backend. This works for most cases, but it's obviously not reliable: sometimes I don't get 10 unique projects.
Can this be solved efficiently in Postgres 9.5?
Example
id | name
----+-----------
1 | Project 1
2 | Project 2
3 | Project 3
id | project_id | name | updated_at
----+------------+--------+-----------------
1 | 1 | Task 1 | 13:12:43.361387
2 | 1 | Task 2 | 13:12:46.369279
3 | 2 | Task 3 | 13:12:54.680891
4 | 3 | Task 4 | 13:13:00.472579
5 | 3 | Task 5 | 13:13:04.384477
If I query:
SELECT project_id, updated_at FROM task ORDER BY updated_at DESC LIMIT 2
I get:
project_id | updated_at
------------+-----------------
3 | 13:13:04.384477
3 | 13:13:00.472579
But I want to get 2 distinct projects with the respective latest task.update_at like this:
project_id | updated_at
------------+-----------------
3 | 13:13:04.384477
2 | 13:12:54.680891 -- from Task 3
The simple (logically correct) solution is to aggregate tasks to get the latest update per project, and then pick the latest 10, like #Nemeros provided.
However, this incurs a sequential scan on task, which is undesirable (expensive) for big tables.
If you have relatively few projects (many task entries per project), there are faster alternatives using (bitmap) index scans.
SELECT *
FROM project p
, LATERAL (
SELECT updated_at AS last_updated_at
FROM task
WHERE project_id = p.id
ORDER BY updated_at DESC
LIMIT 1
) t
ORDER BY t.last_updated_at
LIMIT 10;
Key to performance is a matching multicolumn index:
CREATE INDEX task_project_id_updated_at ON task (project_id, updated_at DESC);
A setup with 1000 projects and 10 million tasks (like you commented) is a perfect candidate for this.
Background:
Optimize GROUP BY query to retrieve latest record per user
Select first row in each GROUP BY group?
NULL and "no row"
Above solution assumes updated_at is defined NOT NULL. Else use ORDER BY updated_at DESCNULLS LAST and ideally make the index match.
Projects without any tasks are eliminated from the result by the implicit CROSS JOIN. NULL values cannot creep in this way. This is subtly different from correlated subqueries like #Nemeros added to his answer: those return NULL values for "no row" (project has no related tasks at all). The outer descending sort order then lists NULL on top unless instructed otherwise. Most probably not what you want.
Related:
PostgreSQL sort by datetime asc, null first?
What is the difference between LATERAL and a subquery in PostgreSQL?
Try a group by expression, that's what it's aimed for :
SELECT project_id, max(update_date) as max_upd_date
FROM task t
GROUP BY project_id
order by max_upd_date DESC
LIMIT 10
Do not forget to put an index that begin with : project_id, update_date if you want to avoid full table scans.
Well the only way to use the index seems to be with correlated sub query :
select p.id,
(select upd_dte from task t where p.id = t.prj_id order by upd_dte desc limit 1) as max_dte
from project p
order by max_dte desc
limit 10
try to use
SELECT project_id,
Max (updated_at)
FROM task
GROUP BY project_id
ORDER BY Max(updated_at) DESC
LIMIT 10
I believe row_number() over() can be used for this but you will still need the final order by and limit clauses:
select
mt.*
from (
SELECT
* , row_number() over(partition by project_id order by updated_at DESC) rn
FROM tasks
) mt
-- inner join Projects p on mt.project_id = p.id
where mt.rn = 1
order by mt.updated_at DESC
limit 2
Advantage of this approach gives you access to the full row corresponding to the maximum updated_at for each project. You can optionally join the projects table as well
result:
| id | project_id | name | updated_at | rn |
|----|------------|--------|-----------------|----|
| 5 | 3 | Task 5 | 13:13:04.384477 | 1 |
| 3 | 2 | Task 3 | 13:12:54.680891 | 1 |
see: http://sqlfiddle.com/#!15/ee039/1
How about sorting the records by the most recent update and then doing distinct on?
select distinct on (t.project_id) t.*
from tasks t
order by max(t.update_date) over (partition by t.project_id), t.project_id;
EDIT:
I didn't realize Postgres did that check. Here is the version with a subquery:
select distinct on (maxud, t.project_id) t.*
from (select t.*,
max(t.update_date) over (partition by t.project_id) as maxud
from tasks t
) t
order by maxud, t.project_id;
You could probably put the analytic call in the distinct on, but I think this is clearer anyway.

Selecting records from subquery found set (postgres)

I have a query on 2 tables (part, price). The simplified version of this query is:
SELECT price.*
FROM price
INNER JOIN parts ON (price.code = part.code )
WHERE price.type = '01'
ORDER BY date DESC
That returns several records:
code | type | date | price | file
-------------+----------+------------------------------------------------------
00065064705 | 01 | 2008-01-07 00:00:00 | 16.400000 | 28SEP2011.zip
00065064705 | 01 | 2007-02-05 00:00:00 | 15.200000 | 20JUL2011.zip
54868278900 | 01 | 2006-02-24 00:00:00 | 16.642000 | 28SEP2011.zip
As you can see, there is code 00065064705 listed twice. I just need the maxdate record (2008-01-07) along with the code, type, date and price for each unique code. So basically the top record for each unique code. This postgres so I can't use SELECT TOP or something like that.
I think I should be using this as subquery inside of a main query but I'm not sure how. something like
SELECT *
FROM price
JOIN (insert my original query here) AS price2 ON price.code = price2.code
Any help would be greatly appreciated.
You can use the row_number() window function to do that.
select *
from (SELECT price.*,
row_number() over (partition by price.code order by price.date desc) as rn
FROM price
INNER JOIN parts ON (price.code = part.code )
WHERE price.type='01') x
where rn = 1
ORDER BY date DESC
(*) Note: I may have prefixed some of the columns incorrectly, as I'm not sure which column is in which table. I'm sure you can fix that.
In Postgres you can use DISTINCT ON:
SELECT DISTINCT ON(code) *
FROM price
INNER JOIN parts ON price.code = part.code
WHERE price.type='01'
ORDER BY code, "date" DESC
select distinct on (code)
code, p.type, p.date, p.price, p.file
from
price p
inner join
parts using (code)
where p.type='01'
order by code, p.date desc
http://www.postgresql.org/docs/current/static/sql-select.html#SQL-DISTINCT

How to efficiently get a value from the last row in bulk on SQL Server

I have a table like so
Id | Type | Value
--------------------
0 | Big | 2
1 | Big | 3
2 | Small | 3
3 | Small | 3
I would like to get a table like this
Type | Last Value
--------------------
Small | 3
Big | 3
How can I do this. I understand there is an SQL Server method called LAST_VALUE(...) OVER .(..) but I can't get this to work with GROUP BY.
I've also tried using SELECT MAX(ID) & SELECT TOP 1.. but this seems a bit inefficient since there would be a subquery for each value. The queries take too long when the table has a few million rows in it.
Is there a way to quickly get the last value for these, perhaps using LAST_VALUE?
You can do it using rownumber:
select
type,
value
from
(
select
type,
value,
rownumber() over (partition by type order by id desc) as RN
) TMP
where RN = 1
Can't test this now since SQL Fiddle doesn't seem to work, but hopefully that's ok.
The most efficient method might be not exists, which uses an anti-join for the underlying operator:
select type, value
from likeso l
where not exists (select 1 from likeso l2 where l2.type = l.type and l2.id > l.id)
For performance, you want an index on likeso(type, id).
I really wonder if there is more efficent solution but, I use following query on such needs;
Select Id, Type, Value
From ( Select *, Max (Id) Over (Partition By Type) As LastId
From #Table) T
Where Id = LastId