Get specific row from each group

Get specific row from each group - sql

My question is very similar to this, except I want to be able to filter by some criteria.
I have a table "DOCUMENT" which looks something like this:
|ID|CONFIG_ID|STATE |MAJOR_REV|MODIFIED_ON|ELEMENT_ID|
+--+---------+----------+---------+-----------+----------+
| 1|1234 |Published | 2 |2019-04-03 | 98762 |
| 2|1234 |Draft | 1 |2019-01-02 | 98762 |
| 3|5678 |Draft | 3 |2019-01-02 | 24244 |
| 4|5678 |Published | 2 |2017-10-04 | 24244 |
| 5|5678 |Draft | 1 |2015-05-04 | 24244 |
It's actually a few more columns, but I'm trying to keep this simple.
For each CONFIG_ID, I would like to select the latest (MAX(MAJOR_REV) or MAX(MODIFIED_ON)) - but I might want to filter by additional criteria, such as state (e.g., the latest published revision of a document) and/or date (the latest revision, published or not, as of a specific date; or: all documents where a revision was published/modified within a specific date interval).
To make things more interesting, there are some other tables I want to join in.
Here's what I have so far:
SELECT
allDocs.ID,
d.CONFIG_ID,
d.[STATE],
d.MAJOR_REV,
d.MODIFIED_ON,
d.ELEMENT_ID,
f.ID FILE_ID,
f.[FILENAME],
et.COLUMN1,
e.COLUMN2
FROM DOCUMENT -- Get all document revisions
CROSS APPLY ( -- Then for each config ID, only look at the latest revision
SELECT TOP 1
ID,
MODIFIED_ON,
CONFIG_ID,
MAJOR_REV,
ELEMENT_ID,
[STATE]
FROM DOCUMENT
WHERE CONFIG_ID=allDocs.CONFIG_ID
ORDER BY MAJOR_REV desc
) as d
LEFT OUTER JOIN ELEMENT e ON e.ID = d.ELEMENT_ID
LEFT OUTER JOIN ELEMENT_TYPE et ON e.ELEMENT_TYPE_ID=et.ID
LEFT OUTER JOIN TREE t ON t.NODE_ID = d.ELEMENT_ID
OUTER APPLY ( -- This is another optional 1:1 relation, but it's wrongfully implemented as m:n
SELECT TOP 1
FILE_ID
FROM DOCUMENT_FILE_RELATION
WHERE DOCUMENT_ID=d.ID
ORDER BY MODIFIED_ON DESC
) as df -- There should never be more than 1, but we're using TOP 1 just in case, to avoid duplicates
LEFT OUTER JOIN [FILE] f on f.ID=df.FILE_ID
WHERE
allDocs.CONFIG_ID = '5678' -- Just for testing purposes
and d.state ='Released' -- One possible filter criterion, there may be others
It looks like the results are correct, but multiple identical rows are returned.
My guess is that for documents with 4 revisions, the same values are found 4 times and returned.
A simple SELECT DISTINCT would solve this, but I'd prefer to fix my query.

This would be a classic row_number & partition by question I think.
;with rows as
(
select <your-columns>,
row_number() over (partion by config_id order by <whatever you want>) as rn
from document
join <anything else>
where <whatever>
)
select * from rows where rn=1

Related

Slightly different greatest-n-per-group

I have read this comment which explains the greatest-n-per-group problem and its solution. Unfortunately, I am facing a slightly different approach, and I am failing to find a solution for it.
Let's suppose I have a table with some basic info regarding users. Due to implementation, this info may or may not repeat itself:
+----+-------------------+----------------+---------------+
| id | user_name | user_name_hash | address |
+----+-------------------+----------------+---------------+
| 1 | peter_jhones | 0xFF321345 | Some Av |
| 2 | sally_whiterspoon | 0x98AB5454 | Certain St |
| 3 | mark_jackobson | 0x0102AB32 | Some Av |
| 4 | mark_jackobson | 0x0102AB32 | Particular St |
+----+-------------------+----------------+---------------+
As you can see, mark_jackobson appears twice, although its address is different in each appearance.
Every now and then, an ETL process queries new user_names and fetches the most recent records of each. Aftewards, it stores the user_name_hash in a table to sign it has already imported that certain user_name
+----------------+
| user_name_hash |
+----------------+
| 0xFF321345 |
| 0x98AB5454 |
+----------------+
Everything begins with the following query:
SELECT DISTINCT user_name_hash
FROM my_table
EXCEPT
SELECT user_name_hash
FROM my_hash_table
This way, I am able to select the new hashes from my table. Since I need to query the most recent occurrence of a hash, I wrap it as a sub-query:
SELECT MAX(id)
FROM my_table
WHERE user_name_hash IN (
SELECT DISTINCT user_name_hash
FROM my_table
EXCEPT
SELECT user_name_hash
FROM my_hash_table)
GROUP BY user_name_hash
Perfect! With the ids of my new users, I can query the addresses as follows:
SELECT
address,
user_name_hash
FROM my_table
WHERE Id IN (
SELECT MAX(id)
FROM my_table
WHERE user_name_hash IN (
SELECT DISTINCT user_name_hash
FROM my_table
EXCEPT
SELECT user_name_hash
FROM my_hash_table)
GROUP BY user_name_hash)
From my perspective, the above query works, but it does not seem optimal. Reading this comment, I noticed I could query the same data, using joins. Since I am failing to write the desired query, could anyone help me out and point me to a direction?
This is the query I have attempted, without success.
SELECT
tb1.address,
tb1.user_name_hash
FROM my_table tb1
INNER JOIN my_table tb2
ON tb1.user_name_hash = tb2.user_name_hash
LEFT JOIN my_hash_table ht
ON tb1.user_name_hash = ht.user_name_hash AND tb1.id > tb2.id
WHERE ht.user_name_hash IS NULL;
Thanks in advance.
EDIT > I am working with PostgreSQL

I believe you are looking for something like this:
SELECT
address,
user_name_hash
FROM my_table t1
JOIN (
SELECT MAX(id) maxid
FROM my_table t2
WHERE NOT EXISTS (
SELECT 1
FROM my_hash_table t3
WHERE t2.user_name_hash = t3.user_name_hash
)
GROUP BY user_name_hash
) t ON t1.ID = t.maxid
I'm using NOT EXISTS instead of EXCEPT since it is more clear to the optimizer.

You can get a better performance using a left outer join (to get the newest records not already imported) and then compute the max id for these records (subquery in the HAVING clause).
SELECT t1.address,
t1.user_name_hash,
MAX(id) AS maxid
FROM my_table t1
LEFT JOIN my_hash_table th ON t1.user_name_hash = th.user_name_hash
WHERE th.user_name_hash IS NULL
GROUP BY t1.address,
t1.user_name_hash
HAVING MAX(id) = (SELECT MAX(id)
FROM my_table t1)

Delete rows where date was least updated

How can I delete rows where dateupdated was least updated ?
My table is
Name Dateupdated ID status
john 1/02/17 JHN1 A
john 1/03/17 JHN2 A
sally 1/02/17 SLLY1 A
sally 1/03/17 SLLY2 A
Mike 1/03/17 MK1 A
Mike 1/04/17 MK2 A
I want to be left with the following after the data removal:
Name Date ID status
john 1/03/17 JHN2 A
sally 1/03/17 SLLY2 A
Mike 1/04/17 MK2 A

If you really want to "delete rows where dateupdated was least updated" then a simple single-row subquery should do the trick.
DELETE MyTable
WHERE Date = (SELECT MIN(Date) From MyTable)
If on the other hand you just want to delete the row with the earliest Date per person (as identified by their ID) you could use:
DELETE MyTable
FROM MyTable a
JOIN (SELECT ID, MIN(Date) MinDate FROM MyTable GROUP BY ID) b
ON a.ID = b.ID AND a.Date = b.MinDate
The idea here is you create an aggregate query that returns rows containing the columns that would match the rows you want deleted, then join to it. Because it's an inner join, rows that do not match the criteria will be excluded.
If people are uniquely identified by something else (e.g. Name then you can just substitute that for the ID in my example above.
I am thinking though that you don't want either of these. I think you want to delete everything except for each person's latest row. If that is the case, try this:
DELETE MyTable
WHERE EXISTS (SELECT 0 FROM MyTable b WHERE b.ID = MyTable.ID AND b.Date > MyTable.Date)
The idea here is you check for existence of another data row with the same ID and a later date. If there is a later record, delete this one.
The nice thing about the last example is you can run it over and over and every person will still be left with exactly one row. The other two queries, if run over and over, will nibble away at the table until it is empty.
P.S. As these are significantly different solutions, I suggest you spend some effort learning how to articulate unambiguous requirements. This is an extremely important skill for any developer.

This deletes rows where the name is a duplicate, and deletes all but the latest row for each name. This is different from your stated question.
Using a common table expression (cte) and row_number():
;with cte as (
select *
, rn = row_number() over (
partition by Name
order by Dateupdated desc
)
from t
)
/* ------------------------------------------------
-- Remove duplicates by deleting rows
-- where the row number (rn) is greater than 1
-- leaving the first row for each partition
------------------------------------------------ */
delete
from cte
where cte.rn > 1
select * from t
rextester: http://rextester.com/HZBQ50469
returns:
+-------+-------------+-------+--------+
| Name | Dateupdated | ID | status |
+-------+-------------+-------+--------+
| john | 2017-01-03 | JHN2 | A |
| sally | 2017-01-03 | SLLY2 | A |
| Mike | 2017-01-04 | MK2 | A |
+-------+-------------+-------+--------+
Without using the cte it can be written as:
delete d
from (
select *
, rn = row_number() over (
partition by Name
order by Dateupdated desc
)
from t
) as d
where d.rn > 1

This should do the trick:
delete
from MyTable a
where not exists (
select top 1 1
from MyTable b
where b.name = a.name
and b.DateUpdated < a.DateUpdated
)
i.e. remove any entries from the table for which there is no record on the same name with a date earlier than the record to be deleted's.

Your Name column has Mike and Mik2 which is different for each other.
So, if you did not make a mistake, standard column to group by must be ID column without last digit.
I think following is more accurate if you did not mistaken.
delete a
from MyTable a
inner join
(select substring(ID, 1, len(ID) - 1) as ID, min(Dateupdated) as MinDate
from MyTable
group by substring(ID, 1, len(ID) - 1)
) b
on substring(a.ID, 1, len(a.ID) - 1) = b.ID and a.Dateupdated = b.MinDate
You can test it at SQLFiddle: http://sqlfiddle.com/#!6/9c440/1

Getting the latest entry per day / SQL Optimizing

Given the following database table, which records events (status) for different objects (id) with its timestamp:
ID | Date | Time | Status
-------------------------------
7 | 2016-10-10 | 8:23 | Passed
7 | 2016-10-10 | 8:29 | Failed
7 | 2016-10-13 | 5:23 | Passed
8 | 2016-10-09 | 5:43 | Passed
I want to get a result table using plain SQL (MS SQL) like this:
ID | Date | Status
------------------------
7 | 2016-10-10 | Failed
7 | 2016-10-13 | Passed
8 | 2016-10-09 | Passed
where the "status" is the latest entry on a day, given that at least one event for this object has been recorded.
My current solution is using "Outer Apply" and "TOP(1)" like this:
SELECT DISTINCT rn.id,
tmp.date,
tmp.status
FROM run rn OUTER apply
(SELECT rn2.date, tmp2.status AS 'status'
FROM run rn2 OUTER apply
(SELECT top(1) rn3.id, rn3.date, rn3.time, rn3.status
FROM run rn3
WHERE rn3.id = rn.id
AND rn3.date = rn2.date
ORDER BY rn3.id ASC, rn3.date + rn3.time DESC) tmp2
WHERE tmp2.status <> '' ) tmp
As far as I understand this outer apply command works like:
For every id
For every recorded day for this id
Select the newest status for this day and this id
But I'm facing performance issues, therefore I think that this solution is not adequate. Any suggestions how to solve this problem or how to optimize the sql?

Your code seems too complicated. Why not just do this?
SELECT r.id, r.date, r2.status
FROM run r OUTER APPLY
(SELECT TOP 1 r2.*
FROM run r2
WHERE r2.id = r.id AND r2.date = r.date AND r2.status <> ''
ORDER BY r2.time DESC
) r2;
For performance, I would suggest an index on run(id, date, status, time).

Using a CTE will probably be the fastest:
with cte as
(
select ID, Date, Status, row_number() over (partition by ID, Date order by Time desc) rn
from run
)
select ID, Date, Status
from cte
where rn = 1

Do not SELECT from a log table, instead, write a trigger that updates a latest_run table like:
CREATE TRIGGER tr_run_insert ON run FOR INSERT AS
BEGIN
UPDATE latest_run SET Status=INSERTED.Status WHERE ID=INSERTED.ID AND Date=INSERTED.Date
IF ##ROWCOUNT = 0
INSERT INTO latest_run (ID,Date,Status) SELECT (ID,Date,Status) FROM INSERTED
END
Then perform reads from the much shorter lastest_run table.
This will add a performance penalty on writes because you'll need two writes instead of one. But will give you much more stable response times on read. And if you do not need to SELECT from "run" table you can avoid indexing it, therefore the performance penalty of two writes is partly compensated by less indexes maintenance.

Self-referencing Query and Not Equals

Trying to pull data from a single table called tblTooling where two TlPartNo numbers are equal to different values and the TlToolNo are not equal for these TlPartNo . This is an Access DB and the following statement gets me close, but still gives too much data.
SELECT DISTINCT
tblTooling.TlToolNo,
tblTooling.TlPartNo,
tblTooling.TlOP,
tblTooling.TlQuantity
FROM tblTooling, tblTooling AS tblTooling_1
WHERE (((tblTooling.TlToolNo)<>tblTooling_1.TlToolNo)
AND ((tblTooling.TlPartNo)="10290722")
AND ((tblTooling_1.TlPartNo)="10295379"));
The included image has the tblTooling structure and Data. Plus the expected results from the query.

You seem to want exclude a ToolNo value when it occurs with both PartNo values. In that case you could group intermediate results by ToolNo, and see whether in such a group there is only one PartNo present (with having). In that case keep that record, and in the outer query, get the two other columns added to it:
SELECT DISTINCT
tblTooling.TlToolNo,
tblTooling.TlPartNo,
tblTooling.TlOP,
tblTooling.TlQuantity
FROM tblTooling
INNER JOIN (
SELECT TlToolNo,
Min(TlPartNo) AS MinTlPartNo,
Max(TlPartNo) AS MaxTlPartNo
FROM tblTooling
WHERE TlPartNo IN ("10290722", "10295379")
GROUP BY TlToolNo
HAVING Min(TlPartNo) = Max(TlPartNo)
) AS grp
ON grp.TlToolNo = tblTooling.TlToolNo
AND grp.MinTlPartNo = tblTooling.TlPartNo
Note that for your sample data this will return 4 rows:
TlToolNo | TlPartNo | TlOP | TlQuantity
----------+----------+------+-----------
T00012362 | 10290722 | OP10 | 2
T00012456 | 10290722 | OP10 | 1
T00013456 | 10290722 | OP20 | 1
T00014348 | 10295379 | OP20 | 1

I think you can do this with not exists:
select t.*
from tblTooling as t
where not exists (select 1
from tblTooling as t2
where t2.TlPartNo in ("10290722", "10295379") and
t2.TlToolNo = t.TlToolNo and
t2.tiid <> t.tiid
) and
t.TlPartNo in ("10290722", "10295379");
This saves on the select distinct, which should be a performance boost.

Select a row used for GROUP BY

I have this table:
id | owner | asset | rate
-------------------------
1 | 1 | 3 | 1
2 | 1 | 4 | 2
3 | 2 | 3 | 3
4 | 2 | 5 | 4
And i'm using
SELECT asset, max(rate)
FROM test
WHERE owner IN (1, 2)
GROUP BY asset
HAVING count(asset) > 1
ORDER BY max(rate) DESC
to get intersection of assets for specified owners with best rate.
I also need id of row used for max(rate), but i can't find a way to include it to SELECT. Any ideas?
Edit:
I need
Find all assets that belongs to both owners (1 and 2)
From the same asset i need only one with the best rate (3)
I also need other columns (owner) that belongs to the specific asset with best rate
I expect the following output:
id | asset | rate
-------------------------
3 | 3 | 3
Oops, all 3s, but basically i need id of 3rd row to query the same table again, so resulting output (after second query) will be:
id | owner | asset | rate
-------------------------
3 | 2 | 3 | 3
Let's say it's Postgres, but i'd prefer reasonably cross-DBMS solution.
Edit 2:
Guys, i know how to do this with JOINs. Sorry for misleading question, but i need to know how to get extra from existing query. I already have needed assets and rates selected, i just need one extra field among with max(rate) and given conditions if it's possible.

Another solution that might or might not be faster than a self join (depending on the DBMS' optimizer)
SELECT id,
asset,
rate,
asset_count
FROM (
SELECT id,
asset,
rate,
rank() over (partition by asset order by rate desc) as rank_rate,
count(asset) over (partition by null) as asset_count
FROM test
WHERE owner IN (1, 2)
) t
WHERE rank_rate = 1
ORDER BY rate DESC

You are dealing with two questions and trying to solve them as if they are one. With a subquery, you can better refine by filtering the list in the proper order first (max(rate)), but as soon as you group, you lose this. As such, i would set up two queries (same procedure, if you are using procedures, but two queries) and ask the questions separately. Unless ... you need some of the information in a single grid when output.
I guess the better direction to head is to have you show how you want the output to look. Once you bake the input and the output, the middle of the oreo is easier to fill.

SELECT b.id, b.asset, b.rate
from
(
SELECT asset, max(rate) maxrate
FROM test
WHERE owner IN (1, 2)
GROUP BY asset
HAVING count(asset) > 1
) a, test b
WHERE a.asset = b.asset
AND a.maxrate = b.rate
ORDER BY b.rate DESC

You don't specify what type of database you're running on, but if you have analytical functions available you can do this:
select id, asset, max_rate
from (
select ID, asset, max(rate) over (partition by asset) max_rate,
row_number() over (partition by asset order by rate desc) row_num
from test
where owner in (1,2)
) q
where row_num = 1
I'm not sure how to add in the "having count(asset) > 1" in this way though.

This first searches for rows with the maximum rate per asset. Then it takes the highest id per asset, and selects that:
select *
from test
inner join
(
select max(id) as MaxIdWithMaxRate
from test
inner join
(
select asset
, max(rate) as MaxRate
from test
group by
asset
) filter
on filter.asset = test.asset
and filter.MaxRate = test.rate
group by
asset
) filter2
on filter.MaxIdWithMaxRate = test.id
If multiple assets share the maximum rate, this will display the one with the highest id.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Get specific row from each group - sql

This would be a classic row_number & partition by question I think. ;with rows as ( select <your-columns>, row_number() over (partion by config_id order by <whatever you want>) as rn from document join <anything else> where <whatever> ) select * from rows where rn=1

Related

Slightly different greatest-n-per-group

Delete rows where date was least updated

Getting the latest entry per day / SQL Optimizing

Self-referencing Query and Not Equals

Select a row used for GROUP BY

Categories

Resources