Delete all rows except latest per user, having a certain column value - sql

I've got a table events that contains events for users, for example:
PK | user | event_type | timestamp
--------------------------------
1 | ab | DTV | 1
2 | ab | DTV | 2
3 | ab | CPVR | 3
4 | cd | DTV | 1
5 | cd | DTV | 2
6 | cd | DTV | 3
What I want to do is keep only one event per user, namely the one with the latest timestamp and event_type = 'DTV'.
After applying the delete to the example above, the table should look like this:
PK | user | event_type | timestamp
--------------------------------
2 | ab | DTV | 2
6 | cd | DTV | 3
Can any one of you come up with something that accomplishes this task?
Update: I'm using Sqlite. This is what I have so far:
delete from events
where id not in (
select id from (
select id, user, max(timestamp)
from events
where event_type = 'DTV'
group by user)
);
I'm pretty sure this can be improved upon. Any ideas?

I think you should be able to do something like this:
delete from events
where (user, timestamp) not in (
select user, max(timestamp)
from events
where event_type = 'DTV'
group by user
)
You could potentially do some more sophisticated tricks like table or partition replacement, depending on the database you're working with

If using sql server roo5/2008 then use following sql:
;WITH ce
AS (SELECT *,
Row_number()
OVER (
partition BY [user], event_type
ORDER BY timestamp DESC) AS rownumber
FROM emp)
DELETE FROM ce
WHERE rownumber <> 1
OR event_type <> 'DTV'

Your solution doesn't seem to me reliable enough, because your subquery is pulling a column that is neither aggregated nor added to GROUP BY. I mean, I am not an experienced SQLite user and your solution did work when I tested it. And if there's any confirmation that the id column is always reliably correlated to the MAX(timestamp) value in this situation, fine, your approach seems quite a decent one.
But if you are as unsure about your solution as I am, you could try the following:
DELETE FROM events
WHERE NOT EXISTS (
SELECT *
FROM (
SELECT MAX(timestamp) AS ts
FROM events e
WHERE event_type = 'DTV'
AND user = events.user
) s
WHERE ts = events.timestamp
);
The inner instance of events is assigned a different alias so that the events alias could be used to unambiguously reference the outer instance of the table (the one the DELETE command is actually being applied to). This solution does assume that timestamp is unique per user, though.
A working example can be run and played with on SQL Fiddle.

Related

How do I merge and delete duplicated rows in SQL using UPDATE?

For example, I have a table of:
id | code | name | type | deviceType
---+------+------+------+-----------
1 | 23 | xyz | 0 | web
2 | 23 | xyz | 0 | mobile
3 | 24 | xyzc | 0 | web
4 | 25 | xyzc | 0 | web
I want the result to be:
id | code | name | type | deviceType
---+------+------+------+-----------
1 | 23 | xyz | 0 | web&mobile
2 | 24 | xyzc | 0 | web
3 | 25 | xyzc | 0 | web
How do I do this in SQL Server using UPDATE and DELETE statements?
Any help is greatly appreciated!
I might actually suggest just leaving the original data intact, and instead creating a view here:
CREATE VIEW yourView AS
SELECT ROW_NUMBER() OVER (ORDER BY MIN(id)) AS id,
code, name, type,
STRING_AGG(deviceType, '&') WITHIN GROUP (ORDER BY id) AS deviceType
FROM yourTable
GROUP BY code, name, type;
Demo
One main reason for not actually doing the update is that every time new data comes in, you might possibly have to run that update, over and over. Instead, just keeping the original data and running the view occasionally might perform better here.
Note that I assume that you are using SQL Server 2017 or later. If not, then STRING_AGG would have to be replaced with an uglier approach, but you should consider upgrading in this case.
To do what you want, you would need two separate statements.
This updates the "first" row of each group with all the device types in the group:
update t
set t.devicetype = t1.devicetype
from mytable t
inner join (
select min(id) as id, string_agg(devicetype, '&') within group(order by id) as devicetype
from mytable
group by code, name, type
having count(*) > 1
) t1 on t1.id = t.id
This deletes everything but the first row per group:
with t as (
select row_number() over(partition by code, name, type order by id) rn
from mytable
)
delete from t where rn > 1
Demo on DB Fiddle

Is there a way to ensure WHERE clause happens after DISTINCT?

Imagine you have a table comments in your database.
The comment table has the columns, id, text, show, comment_id_no.
If a user enters a comment, it inserts a row into the database
| id | comment_id_no | text | show | inserted_at |
| -- | -------------- | ---- | ---- | ----------- |
| 1 | 1 | hi | true | 1/1/2000 |
If a user wants to update that comment it inserts a new row into the db
| id | comment_id_no | text | show | inserted_at |
| -- | -------------- | ---- | ---- | ----------- |
| 1 | 1 | hi | true | 1/1/2000 |
| 2 | 1 | hey | true | 1/1/2001 |
Notice it keeps the same comment_id_no. This is so we will be able to see the history of a comment.
Now the user decides that they no longer want to display their comment
| id | comment_id_no | text | show | inserted_at |
| -- | -------------- | ---- | ----- | ----------- |
| 1 | 1 | hi | true | 1/1/2000 |
| 2 | 1 | hey | true | 1/1/2001 |
| 3 | 1 | hey | false | 1/1/2002 |
This hides the comment from the end users.
Now a second comment is made (not an update of the first)
| id | comment_id_no | text | show | inserted_at |
| -- | -------------- | ---- | ----- | ----------- |
| 1 | 1 | hi | true | 1/1/2000 |
| 2 | 1 | hey | true | 1/1/2001 |
| 3 | 1 | hey | false | 1/1/2002 |
| 4 | 2 | new | true | 1/1/2003 |
What I would like to be able to do is select all the latest versions of unique commend_id_no, where show is equal to true. However, I do not want the query to return id=2.
Steps the query needs to take...
select all the most recent, distinct comment_id_nos. (should return id=3 and id=4)
select where show = true (should only return id=4)
Note: I am actually writing this query in elixir using ecto and would like to be able to do this without using the subquery function. If anyone can answer this in sql I can convert the answer myself. If anyone knows how to answer this in elixir then also feel free to answer.
You can do this without using a subquery using LEFT JOIN:
SELECT c.id, c.comment_id_no, c.text, c.show, c.inserted_at
FROM Comments AS c
LEFT JOIN Comments AS c2
ON c2.comment_id_no = c.comment_id_no
AND c2.inserted_at > c.inserted_at
WHERE c2.id IS NULL
AND c.show = 'true';
I think all other approaches will require a subquery of some sort, this would usually be done with a ranking function:
SELECT c.id, c.comment_id_no, c.text, c.show, c.inserted_at
FROM ( SELECT c.id,
c.comment_id_no,
c.text,
c.show,
c.inserted_at,
ROW_NUMBER() OVER(PARTITION BY c.comment_id_no
ORDER BY c.inserted_at DESC) AS RowNumber
FROM Comments AS c
) AS c
WHERE c.RowNumber = 1
AND c.show = 'true';
Since you have tagged with Postgresql you could also make use of DISTINCT ON ():
SELECT *
FROM ( SELECT DISTINCT ON (c.comment_id_no)
c.id, c.comment_id_no, c.text, c.show, c.inserted_at
FROM Comments AS c
ORDER By c.comment_id_no, inserted_at DESC
) x
WHERE show = 'true';
Examples on DB<>Fiddle
I think you want:
select c.*
from comments c
where c.inserted_at = (select max(c2.inserted_at)
from comments c2
where c2.comment_id_no = c.comment_id_no
) and
c.show = 'true';
I don't understand what this has to do with select distinct. You simply want the last version of a comment, and then to check if you can show that.
EDIT:
In Postgres, I would do:
select c.*
from (select distinct on (comment_id_no) c.*
from comments c
order by c.comment_id_no, c.inserted_at desc
) c
where c.show
distinct on usually has pretty good performance characteristics.
As I told in comments I don't advice to pollute data tables with history/auditory stuff.
And no: "double versioning" suggested by #Josh_Eller in his comment isn't a
good solution too: Not only for complicating queries unnecessarily but also for
being much more expensive in terms of processing and tablespace fragmentation.
Take in mind that UPDATE operations never update anything. They instead
write a whole new version of the row and mark the old one as deleted. That's
why vacuum processes are needed to defragment tablespaces in order to
recover that space.
In any case, apart of suboptimal, that approach forces you to implement more
complex queries to read and write data while in fact, I suppose most of the times you will only need to select, insert, update or even delete single row and only eventually, look its history up.
So the best solution (IMHO) is to simply implement the schema you actually need
for your main task and implement the auditory aside in a separate table and
maintained by a trigger.
This would be much more:
Robust and Simple: Because you focus on single thing every time (Single
Responsibility and KISS principles).
Fast: Auditory operations can be performed in an after trigger so
every time you perform an INSERT, UPDATE, or DELETE any possible lock
within the transaction is yet freed because the database engine knows that its outcome won't change.
Efficient: I.e. an update will, of course, insert a new row and mark
the old one as deleted. But this will be done at a low level by the database engine and, more than that: your auditory data will be fully unfragmented (because you only write there: never update). So the overall fragmentation would be always much less.
That being said, how to implement it?
Suppose this simple schema:
create table comments (
text text,
mtime timestamp not null default now(),
id serial primary key
);
create table comments_audit ( -- Or audit.comments if using separate schema
text text,
mtime timestamp not null,
id integer,
rev integer not null,
primary key (id, rev)
);
...and then this function and trigger:
create or replace function fn_comments_audit()
returns trigger
language plpgsql
security definer
-- This allows you to restrict permissions to the auditory table
-- because the function will be executed by the user who defined
-- it instead of whom executed the statement which triggered it.
as $$
DECLARE
BEGIN
if TG_OP = 'DELETE' then
raise exception 'FATAL: Deletion is not allowed for %', TG_TABLE_NAME;
-- If you want to allow deletion there are a few more decisions to take...
-- So here I block it for the sake of simplicity ;-)
end if;
insert into comments_audit (
text
, mtime
, id
, rev
) values (
NEW.text
, NEW.mtime
, NEW.id
, coalesce (
(select max(rev) + 1 from comments_audit where id = new.ID)
, 0
)
);
return NULL;
END;
$$;
create trigger tg_comments_audit
after insert or update or delete
on public.comments
for each row
execute procedure fn_comments_audit()
;
And that's all.
Notice that in this approach you will have always your current comments data
in comments_audit. You could have instead used the OLD register and only
define the trigger in the UPDATE (and DELETE) operations to avoid it.
But I prefer this approach not only because it gives us an extra redundancy (an
accidental deletion -in case it were allowed or the trigger where accidentally
disabled- on the master table, then we would be able to recover all data from
the auditory one) but also because it simplifies (and optimises) querying the
history when it's needed.
Now you only need to insert, update or select (or even delete if you develop a little more this schema, i.e. by inserting a row with nulls...) in a fully transparent manner just like if it weren't any auditory system. And, when you need that data, you only need to query the auditory table instead.
NOTE: Additionally you could want to include a creation timestamp (ctime). In this case it would be interesting to prevent it of being modified in a BEFORE trigger so I omitted it (for the sake of simplicity again) because you can already guess it from the mtimes in the auditory table (even if you are going to use it in your application it would be very advisable to add it).
If you are running Postgres 8.4 or higher, ROW_NUMBER() is the most efficient solution :
SELECT *
FROM (
SELECT c.*, ROW_NUMBER() OVER(PARTITION BY comment_id_no ORDER BY inserted_at DESC) rn
FROM comments c
WHERE c.show = 'true'
) x WHERE rn = 1
Else, this could also be achieved using a WHERE NOT EXISTS condition, that ensures that you are showing the latest comment :
SELECT c.*
FROM comments c
WHERE
c.show = 'true '
AND NOT EXISTS (
SELECT 1
FROM comments c1
WHERE c1.comment_id_no = c.comment_id_no AND c1.inserted_at > c.inserted_at
)
You have to use group by to get the latest ids and the join to the comments table to filter out the rows where show = false:
select c.*
from comments c inner join (
select comment_id_no, max(id) maxid
from comments
group by comment_id_no
) g on g.maxid = c.id
where c.show = 'true'
I assume that the column id is unique and autoincrement in comments table.
See the demo

Google Big Query SQL Aggregate Data based on Sessions

I am currently working with Google Analytics Data in Big Query, and one thing I have yet to be able to wrap my head around is how to write a query to get aggregated data over events from one session.
I have searched around to find something that might work, but couldn't get it so far.
Bascially, this is how the table looks (vastly simplified):
UserID | event_name | event_timestamp
--------------------------------------
1 | login | 1543171146125000
1 | other event| 1543171155329000
1 | other event| 1543171155341001
1 | login | 1543171157796003
1 | other event| 1543171160541000
2 | login | 1543171157796003
2 | other event| 1543171177531000
What I want to do now is aggregating data over User AND session, whereas a session is defined as all events until another login event is shown for that user.
I'm assuming I have to come up with an additional field "session" that bascially is always showing a new ID once a login event_name is encountered for the currently aggregated UserID.
So, for instance, in that case, if I want to have an aggregated event count, the resulting table would look someting like:
UserID | session | EventCount
---------------------------
1 | 1 | 3
1 | 2 | 2
2 | 1 | 2
My assumption would be that there is some sub-query I could use to get that magical "session" field, so something like:
SELECT UserID, session, COUNT(event_name) as EventCount
FROM (Insert Magical Subquery here)
GROUP BY UserID, session
Any ideas how this might be done? It seems like a simple thing but I just can't figure it out.
Based on your example, a session seems to start with a "login". So, you can just do a cumulative count "login"s for each userid:
select t.*,
countif(event_name = 'login') over (partition by userid order by event_timestamp) as session
from t;
You can then aggregate:
select userid, session, count(*)
from (select t.*,
countif(event_name = 'login') over (partition by userid order by event_timestamp) as session
from t
) t
group by userid, session;

SQL query to get latest user to update record

I have a postgres database that contains an audit log table which holds a historical log of updates to documents. It contains which document was updated, which field was updated, which user made the change, and when the change was made. Some sample data looks like this:
doc_id | user_id | created_date | field | old_value | new_value
--------+---------+------------------------+-------------+---------------+------------
A | 1 | 2018-07-30 15:43:44-05 | Title | | War and Piece
A | 2 | 2018-07-30 15:45:13-05 | Title | War and Piece | War and Peas
A | 1 | 2018-07-30 16:05:59-05 | Title | War and Peas | War and Peace
B | 1 | 2018-07-30 15:43:44-05 | Description | test 1 | test 2
B | 2 | 2018-07-30 17:45:44-05 | Description | test 2 | test 3
You can see that the Title of document A was changed three times, first by user 1 then by user 2, then again by user 1.
Basically I need to know which user was the last one to update a field on a particular document. So for example, I need to know that User 1 was the last user to update the Title field on document A. I don't really care what time it happened, just the document, field, and user.
So sample output would be something like this:
doc_id | field | user_id
--------+-------------+---------
A | Title | 1
B | Description | 2
Seems like it should be fairly straightforward query to write but I'm having some trouble with it. I would think that group by would be in order but the problem is that if I group by doc_id I lose the user data:
select doc_id, max(created_date)
from document_history
group by doc_id;
doc_id | max
--------+------------------------
B | 2018-07-30 15:00:00-05
A | 2018-07-30 16:00:00-05
I could join these results table back to the document_history table but I would need to do so based on the doc_id and timestamp which doesn't seem quite right. If two people editing a document at the exact same time I would get multiple rows back for that document and field. Maybe that's so unlikely I shouldn't worry about it, but still...
Any thoughts on a way to do this in a single query?
You want to filter the records, so think where, not group by:
select dh.*
from document_history
where dh.created_date = (select max(dh2.created_date) from document_history dh2 where dh2.doc_id = dh.doc_id);
In most databases, this will have better performance than a group by, if you have an index on document_history(doc_id, created_date).
If your DBMS supports window functions (e.g. PostgreSQL, SQL Server; aka analytic function in Oracle) you could do something like this (SQLFiddle with Postgres, other systems might differ slightly in the syntax):
http://sqlfiddle.com/#!17/981af/4
SELECT DISTINCT
doc_id, field,
first_value(user_id) OVER (PARTITION BY doc_id, field ORDER BY created_date DESC) as last_user
FROM get_last_updated
first_value() OVER (... ORDER BY x DESC) orders the window frames/partitions descending and then takes the first value which is your latest time stamp.
I added the DISTINCT to get your expected result. The window function just adds a new column to your SELECT result but within the same partition with the same value. If you do not need it, remove it and then you are able to work with the origin data plus the new won information.

complex django or SQL query

A record can have status 'renewal_required'. If it enters this status, and the applicant indeed renews, a copy is generated, which enters status 'in_process' (But an application can have status 'in_process' for other reasons too).
Now I need to get all records that have renewal_required status, BUT, if a copy exists in status 'in_process' for a given applicant, I shall only show that one...the key is the applicant_id, being the same for copied records.
| id | status | applicant_id |
| 1 | renewal_required | 2 |
| 2 | in_process | 3 |
| 3 | renewal_required | 4 |
| 4 | in_process | 4 |
in the above example, records with id 1 and 4 would be returned...
Can this be done? Thanks for any suggestion (DB-redesign excluded, even if the design looks ridiculous - can't do anything about it right now)
Solution needs to be for django but if a SQL solution is being proposed I will happily accept it and adapt/execute directly
select a.applicant_id,COALESCE(b.status,a.status) status from
(select applicant_id,status from yourtable where status='renewal_required') a
left join
(select applicant_id,status from yourtable where status='in_process') b
on a.applicant_id = b.applicant_id;
check the DEMO
Here, a possible solution
SELECT MAX(t1.id) as max_id, t1.status, t1.applicant_id
FROM t1
JOIN (
SELECT MIN(status) as status, applicant_id
FROM t1
WHERE status in ('renewal_required', 'in_process')
GROUP by applicant_id ) tmp
ON t1.status = tmp.status
AND t1.applicant_id = tmp.applicant_id
GROUP BY t1.status, t1.applicant_id
SQL Fiddle
EDIT: Rethought it, now this one won't work if there are more statuses than just these two, because of SELECT MIN(status). Could you comment on that?
EDIT2: Might be like this it will. added WHERE status in ('renewal_required', 'in_process')