SQL - Merge duplicate rows by the most recent - sql

I have a big SQL database with these tables for example:
first_name | last_name | email | country | created_at
-----------------------------------------------------------------
john | DOE | johndoe#email.com | USA | 2016-05-01
john | DOE | johndoe#email.com | FRANCE | 2019-05-03
doe | John | johndoe#email.com | CANADA | 2011-08-23
The previous database was built without a unique email (yes it's horrible).
So, I need to merge the user with same email but different data with the most recent record.
Then update the database by deleting the older one and keep the latest one.
Excuse me if it's not clear..

Something like this?
delete t
where t.created_at < (select max(t2. created_at)
from t t2
where t2.email = t.email
);

With EXISTS:
delete tablename t
where exists (
select 1 from tablename where email = t.email and created_at > t.created_at
)
EXISTS will return TRUE as soon as it finds 1 row with the same email and date greater than the current row, so it does not need to scan the whole table for every row.

You mentioned that this is a big database. I will then suggest that you add an index on the table before running the script by either #forpas or #Gordon Linoff as these scripts might take a long time to complete when dealing with millions of rows.
The index could be created like this:
CREATE INDEX tablename_index ON tablename (email, created_at);
And then afterwards, if you don't need the index any more, you can drop it like this:
DROP INDEX tablename_index ON tablename;

Related

Is there a way to ensure WHERE clause happens after DISTINCT?

Imagine you have a table comments in your database.
The comment table has the columns, id, text, show, comment_id_no.
If a user enters a comment, it inserts a row into the database
| id | comment_id_no | text | show | inserted_at |
| -- | -------------- | ---- | ---- | ----------- |
| 1 | 1 | hi | true | 1/1/2000 |
If a user wants to update that comment it inserts a new row into the db
| id | comment_id_no | text | show | inserted_at |
| -- | -------------- | ---- | ---- | ----------- |
| 1 | 1 | hi | true | 1/1/2000 |
| 2 | 1 | hey | true | 1/1/2001 |
Notice it keeps the same comment_id_no. This is so we will be able to see the history of a comment.
Now the user decides that they no longer want to display their comment
| id | comment_id_no | text | show | inserted_at |
| -- | -------------- | ---- | ----- | ----------- |
| 1 | 1 | hi | true | 1/1/2000 |
| 2 | 1 | hey | true | 1/1/2001 |
| 3 | 1 | hey | false | 1/1/2002 |
This hides the comment from the end users.
Now a second comment is made (not an update of the first)
| id | comment_id_no | text | show | inserted_at |
| -- | -------------- | ---- | ----- | ----------- |
| 1 | 1 | hi | true | 1/1/2000 |
| 2 | 1 | hey | true | 1/1/2001 |
| 3 | 1 | hey | false | 1/1/2002 |
| 4 | 2 | new | true | 1/1/2003 |
What I would like to be able to do is select all the latest versions of unique commend_id_no, where show is equal to true. However, I do not want the query to return id=2.
Steps the query needs to take...
select all the most recent, distinct comment_id_nos. (should return id=3 and id=4)
select where show = true (should only return id=4)
Note: I am actually writing this query in elixir using ecto and would like to be able to do this without using the subquery function. If anyone can answer this in sql I can convert the answer myself. If anyone knows how to answer this in elixir then also feel free to answer.
You can do this without using a subquery using LEFT JOIN:
SELECT c.id, c.comment_id_no, c.text, c.show, c.inserted_at
FROM Comments AS c
LEFT JOIN Comments AS c2
ON c2.comment_id_no = c.comment_id_no
AND c2.inserted_at > c.inserted_at
WHERE c2.id IS NULL
AND c.show = 'true';
I think all other approaches will require a subquery of some sort, this would usually be done with a ranking function:
SELECT c.id, c.comment_id_no, c.text, c.show, c.inserted_at
FROM ( SELECT c.id,
c.comment_id_no,
c.text,
c.show,
c.inserted_at,
ROW_NUMBER() OVER(PARTITION BY c.comment_id_no
ORDER BY c.inserted_at DESC) AS RowNumber
FROM Comments AS c
) AS c
WHERE c.RowNumber = 1
AND c.show = 'true';
Since you have tagged with Postgresql you could also make use of DISTINCT ON ():
SELECT *
FROM ( SELECT DISTINCT ON (c.comment_id_no)
c.id, c.comment_id_no, c.text, c.show, c.inserted_at
FROM Comments AS c
ORDER By c.comment_id_no, inserted_at DESC
) x
WHERE show = 'true';
Examples on DB<>Fiddle
I think you want:
select c.*
from comments c
where c.inserted_at = (select max(c2.inserted_at)
from comments c2
where c2.comment_id_no = c.comment_id_no
) and
c.show = 'true';
I don't understand what this has to do with select distinct. You simply want the last version of a comment, and then to check if you can show that.
EDIT:
In Postgres, I would do:
select c.*
from (select distinct on (comment_id_no) c.*
from comments c
order by c.comment_id_no, c.inserted_at desc
) c
where c.show
distinct on usually has pretty good performance characteristics.
As I told in comments I don't advice to pollute data tables with history/auditory stuff.
And no: "double versioning" suggested by #Josh_Eller in his comment isn't a
good solution too: Not only for complicating queries unnecessarily but also for
being much more expensive in terms of processing and tablespace fragmentation.
Take in mind that UPDATE operations never update anything. They instead
write a whole new version of the row and mark the old one as deleted. That's
why vacuum processes are needed to defragment tablespaces in order to
recover that space.
In any case, apart of suboptimal, that approach forces you to implement more
complex queries to read and write data while in fact, I suppose most of the times you will only need to select, insert, update or even delete single row and only eventually, look its history up.
So the best solution (IMHO) is to simply implement the schema you actually need
for your main task and implement the auditory aside in a separate table and
maintained by a trigger.
This would be much more:
Robust and Simple: Because you focus on single thing every time (Single
Responsibility and KISS principles).
Fast: Auditory operations can be performed in an after trigger so
every time you perform an INSERT, UPDATE, or DELETE any possible lock
within the transaction is yet freed because the database engine knows that its outcome won't change.
Efficient: I.e. an update will, of course, insert a new row and mark
the old one as deleted. But this will be done at a low level by the database engine and, more than that: your auditory data will be fully unfragmented (because you only write there: never update). So the overall fragmentation would be always much less.
That being said, how to implement it?
Suppose this simple schema:
create table comments (
text text,
mtime timestamp not null default now(),
id serial primary key
);
create table comments_audit ( -- Or audit.comments if using separate schema
text text,
mtime timestamp not null,
id integer,
rev integer not null,
primary key (id, rev)
);
...and then this function and trigger:
create or replace function fn_comments_audit()
returns trigger
language plpgsql
security definer
-- This allows you to restrict permissions to the auditory table
-- because the function will be executed by the user who defined
-- it instead of whom executed the statement which triggered it.
as $$
DECLARE
BEGIN
if TG_OP = 'DELETE' then
raise exception 'FATAL: Deletion is not allowed for %', TG_TABLE_NAME;
-- If you want to allow deletion there are a few more decisions to take...
-- So here I block it for the sake of simplicity ;-)
end if;
insert into comments_audit (
text
, mtime
, id
, rev
) values (
NEW.text
, NEW.mtime
, NEW.id
, coalesce (
(select max(rev) + 1 from comments_audit where id = new.ID)
, 0
)
);
return NULL;
END;
$$;
create trigger tg_comments_audit
after insert or update or delete
on public.comments
for each row
execute procedure fn_comments_audit()
;
And that's all.
Notice that in this approach you will have always your current comments data
in comments_audit. You could have instead used the OLD register and only
define the trigger in the UPDATE (and DELETE) operations to avoid it.
But I prefer this approach not only because it gives us an extra redundancy (an
accidental deletion -in case it were allowed or the trigger where accidentally
disabled- on the master table, then we would be able to recover all data from
the auditory one) but also because it simplifies (and optimises) querying the
history when it's needed.
Now you only need to insert, update or select (or even delete if you develop a little more this schema, i.e. by inserting a row with nulls...) in a fully transparent manner just like if it weren't any auditory system. And, when you need that data, you only need to query the auditory table instead.
NOTE: Additionally you could want to include a creation timestamp (ctime). In this case it would be interesting to prevent it of being modified in a BEFORE trigger so I omitted it (for the sake of simplicity again) because you can already guess it from the mtimes in the auditory table (even if you are going to use it in your application it would be very advisable to add it).
If you are running Postgres 8.4 or higher, ROW_NUMBER() is the most efficient solution :
SELECT *
FROM (
SELECT c.*, ROW_NUMBER() OVER(PARTITION BY comment_id_no ORDER BY inserted_at DESC) rn
FROM comments c
WHERE c.show = 'true'
) x WHERE rn = 1
Else, this could also be achieved using a WHERE NOT EXISTS condition, that ensures that you are showing the latest comment :
SELECT c.*
FROM comments c
WHERE
c.show = 'true '
AND NOT EXISTS (
SELECT 1
FROM comments c1
WHERE c1.comment_id_no = c.comment_id_no AND c1.inserted_at > c.inserted_at
)
You have to use group by to get the latest ids and the join to the comments table to filter out the rows where show = false:
select c.*
from comments c inner join (
select comment_id_no, max(id) maxid
from comments
group by comment_id_no
) g on g.maxid = c.id
where c.show = 'true'
I assume that the column id is unique and autoincrement in comments table.
See the demo

SQL query to get latest user to update record

I have a postgres database that contains an audit log table which holds a historical log of updates to documents. It contains which document was updated, which field was updated, which user made the change, and when the change was made. Some sample data looks like this:
doc_id | user_id | created_date | field | old_value | new_value
--------+---------+------------------------+-------------+---------------+------------
A | 1 | 2018-07-30 15:43:44-05 | Title | | War and Piece
A | 2 | 2018-07-30 15:45:13-05 | Title | War and Piece | War and Peas
A | 1 | 2018-07-30 16:05:59-05 | Title | War and Peas | War and Peace
B | 1 | 2018-07-30 15:43:44-05 | Description | test 1 | test 2
B | 2 | 2018-07-30 17:45:44-05 | Description | test 2 | test 3
You can see that the Title of document A was changed three times, first by user 1 then by user 2, then again by user 1.
Basically I need to know which user was the last one to update a field on a particular document. So for example, I need to know that User 1 was the last user to update the Title field on document A. I don't really care what time it happened, just the document, field, and user.
So sample output would be something like this:
doc_id | field | user_id
--------+-------------+---------
A | Title | 1
B | Description | 2
Seems like it should be fairly straightforward query to write but I'm having some trouble with it. I would think that group by would be in order but the problem is that if I group by doc_id I lose the user data:
select doc_id, max(created_date)
from document_history
group by doc_id;
doc_id | max
--------+------------------------
B | 2018-07-30 15:00:00-05
A | 2018-07-30 16:00:00-05
I could join these results table back to the document_history table but I would need to do so based on the doc_id and timestamp which doesn't seem quite right. If two people editing a document at the exact same time I would get multiple rows back for that document and field. Maybe that's so unlikely I shouldn't worry about it, but still...
Any thoughts on a way to do this in a single query?
You want to filter the records, so think where, not group by:
select dh.*
from document_history
where dh.created_date = (select max(dh2.created_date) from document_history dh2 where dh2.doc_id = dh.doc_id);
In most databases, this will have better performance than a group by, if you have an index on document_history(doc_id, created_date).
If your DBMS supports window functions (e.g. PostgreSQL, SQL Server; aka analytic function in Oracle) you could do something like this (SQLFiddle with Postgres, other systems might differ slightly in the syntax):
http://sqlfiddle.com/#!17/981af/4
SELECT DISTINCT
doc_id, field,
first_value(user_id) OVER (PARTITION BY doc_id, field ORDER BY created_date DESC) as last_user
FROM get_last_updated
first_value() OVER (... ORDER BY x DESC) orders the window frames/partitions descending and then takes the first value which is your latest time stamp.
I added the DISTINCT to get your expected result. The window function just adds a new column to your SELECT result but within the same partition with the same value. If you do not need it, remove it and then you are able to work with the origin data plus the new won information.

In mysql, what is an efficient way to get only rows in a table by a distinct value in one of that table's columns?

I'm relatively new to mysql so bear with me.
I have a table that looks a bit like this:
ID | Name | Location
0 | John | Los Angeles
1 | Joe | San Jose
2 | Jane | New York
3 | Sal | Boise
4 | Jay | New York
5 | Kate | San Jose
I need a SELECT statement that gets all rows, with the exception that if Location is repeated, that row is ignored. The result should look something like this:
0 | John | Los Angeles
1 | Joe | San Jose
2 | Jane | New York
3 | Sal | Boise
The important thing is that my table is very, very large, with hundreds of thousands of rows. Most things I've tried as ended up with select statements that take literally 30+ minutes to complete!
This is how I would do it:
SELECT mt1.ID, mt1.Name, mt1.Location
FROM mytable mt1
JOIN (SELECT MIN(id) AS ID
FROM Mytable
GROUP BY location) mt2
ON mt1.id = mt2.Id
The derived table get the minimum id for each location. Then joins to the table to pull the rest of the data.
Incidentally ID is a horrible choice for naming the id field. Please think about using tablenameID instead. It is helpful for reporting not to have the same name for the id fields in differnt tables and it makes if FAR less likely that you will make an accidental join mistake and join tothe ID in the wrong table. It also makes the PK/FK relationships easier to see in my opinion.
SELECT ID, Name, Location
FROM table
GROUP BY Location
Do you have an index on Location? That should help improve the speed a lot.
you can use
SELECT * FROM tbl GROUP BY Location
creating an index on Location can dramatically help reduce the querying time.
Also if there are more number of columns in your table, specifying only the required columns instead of using * will improve performance further.

MySQL select a row preceding the row with ID = X

this is actually a sorting question almost solved thanks to other posts I've found here on StackOverflow.
basically i'm trying to bumb a row up , no need to insert it somewhere in the middle of the table and change all the following rows id's just one place up (replace order id's with previous).
I've found a solution for this in another post WHEN the id's are known for example.
ID | name | surname
1 | John | doe
2 | Jane | Dane
but my table has gaps and often its more like
ID | name | surname
2 | John | doe
7 | Jane | Dane
So i can't really rely on subtracting the id by 1 and replacing row 7 with the row number 7-1.
Is it somehow possible to swwitch the row id X with the one before without knowing the other id ?
is it possible to do it SQL only ? i have an idea of php-ing it but it seems alot like a fix not a solution.
This would work in SQL Server, I'm assuming in MySQL too:
SELECT *
FROM MyTable MT
...
WHERE ID = (SELECT MAX(ID) FROM MyTable MT2 WHERE ID < MT.ID)

MySQL, reading this EXPLAIN statement

I have a query which is starting to cause some concern in my application. I'm trying to understand this EXPLAIN statement better to understand where indexes are potentially missing:
+----+-------------+-------+--------+---------------+------------+---------+-------------------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+------------+---------+-------------------------------+------+---------------------------------+
| 1 | SIMPLE | s | ref | client_id | client_id | 4 | const | 102 | Using temporary; Using filesort |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | www_foo_com.s.user_id | 1 | |
| 1 | SIMPLE | a | ref | session_id | session_id | 4 | www_foo_com.s.session_id | 1 | Using index |
| 1 | SIMPLE | h | ref | email_id | email_id | 4 | www_foo_com.a.email_id | 10 | Using index |
| 1 | SIMPLE | ph | ref | session_id | session_id | 4 | www_foo_com.s.session_id | 1 | Using index |
| 1 | SIMPLE | em | ref | session_id | session_id | 4 | www_foo_com.s.session_id | 1 | |
| 1 | SIMPLE | pho | ref | session_id | session_id | 4 | www_foo_com.s.session_id | 1 | |
| 1 | SIMPLE | c | ALL | userfield | NULL | NULL | NULL | 1108 | |
+----+-------------+-------+--------+---------------+------------+---------+-------------------------------+------+---------------------------------+
8 rows in set (0.00 sec)
I'm trying to understand where my indexes are missing by reading this EXPLAIN statement. Is it fair to say that one can understand how to optimize this query without seeing the query at all and just look at the results of the EXPLAIN?
It appears that the ALL scan against the 'c' table is the achilles heel. What's the best way to index this based on constant values as recommended on MySQL's documentation? |
Note, I also added an index to userfield in the cdr table and that hasn't done much good either.
Thanks.
--- edit ---
Here's the query, sorry -- don't know why I neglected to include it the first pass through.
SELECT s.`session_id` id,
DATE_FORMAT(s.`created`,'%m/%d/%Y') date,
u.`name`,
COUNT(DISTINCT c.id) calls,
COUNT(DISTINCT h.id) emails,
SEC_TO_TIME(MAX(DISTINCT c.duration)) duration,
(COUNT(DISTINCT em.email_id) + COUNT(DISTINCT pho.phone_id) > 0) status
FROM `fa_sessions` s
LEFT JOIN `fa_users` u ON s.`user_id`=u.`user_id`
LEFT JOIN `fa_email_aliases` a ON a.session_id = s.session_id
LEFT JOIN `fa_email_headers` h ON h.email_id = a.email_id
LEFT JOIN `fa_phones` ph ON ph.session_id = s.session_id
LEFT JOIN `fa_email_aliases` em ON em.session_id = s.session_id AND em.status = 1
LEFT JOIN `fa_phones` pho ON pho.session_id = s.session_id AND pho.status = 1
LEFT JOIN `cdr` c ON c.userfield = ph.phone_id
WHERE s.`partner_id`=1
GROUP BY s.`session_id`
I assume you've looked here to get more info about what it is telling you. Obviously the ALL means its going through all of them. The using temporary and using filesort are talked about on that page. You might want to look at that.
From the page:
Using filesort
MySQL must do an extra pass to find
out how to retrieve the rows in sorted
order. The sort is done by going
through all rows according to the join
type and storing the sort key and
pointer to the row for all rows that
match the WHERE clause. The keys then
are sorted and the rows are retrieved
in sorted order. See Section 7.2.12,
“ORDER BY Optimization”.
Using temporary
To resolve the query, MySQL needs to
create a temporary table to hold the
result. This typically happens if the
query contains GROUP BY and ORDER BY
clauses that list columns differently.
I agree that seeing the query might help to figure things out better.
My advice?
Break the query into 2 and use a temporary table in the middle.
Reasonning
The problem appears to be that table c is being table scanned, and that this is the last table in the query. This is probably bad: if you have a table scan, you want to do it at the start of the query, so it's only done once.
I'm not a MySQL guru, but I have spent a whole lot of time optimising queries on other DBs. It looks to me like the optimiser hasn't worked out that it should start with c and work backwards.
The other thing that strikes me is that there are probably too many tables in the join. Most optimisers struggle with more than 4 tables (because the number of possible table orders is growing exponentially, so checking them all becomes impractical).
Having too many tables in a join is the root of 90% of performance problems I have seen.
Give it a go, and let us know how you get on. If it doesn't help, please post the SQL, table definitions and indeces, and I'll take another look.
General Tips
Feel free to look at this answer I gave on general performance tips.
A great resource
MySQL Documentation for EXPLAIN
Well looking at the query would be useful, but there's at least one thing that's obviously worth looking into - the final line shows the ALL type for that part of the query, which is generally not great to see. If the suggested possible key (userfield) makes sense as an added index to table c, it might be worth adding it and seeing if that reduces the rows returned for that table in the search.
Query Plan
The query plan we might hope the optimiser would choose would be something like:
start with sessions where partner_id=1 , possibly using an index on partner_id,
join sessions to users, using an index on user_id
join sessions to phones, where status=1, using an index on session_id and possibly status
join sessions to phones again using an index on session_id and phone_id **
join phones to cdr using an index on userfield
join sessions to email_aliases, where status=1 using an index on session_id and possibly status
join sessions to email_aliases again using an index on session_id and email_id **
join email_aliases to email_headers using an index on email_id
** by putting 2 fields in these indeces, we enable the optimiser to join to the table using session_id, and immediately find out the associated phone_id or email_id without having to read the underlying table. This technique saves us a read, and can save a lot of time.
Indeces I would create:
The above query plan suggests these indeces:
fa_sessions ( partner_id, session_id )
fa_users ( user_id )
fa_email_aliases ( session_id, email_id )
fa_email_headers ( email_id )
fa_email_aliases ( session_id, status )
fa_phones ( session_id, status, phone_id )
cdr ( userfield )
Notes
You will almost certainly get acceptable performance without creating all of these.
If any of the tables are small ( less than 100 rows ) then it's probably not worth creating an index.
fa_email_aliases might work with ( session_id, status, email_id ), depending on how the optimiser works.