Dedupe and retain record with most recent timestamp - sql

I'm working with a wide dataset with 500+ columns. The dataset contains a customer ID field and a time-stamp field. I'd like to query the data and end up with a table with only one row per customer ID field where the row retained is the row with the most recent timestamp. The query will be run on a Netezza server if that makes a difference. It seems like I could do this with a sub-query, but I can't seem to get syntax that works.

Here is a typical way to approach this problem:
select t.*
from table t
where not exists (select 1
from table t2
where t2.customerid = t.customerid and
t2.timestamp > t.timestamp
);
This rephrases the question to: "Get me all rows from the table where there is no row with the same customer id and a larger timestamp."

Related

SQL Server : verify that two columns are in same sort order

I have a table with an ID and a date column. It's possible (likely) that when a new record is created, it gets the next larger ID and the current datetime. So if I were to sort by date or I were to sort by ID, the resulting data set would be in the same order.
How do I write a SQL query to verify this?
It's also possible that an older record is modified and the date is updated. In that case, the records would not be in the same sort order. I don't think this happens.
I'm trying to move the data to another location, and if I know that there are no modified records, that makes it a lot simpler.
I'm pretty sure I only need to query those two columns: ID, RecordDate. Other links indicate I should be able to use LAG, but I'm getting an error that it isn't a built-in function name.
In other words, both https://dba.stackexchange.com/questions/42985/running-total-to-the-previous-row and Is there a way to access the "previous row" value in a SELECT statement? should help, but I'm still not able to make that work for what I want.
If you cannot use window functions, you can use a correlated subquery and EXISTS.
SELECT *
FROM elbat t1
WHERE EXISTS (SELECT *
FROM elbat t2
WHERE t2.id < t1.id
AND t2.recorddate > t1.recorddate);
It'll select all records where another record with a lower ID and a greater timestamp exists. If the result is empty you know that no such record exists and the data is like you want it to be.
Maybe you want to restrict it a bit more by using t2.recorddate >= t1.recorddate instead of t2.recorddate > t1.recorddate. I'm not sure how you want it.
Use this:
SELECT ID, RecordDate FROM tablename t
WHERE
(SELECT COUNT(*) FROM tablename WHERE tablename.ID < t.ID)
<>
(SELECT COUNT(*) FROM tablename WHERE tablename.RecordDate < t.RecordDate);
It counts for each row, how many rows have id less than the row's id and
how many rows have RecordDate less than the row's RecordDate.
If these counters are not equal then it outputs this row.
The result is all the rows that would not be in the same position after sorting by ID and RecordDate
One method uses window functions:
select count(*)
from (select t.*,
row_number() over (order by id) as seqnum_id,
row_number() over (order by date, id) as seqnum_date
from t
) t
where seqnum_id <> seqnum_date;
When the count is zero, then the two columns have the same ordering. Note that the second order by includes id. Two rows could have the same date. This makes the sort stable, so the comparison is valid even when date has duplicates.
the above solutions are all good but if both dates and ids are in increment then this should also work
select modifiedid=t2.id from
yourtable t1 join yourtable t2
on t1.id=t2.id+1 and t1.recordDate<t2.recordDate

Cross joining two tables with "using" instead of "on"

I found a SQL query in a book which i am not able to understand. From what i understand there are two tables - date which has a date_id and test_Date column, the 2nd table has date_id and obs_cnt.
select t1.test_date
,sum(t2.obs_cnt)
from date t1
cross join
(transactions join date using (date_id)) as t2
where t1.test_date>=t2.test_date
group by t1.test_date
order by t1.test_date
Can someone help me understand what this code does or how the output will look like.
I understand obs_cnt variable is being aggregated at a test_date level.
I understand the use of using in placed on on. But what i dont get is how the date table is being reference twice, does it mean it is being joined twice?
But what i dont get is how the date table is being reference twice, does it mean it is being joined twice?
Yes it is, although it's probably easier to think of t2 as a whole rather than as a function of the date table: t2 is the transaction table but with the actual date representation of the test_date rather than an ID.
I assume there's actually some context for all of this in the book, but it looks like this will produce:
one row of output for every row in the date table (t1), in order of test_date
for each row, total up the number of observations for all transactions that happened on or before that date, using our transactions-with-date table t2.
I understand obs_cnt variable is being aggregated at a test_date level.
It's being aggregated against t1 test_date, which is the constraint we're using to select the rows in t2 that are summed.

Is it possible to query all lines of an multi-line item with different dates, if some of the lines fall before a date you want to pull items after?

relative newbie here. Only got serious with SQL within the last month or so for work, so I'm still pretty rusty with it.
Is it posible for a query to pull results after a date, but also ones of the same ID that fall before the date?
Sorry for the title gore, but I'm not sure if there's a better way to word it.
I'm trying to run a query that would normally return results from a table that looks like this:
()
And if I were to want it to return all lines after 10/1/16, It would pull lines 1, 6, 7, 8, 9, 10, but what I want it to do is to pull everything for that ID, if just one of the lines was past a certain date.
The results should more or less look like this:
...if I were to enter a date of greater than 10/1/16.
Is that possible to do within SQL? For the record, I'm on SQLdbx.
Thanks a ton!
What you are wanting to do takes a 2 step operation.
Find all of the Ids that have a Date greater than the date you specify.
Find all rows that relate to that ID.
This can be done lots of ways. Here is a way using exists you can do both steps in 1 statement:
SELECT *
FROM
Table t1
WHERE
EXISTS (SELECT 1 FROM Table t2 WHERE t1.Id = t2.Id AND t2.Date > '2016-10-1')
Using IN
SELECT *
FROM
Table
WHERE
Id IN (SELECT DISTINCT Id FROM Table WHERE Id IS NOT NULL AND Date > '2016-10-1')
Note I have included where Id IS NOT NULL because when you use IN with NULL as a possibility you won't get the list you desire.
Left Self Join:
SELECT t1.*
FROM
Table t1
INNER JOIN Table t2
ON t1.Id = t2.Id
AND t2.Date > '2016-10-1'
And there are more yet.

SQL: Move duplicates to another table where condition

I am quite new to SQL and Stackoverflow, so pardon the layout of my post.
Currently, I am struggling with putting the following workflow into an executable SQL statement:
I have a table containing the following columns:
ID (not unique)
PARTYTYPE (1 or 2)
DATE column
several other, not relevant columns
Now I need to find those observations (rows) that have the same ID and same PARTYTYPE but are not the most recent, i.e. have a date in the DATE column that is less than the most recent for the given combination of PARTYTYPE and ID. The rows that satisfy this condition need to be moved to another table with the same table scheme in order to archive them.
Is there an efficient, yet simple way to accomplish this in SQL?
I have been looking for a long time, but since it involves finding duplicates with certain conditions and inserting it into a table, it is a rather specific problem.
This is what I have so far:
INSERT INTO table_history
select ID, PARTYTYPE, count(*) as count_
from table
group by ID, PARTYTYPE, DATE
having DATE = MAX(DATE)
Any help would be appreciated!
The way you describe the SQL almost exactly conforms to a correlated subquery:
INSERT INTO table_history( . . . )
select t.*
from table t
where date < (select max(date)
from table t2
where t2.id = t.id and t2.partytype = t.partytype
);

SQL Delete low counts

I have a table with this data:
Id Qty
-- ---
A 1
A 2
A 3
B 112
B 125
B 109
But I'm supposed to only have the max values for each id. Max value for A is 3 and for B is 125. How can I isolate (and delete) the other values?
The final table should look like this :
Id Qty
-- ---
A 3
B 125
Running MySQL 4.1
Oh wait. Got a simpler solution :
I'll select all the max values(group by id), export the data, flush the table, reimport only the max values.
CREATE TABLE tabletemp LIKE table;
INSERT INTO tabletemp SELECT id,MAX(qty) FROM table GROUP BY id;
DROP TABLE table;
RENAME TABLE tabletemp TO table;
Thanks to all !
Try this in SQL Server:
delete from tbl o
left outer join
(Select max(qty) anz , id
from tbl i
group by i.id) k on o.id = k.id and k.anz = o.qty
where k.id is null
Revision 2 for MySQL... Can anyone check this one?:
delete from tbl o
where concat(id,qty) not in
(select concat(id,anz) from (Select max(qty) anz , id
from tbl i
group by i.id))
Explanation:
Since I was supposed to not use joins (See comments about MySQL Support on joins and delete/update/insert), I moved the subquery into a IN(a,b,c) clause.
Inside an In clause I can use a subquery, but that query is only allowed to return one field. So in order to filter all elements that are not the maximum, i need to concat both fields into a single one, so i can return it inside the in clause. So basically my query inside the IN returns the biggest ID+QTY only. To compare it with the main table i also need to make a concat on the outside, so the data for both fields match.
Basically the In clause contains:
("A3","B125")
Disclaimer: The above query is "evil!" since it uses a function (concat) on fields to compare against. This will cause any index on those fields to become almost useless. You should never formulate a query that way that is run on a regular basis. I only wanted to try to bend it so it works on mysql.
Example of this "bad construct":
(Get all o from the last 2 weeks)
select ... from orders where orderday + 14 > now()
You should allways do:
select ... from orders where orderday > now() - 14
The difference is subtle: Version 2 only has to do the math once, and is able to use the index, and version 1 has to do the math for every single row in the orders table., and you can forget about the index usage...
I'd try this:
delete from T
where exists (
select * from T as T2
where T2.Id = T.Id
and T2.Qty > T.Qty
);
For those who might have similar question in the future, this might be supported some day (it is now in SQL Server 2005 and later)
It won't require a join, and it has advantages over the use of a temporary table if the table has dependencies
with Tranked(Id,Qty,rk) as (
select
Id, Qty,
rank() over (
partition by Id
order by Qty desc
)
from T
)
delete from Tranked
where rk > 1;
You'll have to go via another table (among other things that makes a single delete statement here quite impossible in mysql is you can't delete from a table and use the same table in a subquery).
BEGIN;
create temporary table tmp_del select id,max(qty) as qty from the_tbl;
delete the_tbl from the_tbl,tmp_del where
the_tbl.id=tmp_del.id and the_tbl.qty=tmp_del.qty;
drop table tmp_del;
END;
MySQL 4.0 and later supports a simple multi-table syntax for DELETE:
DELETE t1 FROM MyTable t1 JOIN MyTable t2 ON t1.id = t2.id AND t1.qty < t2.qty;
This produces a join of each row with a given id to all other rows with the same id, and deletes only the row with the lesser qty in each pairing. After this is all done, the row with the greatest qty per group of id is left not deleted.
If you only have one row with a given id, it still works because a single row is naturally the one with the greatest value.
FWIW, I just tried my solution using MySQL 5.0.75 on a Macbook Pro 2.40GHz. I inserted 1 million rows of synthetic data, with different numbers of rows per "group":
2 rows per id completes in 26.78 sec.
5 rows per id completes in 43.18 sec.
10 rows per id completes in 1 min 3.77 sec.
100 rows per id completes in 6 min 46.60 sec.
1000 rows per id didn't complete before I terminated it.