deleting duplicate row with no unique identifier

deleting duplicate row with no unique identifier - sql

I have some data in a table that looks roughly like the following:
table stockData
(
tickId int not null,
timestamp datetime not null,
price decimal(18,5) not null
)
Neither tickId nor timestamp are unique, however the combination of tickId and timestamp is supposed to be unique.
I have some duplicate data in my table, and I'm attempting to remove it. However, I'm coming to the conclusion that there is not enough information with the given data for me to discern one row from the other, and basically no way for me to delete just one of the duplicate rows. My guess is that I will need to introduce some sort of identity column, which would help me to identify one row from the other.
Is this correct, or is there some magic way of deleting one but not both of the duplicate data with a query?
EDIT Edited to clarify that tickId and timestamp combo should be unique, but it's not because of the duplicate data.

Here is a query that will remove duplicates and leave exactly one copy of each unique row. It will work with SQL Server 2005 or higher:
WITH Dups AS
(
SELECT tickId, timestamp, price,
ROW_NUMBER() OVER(PARTITION BY tickid, timestamp ORDER BY (SELECT 0)) AS rn
FROM stockData
)
DELETE FROM Dups WHERE rn > 1

select distinct * into temp_table from source_table (this table will be created for you)
delete from temp_table (what you don't need)
insert into sorce_table
select * from temp_table

Maybe I'm not understanding your question correctly, but if "tickId" and "timestamp" are guaranteed to be unique then how do you have duplicate data in your table? Could you provide an example or two of what you mean?
However, if you have duplicates of all three columns inside the table the following script may work. Please test this and make a backup of the database before implementing as I just put it together.
declare #x table
(
tickId int not null,
timestamp datetime not null,
price decimal(18,5) not null
)
insert into #x (tickId, timestamp, price)
select tickId,
timestamp,
price
from stockData
group by tickId,
timestamp,
price
having count(*) > 1
union
select tickId,
timestamp,
price
from stockData
group by tickId,
timestamp,
price
having count(*) = 1
delete
from stockData
insert into stockData (tickId, timestamp, price)
select tickId,
timestamp,
price
from #x
alter table stockData add constraint
pk_StockData primary key clustered (tickid, timestamp)

Related

Pick Max Date from Table

I have a table variable defined thus
DECLARE #DatesTable TABLE
(
Id uniqueidentifier,
FooId uniqueidentifier,
Date date,
Value decimal (26, 10)
)
Id is always unique but FooId is duplicated throughout the table. What I would like to do is to select * from this table for each unique FooId having the max(date). So, if there are 20 rows with 4 unique FooIds then I'd like 4 rows, picking the row for each FooId where the date is the largest.
I've tried using group by but I kept getting errors about various fields not being in the select clause etc.

Use a common table expression with row_number():
;WITH cte AS
(
SELECT Id, FooId, Date, Value,
ROW_NUMBER() OVER(PARTITION BY FooId ORDER BY Date DESC) As rn
FROM #DatesTable
)
SELECT Id, FooId, Date, Value
FROM cte
WHERE rn = 1

Often the most efficient method is a correlated subquery:
select dt.*
from #DatesTable dt
where dt.date = (select max(dt2.date) from #DatesTable dt2 where dt2.fooid = dt.fooid);
However, for this to be efficient, you need an index on (fooid, date). In more recent versions of SQL Server, you can have indexes on table variables. In earlier versions, you can do this using a primary key.

RedShift - How to filter records in a table by a composite Primary Key?

I'm writing a script for removing duplicates in a RedShift table. But since the table has a composite primary key containing of 2 columns I faced a problem while selecting and filtering values.
Here is what I've implemented so far. It would be easy if I had just one column as a PK but how to achieve the same result for a composite key (sale_id, sale_date)?
Especially problematic is the second step - copying distinct rows with a WHERE condition for a composite key into a new table.
Step 1
-- Saving PKs with dupes into a TEMP TABLE
CREATE TEMP TABLE main.duplicate_sales AS
SELECT sale_id, sale_date
FROM main.sales
WHERE sale_date=2019-05-20
GROUP BY 1,2
HAVING COUNT(*) > 1;
Step 2
-- Copy distinct rows for the above PKs to a new table
CREATE TEMP TABLE main.sales_new(LIKE main.sales);
INSERT INTO main.sales_new
SELECT DISTINCT *
FROM main.sales
WHERE sale_id, sale_date IN(
SELECT sale_id, sale_date
FROM main.duplicate_sales
);
UPD: The table is very big so I want to avoid selecting all records. After copy of distinct records into a new table(Step 2) I delete duplicate rows from the original table(Step 3) and then insert distinct records from a new table(Step 4).
Step 3
-- Delete all rows that contain duplicates
DELETE FROM main.sales
WHERE sale_id, sale_date IN(
SELECT sale_id, sale_date
FROM main.duplicate_sales
);
Step 4
-- Insert back distinct records
INSERT INTO main.sales
SELECT *
FROM main.sales_new;

what about just taking the distinct value of sale_id, sale_date
create table table_name_new as select distinct sale_id, sale_date
from main.sales;

I am rather confused by your question and what happens to the rest of the columns. However, EXISTS might be sufficient to replace your current second step:
INSERT INTO main.sales_new
SELECT DISTINCT s.*
FROM main.sales s
WHERE EXISTS (SELECT 1
FROM main.duplicate_sales ds
WHERE ds.sale_id = s.sale_id AND
ds.sale_date = s.sale_date
);

Optimizing query when trying to find latest record in multiple tables for specific column

Problem: Find the most recent record based on (created) column for each (linked_id) column in multiple tables, the results should include (user_id, MAX(created), linked_id). The query must also be able to be used with a WHERE clause to find a single record based on the (linked_id).
There is actually several tables in question but here is 3 tables so you can get the idea of the structure (there is several other columns in each table that have been omitted since they are not to be returned).
CREATE TABLE em._logs_adjustments
(
id serial NOT NULL,
user_id integer,
created timestamp with time zone NOT NULL DEFAULT now(),
linked_id integer,
CONSTRAINT _logs_adjustments_pkey PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
CREATE TABLE em._logs_assets
(
id serial NOT NULL,
user_id integer,
created timestamp with time zone NOT NULL DEFAULT now(),
linked_id integer,
CONSTRAINT _logs_assets_pkey PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
CREATE TABLE em._logs_condition_assessments
(
id serial NOT NULL,
user_id integer,
created timestamp with time zone NOT NULL DEFAULT now(),
linked_id integer,
CONSTRAINT _logs_condition_assessments_pkey PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
The query i'm currently using with a small hack to get around the need for user_id in the GROUP BY clause, if possible array_agg should be removed.
SELECT MAX(MaxDate), linked_id, (array_agg(user_id ORDER BY MaxDate DESC))[1] AS user_id FROM (
SELECT user_id, MAX(created) as MaxDate, asset_id AS linked_id FROM _logs_assets
GROUP BY asset_id, user_id
UNION ALL
SELECT user_id, MAX(created) as MaxDate, linked_id FROM _logs_adjustments
GROUP BY linked_id, user_id
UNION ALL
SELECT user_id, MAX(created) as MaxDate, linked_id FROM _logs_condition_assessments
GROUP BY linked_id, user_id
) as subQuery
GROUP BY linked_id
ORDER BY linked_id DESC
I receive the desired results but don't believe it is the right way to be doing this, especially when array_agg is being used and shouldn't and some tables can have upwards of 1.5+ million records making the query take upwards of 10-15+ seconds to run. Any help/steering in the right direction is much appreciated.

distinct on
SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first
select distinct on (linked_id) created, linked_id, user_id
from (
select user_id, created, asset_id as linked_id
from _logs_assets
union all
select user_id, created, linked_id
from _logs_adjustments
union all
select user_id, created, linked_id
from _logs_condition_assessments
) s
order by linked_id desc, created desc

How to optimize SELECT some_field, max(primary_key) FROM table GROUP BY some_field

I have SQL query in SQL Azure:
SELECT some_field, max(primary_key) FROM table GROUP BY some_field
Table has currently over 6 million rows. Index on (some_field asc, primary_key desc) is created. primary_key field is incremental. There is about 700 distinct values of some_field. This select takes at least 30 seconds.
There are only inserts into this table, no updates or deletes.
I can create separate table to store some_field and maximal value of primary key and write trigger to build it, but I am looking for more elegant solution. Is there any?

Dont know if this will be performant but you you can give it a shot...
;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY some_field ORDER BY primary_key DESC) AS rn
FROM table
)
SELECT *
FROM cte
WHERE rn = 1

Definitely do the secondary table of "somefield" and "highestPK" columns that is indexed on the "somefield" column. Build that once up front as a baseline and use that.
Then, whenever any new records are inserted into your 6 million record table, have a simple trigger to update your secondary table with something as simple as..
update SecondaryTable
set highestPK = newlyInsertedPKID
where somefield = newlyInsertedSomeFieldValue
This way, it stays updated with every insert as the highest PK for your "somefield" column will qualify, and if no update is available, insert into the secondary table with the new "somefield" value.

How can I assign same row_number to rows in a table when ranking function called repeatedly and no unique keys

I have a table t1 on which I run analytic functions. Please consider Netezza database. This table is intermediate table so it has no keys. It is used for ETL/ELT processing before loading data to final table t2.
Now , I want to assign row_number() to each row of t1. Table t1 has structure similar to following.
group_id varchar(50)
file_id varchar(50)
rec_num varchar(50)
field_4 varchar(50)
field_5 varchar(50)
field_6 varchar(50)
field_7 varchar(50)
field_8 varchar(50)
unfortunately none of the fields listed above are unique. Their combination as a whole row is unique but individually none of them are.
I am running analytic function on table t1 repeatedly 7 times. If I do following then I don't get expected results.
create table t3 as select group_id, file_id,rec_num. field_4, ,dense_rank() over ( order by field_4) r1, row_number() over (order by group_id) from t1 ;
create table t4 as select group_id, file_id,rec_num. field_5, ,dense_rank() over ( order by field_5) r2, row_number() over (order by group_id) from t1 ;
In above queries there is no guarantee that row_number() assigned in first query (t3) will be exact same row_number() assigned when creating t4.
So my question is " What is best way to ensure that row get assigned exact same row_number no matter how many times you run query ( with changing of analytic function output) ? "
Hope I was able to express what I wanted to mention, if not please comment below and I will clarify.
Thank you in advance for taking the time to read, understand and answer.
Cheers

If you want the row_number to be deterministic (assuming that the underlying data does not change, of course), you'd need to specify an order by that produces a unique order. If you need every column in the table in order to produce a unique order, you'd need to use every column in the table. So something like
row_number() over (order by group_id,
file_id,
rec_num,
field_4,
field_5,
field_6,
field_7) rn

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

deleting duplicate row with no unique identifier - sql

select distinct * into temp_table from source_table (this table will be created for you) delete from temp_table (what you don't need) insert into sorce_table select * from temp_table

Related

Pick Max Date from Table

RedShift - How to filter records in a table by a composite Primary Key?

Optimizing query when trying to find latest record in multiple tables for specific column

How to optimize SELECT some_field, max(primary_key) FROM table GROUP BY some_field

How can I assign same row_number to rows in a table when ranking function called repeatedly and no unique keys

Categories

Resources