How expensive is select distinct * query - sql

In sql server 2012, I have got a table with more than 25 million rows with duplicates. The table doesn't have unique index. It only has a non-clustered index. I wanted to eliminate duplicates and so, I m thinking of the below
select distinct * into #temp_table from primary_table
truncate primary_table
select * into primary_table from #temp_table
I wanted to know how expensive is select distinct * query. If my procedure above is very expensive, I wanted to know if there is another alternate way.

I don't know how expensive it is, but an alternate way is to create another table with a primary key, insert all the data there and silently reject the duplicates as stated here
http://web.archive.org/web/20180404165346/http://sqlblog.com:80/blogs/paul_white/archive/2013/02/01/a-creative-use-of-ignore-dup-key.aspx
basically, using IGNORE_DUP_KEY

Related

What is the most efficient way to implement a Redshift merge/upsert operation

I am in the process of writing a custom upsert function for a specific use case for a redshift table. On their docs, AWS suggests two methods which i'm drawing inspiration from. Here is what i want to accomplish:
Insert any new rows to an existing table, but only if they don't already exist.
There is never a need to delete or modify an existing row (for my use case)
I have so far come up with two separate ways to do this, but I'm wondering what the tradeoffs of each could be
using an EXCEPT query for insertion of only new rows from a temp table:
insert into persisted_table (
select *
from temp_table
except
select *
from persisted_table
);
store results of aUNION ALL query on temp table with persisted table, and use that as the persisted table
insert into new_table (
select *
from temp_table
union
select *
from persisted_table
);
alter table persisted_table rename to old_perisisted_table_marked_for_deletion;
alter table new_table rename to persisted_table;
I'm aware that union all is slow and generally not recommended for bulk/large scale operations. Apart from that though are there any arguments that could influence this decision?
The first advise I'd give is to remember that Redshift is a cluster. Whatever process you select, if the data is large, you will want the comparison to determine if the row already exists to stay "on node". You will want the tables in questions to be distributed by the same key.
Next I would think about what the indexes are into the data. The processes you laid out will compare all columns when comparing. Is this needed? If a subset of columns can be the key this can make things more efficient.
insert into persisted_table (
select * from temp_table a left join persisted_table b on {keys}
where a.keys is null );
Hopefully these aspects will help your decision process

Most efficient way to SELECT DISTINCT ColA FROM LargeTableWithFewValuesForColA

I have a large table (millions of rows).
I have to often get DISTINCT values of some columns. In my case, those columns actually have very few distinct values (a few to a few dozen)
What is the most efficient way of doing this?
Add an index on the column and then run:
select distinct column
from t;
To add to Gordons answer in large databases you could partition your data in addition to the index as well. Partitioning of data is like
Table_1 (id)
Select distinct records from table
Where id <1000
Table_2 (id)
Select distinct records from table
Where id >1000
Actual table =table_1+table_2 (id)
Just a sample to illustrate this partition is not extra its actually the same table or db just that it gets split up on basis of unique column

Fastest options for merging two tables in SQL Server

Consider two very large tables, Table A with 20 million rows in, and Table B which has a large overlap with TableA with 10 million rows. Both have an identifier column and a bunch of other data. I need to move all items from Table B into Table A updating where they already exist.
Both table structures
- Identifier int
- Date DateTime,
- Identifier A
- Identifier B
- General decimal data.. (maybe 10 columns)
I can get the items in Table B that are new, and get the items in Table B that need to be updated in Table A very quickly, but I can't get an update or a delete insert to work quickly. What options are available to merge the contents of TableB into TableA (i.e. updating existing records instead of inserting) in the shortest time?
I've tried pulling out existing records in TableB and running a large update on table A to update just those rows (i.e. an update statement per row), and performance is pretty bad, even with a good index on it.
I've also tried doing a one shot delete of the different values out of TableA that exist in TableB and performance of the delete is also poor, even with the indexes dropped.
I appreciate that this may be difficult to perform quickly, but I'm looking for other options that are available to achieve this.
Since you deal with two large tables, in-place updates/inserts/merge can be time consuming operations. I would recommend to have some bulk logging technique just to load a desired content to a new table and the perform a table swap:
Example using SELECT INTO:
SELECT *
INTO NewTableA
FROM (
SELECT * FROM dbo.TableB b WHERE NOT EXISTS (SELECT * FROM dbo.TableA a WHERE a.id = b.id)
UNION ALL
SELECT * FROM dbo.TableA a
) d
exec sp_rename 'TableA', 'BackupTableA'
exec sp_rename 'NewTableA', 'TableA'
Simple or at least Bulk-Logged recovery is highly recommended for such approach. Also, I assume that it has to be done out of business time since plenty of missing objects to be recreated on a new tables: indexes, default constraints, primary key etc.
A Merge is probably your best bet, if you want to both inserts and updates.
MERGE #TableB AS Tgt
USING (SELECT * FROM #TableA) Src
ON (Tgt.Identifier = SRc.Identifier)
WHEN MATCHED THEN
UPDATE SET Date = Src.Date, ...
WHEN NOT MATCHED THEN
INSERT (Identifier, Date, ...)
VALUES (Src.Identifier, Src.Date, ...);
Note that the merge statement must be terminated with a ;

table index for DISTINCT values

In my stored procedure, I need "Unique" values of one of the columns. I am not sure if I should and if I should, what type of Index I should apply on the table for better performance. No being very specific, the same case happens when I retrieve distinct values of multiple columns.
The column is of String(NVARCHAR) type.
e.g.
select DISTINCT Column1 FROM Table1;
OR
select DISTINCT Column1, Column2, Column3 FROM Table1;
An index on these specific columns could improve performance by a bit, but just because it will require SQL Server to scan less data (just these specific columns, nothing else). Other than that - a SCAN will always be done. An option would be to create indexed view if you need distinct values from that table.
CREATE VIEW Test
WITH SCHEMABINDING
AS
SELECT Column1, COUNT_BIG(*) AS UselessColumn
FROM Table1
GROUP BY Column1;
GO
CREATE UNIQUE CLUSTERED INDEX PK_Test ON Test (Column1);
GO
And then you can query it like that:
SELECT *
FROM Test WITH (NOEXPAND);
NOEXPAND is a hint needed for SQL Server to not expand query in a view and treat it as a table. Note: this is needed for non Enterprise version of SQL Server only.
I recently had the same issue and found it could be overcome using a Columnstore index:
CREATE NONCLUSTERED COLUMNSTORE INDEX [CI_TABLE1_Column1] ON [TABLE1]
([Column1])
WITH (DROP_EXISTING = OFF, COMPRESSION_DELAY = 0)

Selecting the most optimal query

I have table in Oracle database which is called my_table for example. It is type of log table. It has an incremental column which is named "id" and "registration_number" which is unique for registered users. Now I want to get latest changes for registered users so I wrote queries below to accomplish this task:
First version:
SELECT t.*
FROM my_table t
WHERE t.id =
(SELECT MAX(id) FROM my_table t_m WHERE t_m.registration_number = t.registration_number
);
Second version:
SELECT t.*
FROM my_table t
INNER JOIN
( SELECT MAX(id) m_id FROM my_table GROUP BY registration_number
) t_m
ON t.id = t_m.m_id;
My first question is which of above queries is recommended and why? And second one is if sometimes there is about 70.000 insert to this table but mostly the number of inserted rows is changing between 0 and 2000 is it reasonable to add index to this table?
An analytical query might be the fastest way to get the latest change for each registered user:
SELECT registration_number, id
FROM (
SELECT
registration_number,
id,
ROW_NUMBER() OVER (PARTITION BY registration_number ORDER BY id DESC) AS IDRankByUser
FROM my_table
)
WHERE IDRankByUser = 1
As for indexes, I'm assuming you already have an index by registration_number. An additional index on id will help the query, but maybe not by much and maybe not enough to justify the index. I say that because if you're inserting 70K rows at one time the additional index will slow down the INSERT. You'll have to experiment (and check the execution plans) to figure out if the index is worth it.
In order to check for faster query, you should check the execution plan and cost and it will give you a fair idea. But i agree with solution of Ed Gibbs as analytics make query run much faster.
If you feel this table is going to grow very big then i would suggest partitioning the table and using local indexes. They will definitely help you to form faster queries.
In cases where you want to insert lots of rows then indexes slow down insertion as with each insertion index also has to be updated[I will not recommend index on ID]. There are 2 solutions i have think of for this:
You can drop index before insertion and then recreate it after insertion.
Use reverse key indexes. Check this link : http://oracletoday.blogspot.in/2006/09/there-is-option-to-create-index.html. Reverse key index can impact your query a bit so there will be trade off.
If you look for faster solution and there is a really need to maintain list of last activity for each user, then most robust solution is to maintain separate table with unique registration_number values and rowid of last record created in log table.
E.g. (only for demo, not checked for syntax validity, sequences and triggers omitted):
create table my_log(id number not null, registration_number number, action_id varchar2(100))
/
create table last_user_action(refgistration_number number not null, last_action rowid)
/
alter table last_user_action
add constraint pk_last_user_action primary key (registration_number) using index
/
create or replace procedure write_log(p_reg_num number, p_action_id varchar2)
is
v_row_id rowid;
begin
insert into my_log(registration_number, action_id)
values(p_reg_num, p_action_id)
returning rowid into v_row_id;
update last_user_action
set last_action = v_row_id
where registration_number = p_reg_num;
end;
/
With such schema you can simple query last actions for every user with good performance:
select
from
last_user_action lua,
my_log l
where
l.rowid (+) = lua.last_action
Rowid is physical storage identity directly addressing storage block and you can't use it after moving to another server, restoring from backups etc. But if you need such functionality it's simple to add id column from my_log table to last_user_action too, and use one or another depending on requirements.