Most efficient way to SELECT DISTINCT ColA FROM LargeTableWithFewValuesForColA - sql

I have a large table (millions of rows).
I have to often get DISTINCT values of some columns. In my case, those columns actually have very few distinct values (a few to a few dozen)
What is the most efficient way of doing this?

Add an index on the column and then run:
select distinct column
from t;

To add to Gordons answer in large databases you could partition your data in addition to the index as well. Partitioning of data is like
Table_1 (id)
Select distinct records from table
Where id <1000
Table_2 (id)
Select distinct records from table
Where id >1000
Actual table =table_1+table_2 (id)
Just a sample to illustrate this partition is not extra its actually the same table or db just that it gets split up on basis of unique column

Related

Oracle PL/SQL: SELECT DISTINCT FIELD1, FIELD2 FROM TWO ADIACENT PARTITIONS OF A PARTITIONED TABLE

I have a partitioned table with a field MY_DATE which is always and only the first day of EVERY month from year 1999 to year 2017.
In example, it contains records with 01/01/2015, 01/02/2015, ..... 01/12/2015, such as 01/01/1999, 01/02/1999, and so on.
The field MY_DATE is the partitioning field.
I would like to copy, IN THE MOST EFFICIENT WAY, the distinct values of the field2 and the field3 of two adjacent partitions (month M and month M-1), to another table, in order to find the distinct couple of (field2, field3) of the date overall.
Exchange Partition works only if destination table is not partitioned, but when copying the data of the second, adjacent partition, I receive the error,
"ORA-14099: all rows in table do not qualify for specified partition".
I am using the statement:
ALTER TABLE MY_USER.MY_PARTITIONED_TABLE EXCHANGE PARTITION PART_P201502 WITH TABLE MY_USER.MY_TABLE
Of course MY_PARTITIONED_TABLE and MY_TABLE have the same fields, but the first is partitioned as described above.
Please suppose that MY_PARTITIONED_TABLE is a huge table with about 500 million records.
The goal is to find the different couples of (field2, field3) values of the two adjacent partitions.
My approach was: copy the data of the partition M, copy the data of the partition M-1, and then SELECT DISTINCT FIELD2, FIELD3 from DESTINATION_TABLE.
Thank you very much for considering my request.
I would like to copy, ...
Please note that EXCHANGE PARTITION performs no copy, but EXCHANGE. I.e. the content of the partition of the big table and the temporary table are switched.
If you performs this twice for two different partitions and the same temp table you get exactly the error you received.
To copy (extract the data without changing the big table) you may use
create table tab1 as
select * from bigtable partition (partition_name1)
create table tab2 as
select * from bigtable partition (partition_name2)
Your source table is unchanged, after you are ready simple drop the two temp tables. You need only additional space for the two partitions.
Maybe you can event perform your query without copying the data
with tmp as (
select * from bigtable partition (partition_name1)
union all
select * from bigtable partition (partition_name2)
)
select ....
from tmp;
Good luck!

redshift select distinct returns repeated values

I have a database where each object property is stored in a separate row. The attached query does not return distinct values in a redshift database but works as expected when testing in any mysql compatible database.
SELECT DISTINCT distinct_value
FROM
(
SELECT
uri,
( SELECT DISTINCT value_string
FROM `test_organization__app__testsegment` AS X
WHERE X.uri = parent.uri AND name = 'hasTestString' AND parent.value_string IS NOT NULL ) AS distinct_value
FROM `test_organization__app__testsegment` AS parent
WHERE
uri IN ( SELECT uri
FROM `test_organization__app__testsegment`
WHERE name = 'types' AND value_uri_multivalue = 'Document'
)
) AS T
WHERE distinct_value IS NOT NULL
ORDER BY distinct_value ASC
LIMIT 10000 OFFSET 0
This is not a bug and behavior is intentional, though not straightforward.
In Redshift, you can declare constraints on the tables but Redshift doesn't enforce them, i.e. it allows duplicate values if you insert them. The only difference here is that when you run SELECT DISTINCT query against a column that doesn't have a primary key declared it will scan the whole column and get unique values, and if you run the same on a column that has primary key constraint it will just return the output without performing unique list filtering. This is how you can get duplicate entries if you insert them.
Why is this done? Redshift is optimized for large datasets and it's much faster to copy data if you don't need to check constraint validity for every row that you copy or insert. If you want you can declare a primary key constraint as a part of your data model but you will need to explicitly support it by removing duplicates or designing ETL in a way there are no such.
More information with specific examples in this Heap blog post Redshift Pitfalls And How To Avoid Them
Perhaps You can solve this by using appropriate joins.
for example i have duplicate values in table 1 and i want values of table 1 by joining it to table 2 and there is some logic behind joining two tables according to your conditions.
so i can do something like this!!
select distinct table1.col1 from table1 left outer join table2 on table1.col1 = table2.col1
this worked for me very well and i got unique values from table1 and could remove dublicates

Create a unique index on a non-unique column

Not sure if this is possible in PostgreSQL 9.3+, but I'd like to create a unique index on a non-unique column. For a table like:
CREATE TABLE data (
id SERIAL
, day DATE
, val NUMERIC
);
CREATE INDEX data_day_val_idx ON data (day, val);
I'd like to be able to [quickly] query only the distinct days. I know I can use data_day_val_idx to help perform the distinct search, but it seems this adds extra overhead if the number of distinct values is substantially less than the number of rows in the index covers. In my case, about 1 in 30 days is distinct.
Is my only option to create a relational table to only track the unique entries? Thinking:
CREATE TABLE days (
day DATE PRIMARY KEY
);
And update this with a trigger every time we insert into data.
An index can only index actual rows, not aggregated rows. So, yes, as far as the desired index goes, creating a table with unique values like you mentioned is your only option. Enforce referential integrity with a foreign key constraint from data.day to days.day. This might also be best for performance, depending on the complete situation.
However, since this is about performance, there is an alternative solution: you can use a recursive CTE to emulate a loose index scan:
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT day FROM data ORDER BY 1 LIMIT 1
)
UNION ALL
SELECT (SELECT day FROM data WHERE day > c.day ORDER BY 1 LIMIT 1)
FROM cte c
WHERE c.day IS NOT NULL -- exit condition
)
SELECT day FROM cte;
Parentheses around the first SELECT are required because of the attached ORDER BY and LIMIT clauses. See:
Combining 3 SELECT statements to output 1 table
This only needs a plain index on day.
There are various variants, depending on your actual queries:
Optimize GROUP BY query to retrieve latest row per user
Unused index in range of dates query
Select first row in each GROUP BY group?
More in my answer to your follow-up querstion:
Counting distinct rows using recursive cte over non-distinct index

How expensive is select distinct * query

In sql server 2012, I have got a table with more than 25 million rows with duplicates. The table doesn't have unique index. It only has a non-clustered index. I wanted to eliminate duplicates and so, I m thinking of the below
select distinct * into #temp_table from primary_table
truncate primary_table
select * into primary_table from #temp_table
I wanted to know how expensive is select distinct * query. If my procedure above is very expensive, I wanted to know if there is another alternate way.
I don't know how expensive it is, but an alternate way is to create another table with a primary key, insert all the data there and silently reject the duplicates as stated here
http://web.archive.org/web/20180404165346/http://sqlblog.com:80/blogs/paul_white/archive/2013/02/01/a-creative-use-of-ignore-dup-key.aspx
basically, using IGNORE_DUP_KEY

Delete duplicate records in oracle table : size is 389 GB

Need to delete duplicate records from the table. Table contains 33 columns out of them only PK_NUM is the primary key columns. As PK_NUM contains unique records we need to consider either min/max value.
Total records in the table : 1766799022
Distinct records in the table : 69237983
Duplicate records in the table : 1697561039
Column details :
4 : Date data type
4 : Number data type
1 : Char data type
24 : Varchar2 data type
Size of table : 386 GB
DB details : Oracle Database 11g EE::11.2.0.2.0 ::64bit Production
Sample data :
col1 ,col2,col3
1,ABC,123
2,PQR,456
3,ABC,123
Expected data should contains only 2 records:
col1,col2,col3
1,ABC,123
2,PQR,456
*1 can be replaced by 3 ,vice versa.
My plan here is to
Pull distinct records and store it in a back up table.(ie by using insert into select)
Truncate existing table and move records from back up to existing.
As data size is huge ,
Want to know what is the optimized sql for retrieving the distinct
records
Any estimate on how much it will take to complete (insert into
select) and to truncate the existing table.
Please do let me know, if there is any other best way to achieve this. My ultimate goal is to remove the duplicates.
One option for making this memory efficient is to insert (nologging append) all of the rows into a table that is hash partitioned on the list of columns on which duplicates are to be detected, or if there is a limitation on the number of columns then on as many as you can use (aiming to use those with maximum selectivity). Use something like 1024 partitions, and each one will ideally be around
You have then isolated all of the potential duplicates for each row into the same partition, and standard methods for deduplication will run on each partition without as much memory consumption.
So for each partition you can do something like ...
insert /*+ append */ into new_table
select *
from temp_table partition (p1) t1
where not exists (
select null
from temp_table partition (p1) t2
where t1.col1 = t2.col1 and
t1.col2 = t2.col2 and
t1.col3 = t2.col3 and
... etc ...
t1.rownum < t2.rownum);
The key to good performance here is that the hash table created to perform the anti-join in that query, which is going to be nearly as big as the partition itself, be able to fit in memory. So if you can manage a 2GB sort area you need at least 389/2 = approx 200 table partitions. Round up to the nearest power of two, so make it 256 table partitions in that case.
try this:
rename table_name to table_name_dup;
and then:
create table table_name
as
select
min(col1)
, col2
, col3
from table_name_dup
group by
col2
, col3;
as far as i know the temp_tablespace used is not much as the whole group by is taking place in the target tablespace where the new table will be created. once finished, you can just drop the one with the duplicates:
drop table table_name_dup;