redshift select distinct returns repeated values - sql

I have a database where each object property is stored in a separate row. The attached query does not return distinct values in a redshift database but works as expected when testing in any mysql compatible database.
SELECT DISTINCT distinct_value
FROM
(
SELECT
uri,
( SELECT DISTINCT value_string
FROM `test_organization__app__testsegment` AS X
WHERE X.uri = parent.uri AND name = 'hasTestString' AND parent.value_string IS NOT NULL ) AS distinct_value
FROM `test_organization__app__testsegment` AS parent
WHERE
uri IN ( SELECT uri
FROM `test_organization__app__testsegment`
WHERE name = 'types' AND value_uri_multivalue = 'Document'
)
) AS T
WHERE distinct_value IS NOT NULL
ORDER BY distinct_value ASC
LIMIT 10000 OFFSET 0

This is not a bug and behavior is intentional, though not straightforward.
In Redshift, you can declare constraints on the tables but Redshift doesn't enforce them, i.e. it allows duplicate values if you insert them. The only difference here is that when you run SELECT DISTINCT query against a column that doesn't have a primary key declared it will scan the whole column and get unique values, and if you run the same on a column that has primary key constraint it will just return the output without performing unique list filtering. This is how you can get duplicate entries if you insert them.
Why is this done? Redshift is optimized for large datasets and it's much faster to copy data if you don't need to check constraint validity for every row that you copy or insert. If you want you can declare a primary key constraint as a part of your data model but you will need to explicitly support it by removing duplicates or designing ETL in a way there are no such.
More information with specific examples in this Heap blog post Redshift Pitfalls And How To Avoid Them

Perhaps You can solve this by using appropriate joins.
for example i have duplicate values in table 1 and i want values of table 1 by joining it to table 2 and there is some logic behind joining two tables according to your conditions.
so i can do something like this!!
select distinct table1.col1 from table1 left outer join table2 on table1.col1 = table2.col1
this worked for me very well and i got unique values from table1 and could remove dublicates

Related

Fastest options for merging two tables in SQL Server

Consider two very large tables, Table A with 20 million rows in, and Table B which has a large overlap with TableA with 10 million rows. Both have an identifier column and a bunch of other data. I need to move all items from Table B into Table A updating where they already exist.
Both table structures
- Identifier int
- Date DateTime,
- Identifier A
- Identifier B
- General decimal data.. (maybe 10 columns)
I can get the items in Table B that are new, and get the items in Table B that need to be updated in Table A very quickly, but I can't get an update or a delete insert to work quickly. What options are available to merge the contents of TableB into TableA (i.e. updating existing records instead of inserting) in the shortest time?
I've tried pulling out existing records in TableB and running a large update on table A to update just those rows (i.e. an update statement per row), and performance is pretty bad, even with a good index on it.
I've also tried doing a one shot delete of the different values out of TableA that exist in TableB and performance of the delete is also poor, even with the indexes dropped.
I appreciate that this may be difficult to perform quickly, but I'm looking for other options that are available to achieve this.
Since you deal with two large tables, in-place updates/inserts/merge can be time consuming operations. I would recommend to have some bulk logging technique just to load a desired content to a new table and the perform a table swap:
Example using SELECT INTO:
SELECT *
INTO NewTableA
FROM (
SELECT * FROM dbo.TableB b WHERE NOT EXISTS (SELECT * FROM dbo.TableA a WHERE a.id = b.id)
UNION ALL
SELECT * FROM dbo.TableA a
) d
exec sp_rename 'TableA', 'BackupTableA'
exec sp_rename 'NewTableA', 'TableA'
Simple or at least Bulk-Logged recovery is highly recommended for such approach. Also, I assume that it has to be done out of business time since plenty of missing objects to be recreated on a new tables: indexes, default constraints, primary key etc.
A Merge is probably your best bet, if you want to both inserts and updates.
MERGE #TableB AS Tgt
USING (SELECT * FROM #TableA) Src
ON (Tgt.Identifier = SRc.Identifier)
WHEN MATCHED THEN
UPDATE SET Date = Src.Date, ...
WHEN NOT MATCHED THEN
INSERT (Identifier, Date, ...)
VALUES (Src.Identifier, Src.Date, ...);
Note that the merge statement must be terminated with a ;

Count rows with column varbinary NOT NULL tooks a lot of time

This query
SELECT COUNT(*)
FROM Table
WHERE [Column] IS NOT NULL
takes a lot of time. The table has 5000 rows, and the column is of type VARBINARY(MAX).
What can I do?
Your query needs to do a table scan on a column that can potentially be very large without any way to index it. There isn't much you can do to fix this without changing your approach.
One option is to split the table into two tables. The first table could have all the details you have now in it and the second table would have just the file. You can make this a 1-1 table to ensure data is not duplicated.
You would only add the binary data as needed into the second table. If it is not needed anymore, you simply delete the record. This will allow you to simply write a JOIN query to get the information you are looking for.
SELECT
COUNT(*)
FROM dbo.Table1
INNER JOIN dbo.Table2
ON Table1.Id = Table2.Id

Using EXCEPT clause in PostgreSQL

I am trying to use the EXCEPT clause to retrieve data from table. I want to get all the rows from table1 except the one's that exist in table2.
As far I understand, the following would not work:
CREATE TABLE table1(pk_id int, fk_id_tbl2 int);
CREATE TABLE table2(pk_id int);
Select fk_id_tbl2
FROM table1
Except
Select pk_id
FROM table2
The only way I can use EXCEPT seems to be to select from the same tables or select columns that have the same column name from different tables.
Can someone please explain how best to use the explain clause?
Your query seems perfectly valid:
SELECT fk_id_tbl2 AS some_name
FROM table1
EXCEPT -- you may want to use EXCEPT ALL
SELECT pk_id
FROM table2;
Column names are irrelevant to the query. Only data types must match. The output column name of your query is fk_id_tbl2, just because it's the column name in the first SELECT. You can use any alias.
What's often overlooked: the subtle differences between EXCEPT (which folds duplicates) and EXCEPT ALL - which keeps all individual unmatched rows.
More explanation and other ways to do the same, some of them much more flexible:
Select rows which are not present in other table
Details for EXCEPT in the manual.

What's the best way to union two queries with distinct values in a single column, prioritizing the first query?

Needs to be database-agnostic between Oracle and SQL server, although I wouldn't mind hearing SQL server-specific examples as well.
I'm sure the title isn't clear at all, so let me explain what I'm thinking. I'm thinking of two queries. The first might pull in a bunch of data from a given table, including primary keys. The second would just pull in every primary key and leave all other columns blank.
Then I'd want to union them together in such a way that whenever a primary key is missing in the first query, the row from the second query gets pulled in. Otherwise, if the primary key exists in the first query, the row from the second query is ignored.
Quick example:
First query pulls in two columns (first is primary key):
1 1
2 1
Second query pulls in :
1 NULL
2 NULL
3 NULL
So I would want the whole query to pull in:
1 1
2 1
3 NULL
What's the best way to pull this off, performance-wise? Consider an example where there might be a very large number of rows and columns, and the first query might be pretty performance-intensive (although the second of course should always be straightforward, just pulling in primary keys from a list and filling the rest of the columns out with either NULLs or static values).
It sounds to me that you want to use a FULL OUTER JOIN on the two tables or queries:
select
coalesce(q1.col1, q2.col1) col1,
coalesce(q1.col2, q2.col2) col2
from query1 q1
full outer join query2 q2
on q1.col1 = q2.col1;
See SQL Fiddle with Demo.
This will join the two queries on your primary key column (col1 in the sample query), then you can use COALESCE on the columns to return the first non-null value for col1, col2, etc.
You can't use a union since SQL will consider 1, 2 and 1, NULL to be distinct.
Not knowing your schema, I would try the following in psuedo code:
select *
from query_1
union all
select primary_key
from query_2
where query_2.PK not in(select PK from query_1)
This will only return the primary keys in query_2 that are not in query_1 and get you a clean union where the query_1 results are prioritized over query_2 results. Selecting just the primary keys for the first query should be quick and easy, but if that isn't the case let me know and I can try to come up with a more complicated query given your schema.

Check for complete duplicate rows in a large table

My original question with all the relevant context can be found here:
Adding a multi-column primary key to a table with 40 million records
I have a table with 40 million rows and no primary key. Before I add the primary key, I would like to check if the table has any duplicate entries. When I say duplicate entries, I don't just mean duplicate on particular columns. I mean duplicates on entire rows.
I was told in my last question that I can do an EXISTS query to determine duplicates. How would I do that?
I am running PostgreSQL 8.1.22. (Got this info by running select version()).
To find whether any full duplicate exists (identical on all columns), this is probably the fastest way:
SELECT EXISTS (
SELECT 1
FROM tbl t
NATURAL JOIN tbl t1
WHERE t.ctid <> t1.ctid
)
NATURAL JOIN is a very convenient shorthand for the case because (quoting the manual here):
NATURAL is shorthand for a USING list that mentions all columns in the
two tables that have the same names.
EXISTS is probably fastest, because Postgres stops searching as soon as the first duplicate is found. Since you most probably don't have an index covering the whole row and your table is huge, this will save you a lot of time.
Be aware that NULL is never considered identical to another NULL. If you have NULL values and consider them identical, you'd have to do more.
ctid is a system column that can be (ab-)used as ad-hoc primary key, but cannot replace an actual user-defined primary key in the long run.
The outdated version 8.1 seems to have no <> operator defined for a ctid. Try casting to text:
SELECT EXISTS (
SELECT 1
FROM tbl t
NATURAL JOIN tbl t1
WHERE t.ctid::text <> t1.ctid::text
)
shouldn't something like that do the job?
SELECT ALL_COLUMNS[expect unique ID],
count(0) as Dupl
FROM table
WHERE Dupl>1
GROUP BY ALL_COLUMNS[expect unique ID];
not sure if its the most efficient way, but count>1 means you have two identical rows.