Using distinct on in subqueries - sql

I noticed that in PostgreSQL the following two queries output different results:
select a.*
from (
select distinct on (t1.col1)
t1.*
from t1
order by t1.col1, t1.col2
) a
where a.col3 = value
;
create table temp as
select distinct on (t1.col1)
t1.*
from t1
order by t1.col1, t1.col2
;
select temp.*
from temp
where temp.col3 = value
;
I guess it has something to do with using distinct on in subqueries.
What is the correct way to use distinct on in subqueries? E.g. can I use it if I don't use where statement?
Or in queries like
(
select distinct on (a.col1)
a.*
from a
)
union
(
select distinct on (b.col1)
b.*
from b
)

In normal situation, both examples should return the same result.
I suspect that you are getting different results because the order by clause of your distinct on subquery is not deterministic. That is, there may be several rows in t1 sharing the same col1 and col2.
If the columns in the order by do not uniquely identify each row, then the database has to make its own decision about which row will be retained in the resultset: as a consequence, the results are not stable, meaning that consecutive executions of the same query may yield different results.
Make sure that your order by clause is deterministic (for example by adding more columns in the clause), and this problem should not arise anymore.

Related

Write a where clause that compares two columns to the same subquery?

I want to know if it's possible to make a where clause compare 2 columns to the same subquery. I know I could make a temp table/ variable table or write the same subquery twice. But I want to avoid all that if possible. The Subquery is long and complex and will cause significant overhead if I have to write it twice.
Here is an example of what I am trying to do.
SELECT * FROM Table WHERE (Column1 OR Column2) IN (Select column from TABLE)
I'm looking for a simple answer and that might just be NO but if it's possible without anything too elaborate please clue me in.
I updated the select to use OR instead of AND as this clarified my question a little better.
The example you've given would probably perform best using exists, such as:
select *
from t1
where exists (
select 1 from t2
where t2.col = t1.col1 and t2.col = t1.col2
);
To prevent writing the complicated subquery twice, you can use a CTE (Common Table Expression):
;WITH MyFirstCTE (x) AS
(
SELECT [column] FROM [TABLE1]
-- add all the very complicated stuff here
)
SELECT *
FROM Table2
WHERE Column1 IN (SELECT x FROM MyFirstCTE)
AND Column2 IN (SELECT x FROM MyFirstCTE)
Or using EXISTS:
;WITH MyFirstCTE (x) AS
(
SELECT [column] FROM [TABLE1]
-- add all the very complicated stuff here
)
SELECT *
FROM Table2
WHERE EXISTS (SELECT 1 FROM MyFirstCTE WHERE x = Column1)
AND EXISTS (SELECT 1 FROM MyFirstCTE WHERE x = Column2)
I used deliberately clumsy names, best to pick better ones.
I started it with a ; because if it's not the first command in a larger script then a ; is needed to separate the CTE from the commands before it.

Most efficient way to find distinct records, retaining unique ID

I have a large dataset stored in a SQL server table, with 1 unique ID, and many attributes. I need to select the distinct attribute records, along with one of the unique IDs associated with that unique combination.
Example dataset:
ID|Col1|Col2|Col3...
1|big|blue|ball
2|big|red|ball
3|big|blue|ball
4|small|red|ball
Example Goal (2,3,4 would also have been acceptable) :
ID|Col1|Col2|Col3...
1|big|blue|ball
2|big|red|ball
4|small|red|ball
I have tried a few different methods, but all of them seem to be taking very long (hours), so I was wondering if there was a more efficient approach. Failing this, my next idea is to partition the table.
I have tried:
Using Where exists, e.g.
SELECT * from Table as T1
where exists (select *
from table as T2
where
ISNULL(T1.ID,'') <> ISNULL(T2.ID,'')
AND ISNULL([T1].[Col1],'') = ISNULL([T2].[Col1],'')
AND ISNULL([T1].[Col2],'') = ISNULL([T2].[Col2],'')
)
MAX(ID) and Group By Attributes.
GROUP BY Attributes, having count > 1.
How about just using group by?
select min(id), col1, col2, col3
from t
group by col1, col2, col3;
This will probably take a while. This might be more efficient:
select t.*
from t
where t.id = (select min(t2.id)
from t t2
where t.col1 = t2.col1 and t.col2 = t2.col2 and . . .
);
This requires an index on t(col1, col2, col3, . . ., id). Given your request, that is on all columns.
In addition, this will not work for columns that are NULL. Some databases support the ANSI standard is not distinct from for null-safe comparisons. If yours does, then it should use the index for this construct as well.
SELECT Id,Col1,Col2,Col3 FROM (
SELECT Id,Col1,Col2,Col3,ROW_NUMBER() OVER (Partition By Col1,Col2,Col3 Order By ID,Col1,Col2,Col3) valid
from Table as T1) t
WHERE valid=1
Hope this helps...

SQL check top 100 rows different everytime running the SQL query?

i am new to SQL.
I have used "order by" to sort two large tables on SQL, netezza of IBM.
The table is:
col1 INT
col2 INT
col3 INT
col4 DOUBLE PRECISION
INSERT INTO mytable
SELECT *
FROM table1 AS t1
ORDER BY t1.col1 , t1.col2, t1.col3, t1.col4 ASC
After sorting, I check the top 100 rows:
SELECT *
FROM mytable
LIMIT 100;
But, I got different results everytime when i run the SQL query for top 100 rows.
When I export the table to a txt file, the same thing.
Why ?
Thanks !
The order in which you insert data into the table is meaningless. Running a query has absolutely no guarantee on the order rows are returned, unless you explicitly specify it using the order by clause. Since there's no guarantee on the order of rows, there's no guarantee what the "top 100" are, and hence you may very well get different results each time you run the query.
If you do specify the order in your query, however, you should get consistent results (assuming that there's only one possible outcome for the top 100 rows, and not, e.g., 200 rows which any 100 of which could be considered valid results):
SELECT *
FROM mytable
ORDER BY col1, col2, col3, col4 ASC
LIMIT 100;
The sequence is not guaranteed, although it may match the order in which the data was inserted. If you need it in a particular sequence, use ORDER BY.
you need to use order by to the select query and its not mandatory to write ASC explicitly after order by clause as sorting by default will in ascending order
SELECT *
FROM mytable t1
ORDER BY t1.col1 , t1.col2, t1.col3, t1.col4
LIMIT 100;
You need to apply an order by to your Select statement
The ORDER BY belongs on the SELECT query too:
SELECT *
FROM mytable t1
ORDER BY t1.col1 , t1.col2, t1.col3, t1.col4 ASC
LIMIT 100;

Wrapping SQL query into outer Select causes Order By reshuffle

I have an engine that builds a query. So this is not static and this is why I had to go this way (below). Plus, it works for SQL and Oracle (Oracle adds different wrapper, RowNum, etc...). I have no easy way to test Oracle but below is SQL Server problem, step-by-step logic
Lets take a simple query
Select field1 as f1, myDate dateFld From table1 t1 Where t1.field2 = 1
I may or may not, have to union output with another table
Select field1, myDate dateFld as f1 From table1 t1 Where t1.field2 = 1
Union
Select field2, myDate dateFld as f1 From table2 t2 Where t2.field2 = 2
I need to get only N records from this Union
Select Top(N) *
From
(
Select field1 as f1, myDate dateFld From table1 t1 Where t1.field2 = 1
Union
Select field2 as f1, myDate dateFld From table2 t2 Where t2.field2 = 2
) Union_Tbl_Alias
Order By dateFld Desc, f1
Remember this "Order by"
I also have Select Subqueries (and nothing I can do but have them in Select), which I moved to yet another Select wrapper
Select
f1,
myDate,
(Select field99 From table99 t99 Where t99.f1 = Outer_Tbl_Alias.f1) as f3
From
(
Select Top(N) *
From
(
Select field1 as f1, myDate dateFld From table1 t1 Where t1.field2 = 1
Union
Select field2 as f1, myDate dateFld From table2 t2 Where t2.field2 = 2
)
Order By dateFld Desc, f1
) Outer_Tbl_Alias
So the problem is that outer-most select reshuffles records a bit. They no longer sorted dateFld Desc.
I don't want to speculate, I think, this is only SQL Server issue but I will test it in oracle as well. Moving "Order By" to outer-most statement fixes it for SQL Server.
But I'm wondering:
1 - why it happens?
2 - is there a hint to tell SQL server - keep the order of inner Select?
That behavior appears to make sense. Your outer query does not contain an ORDER BY clause so the order of the results is arbitrary. The fact that rows may have been ordered in a subquery is not controlling (though it undoubtedly does end up affecting the order of the results). Since you are building the query programmatically, it would make far more sense to add whatever ORDER BY clause you want than to try to work around the issue (and I'm not aware of a way to work around the issue that is guaranteed to work every time).
You'll have exactly the same issue when you run against an Oracle database and switch out the TOP for a couple of nested queries with rownum predicates. The only way to guarantee the order of your results is to add an ORDER BY clause. Since that is going to be necessary regardless of the database you are using, it makes even more sense to do it correctly by adding the additional ORDER BY to the outer query rather than having different database-specific workarounds.

Combining several query results into one table, how is the results order determined?

I am retuning table results for different queries but each table will be in the same format and will all be in one final table. If I want the results for query 1 to be listed first and query2 second etc, what is the easiest way to do it?
Does UNION append the table or are is the combination random?
The SQL standard does not guarantee an order unless explicitly called for in an order by clause. In practice, this usually comes back chronologically, but I would not rely on it if the order is important.
Across a union you can control the order like this...
select
this,
that
from
(
select
this,
that
from
table1
union
select
this,
that
from
table2
)
order by
that,
this;
UNION appends the second query to the first query, so you have all the first rows first.
You can use:
SELECT Col1, Col2,...
FROM (
SELECT Col1, Col2,..., 1 AS intUnionOrder
FROM ...
) AS T1
UNION ALL (
SELECT Col1, Col2,..., 2 AS intUnionOrder
FROM ...
) AS T2
ORDER BY intUnionOrder, ...