SQL check top 100 rows different everytime running the SQL query? - sql

i am new to SQL.
I have used "order by" to sort two large tables on SQL, netezza of IBM.
The table is:
col1 INT
col2 INT
col3 INT
col4 DOUBLE PRECISION
INSERT INTO mytable
SELECT *
FROM table1 AS t1
ORDER BY t1.col1 , t1.col2, t1.col3, t1.col4 ASC
After sorting, I check the top 100 rows:
SELECT *
FROM mytable
LIMIT 100;
But, I got different results everytime when i run the SQL query for top 100 rows.
When I export the table to a txt file, the same thing.
Why ?
Thanks !

The order in which you insert data into the table is meaningless. Running a query has absolutely no guarantee on the order rows are returned, unless you explicitly specify it using the order by clause. Since there's no guarantee on the order of rows, there's no guarantee what the "top 100" are, and hence you may very well get different results each time you run the query.
If you do specify the order in your query, however, you should get consistent results (assuming that there's only one possible outcome for the top 100 rows, and not, e.g., 200 rows which any 100 of which could be considered valid results):
SELECT *
FROM mytable
ORDER BY col1, col2, col3, col4 ASC
LIMIT 100;

The sequence is not guaranteed, although it may match the order in which the data was inserted. If you need it in a particular sequence, use ORDER BY.

you need to use order by to the select query and its not mandatory to write ASC explicitly after order by clause as sorting by default will in ascending order
SELECT *
FROM mytable t1
ORDER BY t1.col1 , t1.col2, t1.col3, t1.col4
LIMIT 100;

You need to apply an order by to your Select statement

The ORDER BY belongs on the SELECT query too:
SELECT *
FROM mytable t1
ORDER BY t1.col1 , t1.col2, t1.col3, t1.col4 ASC
LIMIT 100;

Related

Using distinct on in subqueries

I noticed that in PostgreSQL the following two queries output different results:
select a.*
from (
select distinct on (t1.col1)
t1.*
from t1
order by t1.col1, t1.col2
) a
where a.col3 = value
;
create table temp as
select distinct on (t1.col1)
t1.*
from t1
order by t1.col1, t1.col2
;
select temp.*
from temp
where temp.col3 = value
;
I guess it has something to do with using distinct on in subqueries.
What is the correct way to use distinct on in subqueries? E.g. can I use it if I don't use where statement?
Or in queries like
(
select distinct on (a.col1)
a.*
from a
)
union
(
select distinct on (b.col1)
b.*
from b
)
In normal situation, both examples should return the same result.
I suspect that you are getting different results because the order by clause of your distinct on subquery is not deterministic. That is, there may be several rows in t1 sharing the same col1 and col2.
If the columns in the order by do not uniquely identify each row, then the database has to make its own decision about which row will be retained in the resultset: as a consequence, the results are not stable, meaning that consecutive executions of the same query may yield different results.
Make sure that your order by clause is deterministic (for example by adding more columns in the clause), and this problem should not arise anymore.

BigQuery: Use COUNT as LIMIT

I want to select everything from mytable1 and combine that with just as many rows from mytable2. In my case mytable1 always has fewer rows than mytable2 and I want the final table to be a 50-50 mix of data from each table. While I feel like the following code expresses what I want logically, it doesn't work syntax wise:
Syntax error: Expected "#" or integer literal or keyword CAST but got
"(" at [3:1]
(SELECT * FROM `mytable1`)
UNION ALL (
SELECT * FROM `mytable2`
LIMIT (SELECT COUNT(*) FROM`mytable1`)
)
Using standard SQL in bigquery
The docs state that LIMIT clause accept only literal or parameter values. I think you can ROW_NUMBER() the rows from second table and limit based on that:
SELECT col1, col2, col3
FROM mytable1
UNION ALL
SELECT col1, col2, col3
FROM (
SELECT col1, col2, col3, ROW_NUMBER() OVER () AS rn
FROM mytable2
) AS x
WHERE x.rn <= (SELECT COUNT(*) FROM mytable1)
Each SELECT statement within UNION must have the same number of
columns
The columns must also have similar data types
The columns in each SELECT statement must also be in the same order
As your mytable1 always less column than mytable2 so you have to put same number of column by selection
select col1,col2,col3,'' as col4 from mytable1 --in case less column you can use alias
union all
select col1,col2,col3,col4 from mytable2

Most efficient way to find distinct records, retaining unique ID

I have a large dataset stored in a SQL server table, with 1 unique ID, and many attributes. I need to select the distinct attribute records, along with one of the unique IDs associated with that unique combination.
Example dataset:
ID|Col1|Col2|Col3...
1|big|blue|ball
2|big|red|ball
3|big|blue|ball
4|small|red|ball
Example Goal (2,3,4 would also have been acceptable) :
ID|Col1|Col2|Col3...
1|big|blue|ball
2|big|red|ball
4|small|red|ball
I have tried a few different methods, but all of them seem to be taking very long (hours), so I was wondering if there was a more efficient approach. Failing this, my next idea is to partition the table.
I have tried:
Using Where exists, e.g.
SELECT * from Table as T1
where exists (select *
from table as T2
where
ISNULL(T1.ID,'') <> ISNULL(T2.ID,'')
AND ISNULL([T1].[Col1],'') = ISNULL([T2].[Col1],'')
AND ISNULL([T1].[Col2],'') = ISNULL([T2].[Col2],'')
)
MAX(ID) and Group By Attributes.
GROUP BY Attributes, having count > 1.
How about just using group by?
select min(id), col1, col2, col3
from t
group by col1, col2, col3;
This will probably take a while. This might be more efficient:
select t.*
from t
where t.id = (select min(t2.id)
from t t2
where t.col1 = t2.col1 and t.col2 = t2.col2 and . . .
);
This requires an index on t(col1, col2, col3, . . ., id). Given your request, that is on all columns.
In addition, this will not work for columns that are NULL. Some databases support the ANSI standard is not distinct from for null-safe comparisons. If yours does, then it should use the index for this construct as well.
SELECT Id,Col1,Col2,Col3 FROM (
SELECT Id,Col1,Col2,Col3,ROW_NUMBER() OVER (Partition By Col1,Col2,Col3 Order By ID,Col1,Col2,Col3) valid
from Table as T1) t
WHERE valid=1
Hope this helps...

Wrapping SQL query into outer Select causes Order By reshuffle

I have an engine that builds a query. So this is not static and this is why I had to go this way (below). Plus, it works for SQL and Oracle (Oracle adds different wrapper, RowNum, etc...). I have no easy way to test Oracle but below is SQL Server problem, step-by-step logic
Lets take a simple query
Select field1 as f1, myDate dateFld From table1 t1 Where t1.field2 = 1
I may or may not, have to union output with another table
Select field1, myDate dateFld as f1 From table1 t1 Where t1.field2 = 1
Union
Select field2, myDate dateFld as f1 From table2 t2 Where t2.field2 = 2
I need to get only N records from this Union
Select Top(N) *
From
(
Select field1 as f1, myDate dateFld From table1 t1 Where t1.field2 = 1
Union
Select field2 as f1, myDate dateFld From table2 t2 Where t2.field2 = 2
) Union_Tbl_Alias
Order By dateFld Desc, f1
Remember this "Order by"
I also have Select Subqueries (and nothing I can do but have them in Select), which I moved to yet another Select wrapper
Select
f1,
myDate,
(Select field99 From table99 t99 Where t99.f1 = Outer_Tbl_Alias.f1) as f3
From
(
Select Top(N) *
From
(
Select field1 as f1, myDate dateFld From table1 t1 Where t1.field2 = 1
Union
Select field2 as f1, myDate dateFld From table2 t2 Where t2.field2 = 2
)
Order By dateFld Desc, f1
) Outer_Tbl_Alias
So the problem is that outer-most select reshuffles records a bit. They no longer sorted dateFld Desc.
I don't want to speculate, I think, this is only SQL Server issue but I will test it in oracle as well. Moving "Order By" to outer-most statement fixes it for SQL Server.
But I'm wondering:
1 - why it happens?
2 - is there a hint to tell SQL server - keep the order of inner Select?
That behavior appears to make sense. Your outer query does not contain an ORDER BY clause so the order of the results is arbitrary. The fact that rows may have been ordered in a subquery is not controlling (though it undoubtedly does end up affecting the order of the results). Since you are building the query programmatically, it would make far more sense to add whatever ORDER BY clause you want than to try to work around the issue (and I'm not aware of a way to work around the issue that is guaranteed to work every time).
You'll have exactly the same issue when you run against an Oracle database and switch out the TOP for a couple of nested queries with rownum predicates. The only way to guarantee the order of your results is to add an ORDER BY clause. Since that is going to be necessary regardless of the database you are using, it makes even more sense to do it correctly by adding the additional ORDER BY to the outer query rather than having different database-specific workarounds.

sql delete rows with 1 column duplicated

I have a microsoft sql 2005 db table where the entire row is not duplicate, but a column is duplicated.
1 aaa
1 bbb
1 ccc
2 abc
2 def
How can i delete all the rows but 1 that have the first column duplicated?
For clarification I need to get rid of the second, third and fifth rows.
Try the following query in sql server 2005
WITH T AS (SELECT ROW_NUMBER()OVER(PARTITION BY id ORDER BY id) AS rnum,* FROM dbo.Table_1)
DELETE FROM T WHERE rnum>1
Let's call these the id and the Col1 columns.
DELETE myTable T1
WHERE EXISTS
(SELECT * FROM myTable T2
WHERE T2.id = T1.id AND T2.Col1 > T1.Col1)
Edit: As pointed out by Andomar, the above doesn't get rid of exact duplicate cases, where both id and Col1 are the same in different rows.
These can be handled as follow:
(note: whereby the above query is generic SQL, the following applies to MSSQL 2005 and above)
It uses the Common Table Expression (CTE) feature, along with ROW_NUMBER() function to produce a distinctive row value. It is essentially the same construct as the above except that it now works with a "table" (CTEs are mostly like a table) which has a truly distinct identifier key.
Note that by removing "AND T2.Col1 = T1.Col1", we produce a query which can handle both types of duplicates (id-only duplicates and both Id and Col1 duplicates) in a single query, i.e. in a similar fashion that Hamadri's solution (the PARTITION in his/her CTE serves the same purpose as the subquery in this solution, essentially the same amount of work is done). Depending on the situation, it may be preferable, performance-wise or other, to handle the situation in two steps.
WITH T AS
(SELECT ROW_NUMBER() OVER (ORDER BY id, Col1) AS rn, id, Col1 FROM MyTable)
DELETE T AS T1
WHERE EXISTS
(SELECT *
FROM T AS T2
WHERE T2.id = T1.id AND T2.Col1 = T1.Col1
AND T2.rn > T1.rn
)
DELETE tableName as ta
WHERE col2 NOT IN (SELECT MIN(col2) FROM tableName AS t2 GROUP BY col1)
Make sure the sub select returns the rows you want to keep.
Try this.
DELETE FROM <TABLE_NAME_HERE> WHERE <SECOND_COLUMN_NAME_HERE> IN ("bbb","abc","def");
SQL server is not my native SQL database, but maybe something like this? The idea is to get the duplicates and delete the ones with the larger ROW_NUMBER. This should leave only the first one. I dont know if this is what you want or if it will work, but the logic seems sound
DELETE T1
FROM T1 T2
WHERE T1.Col1 = T2.col1
AND T1.ROW_NUMBER() > T2.ROW_NUMBER()
Please feel free to correct me if SQL server cant handle that kind of treatment :)
--Another idea using ROW_NUMBER()
Delete MyTable
Where Id IN
(
Select T.Id FROM
(
SELECT ROW_NUMBER() OVER (PARTITION BY UniqueColumn ORDER BY Id) AS RowNumber FROM MyTable
)T
WHERE T.RowNumber > 1
)