UNION ALL vs OR condition in sql server query - sql

I have to select some rows based on a not exists condition on a table. If I use a union all as below, it gets executed in less than 1 second.
SELECT 1 FROM dummyTable
WHERE NOT EXISTS
(
SELECT 1 FROM TABLE t
WHERE Data1 = t.Col1 AND Data2=t.Col2
UNION ALL
SELECT 1 FROM TABLE t
WHERE Data1 = t.Col2 AND Data2=t.Col1
)
but if I use an OR condition, it takes close to a minute as SQL server is doing a table lazy pool. Can someone explain it?
SELECT 1 FROM dummyTable
WHERE NOT EXISTS
(
SELECT 1 FROM TABLE t
WHERE ( (Data1 = t.Col1 AND Data2=t.Col2) OR (Data1 = t.Col2 AND Data2=t.Col1))
)

The issue is that you are specifying two conditions with OR that apply to separate tables in your query. Because of this, the nonclustered index seek has to return most or all of the rows in your big table because OR logic means they might also match the condition clause in the second table.
Look at the SQL execution plan in all three examples above, and notice the number of rows that come out of the nonclustered index seek from the big table. The ultimate result may only return 1,000 or fewer of the 800,000 rows in the table but the OR clause means that the contents of that table have to be cross-referenced with the conditional in the second table since OR means they may be needed for the final query output.
Depending on your execution plan, the index seek may pull out all 800,000 rows in big table because they may also match the conditions of the OR clause in the second table. The UNION ALL is two separate query against one table each, so the index seek only has to output the smaller result set that might match the condition for that query.
I hope this makes sense. I've run across the same situation while refactoring slow-running SQL statements.
Cheers,
Andre Ranieri

The query plan is also affected by the number of rows in your tables. How many rows are there in table t ?
You could also try:
SELECT 1 FROM dummyTable
WHERE NOT EXISTS
(
SELECT 1 FROM TABLE t
WHERE Data1 = t.Col1 AND Data2=t.Col2
)
AND NOT EXISTS
(
SELECT 1 FROM TABLE t
WHERE Data1 = t.Col2 AND Data2=t.Col1
)
or (corrected for SQL-Server) this that will use the index:
WITH tt AS <---- a temp table with 2 rows
( SELECT Data1 AS Col1, Data2 AS Col2
UNION
SELECT Data2 AS Col1, Data1 AS Col2
)
SELECT 1 FROM dummyTable
WHERE NOT EXISTS
(
SELECT 1
FROM TABLE t
JOIN tt
ON tt.Col1 = t.Col1 AND tt.Col2=t.Col2
)

The usage of the OR is probably causing the query optimizer to no longer use an index in the second query. Look at the explain for each query and that will tell you the answer.

Related

Write a where clause that compares two columns to the same subquery?

I want to know if it's possible to make a where clause compare 2 columns to the same subquery. I know I could make a temp table/ variable table or write the same subquery twice. But I want to avoid all that if possible. The Subquery is long and complex and will cause significant overhead if I have to write it twice.
Here is an example of what I am trying to do.
SELECT * FROM Table WHERE (Column1 OR Column2) IN (Select column from TABLE)
I'm looking for a simple answer and that might just be NO but if it's possible without anything too elaborate please clue me in.
I updated the select to use OR instead of AND as this clarified my question a little better.
The example you've given would probably perform best using exists, such as:
select *
from t1
where exists (
select 1 from t2
where t2.col = t1.col1 and t2.col = t1.col2
);
To prevent writing the complicated subquery twice, you can use a CTE (Common Table Expression):
;WITH MyFirstCTE (x) AS
(
SELECT [column] FROM [TABLE1]
-- add all the very complicated stuff here
)
SELECT *
FROM Table2
WHERE Column1 IN (SELECT x FROM MyFirstCTE)
AND Column2 IN (SELECT x FROM MyFirstCTE)
Or using EXISTS:
;WITH MyFirstCTE (x) AS
(
SELECT [column] FROM [TABLE1]
-- add all the very complicated stuff here
)
SELECT *
FROM Table2
WHERE EXISTS (SELECT 1 FROM MyFirstCTE WHERE x = Column1)
AND EXISTS (SELECT 1 FROM MyFirstCTE WHERE x = Column2)
I used deliberately clumsy names, best to pick better ones.
I started it with a ; because if it's not the first command in a larger script then a ; is needed to separate the CTE from the commands before it.

SQL - Is there a way to check rows for duplicates in all columns of a table

I have a big table with over 1 million rows and 96 columns.
Using SQL I want to find rows where every value is the same. The table doesn't have any primary key so I'm not sure how to approach this. I'm not permitted to change the table structure.
I've seen people use count(*) and group by but I'm not sure if this is effective for a table with 96 columns.
Using COUNT() as an analytic function we can try:
WITH cte AS (
SELECT *, COUNT(*) OVER (PARTITION BY col1, col2, ..., col96) cnt
FROM yourTable
)
SELECT col1, col2, ..., col96
FROM cte
WHERE cnt > 1;
you can use md5 function as primary key.
select count(1),md5_col,* from (
select md5(concat_ws('',col1,col2)) as md5_col,* from db_name.table_name) tt group by md5_col;
For convenience, use BINARY_CHECKSUM:
with cte as (
select *, BINARY_CHECKSUM(*) checksum
from mytable
), cte2 as (
select checksum
from cte
group by checksum
having count(*) > 1
)
select distinct t1.*
from cte t1
join cte t2 on t1.checksum = t2.checksum
and t1.col1 = t2.col2
and t1.col2 = t2.col2
-- etc
where t1.checksum in (select checksum from cte2)
cte2 will return (almost) only truly matching rows, so join condition won't have many rows to exhaustively compare every column.
Rather than trying to boil the ocean and solve the entire problem with a single sql query (which you certainly can do...), I recommend using any indexes or statistics on the tables to filter out as many rows as you can.
Start by finding the columns with the most / fewest unique values (assuming you have statistics, that is), and smash them up against each other to rapidly exclude as many rows as possible. Take the results, dump them to a temp table, index fields as needed, and repeat.
Or you could just do this:
Declare #sql nvarchar(max);
Select #sql='select column1 from schema.table where case ' + stuff((select 'when col1!=' + quotename(name) + ' then 0 ' from sys.columns where object_id=object_id('schema.table') for xml path(''),Type).value('.','nvarchar(max)'),1,11,'') + 'else 1 end = 1';
Exec sp_executesql #sql;
If you must run that horrorshow of a query in production, please use snapshot isolation or move it to a temp table first (unless no one ever updates the table.
(Honestly, I would probably use something like that query on the temp table containing my filtered-down dataset... but anything you can do to makes sure that the comparisons aren't naïve (e.g. taking statistics into account) can improve your performance significantly. If you want to do it all at once, you could always join sys.tables to a temp table that puts your field comparisons into a thoughtful order. After all, once a case statement if found to be true, all the others will be skipped for that record. )

How to write a hive sql that insert into table2 if there's a specific row in table1

I am facing a hive problem.
I will get a 0 or 1 after from sql
"select count(*) from table1 where ..."
If the result is 1, then I will execute the sql
"Insert Into table2 partition(d) (select xxxx from table 1 where ...
group by t)"
Otherwise do nothing.
My question is how can I write these two sql together into one sql. I am only allowed to write a single long sql.
I tried to put the first sql into the where condition in sql2, but it throwed an error said it's not supported to operat on table1 in the subquery (couldn't remember clearly, something like this).
It sounds like a very easy question for experienced programmers, but I just started lerning hive for 2 days.
If select in insert overwrite table partition does not returns rows, nothing is being overwritten.
So, just calculate your count in the same dataset and filter by it, use analytics funtion if you want to aggregate data on different level before insert
Insert Into table2 partition(d)
select col1, col2, sum(col3), etc, etc, partition_col
from
(
select --some columns here,
--Assign the same count to all rows
count(case when your_boolean_condition then 1 else null end) over () as cnt
from table 1
) s
where cnt=1 --If this is not satisfied, no overwrite will happen
AND more_conditions
group by ...
Another approach possible is to use cross-join with your count:
insert Into table2 partition(d)
select xxxx ... ... ..., partition_column as d
from
(
select t.*, c.cnt
table1 t cross join (select count(*) cnt from table1 where condition) c
)s
where cnt=1 <and another_condition>

Update Table Beginning At Record One SQL Server

I am trying to update a table with records from another table. Whenever I use the insert into statement, I find that the records are simply appended. Instead, I want the records to be inserted from the top of the table. What is the easiest way to do this? I am thin king I could use a update statement, but that means I will have to join the tables. One of the tables(the one I am pulling records from) has only one column. As such, I would have to include another column to do the join.I am trying not to make it so complicated. If there is a simplier way, please let me know.
Sample:
Table One
Col1
1
2
3
4
Table 2
Col1 Col2
a
b
c
d
I want to move column 1 from table 1 to column 2 in table 2 such that table 2 will be:
Table 2
Col1 Col2
a 1
b 2
c 3
d 4
You can do the update using row_number(), but the rows will be assigned in an indeterminate order:
with toupdate as (
select t2.*, row_number() over (select NULL)) as seqnum
from table2 t2
),
t1 as (
select t1.*, row_numbrer() over (select NULL)) as seqnum
from table1 t1
)
update toupdate
set col2 = t1.col1
from toupdate join
t1
on toupdate.seqnum = t1.seqnum;
Note: if you have an ordering in mind, then use the appropriate order by in the partition clauses.
Unless you explicity define an ORDER BY clause in your SELECT statements, your result set will be completely arbitrary. This is in line with how any RDBMS should operate. You should consider including a timestamp at the time of insertion to identify the latest rows.

Combining several query results into one table, how is the results order determined?

I am retuning table results for different queries but each table will be in the same format and will all be in one final table. If I want the results for query 1 to be listed first and query2 second etc, what is the easiest way to do it?
Does UNION append the table or are is the combination random?
The SQL standard does not guarantee an order unless explicitly called for in an order by clause. In practice, this usually comes back chronologically, but I would not rely on it if the order is important.
Across a union you can control the order like this...
select
this,
that
from
(
select
this,
that
from
table1
union
select
this,
that
from
table2
)
order by
that,
this;
UNION appends the second query to the first query, so you have all the first rows first.
You can use:
SELECT Col1, Col2,...
FROM (
SELECT Col1, Col2,..., 1 AS intUnionOrder
FROM ...
) AS T1
UNION ALL (
SELECT Col1, Col2,..., 2 AS intUnionOrder
FROM ...
) AS T2
ORDER BY intUnionOrder, ...