SQL command usage of in / or

SQL command usage of in / or - sql

I have an sql command similar to below one.
select * from table1
where table1.col1 in (select columnA from table2 where table2.keyColumn=3)
or table1.col2 in (select columnA from table2 where table2.keyColumn=3)
Its performance is really bad so how can I change this command? (pls note that the two sql commands in the paranthesis are exactly same.)

Try
select distinct t1.* from table1 t1
inner join table2 t2 ON t1.col1 =t2.columnA OR t1.col2 = t2.columnA

This is your query:
select *
from table1
where table1.col1 in (select columnA from table2 and t2.keyColumn = 3) or
table1.col2 in (select columnA from table2 and t2.keyColumn = 3);
Probably the best approach is to build an index on table2(keyColumn, columnA).
It is also possible that in has poor performance characteristics. So, you can try rewriting this as an exists query:
select *
from table1 t1
where exists (select 1 from table2 t2 where t2.columnA = t1.col1 and t2.keyColumn = 3) or
exists (select 1 from table2 t2 where t2.columnA = t2.col1 and t2.keyColumn = 3);
In this case, the appropriate index is table2(columnA, keyColumn).

Assuming you're doing this in VFP, use SYS(3054) to see how the query is being optimized and what part is not.

Are the main query and subqueries fully Rushmore-optimisable?
Since the subqueries do not appear to be correlated (i.e. they don't refer to table1 then as long as everything is fully supported by indexes you should be fine.

Related

what does it mean to have an SQL FROM clause with no comma?

I noticed today that this query
select * from table1 table2 where column_from_table1 = ?;
works. It works the same as (same columns return)
select * from table1 where column_from_table1 = ?;
Shouldn't the former be a syntax error? What is it interpreting table2 as?

Appears it's interpreting it as renaming the table, even though table2 exists it happily allows the rename, this also works:
select * from table1 asdf where asdf.column_from_table1 = ?;

select * from table1 table2 where column_from_table1 = ?;
table2 is working as a table alias for table1. It's not being used as the name of an object in the database at all. The fact that a table named table2 exists is wholly irrelevant to this query. Usually you'd see something like this:
select t.id, t.name from table1 t where t.column_from_table1 = ?;
Some RDBMSs require the as keyword, so you'll also see this:
SELECT t.id, t.name FROM table1 AS t WHERE t.column_from_table1 = ?;
Table aliases are useful for making queries with multiple tables easier to write, especially if they have shared column names which need to be qualified. They're also essential for self-joins where a table is joined to itself.
Example of a join using aliases:
SELECT t1.Id,
t1.Name as t1_Name
t2.Name as t2_Name
FROM table1 t1
JOIN table2 t2
ON t1.id = t2.id
WHERE t1.column_from_table1 = ?;
Or, for a self-join to look for duplicate Name values, for example:
SELECT t1.Name,
t1.Id
t2.Id as Dupe_Id
FROM table1 t1
JOIN table1 t2
ON t1.Name = t2.Name
WHERE t1.Id < t2.Id;
Notice that this query is referring to table1 twice and uses the aliases of t1 and t2 to differentiate which it's referring to.
Note that a comma join, such as FROM table1, table2 WHERE table1.id = table2.id is very old syntax that should be explicitly avoided when writing queries. The older syntax is difficult to read and maintain and doesn't support outer joins except by vender-specific extensions. The newer syntax with the JOIN keyword was introduced in standard SQL in 1992. There's no reason to still be using comma joins.

comparing two tables to make sure they are same row by row and column by column on SQl server

I am comparing two tables to make sure they are same row by row and column by column on SQl server.
SELECT *
FROM t1, t2
WHERE t1.column1 = t2.column1 AND t1.column2 = t2.column2
AND t1.column3 = t2.column3 AND t1.column4 != t2.column4
The tables are vey large, more than 100 million.
I got error:
ERROR [HY000] ERROR: 9434 : Not enough memory for merge-style join
Are there better ways to do this comparison.
thanks !

A much efficient way of checking the row by row difference will be using Exists operator.
Something like this....
SELECT *
FROM t1
WHERE NOT EXISTS (SELECT 1
FROM t2
WHERE t1.column1 = t2.column1
AND t1.column2 = t2.column2
AND t1.column3 = t2.column3
AND t1.column4 = t2.column4
)

You could try EXCEPT http://technet.microsoft.com/en-us/library/ms188055(v=sql.100).aspx
SELECT column1, column2, column3, column4 FROM t1
EXCEPT
SELECT column1, column2, column3, column4 FROM t2

What if you try an INNER JOIN (and not select all the data from both tables)?
SELECT t1.column4, t2.column4
FROM t1 INNER JOIN t2 ON t1.column1 = t2.column1 AND t1.column2 = t2.column2
AND t1.column3 = t2.column3
WHERE t1.column4 != t2.column4
Do you want to identify all the rows that are different or just identify IF there are any rows that are different?

Here's how I would do this: first, I assume you have primary keys on both tables. When you join those tables, the best way to join is using primary key fields, not all of them:
select t1.*, t2.*
from t1 join t2 on t1.id = t2.id
then you can compare those tables field-by-field without overloading sql:
select t1.*, t2.*
from t1 outer join t2 on t1.id = t2.id
where t1.field1 <> t2.field1 ot t1.field2 <> t2.field2 .....
the resulting records would be mismatches.
the code I wrote here is conceptual, I personally didn't run it on sql, so you might need to adjust

All of the above are good suggestions (My first try would be SELECT * FROM t1 EXCEPT SELECT * FROM t2), but you indicate they all give the same out of memory error. Therefore I must conclude your tables are simply too large to perform the operation you desire all in one go. You'll have to run the query in stages, using a technique like one of the ones from "Equivalent of LIMIT and OFFSET for SQL Server?" I'd start with something like this (SQL Fiddle):
DECLARE #offset INT = 0
SELECT TOP 50000000 *
FROM (
SELECT *,
ROW_NUMBER() over (order by column1) AS r_n_n
FROM t1
) xx
WHERE r_n_n >= #offset
EXCEPT
SELECT TOP 50000000 *
FROM (
SELECT *,
ROW_NUMBER() over (order by column1) AS r_n_n
FROM t2
) xx
WHERE r_n_n >= #offset
Then you can increment #offset by the amount of TOP n and do it again. This will likely involve some trial and error to find the limit for the TOP n clause that will run to completion rather than throw an error. I'd start with half, then try quarters, eighths, etc. as necessary.

Using subquery in an update always requires subquery in a where clause?

This is a common SQL query for me:
update table1 set col1 = (select col1 from table2 where table1.ID = table2.ID)
where exists (select 1 from table2 where table1.ID = table2.ID)
Is there any way to avoid having two nearly identical subqueries? This query is an obvious simplification but performance suffers and query is needlessly messy to read.

Unfortunately Informix don't support the FROM clause at UPDATE statement.
The way to workaround and you will get better results (performance) is change the UPDATE to MERGE statement.
This will work only if your database is version 11.50 or above
MERGE INTO table1 as t1
USING table2 as t2
ON t1.ID = t2.ID
WHEN MATCHED THEN UPDATE set (t1.col1, t1.col2) = (t2.col1, t2.col2);
Check IBM Informix manual for more information

Update with inner join can be used to avoid subqueries
something like this:
update t1
set col1 = t2.col1
from table1 t1
inner join table2 t2
on t1.ID = t2.ID

try this:
update table1 set col1 = (select col1 as newcol from table2 where table1.ID = table2.ID)
where exists (newcol)

Rewrite SQL code SELECT block to simplify logic

I am trying to rewrite this block with simpler logic if this can be done. I am using it within a larger SELECT statement and I think IF I can simplify this block, I might be able to improve performance of my query.
proj_catg_type_id, proj_catg_id and proj_id are all PKs in their tables.
select t1.proj_catg_name
from table1 t1, table2 t2, table3 t3
where t2.proj_catg_type_id = t1.proj_catg_type_id
and t2.proj_catg_type_id = 213
and t3.proj_id = t2.proj_id

Without knowing the referential integrety rules and the logic behind the tables it is difficult to give a 100% correct answer. But just by looking to this statement the most simplified logic would be
select t1.proj_catg_name
from table1 t1
where t1.proj_catg_type_id = 213;

select t1.proj_catg_name
from table1 t1 inner join table2 t2
on t2.proj_catg_type_id=t1.proj_catg_type_id
where t2.proj_catg_type_id=213
and t3.proj_id=t2.proj_i
maybe? is t3 used outside this subselect?

If t3 is a table outside the selct you showed, then this is a correlated subquery which you should not be using at all, ever! That turns your query into a row-by agonizing row cursor.
Use derived tables or joins to get the results.
You don't give me enough code to write a specific solution for your problem, but let me give you an example:
SELECT
field1
, field2
, (SELECT t3.field3
FROM table2 t2
JOIN table3 t3 ON t2.id = t3.id
WHERE t4.somefield = t2.somefield)
FROM table1 t1
JOIn table4 t4 ON t1.id = t4.id
SELECT
field1
, field2
, t3.field3
FROM table1 t1
JOIn table4 t4
ON t1.id = t4.id
join (SELECT field3
FROM table2 t2
JOIN table3 t3 ON t2.id = t3.id) a
ON t4.somefield = t2.somefield
The first query runs one row at a time which is extremely slow. The second should give the same results but runs in a set-based fashion which is much faster. It is important to make sure the derived table has an a alias. You could also use a CTE.

Best self join technique when checking for duplicates

i'm trying to optimize a query that is in production which is taking a long time. The goal is to find duplicate records based on matching field values criteria and then deleting them. The current query uses a self join via inner join on t1.col1 = t2.col1 then a where clause to check the values.
select * from table t1
inner join table t2 on t1.col1 = t2.col1
where t1.col2 = t2.col2 ...
What would be a better way to do this? Or is it all the same based on indexes? Maybe
select * from table t1, table t2
where t1.col1 = t2.col1, t2.col2 = t2.col2 ...
this table has 100m+ rows.
MS SQL, SQL Server 2008 Enterprise
select distinct t2.id
from table1 t1 with (nolock)
inner join table1 t2 with (nolock) on t1.ckid=t2.ckid
left join table2 t3 on t1.cid = t3.cid and t1.typeid = t3.typeid
where
t2.id > #Max_id and
t2.timestamp > t1.timestamp and
t2.rid = 2 and
isnull(t1.col1,'') = isnull(t2.col1,'') and
isnull(t1.cid,-1) = isnull(t2.cid,-1) and
isnull(t1.rid,-1) = isnull(t2.rid,-1)and
isnull(t1.typeid,-1) = isnull(t2.typeid,-1) and
isnull(t1.cktypeid,-1) = isnull(t2.cktypeid,-1) and
isnull(t1.oid,'') = isnull(t2.oid,'') and
isnull(t1.stypeid,-1) = isnull(t2.stypeid,-1)
and (
(
t3.uniqueoid = 1
)
or
(
t3.uniqueoid is null and
isnull(t1.col1,'') = isnull(t2.col1,'') and
isnull(t1.col2,'') = isnull(t2.col2,'') and
isnull(t1.rdid,-1) = isnull(t2.rdid,-1) and
isnull(t1.stid,-1) = isnull(t2.stid,-1) and
isnull(t1.huaid,-1) = isnull(t2.huaid,-1) and
isnull(t1.lpid,-1) = isnull(t2.lpid,-1) and
isnull(t1.col3,-1) = isnull(t2.col3,-1)
)
)

Why self join: this is an aggregate question.
Hope you have an index on col1, col2, ...
--DELETE table
--WHERE KeyCol NOT IN (
select
MIN(KeyCol) AS RowToKeep,
col1, col2,
from
table
GROUP BY
col12, col2
HAVING
COUNT(*) > 1
--)
However, this will take some time. Have a look at bulk delete techniques

You can use ROW_NUMBER() to find duplicate rows in one table.
You can check here

The two methods you give should be equivalent. I think most SQL engines would do exactly the same thing in both cases.
And, by the way, this won't work. You have to have at least one field that is differernt or every record will match itself.
You might want to try something more like:
select col1, col2, col3
from table
group by col1, col2, col3
having count(*)>1

For table with 100m+ rows, Using GROUPBY functions and using holding table will be optimized. Even though it translates into four queries.
STEP 1: create a holding key:
SELECT col1, col2, col3=count(*)
INTO holdkey
FROM t1
GROUP BY col1, col2
HAVING count(*) > 1
STEP 2: Push all the duplicate entries into the holddups. This is required for Step 4.
SELECT DISTINCT t1.*
INTO holddups
FROM t1, holdkey
WHERE t1.col1 = holdkey.col1
AND t1.col2 = holdkey.col2
STEP 3: Delete the duplicate rows from the original table.
DELETE t1
FROM t1, holdkey
WHERE t1.col1 = holdkey.col1
AND t1.col2 = holdkey.col2
STEP 4: Put the unique rows back in the original table. For example:
INSERT t1 SELECT * FROM holddups

To detect duplicates, you don't need to join:
SELECT col1, col2
FROM table
GROUP BY col1, col2
HAVING COUNT(*) > 1
That should be much faster.

In my experience, SQL Server performance is really bad with OR conditions. Probably it is not the self join but that with table3 that causes the bad performance. But without seeing the plan, I would not be sure.
In this case, it might help to split your query into two:
One with a WHERE condition t3.uniqueoid = 1 and one with a WHERE condition for the other conditons on table3, and then use UNION ALL to append one to the other.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL command usage of in / or - sql

Try select distinct t1.* from table1 t1 inner join table2 t2 ON t1.col1 =t2.columnA OR t1.col2 = t2.columnA

Assuming you're doing this in VFP, use SYS(3054) to see how the query is being optimized and what part is not.

Are the main query and subqueries fully Rushmore-optimisable? Since the subqueries do not appear to be correlated (i.e. they don't refer to table1 then as long as everything is fully supported by indexes you should be fine.

Related

what does it mean to have an SQL FROM clause with no comma?

comparing two tables to make sure they are same row by row and column by column on SQl server

Using subquery in an update always requires subquery in a where clause?

Rewrite SQL code SELECT block to simplify logic

Best self join technique when checking for duplicates

Categories

Resources