SQL Like Operator very slow when using from Another table in AWS Athena - sql

I have SQL query in athena that is very slow when using like operator value from another table
Select * from table1 t1
Where t1.value like (select
concat('%',t2.value,'%') as val
from table2 t2 where t2.id =1
limit 1)
The above query is very slow
When i am using something like below query its working super fast
Select * from table1 t1
Where t1.value like
'%somevalue%'
In my scenario the like value is not fixed it can be changed by the time that's why i need to use this value from another table.
Please suggest fastest way

"Slow" is a relative term, but a query that joins two tables will always be slower than a query that doesn't. A query that compares against a pattern that needs to be looked up in another table at query time will always be slower than a query that uses a static pattern.
Does that mean that the second query is slow? Perhaps, but it you have to base that on what you're actually asking the query engine to do.
Let's dissect what your query is doing:
The outer query looks for all columns of all rows of the first table where one of the columns contains a particular string.
That string is dynamically looked up by scanning every row in the second table looking for a row with a particular value for the id column.
In other words, the first query scans only the first table but the second scans both tables. That's always going to be slower, because it's doing a lot more work. How much more work? That depends on the sizes of the tables. You aren't specifying the running times of any of the queries or the sizes of the tables, so it's hard to know.
You don't provide enough context in your question to answer any more precise than this. We can only respond with generalities like: if it's slow then don't use LIKE, that's a slow operation. Don't do a correlated subquery that reads the whole second table, that's slow.

i have found other method to the same and it's super faster in Athena
Select * from table1 t1
Where POSITION ( (select concat('%',t2.value,'%') as val from table2 t2 where t2.id =1 limit 1) in t1.value )>0

Related

Querying Table vs Querying Subquery of Same Table

I'm working with some very old legacy code, and I've seen a few queries that are structured like this
SELECT
FieldA,
FieldB,
FieldC
FROM
(
SELECT * FROM TABLE1
)
LEFT JOIN TABLE2 ON...
Is there any advantage to writing a query this way?
This is in Oracle.
There would seem to be no advantage to using a subquery like this. The reason may be a historical relic, regarding the code.
Perhaps once upon a time, there was a more complicated query there. The query was replaced by a table/view, and the author simply left the original structure.
Similarly, once upon a time, perhaps a column needed to be calculated (say for the outer query or select). This column was then included in the table/view, but the structure remained.
I'm pretty sure that Oracle is smart enough to ignore the subquery when optimizing the query. Not all databases are that smart, but you might want to clean-up the code. At the very least, such as subquery looks awkward.
As a basic good practice in SQL, you should not code a full-scan from a table (SELECT * FROM table, without a WHERE clause), unless necessary, for performance issues.
In this case, it's not necessary: The same result can be obtained by:
SELECT
Fields
FROM
TABLE1 LEFT JOIN TABLE2 ON...

How to put more than 1million ID's using union All [duplicate]

I have comma delimited id's that I want to use in NOT IN clause..
I'm using oracle 11g.
select * from table where ID NOT IN (1,2,3,4,...,1001,1002,...)
results in
ORA-01795: maximum number of expressions in a list is 1000
I don't want to use temp table. am trying considering doing this
select * from table1 where ID NOT IN (1,2,3,4,…,1000) AND
ID NOT IN (1001,1002,…,2000)
Is there any other better workaround to this issue?
You said you don't want to, but: use a temporary table. That's the correct solution here.
Query parsing is expensive in Oracle, and that's what you'll get when you put thousands of identifiers into a giant blob of SQL. Also, there are ill-defined limits on query length that you're going to hit. Doing an anti-JOIN against a table, on the other hand... Oracle is good at that. Bulk loading data into a table, Oracle is good at that too. Use a temp table.
Limiting IN to a thousand entries is a sanity check. The fact that you're hitting it means you're trying to do something insane.
Jump out of the question, can you combine the SQL to get more than 1000 IDs with this SQL. That's the better way to simplify your SQLs.
It's insane.
But you can probably try to select from select:
SELECT * FROM
(SELECT * FROM table WHERE ID NOT IN (1,2,3,4,...,1000))
WHERE ID NOT IN (1001,1002,…,2000)
Make as many levels as you need.
Use MINUS, the opposite to `UNION
SELECT * FROM TABLE
MINUS
SELECT T.* FROM TABLE T,TABLE2 T2 WHERE T.ID = T2.ID
This represents registers on table T which id not in table2 t2

Performance of "NOT IN" in SQL query

I'm quite new to SQL query analysis. Recently I stumbled upon a performance issue with one of the queries and I'm wondering whether my thought process is correct here and why Query Optimizer works the way it works in this case.
I'm om SQL Server 2012.
I've got a SQL query that looks like
SELECT * FROM T1
WHERE Id NOT IN
(SELECT DISTINCT T1_Id from T2);
It takes around 30 seconds to run on my test server.
While trying to understand what is taking so long I rewrote it using a temp table, like this:
SELECT DISTINCT T1_Id
INTO #temp from T2;
SELECT * FROM T1
WHERE Id NOT IN
(SELECT T1_Id from #temp);
It runs a hundred times faster than the first one.
Some info about the tables:
T2 has around 1 million rows, and there are around 1000 distinct values of T1_id there. T1 has around 1000+ rows. Initially I only had a clustered index on T2 on a column other than T1_Id, so T1_id wasn't indexed at all.
Looking at the execution plans, I saw that for the first query there were as many index scans as there are distinct T1_id values, so basically SQL Server performs about 1000 index scans in this case.
That made me realize that adding a non-clustered index on T1_id may be a good idea (the index should've been there from the start, admittedly), and adding an index indeed made the original query run much faster since now it does nonclustered index seeks.
What I'm looking for is to understand the Query optimizer behavior for the original query - does it look reasonable? Are there any ways to make it work in a way similar to the temporary table variant that I posted here rather than doing multiple scans? Am I just misunderstanding something here?
Thanks in advance for any links to the similar discussion as I haven't really found anything useful.
Not in is intuitive but slow. This construct will generally run quicker.
where id in
(select id from t1
except select t1_id from t2)
The actual performance will likely vary from the estimates, but neither of your queries will out-perform this query, which is the de facto standard approach:
SELECT T1.* FROM T1
LEFT JOIN T2 ON T1.Id = T2.T1_Id
WHERE T2.T1_Id IS NULL
This uses a proper join, which will perform very well (assuming the foreign key column is indexed) and being an left (outer) join the WHERE condition selects only those rows from T1 that don't join (all columns of the right side table are null when the join misses).
Note also that DISTINCT is not required, since there is only ever one row returned from T1 for missed joins.
The SQL Server optimizer needs to understand the size if tables for some of its decisions.
When doing a NOT IN with a subquery, those estimates may not be entirely accurate. When the table is actually materialized, the count would be highly accurate.
I think the first would be faster with an index on
Table2(t1_id)
This is just a guess, but hopefully an educated one...
The DBMS probably concluded that searching a large table small number of times is faster than searching a small table large number of times. That's why you had ~1000 searches on T2, instead of ~1000000 searches on T1.
When you added an index on T2.T1_Id, that turned ~1000 table scans (or full clustered index scans if the table is clustered) into ~1000 index seeks, which made things much faster, as you already noted.
I'm not sure why it didn't attempt a hash join (or a merge join after the index was added) - perhaps it had stale statistics and badly overestimated the number of distinct values?
One more thing: is there a FOREIGN KEY on T2.T1_Id referencing T1.Id? I know Oracle can use FKs to improve the accuracy of cost estimates (in this case, it could infer that the cardinality of T2.T1_Id cannot be greater than T1.Id). If MS SQL Server does something similar, and the FK is missing (or is untrusted), that could contribute to the MS SQL Server thinking there are more distinct values than there really are.
(BTW, it would have helped if you posted the actual query plans and the database structure.)

In which sequence are queries and sub-queries executed by the SQL engine?

Hello I made a SQL test and dubious/curious about one question:
In which sequence are queries and sub-queries executed by the SQL engine?
the answers was
primary query -> sub query -> sub sub query and so on
sub sub query -> sub query -> prime query
the whole query is interpreted at one time
There is no fixed sequence of interpretation, the query parser takes a decision on fly
I choosed the last answer (just supposing that it is most reliable w.r.t. others).
Now the curiosity:
where can i read about this and briefly what is the mechanism under all of that?
Thank you.
I think answer 4 is correct. There are a few considerations:
type of subquery - is it corrrelated, or not. Consider:
SELECT *
FROM t1
WHERE id IN (
SELECT id
FROM t2
)
Here, the subquery is not correlated to the outer query. If the number of values in t2.id is small in comparison to t1.id, it is probably most efficient to first execute the subquery, and keep the result in memory, and then scan t1 or an index on t1.id, matching against the cached values.
But if the query is:
SELECT *
FROM t1
WHERE id IN (
SELECT id
FROM t2
WHERE t2.type = t1.type
)
here the subquery is correlated - there is no way to compute the subquery unless t1.type is known. Since the value for t1.type may vary for each row of the outer query, this subquery could be executed once for each row of the outer query.
Then again, the RDBMS may be really smart and realize there are only a few possible values for t2.type. In that case, it may still use the approach used for the uncorrelated subquery if it can guess that the cost of executing the subquery once will be cheaper that doing it for each row.
Option 4 is close.
SQL is declarative: you tell the query optimiser what you want and it works out the best (subject to time/"cost" etc) way of doing it. This may vary for outwardly identical queries and tables depending on statistics, data distribution, row counts, parallelism and god knows what else.
This means there is no fixed order. But it's not quite "on the fly"
Even with identical servers, schema, queries, and data I've seen execution plans differ
The SQL engine tries to optimise the order in which (sub)queries are executed. The part deciding about that is called a query optimizer. The query optimizer knows how many rows are in each table, which tables have indexes and on what fields. It uses that information to decide what part to execute first.
If you want something to read up on these topics, get a copy of Inside SQL Server 2008: T-SQL Querying. It has two dedicated chapters on how queries are processed logically and physically in SQL Server.
It's usually depends from your DBMS, but ... I think second answer is more plausible.
Prime query usually can't be calculated without sub query results.

SQL Server - Query Short-Circuiting?

Do T-SQL queries in SQL Server support short-circuiting?
For instance, I have a situation where I have two database and I'm comparing data between the two tables to match and copy some info across. In one table, the "ID" field will always have leading zeros (such as "000000001234"), and in the other table, the ID field may or may not have leading zeros (might be "000000001234" or "1234").
So my query to match the two is something like:
select * from table1 where table1.ID LIKE '%1234'
To speed things up, I'm thinking of adding an OR before the like that just says:
table1.ID = table2.ID
to handle the case where both ID's have the padded zeros and are equal.
Will doing so speed up the query by matching items on the "=" and not evaluating the LIKE for every single row (will it short circuit and skip the LIKE)?
SQL Server does NOT short circuit where conditions.
it can't since it's a cost based system: How SQL Server short-circuits WHERE condition evaluation .
You could add a computed column to the table. Then, index the computed column and use that column in the join.
Ex:
Alter Table Table1 Add PaddedId As Right('000000000000' + Id, 12)
Create Index idx_WhateverIndexNameYouWant On Table1(PaddedId)
Then your query would be...
select * from table1 where table1.PaddedID ='000000001234'
This will use the index you just created to quickly return the row.
You want to make sure that at least one of the tables is using its actual data type for the IDs and that it can use an index seek if possible. It depends on the selectivity of your query and the rate of matches though to determine which one should be converted to the other. If you know that you have to scan through the entire first table, then you can't use a seek anyway and you should convert that ID to the data type of the other table.
To make sure that you can use indexes, also avoid LIKE. As an example, it's much better to have:
WHERE
T1.ID = CAST(T2.ID AS VARCHAR) OR
T1.ID = RIGHT('0000000000' + CAST(T2.ID AS VARCHAR), 10)
than:
WHERE
T1.ID LIKE '%' + CAST(T2.ID AS VARCHAR)
As Steven A. Lowe mentioned, the second query might be inaccurate as well.
If you are going to be using all of the rows from T1 though (in other words a LEFT OUTER JOIN to T2) then you might be better off with:
WHERE
CAST(T1.ID AS INT) = T2.ID
Do some query plans with each method if you're not sure and see what works best.
The absolute best route to go though is as others have suggested and change the data type of the tables to match if that's at all possible. Even if you can't do it before this project is due, put it on your "to do" list for the near future.
How about,
table1WithZero.ID = REPLICATE('0', 12-len(table2.ID))+table2.ID
In this case, it should able to use the index on the table1
Just in case it's useful, as the linked page in Mladen Prajdic's anwer explains, CASE clauses are short-circuit evaluated.
If the ID is purely numeric (as your example), I would reccomend (if possible) changing that field to a number type instead. If the database is allready in use it might be hard to change the type though.
fix the database to be consistent
select * from table1 where table1.ID LIKE '%1234'
will match '1234', '01234', '00000000001234', but also '999991234'. Using LIKE pretty much guarantees an index scan (assuming table1.ID is indexed!). Cleaning up the data will improve performance significantly.
if cleaning up the data is not possible, write a user-defined function (UDF) to strip off leading zeros, e.g.
select * from table1 where dbo.udfStripLeadingZeros(table1.ID) = '1234'
this may not improve performance (since the function will have to run for each row) but it will eliminate false matches and make the intent of the query more obvious
EDIT: Tom H's suggestion to CAST to an integer would be best, if that is possible.