sql select condition performance - sql

I have a table 'Tab' with data such as:
id | value
---------------
1 | Germany
2 | Argentina
3 | Brasil
4 | Holland
What way of select is better by perfomane?
1. SELECT * FROM Tab WHERE value IN ('Argentina', 'Holland')
or
2. SELECT * FROM Tab WHERE id IN (2, 4)
I suppose that second select would be faster, because int comparison is faster than string. Is that true for MS SQL?

This is a premature optimization. The comparison between integers and strings is generally going to have a minimal impact on query performance. The drivers of query performance are more along the lines of tables sizes, query plans, available memory, and competition for resources.
In general, it is a good idea to have indexes on columns used for either comparison. The first column looks like a primary key, so it automatically gets an index. The string column should have an index built on it. In general, indexes built on an integer column will have marginally better performance compared to integers built on variable length string columns. However, this type of performance difference really makes a difference only in environments with very high levels of transactions (think thousands of data modification operations per second).
You should use the logic that best fits the application and worry about other aspects of the code.

To answer the simple question yes option 2 SELECT * FROM Tab WHERE id IN (2, 4) would be faster as you said because int comparison is faster.
One way to speed it up is to add indexes to your columns to speed up evaluation, filtering, and the final retrieval of results.
If this table was to grow even more you should also not SELECT * but SELECT id, value otherwise you may be pulling more data than you need.
You can also speed up your query's buy adding WITH(NOLOCK) as the speed of your query might be affected by other sessions accessing the tables at the same time. For example SELECT * FROM Tab WITH(NOLOCK) WHERE id IN (2, 4) . As mentioned below though adding nolock is not a turbo and should only be used in appropriate situations.

Related

Is it better to use parameter or column value when copying data from one table to another?

I have a SQL statement to copy records from one table to another:
INSERT INTO [deletedItems] (
[id],
[shopId])
SELECT
[id],
[shopId]
FROM [items]
WHERE shopId = #ShopId
#ShopId is a parameter provided to the sql command when calling the db from my application code.
Will it make the statement perform better if I change it to use the provided parameter directly, so the SQL server does not have to include shopId column from products table in the projection?
INSERT INTO [deletedItems](
[id],
[shopId])
SELECT
[id],
#ShopId
FROM [items]
WHERE shopId = #ShopId
Intuition is telling me yes, but at the same time, I would expect the sql server to optimize the execution plan of the first query and ommit the projection of the shopId column anyways (because the value will be the same for all the records) and use a constant value instead.
I would expect the sql server to optimize the execution plan of the
first query and ommit the projection of the shopId column anyways
(because the value will be the same for all the records) and use a
constant value instead.
No, SQL Server, does not do this. You can verify this by looking at the execution plan and the "output columns" for the operator accessing items.
In the general case this is not a safe transformation and can lead to lost information. For example if the source matches the rows
+--------+
| ShopId |
+--------+
| A123 |
| a123 |
+--------+
Then on a case insensitive collation both would match the same predicate and should be inserted but are different.
If one of the following applies
You are using a datatype where this is not possible
You know that this is not an issue in your data - e.g. as check constraints ensure all data is stored trimmed and upper case.
Are happy for a canonical representation to be used for all rows if it is an issue.
then it is possible to come up with convoluted scenarios where your manual optimisation makes sense as below
CREATE TABLE #T(X INT IDENTITY, Y CHAR(4000));
INSERT INTO #T
SELECT TOP 1000000 REPLICATE('A',4000)
FROM sys.all_objects o1, sys.all_objects o2
SELECT X, Y
FROM #T
WHERE Y = REPLICATE('A',4000)
ORDER BY X
SELECT X, REPLICATE('A',4000) AS Y
FROM #T
WHERE Y = REPLICATE('A',4000)
ORDER BY X
The size of the rows going into the sort operator is much bigger in the first case as it includes the large string column and the sort spills to tempdb. The query execution takes substantially longer as a result. The memory grant request for the second query is the same as that of the first as it does not take into account that the column is computed after the sort but there is less data to sort and it does not spill. On versions of SQL Server where adaptive memory grant feedback is available the excessive grant would be corrected if the query is executed repeatedly.
In most real world scenarios I doubt the manual optimisation will make any practical difference however so you should choose whichever one does what you need and you feel is clearer and concentrate optimisation efforts in more promising areas (for me the second one makes clearer that the same value will be inserted in all rows).
I don't except any performances differences. The slow part will be finding the correct items by #ShopID or the IO operations.
What can improve your query performance is having an index on [ShopID] column where ID is primary key or included column.
Will it makes the statement perform better if I change it to use the
provided parameter directly
It's the same. Because your result just has a unique ShopId as Where clause.
INSERT INTO [deletedItems] (
[id],
[shopId])
SELECT
[id],
[shopId]
FROM [items]
WHERE shopId = #ShopId -- this condition makes the `shopId` value is become unique
Two important points.
The calculation of scalar expressions in the SELECT (generally) has little impact on query performance. The performance is determined by data movement.
So, selecting a "constant" versus selecting a column from a table is immaterial.
Second, if you care about performance, you need to be very careful about query plans. Either force the use of an index or be sure that the query gets recompiled periodically as the data changes in your tables.
In particular, you want to be sure that the query uses an index on items(shopId) if the table spans multiple data pages.

How can I improve the speed of a SQL query searching for a collection of strings

I have a table called T_TICKET with a column CallId varchar(30).
Here is an example of my data:
CallId | RelatedData
===========================================
MXZ_SQzfGMCPzUA | 0000
MXyQq6wQ7gVhzUA | 0001
MXwZN_d5krgjzUA | 0002
MXw1YXo7JOeRzUA | 0000
...
I am attempting to find records that match a collection of CallId's. Something like this:
SELECT * FROM T_TICKET WHERE CALLID IN(N'MXZInrBl1DCnzUA', N'MXZ0TWkUhHprzUA', N'MXZ_SQzfGMCPzUA', ... ,N'MXyQq6wQ7gVhzUA')
And I have anywhere from 200 - 300 CallId's that I am looking up at a time using this query. The query takes around 35 seconds to run. Is there anything I can do to either the table structure, the column type, the index, or the query itself to improve the performance of this query?
There are around 300,000 rows in T_INDEX currently. CallId is not unique. And RelatedData is not unique. I also have an index (non-clustered) on CallId.
I know the basics of SQL, but I'm not a pro. Some things I've thought of doing are:
Change the type of CallId from varchar to char.
Shorten the length of CallId (it's length is 30, but in reality, right now, I am using only 15 bytes).
I have not tried any of these yet because it requires changes to live production data. And, I am not sure they would make a significant improvement.
Would either of these options make a significant improvement? Or, are there other things I could do to make this perform faster?
First, be sure that the types are the same -- either VARCHAR() or NVARCHAR(). Then, add an index:
create index idx_t_ticket_callid on t_ticket(callid);
If the types are compatible, SQL Server should make use of the index.
Your table is what we called heap (a table without clustered index). This kind of tables only good for data loading and/or as staging table. I would recommend you to convert your table to have a clustered key. A good clustering key should be unique, static, narrow, non-nullable, and ever-increasing (eg. int/bigint identity datatype).
Another downside of heap is when you have lots of UPDATE/DELETE on your table, it will slow down your SELECT because of forwarded records. Quoting from Paul Randal about forwarded records:
If a forwarding record occurs in a heap, when the record locator points to that location, the Storage Engine gets there and says Oh, the record isn't really here – it's over there! And then it has to do another (potentially physical) I/O to get to the page with the forwarded record on. This can result in a heap being less efficient that an equivalent clustered index.
Lastly, make sure you define all your columns on your SELECT. Avoid the SELECT *. I'm guessing you are experiencing a table scan when you execute the query. What you can do is INCLUDE all columns list on your SELECT on your index like this:
CREATE INDEX [IX_T_TICKET_CallId_INCLUDE] ON [T_TICKET] ([CallId]) INCLUDE ([RelatedData]) WITH (DROP_EXISTING=ON)
It turns out there is in fact a way to significantly optimize my query without changing any data types.
This query:
SELECT * FROM T_TICKET
WHERE CALLID IN(N'MXZInrBl1DCnzUA', N'MXZ0TWkUhHprzUA', N'MXZ_SQzfGMCPzUA', ... ,N'MXyQq6wQ7gVhzUA')
is using NVARCHAR types as the input params (N'MXZInrBl1DCnzUA', N'MXZ0TWkUhHprzUA'...). As I specified in my question, CallId is VARCHAR. Sql Server was converting CallId in every row of the table to an NVARCHAR type to do the comparison, which was taking a long time (even though I have an index on CallId).
I was able to optimize it by simply NOT changing the parameter types to NVARCHAR:
SELECT * FROM T_TICKET
WHERE CALLID IN('MXZInrBl1DCnzUA', 'MXZ0TWkUhHprzUA', 'MXZ_SQzfGMCPzUA', ... ,'MXyQq6wQ7gVhzUA')
Now, instead of taking over 30 seconds to run, it only takes around .03 seconds. Thanks for all the input.

difference b/w where column='' and column like '' in sql [duplicate]

This question skirts around what I'm wondering, but the answers don't exactly address it.
It would seem that in general '=' is faster than 'like' when using wildcards. This appears to be the conventional wisdom. However, lets suppose I have a column containing a limited number of different fixed, hardcoded, varchar identifiers, and I want to select all rows matching one of them:
select * from table where value like 'abc%'
and
select * from table where value = 'abcdefghijklmn'
'Like' should only need to test the first three chars to find a match, whereas '=' must compare the entire string. In this case it would seem to me that 'like' would have an advantage, all other things being equal.
This is intended as a general, academic question, and so should not matter which DB, but it arose using SQL Server 2005.
See https://web.archive.org/web/20150209022016/http://myitforum.com/cs2/blogs/jnelson/archive/2007/11/16/108354.aspx
Quote from there:
the rules for index usage with LIKE
are loosely like this:
If your filter criteria uses equals =
and the field is indexed, then most
likely it will use an INDEX/CLUSTERED
INDEX SEEK
If your filter criteria uses LIKE,
with no wildcards (like if you had a
parameter in a web report that COULD
have a % but you instead use the full
string), it is about as likely as #1
to use the index. The increased cost
is almost nothing.
If your filter criteria uses LIKE, but
with a wildcard at the beginning (as
in Name0 LIKE '%UTER') it's much less
likely to use the index, but it still
may at least perform an INDEX SCAN on
a full or partial range of the index.
HOWEVER, if your filter criteria uses
LIKE, but starts with a STRING FIRST
and has wildcards somewhere AFTER that
(as in Name0 LIKE 'COMP%ER'), then SQL
may just use an INDEX SEEK to quickly
find rows that have the same first
starting characters, and then look
through those rows for an exact match.
(Also keep in mind, the SQL engine
still might not use an index the way
you're expecting, depending on what
else is going on in your query and
what tables you're joining to. The
SQL engine reserves the right to
rewrite your query a little to get the
data in a way that it thinks is most
efficient and that may include an
INDEX SCAN instead of an INDEX SEEK)
It's a measureable difference.
Run the following:
Create Table #TempTester (id int, col1 varchar(20), value varchar(20))
go
INSERT INTO #TempTester (id, col1, value)
VALUES
(1, 'this is #1', 'abcdefghij')
GO
INSERT INTO #TempTester (id, col1, value)
VALUES
(2, 'this is #2', 'foob'),
(3, 'this is #3', 'abdefghic'),
(4, 'this is #4', 'other'),
(5, 'this is #5', 'zyx'),
(6, 'this is #6', 'zyx'),
(7, 'this is #7', 'zyx'),
(8, 'this is #8', 'klm'),
(9, 'this is #9', 'klm'),
(10, 'this is #10', 'zyx')
GO 10000
CREATE CLUSTERED INDEX ixId ON #TempTester(id)CREATE CLUSTERED INDEX ixId ON #TempTester(id)
CREATE NONCLUSTERED INDEX ixTesting ON #TempTester(value)
Then:
SET SHOWPLAN_XML ON
Then:
SELECT * FROM #TempTester WHERE value LIKE 'abc%'
SELECT * FROM #TempTester WHERE value = 'abcdefghij'
The resulting execution plan shows you that the cost of the first operation, the LIKE comparison, is about 10 times more expensive than the = comparison.
If you can use an = comparison, please do so.
You should also keep in mind that when using like, some sql flavors will ignore indexes, and that will kill performance. This is especially true if you don't use the "starts with" pattern like your example.
You should really look at the execution plan for the query and see what it's doing, guess as little as possible.
This being said, the "starts with" pattern can and is optimized in sql server. It will use the table index. EF 4.0 switched to like for StartsWith for this very reason.
If value is unindexed, both result in a table-scan. The performance difference in this scenario will be negligible.
If value is indexed, as Daniel points out in his comment, the = will result in an index lookup which is O(log N) performance. The LIKE will (most likely - depending on how selective it is) result in a partial scan of the index >= 'abc' and < 'abd' which will require more effort than the =.
Note that I'm talking SQL Server here - not all DBMSs will be nice with LIKE.
You are asking the wrong question. In databases is not the operator performance that matters, is always the SARGability of the expression, and the coverability of the overall query. Performance of the operator itself is largely irrelevant.
So, how do LIKE and = compare in terms of SARGability? LIKE, when used with an expression that does not start with a constant (eg. when used LIKE '%something') is by definition non-SARGabale. But does that make = or LIKE 'something%' SARGable? No. As with any question about SQL performance the answer does not lie with the query of the text, but with the schema deployed. These expression may be SARGable if an index exists to satisfy them.
So, truth be told, there are small differences between = and LIKE. But asking whether one operator or other operator is 'faster' in SQL is like asking 'What goes faster, a red car or a blue car?'. You should eb asking questions about the engine size and vechicle weight, not about the color... To approach questions about optimizing relational tables, the place to look is your indexes and your expressions in the WHERE clause (and other clauses, but it usually starts with the WHERE).
A personal example using mysql 5.5: I had an inner join between 2 tables, one of 3 million rows and one of 10 thousand rows.
When using a like on an index as below(no wildcards), it took about 30 seconds:
where login like '12345678'
using 'explain' I get:
When using an '=' on the same query, it took about 0.1 seconds:
where login ='12345678'
Using 'explain' I get:
As you can see, the like completely cancelled the index seek, so query took 300 times more time.
= is much faster than LIKE, even without wildcard. I tested on MySQL with 11GB of data and more than 100 million of records, the f_time column is indexed.
SELECT * FROM XXXXX WHERE f_time = '1621442261'
#took 0.00sec and return 330 records
SELECT * FROM XXXXX WHERE f_time LIKE '1621442261'
#took 44.71sec and return 330 records
Besides all the answers, there this to consider:
'like' is case insensitive, so every character needs to be compared twice, whereas the '=' only compares once for identical characters.
This issue arises with or without indexes.
Maybe you are looking about Full Text Search.
In contrast to full-text search, the LIKE Transact-SQL predicate works on
character patterns only. Also, you cannot use the LIKE predicate to
query formatted binary data. Furthermore, a LIKE query against a large
amount of unstructured text data is much slower than an equivalent
full-text query against the same data. A LIKE query against millions
of rows of text data can take minutes to return; whereas a full-text
query can take only seconds or less against the same data, depending
on the number of rows that are returned.
I was working with a huge database that has more then 400M records and I put LIKE in search query. Here is the final results.
There were three tables tb1, tb2 and tb3. When I use EQUAL for in all tables QUERY the response time was 193ms. and when I put LIKE in one of he table the response time was 19.22 sec. and for all table LIKE response time was 112 Sec

String Concat Vs Substring in Queries - Which has better performance?

I have 2 tables, say table1 and table2 with sample data as below:
Table1 (User_id)
--------------------
X1011
X1175
X1234
Table2 (User_id)
-----------------
1011
1175
1234
I need to write a query with a where condition, where I would compare these two values. Which of the following in general, would be better/advisable and why?
1. WHERE TABLE1.USER_ID = 'X' || TABLE2.USER_ID;
2. WHERE TABLE1.USER_ID = CONCAT ('X', TABLE2.USER_ID);
3. WHERE SUBSTR(TABLE1.USER_ID,2) = TABLE2.USER_ID;
Both columns are indexed.
The way to answer a performance question is to test the different options on your data and on your system.
I wouldn't expect the performance of these to be radically different, except for the impact on the execution plan. When you wrap a column in a function, then that affects the execution plan. First it affects the use of indexes and second it affects the statistics used for choosing various underlying algorithms. The actual execution of functions would (in all likelihood) have minimal impact.
I would suggest that you create a functional index. For instance, using the third example:
create index idx_table1_f1 on table1(substr(user_id, 2));
Or for the second example:
create index idx_table2_f1 on table2(CONCAT('X', TABLE2.USER_ID));
Apart from fixing your data structure so the keys really are the same thing, this is probably the best step you can take to improve performance.
Examples 1 and 2 are equivalent. Choosing between 1 and 3 depends on what table is leading and what is lead in the join (if you are going to use join). Anyway giving the actual query you are going to use and, at least, the row counts for these tables will help to give you the answer.
And, well, you may try to use 1 and 3 together. So the optimizer could change the best access path according to the statistics.

What are the performance implications of Oracle IN Clause with no joins?

I have a query in this form that will on average take ~100 in clause elements, and at some rare times > 1000 elements. If greater than 1000 elements, we will chunk the in clause down to 1000 (an Oracle maximum).
The SQL is in the form of
SELECT * FROM tab WHERE PrimaryKeyID IN (1,2,3,4,5,...)
The tables I am selecting from are huge and will contain millions more rows than what is in my in clause. My concern is that the optimizer may elect to do a table scan (our database does not have up to date statistics - yeah - I know ...)
Is there a hint I can pass to force the use of the primary key - WITHOUT knowing the index name of the primary Key, perhaps something like ... /*+ DO_NOT_TABLE_SCAN */?
Are there any creative approaches to pulling back the data such that
We perform the least number of round-trips
We we read the least number of blocks (at the logical IO level?)
Will this be faster ..
SELECT * FROM tab WHERE PrimaryKeyID = 1
UNION
SELECT * FROM tab WHERE PrimaryKeyID = 2
UNION
SELECT * FROM tab WHERE PrimaryKeyID = 2
UNION ....
If the statistics on your table are accurate, it should be very unlikely that the optimizer would choose to do a table scan rather than using the primary key index when you only have 1000 hard-coded elements in the WHERE clause. The best approach would be to gather (or set) accurate statistics on your objects since that should cause good things to happen automatically rather than trying to do a lot of gymnastics in order to work around incorrect statistics.
If we assume that the statistics are inaccurate to the degree that the optimizer would be lead to believe that a table scan would be more efficient than using the primary key index, you could potentially add in a DYNAMIC_SAMPLING hint that would force the optimizer to gather more accurate statistics before optimizing the statement or a CARDINALITY hint to override the optimizer's default cardinality estimate. Neither of those would require knowing anything about the available indexes, it would just require knowing the table alias (or name if there is no alias). DYNAMIC_SAMPLING would be the safer, more robust approach but it would add time to the parsing step.
If you are building up a SQL statement with a variable number of hard-coded parameters in an IN clause, you're likely going to be creating performance problems for yourself by flooding your shared pool with non-sharable SQL and forcing the database to spend a lot of time hard parsing each variant separately. It would be much more efficient if you created a single sharable SQL statement that could be parsed once. Depending on where your IN clause values are coming from, that might look something like
SELECT *
FROM table_name
WHERE primary_key IN (SELECT primary_key
FROM global_temporary_table);
or
SELECT *
FROM table_name
WHERE primary_key IN (SELECT primary_key
FROM TABLE( nested_table ));
or
SELECT *
FROM table_name
WHERE primary_key IN (SELECT primary_key
FROM some_other_source);
If you got yourself down to a single sharable SQL statement, then in addition to avoiding the cost of constantly re-parsing the statement, you'd have a number of options for forcing a particular plan that don't involve modifying the SQL statement. Different versions of Oracle have different options for plan stability-- there are stored outlines, SQL plan management, and SQL profiles among other technologies depending on your release. You can use these to force particular plans for particular SQL statements. If you keep generating new SQL statements that have to be re-parsed, however, it becomes very difficult to use these technologies.