String Concat Vs Substring in Queries - Which has better performance? - sql

I have 2 tables, say table1 and table2 with sample data as below:
Table1 (User_id)
--------------------
X1011
X1175
X1234
Table2 (User_id)
-----------------
1011
1175
1234
I need to write a query with a where condition, where I would compare these two values. Which of the following in general, would be better/advisable and why?
1. WHERE TABLE1.USER_ID = 'X' || TABLE2.USER_ID;
2. WHERE TABLE1.USER_ID = CONCAT ('X', TABLE2.USER_ID);
3. WHERE SUBSTR(TABLE1.USER_ID,2) = TABLE2.USER_ID;
Both columns are indexed.

The way to answer a performance question is to test the different options on your data and on your system.
I wouldn't expect the performance of these to be radically different, except for the impact on the execution plan. When you wrap a column in a function, then that affects the execution plan. First it affects the use of indexes and second it affects the statistics used for choosing various underlying algorithms. The actual execution of functions would (in all likelihood) have minimal impact.
I would suggest that you create a functional index. For instance, using the third example:
create index idx_table1_f1 on table1(substr(user_id, 2));
Or for the second example:
create index idx_table2_f1 on table2(CONCAT('X', TABLE2.USER_ID));
Apart from fixing your data structure so the keys really are the same thing, this is probably the best step you can take to improve performance.

Examples 1 and 2 are equivalent. Choosing between 1 and 3 depends on what table is leading and what is lead in the join (if you are going to use join). Anyway giving the actual query you are going to use and, at least, the row counts for these tables will help to give you the answer.
And, well, you may try to use 1 and 3 together. So the optimizer could change the best access path according to the statistics.

Related

Sybase ASE 15 : Creating an index on multiple columns which values are inequally discriminating

I have a table mytable of 5 million records and a query that looks like
select *
from mytable
where column1 = 'value1' and column2 = 'value2' and column3 = 'value3'
So I thought about creating an index based on the 3 columns but my problem is that I have no best column to put in the first position of the index because there is no column that is really discrimating compared to the others.
Therefore I would like to build something similar to the hash tables with a hash code based on these 3 columns. I tried a function-based index based on the concatenation of those 3 columns but it's taking so long to create that I never got it created and I believe it's the wrong way to achieve what I want. What is the correct way to achieve this ?
Just create an index with three columns:
create idx_mytable_col1_col2_col3 on mytable(col1, col2, col3)
You have equality comparisons. The order of the columns in the index does not matter in this case.
Let the database do the work for you.
ASE's indexes are generally stored as b-trees, and while there's some hashing 'magic' that takes place during index searching, there's still a bit of traversal/searching required; if the first column of an index is not very selective then you can see some degradation in index search performance when compared to an index where the more selective column(s) is listed first; the difference in performance is really going to depend on the selectivity of the column(s) in question and the sheer size of the index (ie, number of index levels and pages that have to be read/processed).
If you're running ASE 15.0.3+, and you're running ASE on linux, you may want to take a look at virtually-hashed tables. In a nutshell ... ASE stores the PK index as a hash instead of the normal b-tree, with the net result being that index search times are reduced. There are quite a few requirements/restrictions on virtually-hashed tables so I suggest you take a look at your Transact-SQL User's Guide for more details.
Obviously (?) there's a good bit more to table/index design than can be discussed here; certainly not something that can be addressed by looking at a single, generic query. ("Duh, Mark!" ?)

sql select condition performance

I have a table 'Tab' with data such as:
id | value
---------------
1 | Germany
2 | Argentina
3 | Brasil
4 | Holland
What way of select is better by perfomane?
1. SELECT * FROM Tab WHERE value IN ('Argentina', 'Holland')
or
2. SELECT * FROM Tab WHERE id IN (2, 4)
I suppose that second select would be faster, because int comparison is faster than string. Is that true for MS SQL?
This is a premature optimization. The comparison between integers and strings is generally going to have a minimal impact on query performance. The drivers of query performance are more along the lines of tables sizes, query plans, available memory, and competition for resources.
In general, it is a good idea to have indexes on columns used for either comparison. The first column looks like a primary key, so it automatically gets an index. The string column should have an index built on it. In general, indexes built on an integer column will have marginally better performance compared to integers built on variable length string columns. However, this type of performance difference really makes a difference only in environments with very high levels of transactions (think thousands of data modification operations per second).
You should use the logic that best fits the application and worry about other aspects of the code.
To answer the simple question yes option 2 SELECT * FROM Tab WHERE id IN (2, 4) would be faster as you said because int comparison is faster.
One way to speed it up is to add indexes to your columns to speed up evaluation, filtering, and the final retrieval of results.
If this table was to grow even more you should also not SELECT * but SELECT id, value otherwise you may be pulling more data than you need.
You can also speed up your query's buy adding WITH(NOLOCK) as the speed of your query might be affected by other sessions accessing the tables at the same time. For example SELECT * FROM Tab WITH(NOLOCK) WHERE id IN (2, 4) . As mentioned below though adding nolock is not a turbo and should only be used in appropriate situations.

Join Performance Issue in Oracle

We have are having 2 tables.
Table - XYZ - > Having over 189 M Records
Table - ABC - > Having only 1098 records.
Our join query is some what like
select a.a, a.b, a.c
from xyz a , ABC r
where a.d = r.d
and a.sub not like '0%'
and ((a.eff_dat < sysdate) or (a.eff_date is null))
This is how our query is performing. In any way can it be optmised to perform faster.
Apart from the not like, can you suggest me any other method.
In the explain plan I have seen that it is taking the 189 M as itrator and checking with the 1098 records which is taking more time.
I swapped the tables after the from Key word but also it did not work.
Tried leading hint, which also not servered the purpose.
Also a.d column is an indexed one which is also used in the hint.
Please do suggest any methods for optimisation.
When you have multiple predicates on a table such as:
a.sub not like '0%'
and ((a.eff_dat < sysdate) or (a.eff_date is null))
... it is rather unlikely that the optimiser will accurately estimate the cardinality of the result set unless you use dynamic sampling, so check the explain plan to see whether:
Dynamic sampling is being invoked.
The estimation of the cardinalities are correct.
If the predicates are not very selective -- if they do not eliminate something in the order of 90% or the rows in the table -- then it is unlikely that an index will be of help in finding the rows, and a full scan (with partition pruning if the table is partitioned in a way that supports that) is likely to be the best access path.
I'd be reasonably sure that if there is a foreign key between the tables (ie. that all values of a.d exist in r.d) then the best access path is going to be a full scan of XYZ with a hash join to ABC.
By the way, you mention hints but do not include them in the question. It's also unhelpful to hide the purpose of the tables with fake names, as the names often give valuable clues about the type of data and distribution of values within the data sets.
It would seem most of the cost would be in the (presumed) full table scan on the large table. I would suggest rewriting your WHERE condition as follows:
SELECT * FROM XYZ A
WHERE SUBSTR(A.SUB, 1, 1) <> '0'
AND NVL(A.EFF_DAT, TO_DATE('01-01-0001', 'MM-DD-YYYY')) < SYSDATE ;
And then create a function index that includes all the relevant columns:
CREATE INDEX IX_XYZ1 ON
XYZ(NVL(EFF_DAT, TO_DATE('01-01-0001', 'MM-DD-YYYY')), SUB, D);
Make sure the new index is being picked up by the cost-based optimizer, by checking the execution plan.
LIKE, NOT LIKE and the OR operand are some of the worst things you can use in a WHERE condition.

Oracle: How can I find tablespace fragmentation?

I've a JOIN beween two tables. It's really really slow and I can't find why.
The query takes hours in a PRODUCTION environment on a very big Client.
Can you ask me what you need to understand why it doesn't work well?
I can add indexes, partition the table, etc. It's Oracle 10g.
I expect a few thousand record. Because of the following condition:
f.eif_campo1 != c.fornitura AND and f.field29 = 'New'
Infact it should be always verified for all 18 million records
SELECT c.id_messaggio
,f.campo1
,c.f
FROM
flows c,
tab f
WHERE
f.field198 = c.id_messaggio
AND f.extra_id = c.extra_id
and f.field1 != c.ExampleF
and f.field29 = 'New'
and c.processtype in ('Example1')
and c.flag_ann = 'N';
Selectivity for the following record expressed as number of distinct values:
COUNT (DISTINCT extra_id) =>17*10^6,
COUNT (DISTINCT (extra_id || field20)) =>17*10^6,
COUNT (DISTINCT field198) =>36*10^6,
COUNT (DISTINCT (field19 || field20)) =>45*10^6,
COUNT (DISTINCT (field1)) =>18*10^6,
COUNT (DISTINCT (field20)) =>47
This is the execution plan [See large image][1]
![enter image description here][2]
Extra details:
I have relaxed one contition to see how many records are taken. 300 thousand.
![enter image description here][7]
--03:57 mins with parallel execution /*+ parallel(c 8) parallel(f 24) */
--395.358 rows
SELECT count(1)
FROM
flows c,
flet f
WHERE
f.field19 = c.id_messaggio
AND f.extra_id = c.extra_id
and f.field20 = 'ExampleF'
and c.process_type in ('ExampleP')
and c.flag_ann = 'N';
Your explain plan shows the following.
The database uses an index to retrieve rows from ENI_FLUSSI_HUB where
flh_tipo_processo_cod in ('VT','VOLTURA_ENI','CC')
It then winnows the rows
where flh_flag_ann = 'N'
This produces a result set which is used to access
rows from ETL_ELAB_INTERF_FLAT on the basis of f.idde_identif_dati_ext_id =
c.idde_identif_dati_ext_id
Finally those rows are filtered on the basis of the
remaining parts of the WHERE clause.
Now, the starting point is a good one if flh_tipo_processo_cod is a selective
column: that is, if it contains hundreds of different values, or if the values in
your list are relatively rare. It might even be a good path of the flag column
identifies relatively few columns with a value of 'N'. So you need to understand
both the distribution of your data - how many distinct values you have - and its
skew - which values appear very often or hardly at all. The overall
performance suggests that the distribution and/or skew of the
flh_tipo_processo_cod and flh_flag_ann columns is not good.
So what can you do? One approach is to follow Ben's suggestion, and use full
table scans. If you have an Enterprise Edition licence and plenty of CPU capacity
you could try parallel query to improve things. That might still be too slow, or it might be too disruptive for other users.
An alternative approach would be to use better indexes. A composite index on
eni_flussi_hub(flh_tipo_processo_cod,flh_flag_ann,idde_identif_dati_ext_id,
flh_fornitura,flh_id_messaggio) would avoid the need to read that table. Whether
this would be a new index or a replacement for ENI_FLK_IDX3 depends on the other
activity against the table. You might be able to benefit from index compression.
All the columns in the query projection are referenced in the WHERE clause. So
you could also use a composite index on the other table to avoid table reads. Agsin you need to understand the distribution and skew of the data. But you should probably lead with the least selective columns. Something like etl_elab_interf_flat(etl_elab_interf_flat,eif_campo200,dde_identif_dati_ext_id,eif_campo1,eif_campo198). Probably this is a new index. It's unlikely you would want to replace ETL_EIF_FK_IDX4 with this (especially if that really is an index on a foreign key constraint)..
Of course these are just guesses on my part. Tuning is a science and to do it properly requires lots of data. Use the Wait Interface to investigate where the database is spending its time. Use the 10053 event to understand why the Optimizer makes the choices it does. But above all, don't implement partitioning unless you really know the ramifications.
The simple answer seems to be your explain plan. You're accessing both tables by index rowid. Whilst to select a single row you cannot - to my knowledge - get faster, in your case you're selecting a lot more than a single row.
This means that for every single row you, you're going into both tables one row at a time, which when you're looking a significant proportion of a table or index is not what you want to do.
My suggestion would be to force a full scan of one or both of your tables. Try to use the smaller as a driver first:
SELECT /*+ full(c) */ c.flh_id_messaggio
, f.eif_campo1
, c.f
FROM flows c,
JOIN flet f
ON f.field19 = c.flh_id_messaggio
AND f.extra_id = c.extra_id
AND f.field1 <> c.f
WHERE ...
But you may have to change /*+ full(c) */ to /*+ full(c) full(f) */.
Your indexes seem to be separate column indexes as well. For this, and if possible, I would have indexes on:
flows of id_messaggio, extra_id, f
and on flet of field19, extra_id, field1.
This will only really matter if you do not use as full scan. Or, if you have everything you're returning and selecting is in one index.

Unique index on two columns plus separate index on each one?

I don't know much about database optimization, but I'm trying to understand this case.
Say I have the following table:
cities
===========
state_id integer
name varchar(32)
slug varchar(32)
Now, say I want to perform queries like this:
SELECT * FROM cities WHERE state_id = 123 AND slug = 'some_city'
SELECT * FROM cities WHERE state_id = 123
If I want the "slug" for a city to be unique within its particular state, I'd add a unique index on state_id and slug.
Is that index enough? Or should I also add another on state_id so the second query is optimized? Or does the second query automatically use the unique index?
I'm working on PostgreSQL, but I feel this case is so simple that most DBMS work similarly.
Also, I know this surely doesn't make a difference on small tables, but my example is a simple one. Think of 200k+ rows tables.
Thanks!
A single unique index on (state_id, slug) should be sufficient. To be sure, of course, you'll need to run EXPLAIN and/or ANALYZE (perhaps with the help of something like http://explain.depesz.com/), but ultimately what indexes are appropriate depends very closely on what kind of queries you will be running. Remember, indexes make SELECTs faster and INSERTs, UPDATEs, and DELETEs slower, so you ideally want only as many indexes as are actually necessary.
Also, PostgreSQL has a smart query optimizer: it will use radically different search plans for queries on small tables and huge tables. If the table is small, it will just do a sequential scan and not even bother with any indexes, since the overhead of working with them is higher than just brute-force sifting through the table. This changes to a different plan once the table size passes a threshold, and may change again if the table gets larger again, or if you change your SELECT, or....
Summary: you can't trust the results of EXPLAIN and ANALYZE on datasets much smaller or different than your actual data. Make it work, then make it fast later (if you need to).
[EDIT: Misread the question... Hopefully my answer is more relevant now!]
In your case, I'd suggest 1 index on (state_id, slug). If you ever need to search just by slug, add an index on just that column. If you have those, then adding another index on state_id is unnecessary as the first index already covers it.
An index can be used whenever an initial segment of its columns are used in a WHERE clause. So e.g. an index on columns A, B and C will optimise queries containing WHERE clauses involving A, B and C, WHERE clauses with just A and B, or WHERE clauses with just A. Note that the order that columns appear in the index definition is very important -- this example index cannot be used for WHERE clauses involving just B and/or C.
(Of course it's up to the query optimiser whether or not a particular index actually gets used, but in your case with 200k rows, you can guarantee that a simple search by state_id or slug or both will use one of the indices.)
Any decent optimizer will see an index on three columns - say:
CREATE INDEX idx_1 ON SomeTable(Col1, Col2, Col3);
and will use that index for any of the following conditions:
WHERE Col1 = ...something...
WHERE Col1 = ...something... AND Col2 = ...otherthing...
WHERE Col3 = ....whatnot....
AND Col1 = ...something....
AND Col2 = ...otherthing...
That is, it will use the index if there are conditions applied to any contiguous leading subset of the columns of the index. Although I used equality, it can also apply to ranges (open - just greater than, for example) or closed (between two values).
To do optimization use EXPLAIN http://www.postgresql.org/docs/7.4/static/sql-explain.html and see for your self.
But optimization is not the most important reason to make those indexes; first it is a constraint inhibiting a database from not being logical.