"Set" with "where exists" performs better then without - sql

I come up by chances to this curious case.
Environment:
Oracle 12.2.2
Involved 2 tables.
N. of rows 16 milions
As far I know, and reported here Oracle / PLSQL: EXISTS Condition the use of where exists is in general less perfomant of other way.
In my case however when updating a table's column with the value with another on join condition with the exists, the query run in about 12-13 sec without issues(I did only some check, as I really do not know all the content of the table):
update fdm_auftrag ou
set (ou.e_hr,ou.e_budget) = ( select b.e_hr,b.e_budget
from fdm_budget_auftrag b
where b.fk_column1 = ou.fk_column1
and b.fk_column2 = ou.fk_column2
and b.fk_col3 = ou.fk_col3 )
where exists ( select b.e_hr,b.e_budget
from fdm_budget_auftrag b
where b.fk_column1 = ou.fk_column1
and b.fk_column2 = ou.fk_column2
and b.fk_col3 = ou.fk_col3 );
instead without the exists, it takes so much time then I even interrupt it.
I am just gessing as the condition in exist is valuated as a boolean, if the enginee found out at least one row, then had to do less touch on the DB, but I am not sure about it.
It is correct this "guess", have someone a more clear explanation?

The where clause is limiting the number of rows being updated.
Fewer updated rows means that the update query runs faster. There is a lot of overhead to updating a row, including stashing away information for roll-back purposes.
I am assuming that you are updating relatively few rows in a much larger table. If the where clause is selecting most of the rows, then there might be no performance difference.
And, finally, the two queries are not identical. Without the where unmatched values will be assigned NULL.

Related

Bahaviour of SQL update using one-to-many join

Imagine I have two tables, t1 and t2. t1 has two fields, one containing unique values called a and another field called value. Table t2 has a field that does not contain unique values called b and a field also called value.
Now, if I use the following update query (this is using MS Access btw):
UPDATE t1
INNER JOIN t2 ON t1.a=t2.b
SET t1.value=t2.value
If I have the following data
t1 t2
a | value b | value
------------ ------------
'm' | 0.0 'm'| 1.1
'm'| 0.2
and run the query what value ends up in t1.value? I ran some tests but couldn't find consistent behaviour, so I'm guessing it might just be undefined. Or this kind of update query is something that just shouldn't be done? There is a long boring story about why I've had to do it this way, but it's irrelevant to the technical nature of my enquiry.
This is known as a non deterministic query, it means exactly what you have found that you can run the query multiple times with no changes to the query or underlying data and get different results.
In practice what happens is the value will be updated with the last record encountered, so in your case it will be updated twice, but the first update will be overwritten by last. What you have absolutely no control over is in what order the SQL engine accesses the records, it will access them it whatever order it deems fit, this could be simply a clustered index scan from the begining, or it could use other indexes and access the clustered index in a different order. You have no way of knowing this. It is quite likely that running the update multiple times would yield the same result, because with no changes to the data the sql optimiser will use the same query plan. But again there is no guarantee, so you should not rely on a non determinstic query to get deterministic results.
EDIT
To update the value in T1 to the Maximum corresponding value in T2 you can use DMax:
UPDATE T1
SET Value = DMax("Value", "T2", "b=" & T1.a);
When you execute the query as you’ve indicated, the “value” that ends up in “t1” for the row ‘m’ will be, effectively, random, due to the fact that “t2” has multiple rows for the identity value ‘m’.
Unless you specifically specify that you want the maximum (max function), minimum (min function) or some-other aggregate of the collection of rows with the identity ‘m’ the database has no ability to make a defined choice and as such returns whatever value it first comes across, hence the inconsistent behaviour.
Hope this helps.

how to make join table update efficient in sql (it's efficient from my ruby code!!)

This is something weird I can't really understand, I'm working in PostGreSql9.2 ...
I have this database:
movies (id, title, votes)
infos (id, movie_id, info_type, value)
I want to update movies.votes with infos.value, joining on movies.id = infos.movie_if and only where info_type = 100 (which is the type for votes..)
I tried 2 different queries:
update movies
set votes = cast(i.value as integer)
from movies m inner join infos i on m.id = i.movie_id
where i.info_type = 100
which (using explain) predicts a running time of about 11 million seconds (too much!)
second try:
update movies m
set votes = cast(
(
select value
from infos i
where i.info_type = 100 and i.movie_id = m.id
limit 1
) AS integer);
this one whould be "only" 20 thousands seconds.. still far too much
I don't really know how the query plan work, so I tries to do this with a ruby script (using active_record)... which is:
Info.find_in_batches(:conditions => "info_type = 100") do |group|
group.each{ |info|
movie = Movie.find(info.movie_id)
movie.votes = info.value.to_i
movie.save
}
end
For those of you who don't read ruby, this query simply loops thru all infos that meet the info_type = 100 condition, then for each one it searches the corresponding movie and updates it..
And it was very fast! just a few minutes, and with all the ruby/orm overhead!!
Now, why?? Know that movies is about 600k records, but only 200k (a third) have an info record with the number of votes. Still this doesn't explain what is happening.
EXPLAIN
As #ruakh already explained, you probably misunderstood what EXPLAIN is telling you. If you want actual times in seconds, use EXPLAIN ANALYZE.
Be aware though, that this actually executes the statement. I quote the manual here:
Important: Keep in mind that the statement is actually executed when the ANALYZE
option is used. Although EXPLAIN will discard any
output that a SELECT would return, other side effects of the statement
will happen as usual. If you wish to use EXPLAIN ANALYZE on an INSERT,
UPDATE, DELETE, CREATE TABLE AS, or EXECUTE statement without letting
the command affect your data, use this approach:
BEGIN;
EXPLAIN ANALYZE ...;
ROLLBACK;
Still, the estimates on the first query are way to high and indication of a grave problem.
What's wrong?
As to your third approach: for big tables, it will always be faster by an order of magnitude to have the database server update a whole (big) table at once than sending instructions to the server for every row - even more so if the new values come from within the database. More in this related answer. If your tests show otherwise, chances are that something is wrong with your (test) setup. And in fact, it is ...
Your first query goes wrong completely. The god-awful performance estimate is indication of how terribly wrong it is. While you join the table movies to the table infos in the FROM clause, you forget the WHERE condition to bind the resulting rows to the rows in the UPDATE table. This leads to a CROSS JOIN, i. e. every row in movies (600k) is updated with every single vote in values (200k) resulting in 120 000 000 000 updates. Yummy. And all wrong. Never execute this. Not even in a transaction that can be rolled back.
Your second query goes wrong, too. It runs a correlated subquery, i. e. it runs a separate query for every row. That's 600k subqueries instead of just 1, hence terrible performance.
That's right 600k subqueries. Not 200k. You instruct Postgres to update every movie, no matter what. Those without a matching infos.value (no info_type = 100), receive a NULL value in votes, overwriting whatever was there before.
Also, I wonder what that LIMIT 1 is doing there?
Either (infos.movie_id, infos.info_type) is UNIQUE, then you don't need LIMIT.
Or it isn't UNIQUE. Then add a UNIQUE index to infos if you intend to keep the structure.
Proper query
UPDATE movies m
SET votes = i.value::int
FROM infos i
WHERE m.id = i.movie_id
AND i.info_type = 100
AND m.votes IS DISTINCT FROM i.value::int;
This is much like your first query, just simplified and doing it right, plus an enhancement.
No need to join to movies a second time. You only need infos in the FROM clause.
Actually bind the row to be updated to the row carrying the new value, thereby avoiding the (unintended) CROSS JOIN:
WHERE m.id = i.movie_id
Avoid empty updates, they carry a cost for no gain. That's what the last line is for.
Should be a matter of seconds or less, not millions of seconds.
BTW, indexes will not help this query, table scans are faster for the described data distribution since you use all (or a third) of the involved tables.
[…] which (using explain) predicts a running time of about 11 million seconds (too much!)
[…] this one whould be "only" 20 thousands seconds.. still far too much
I think you are misunderstanding the output of EXPLAIN. As explained in its documentation, the "estimated statement execution cost" (i.e. "the planner's guess at how long it will take to run the statement") is not measured in seconds, but "in cost units that are arbitrary, but conventionally mean disk page fetches".
So PostgreSQL is guessing that the second statement will run about 500 times faster than the first statement, but neither one will take anywhere near as long you think. :-)

Oracle: How can I find tablespace fragmentation?

I've a JOIN beween two tables. It's really really slow and I can't find why.
The query takes hours in a PRODUCTION environment on a very big Client.
Can you ask me what you need to understand why it doesn't work well?
I can add indexes, partition the table, etc. It's Oracle 10g.
I expect a few thousand record. Because of the following condition:
f.eif_campo1 != c.fornitura AND and f.field29 = 'New'
Infact it should be always verified for all 18 million records
SELECT c.id_messaggio
,f.campo1
,c.f
FROM
flows c,
tab f
WHERE
f.field198 = c.id_messaggio
AND f.extra_id = c.extra_id
and f.field1 != c.ExampleF
and f.field29 = 'New'
and c.processtype in ('Example1')
and c.flag_ann = 'N';
Selectivity for the following record expressed as number of distinct values:
COUNT (DISTINCT extra_id) =>17*10^6,
COUNT (DISTINCT (extra_id || field20)) =>17*10^6,
COUNT (DISTINCT field198) =>36*10^6,
COUNT (DISTINCT (field19 || field20)) =>45*10^6,
COUNT (DISTINCT (field1)) =>18*10^6,
COUNT (DISTINCT (field20)) =>47
This is the execution plan [See large image][1]
![enter image description here][2]
Extra details:
I have relaxed one contition to see how many records are taken. 300 thousand.
![enter image description here][7]
--03:57 mins with parallel execution /*+ parallel(c 8) parallel(f 24) */
--395.358 rows
SELECT count(1)
FROM
flows c,
flet f
WHERE
f.field19 = c.id_messaggio
AND f.extra_id = c.extra_id
and f.field20 = 'ExampleF'
and c.process_type in ('ExampleP')
and c.flag_ann = 'N';
Your explain plan shows the following.
The database uses an index to retrieve rows from ENI_FLUSSI_HUB where
flh_tipo_processo_cod in ('VT','VOLTURA_ENI','CC')
It then winnows the rows
where flh_flag_ann = 'N'
This produces a result set which is used to access
rows from ETL_ELAB_INTERF_FLAT on the basis of f.idde_identif_dati_ext_id =
c.idde_identif_dati_ext_id
Finally those rows are filtered on the basis of the
remaining parts of the WHERE clause.
Now, the starting point is a good one if flh_tipo_processo_cod is a selective
column: that is, if it contains hundreds of different values, or if the values in
your list are relatively rare. It might even be a good path of the flag column
identifies relatively few columns with a value of 'N'. So you need to understand
both the distribution of your data - how many distinct values you have - and its
skew - which values appear very often or hardly at all. The overall
performance suggests that the distribution and/or skew of the
flh_tipo_processo_cod and flh_flag_ann columns is not good.
So what can you do? One approach is to follow Ben's suggestion, and use full
table scans. If you have an Enterprise Edition licence and plenty of CPU capacity
you could try parallel query to improve things. That might still be too slow, or it might be too disruptive for other users.
An alternative approach would be to use better indexes. A composite index on
eni_flussi_hub(flh_tipo_processo_cod,flh_flag_ann,idde_identif_dati_ext_id,
flh_fornitura,flh_id_messaggio) would avoid the need to read that table. Whether
this would be a new index or a replacement for ENI_FLK_IDX3 depends on the other
activity against the table. You might be able to benefit from index compression.
All the columns in the query projection are referenced in the WHERE clause. So
you could also use a composite index on the other table to avoid table reads. Agsin you need to understand the distribution and skew of the data. But you should probably lead with the least selective columns. Something like etl_elab_interf_flat(etl_elab_interf_flat,eif_campo200,dde_identif_dati_ext_id,eif_campo1,eif_campo198). Probably this is a new index. It's unlikely you would want to replace ETL_EIF_FK_IDX4 with this (especially if that really is an index on a foreign key constraint)..
Of course these are just guesses on my part. Tuning is a science and to do it properly requires lots of data. Use the Wait Interface to investigate where the database is spending its time. Use the 10053 event to understand why the Optimizer makes the choices it does. But above all, don't implement partitioning unless you really know the ramifications.
The simple answer seems to be your explain plan. You're accessing both tables by index rowid. Whilst to select a single row you cannot - to my knowledge - get faster, in your case you're selecting a lot more than a single row.
This means that for every single row you, you're going into both tables one row at a time, which when you're looking a significant proportion of a table or index is not what you want to do.
My suggestion would be to force a full scan of one or both of your tables. Try to use the smaller as a driver first:
SELECT /*+ full(c) */ c.flh_id_messaggio
, f.eif_campo1
, c.f
FROM flows c,
JOIN flet f
ON f.field19 = c.flh_id_messaggio
AND f.extra_id = c.extra_id
AND f.field1 <> c.f
WHERE ...
But you may have to change /*+ full(c) */ to /*+ full(c) full(f) */.
Your indexes seem to be separate column indexes as well. For this, and if possible, I would have indexes on:
flows of id_messaggio, extra_id, f
and on flet of field19, extra_id, field1.
This will only really matter if you do not use as full scan. Or, if you have everything you're returning and selecting is in one index.

The old IN vs. Exists vs. Left Join (Where ___ Is or Is Not Null); Performance

I have found my self in quite a pickle. I have tables of only one column (supression or inclusion lists) that are more or less varchar(25) but the thing is I won't have time to index them before using them in the main query and, depending how inportant it is, I won't know how many rows are in each table. The base table at the heart of all this is some 1.4 million rows and some 50 columns.
My assumptions are as follows:
IN shouln't be used in cases with a lot of values (rows) returned because it looks though the values serially, right? (IN on a subquery not passed the values directly)
Joins (INNER for inclusion and LEFT and checking for Nulls when supression) are the best for large sets of data (over 1k rows or so to mach to)
EXISTS has always concerned me because it seems to be doing a subquery for every row (all 1.4 million? Yikes.)
My gut say, if feasable get the count of the supression table and use either IN (for sub 1k rows) and INNER/LEFT Join (for suppression tables above 1k rows) Note, and field I will be supressing on will be index in the big base table but the supression table won't be. Thoughts?
Thanks in advance for any and all comments and/or advice.
Assuming TSQL to mean SQL Server, have you seen this link regarding a comparison of NOT IN, NOT EXISTS, and LEFT JOIN IS NULL? In summary, as long as the columns being compared can not be NULL, NOT IN and NOT EXISTS are more efficient than LEFT JOIN/IS NULL...
Something to keep in mind about the difference between IN and EXISTS - EXISTS is a boolean operator, and returns true on the first time the criteria is satisfied. Though you see a correlated subquery in syntax, EXISTS has performed better than IN...
Also, IN and EXISTS only check for the existence of the value comparison. This means there's no duplication of records like you find when JOINing...
It really depends, so if you're really out to find what performs best you'll have to test & compare what the query plans are doing...
It won't matter what technique you use, if there is no index on the table on which you apply a filter or join, the system will do a table scan.
RE: Exists
It is not necessarily the case that the system will do a subquery for all 1.4 million rows. SQL Server is smart enough to do the inner Exists query and then evaluate that against the main query. In some cases, Exists can perform equal to or better than a Join.

What's the most efficient way to check the presence of a row in a table?

Say I want to check if a record in a MySQL table exists. I'd run a query, check the number of rows returned. If 0 rows do this, otherwise do that.
SELECT * FROM table WHERE id=5
SELECT id FROM table WHERE id=5
Is there any difference at all between these two queries? Is effort spent in returning every column, or is effort spent in filtering out the columns we don't care about?
SELECT COUNT(*) FROM table WHERE id=5
Is a whole new question. Would the server grab all the values and then count the values (harder than usual), or would it not bother grabbing anything and just increment a variable each time it finds a match (easier than usual)?
I think I'm making a lot of false assumptions about how MySQL works, but that's the meat of the question! Where am I wrong? Educate me, Stack Overflow!
Optimizers are pretty smart (generally). They typically only grab what they need so I'd go with:
SELECT COUNT(1) FROM mytable WHERE id = 5
The most explicit way would be
SELECT WHEN EXISTS (SELECT 1 FROM table WHERE id = 5) THEN 1 ELSE 0 END
If there is an index on (or starting with) id, it will only search, with maximum efficiency, for the first entry in the index it can find with that value. It won't read the record.
If you SELECT COUNT(*) (or COUNT anything else) it will, under the same circumstances, count the index entries, but not read the records.
If you SELECT *, it will read all the records.
Limit your results to at most one row by appending LIMIT 1, if all you want to do is check the presence of a record.
SELECT id FROM table WHERE id=5 LIMIT 1
This will definitely ensure that no more than one row is returned or processed. In my experience, LIMIT 1 (or TOP 1 depending in the DB) to check for existence of a row makes a big difference in terms of performance for large tables.
EDIT: I think I misread your question, but I'll leave my answer here anyway if it's of any help.
I would think this
SELECT null FROM table WHERE id = 5 LIMIT 1;
would be faster than this
SELECT 1 FROM table WHERE id = 5 LIMIT 1;
but the timer says the winner is "SELECT 1".
For the first two queries, most people will generally say, always specify exactly what you need and leave the rest. Effort isn't all specific as bandwidth could be spent in returning data that you aren't even going to do anything with.
As for the previous answer will do for your result set, unless you're dealing with a language that supports affected rows. This can sometimes work when getting data to collect information on how many rows were returned in the last query. You'll need to look at your interface documentation as to how to get that information.
The difference between your 3 queries depends on how you've built your index. Only returning the primary key is likely to be faster as MySQL will have your index in memory, and not have to hit disk. Adding the LIMIT 1 is also a good trick that will speed up the optimizer significantly in early 5.0.x branches and earlier.
try EXPLAIN SELECT id FROM table WHERE id=5 and check the Extras column for the presence of USING INDEX. If its there, then you're query is coming straight from the index, and is going to be much faster.