create or replace FUNCTION FUNCTION_X
(
N_STRING IN VARCHAR2
) RETURN VARCHAR2 AS
BEGIN
RETURN UPPER(translate(N_STRING, 'ÁÇÉÍÓÚÀÈÌÒÙÂÊÎÔÛÃÕËÜáçéíóúàèìòùâêîôûãõëü','ACEIOUAEIOUAEIOUAOEUaceiouaeiouaeiouaoeu'));
END FUNCTION_X;
SELECT takes around 5 seconds (80k + Lines)
SELECT TABLE_A.STRING_X
FROM TABLE_A
INNER JOIN TABLE_B ON TABLE_B.ID = TABLE_A.IDTB
WHERE
UPPER(UPPER(translate(TABLEB.STRING_X,
'ÁÇÉÍÓÚÀÈÌÒÙÂÊÎÔÛÃÕËÜáçéíóúàèìòùâêîôûãõëü','ACEIOUAEIOUAEIOUAOEUaceiouaeiouaeiouaoeu'))
=
UPPER(translate(TABLEB.N_STRING,
'ÁÇÉÍÓÚÀÈÌÒÙÂÊÎÔÛÃÕËÜáçéíóúàèìòùâêîôûãõëü','ACEIOUAEIOUAEIOUAOEUaceiouaeiouaeiouaoeu')
Using function takes over 3 minutes (80k + lines)
SELECT TABLE_A.STRING_X
FROM TABLE_A
INNER JOIN TABLE_B ON TABLE_B.ID = TABLE_A.IDTB
WHERE
FUNCTION_X(TABLE_A.STRING_X) = FUNCTION_X(TABLE_B.N_STRING)
I dont know whats makes it so heavy.
If your first query, with the UPPER(UPPER(translate(...))) inline in the query takes only 5 seconds and the tables are big, I would look to see if you have a function based index having those functions on either or both tables.
An index, as you probably know, stores a sorted version of the data so that rows can be found quickly. But they're only useful if you are searching on the data that is sorted in the index. (Think of an index in a book, in which keywords sorted alphabetically -- useful for searching for a particular word, not so useful for finding references to words ending in the letter "r").
If there is a function based index on UPPER(UPPER(translate(...))) that is helping your original query, you are losing the benefit when your query specifies FUNCTION_X(...) instead. Oracle is not smart enough to realize they are the same function. You would need to create function based indexes on the expression you actually use in the query -- i.e, on FUNCTION_X(...).
Also, you can help performance by telling Oracle that your function is deterministic (i.e., always returns the same value for the same input) and intended to be used in SQL queries. So, in addition to the function based indexes, a better definition of your function would be:
create or replace FUNCTION FUNCTION_X
(
N_STRING IN VARCHAR2
) RETURN VARCHAR2
DETERMINISTIC -- add this
AS
PRAGMA UDF; -- add this too
BEGIN
RETURN UPPER(translate(N_STRING, 'ÁÇÉÍÓÚÀÈÌÒÙÂÊÎÔÛÃÕËÜáçéíóúàèìòùâêîôûãõëü','ACEIOUAEIOUAEIOUAOEUaceiouaeiouaeiouaoeu'));
END FUNCTION_X;
JOINS, of course, intend to exploit index values. The problem with your second query is that you are demanding that the SQL engine must execute this function-call for each and every line. It therefore cannot do anything better than a "full table scan," evaluating the function over each and every row ... actually over a Cartesian Product of the two tables taken together!!
You must find an alternative way to do that.
Related
The question can be specific to SQL server.
When I write a query such as :
SELECT * FROM IndustryData WHERE Date='20131231'
AND ReportTypeID = CASE WHEN (fnQuarterDate('20131231')='20131231') THEN 1
WHEN (fnQuarterDate('20131231')!='20131231') THEN 4
END;
Does the Function Call fnQuarterDate (or any Subquery) within Case inside a Where clause is executed for EACH row of the table ?
How would it be better if I get the function's (or any subquery) value beforehand inside a variable like:
DECLARE #X INT
IF fnQuarterDate('20131231')='20131231'
SET #X=1
ELSE
SET #X=0
SELECT * FROM IndustryData WHERE Date='20131231'
AND ReportTypeID = CASE WHEN (#X = 1) THEN 1
WHEN (#X = 0) THEN 4
END;
I know that in MySQL if there is a subquery inside IN(..) within a WHERE clause, it is executed for each row, I just wanted to find out the same for SQL SERVER.
...
Just populated table with about 30K rows and found out the Time Difference:
Query1= 70ms ; Query 2= 6ms. I think that explains it but still don't know the actual facts behind it.
Also would there be any difference if instead of a UDF there was a simple subquery ?
I think the solution may in theory help you increase the performance, but it also depends on what the scalar function actually does. I think that in this case (my guess is formatting the date to last day in the quarter) would really be negligible.
You may want to read this page with suggested workarounds:
http://connect.microsoft.com/SQLServer/feedback/details/273443/the-scalar-expression-function-would-speed-performance-while-keeping-the-benefits-of-functions#
Because SQL Server must execute each function on every row, using any function incurs a cursor like performance penalty.
And in Workarounds, there is a comment that
I had the same problem when I used scalar UDF in join column, the
performance was horrible. After I replaced the UDF with temp table
that contains the results of UDF and used it in join clause, the
performance was order of magnitudes better. MS team should fix UDF's
to be more reliable.
So it appears that yes, this may increase the performance.
Your solution is correct, but I would recommend considering an improvement of the SQL to use ELSE instead, it looks cleaner to me:
AND ReportTypeID = CASE WHEN (#X = 1) THEN 1
ELSE 4
END;
It depends. See User-Defined Functions:
The number of times that a function specified in a query is actually executed can vary between execution plans built by the optimizer. An example is a function invoked by a subquery in a WHERE clause. The number of times the subquery and its function is executed can vary with different access paths chosen by the optimizer.
This approach uses in-line MySQL variables... The query alias of "sqlvars" will prepare the #dateBasis first with the date in question, then a second variable #qtrReportType based on the function call done ONCE for the entire query. Then, by cross-join (via no where clause between the tables since the sqlvars is considered a single row anyhow), will use those values to get data from your IndustryData table.
select
ID.*
from
( select
#dateBasis := '20131231',
#qtrReportType := case when fnQuarterDate(#dateBasis) = #dateBasis
then 1 else 4 end ) sqlvars,
IndustryData ID
where
ID.Date = #dateBasis
AND ID.ReportTypeID = #qtrReportType
Let's suppose I had a view, like this:
CREATE VIEW EmployeeView
AS
SELECT ID, Name, Salary(PaymentPlanID) AS Payment
FROM Employees
The user-defined function, Salary, is somewhat expensive.
If I wanted to do something like this,
SELECT *
FROM TempWorkers t
INNER JOIN EmployeeView e ON t.ID = e.ID
will Salary be executed on every row of Employees, or will it do the join first and then only be called on the rows filtered by the join? Could I expect the same behavior if EmployeeView was a subquery or a table valued function instead of a view?
The function will only be called where relevant. If your final select statement does not include that field, it's not called at all. If your final select refers to 1% of your table, it will only be called for that 1% of the table.
This is effectively the same for sub-queries/inline views. You could specify the function for a field in a sub-query, then never use that field, in which case the function never gets called.
As an aside: scalar functions are indeed notoriously expensive in many regards. You may be able to reduce it's cost by forming it as an inline table valued function.
SELECT
myTable.*,
myFunction.Value
FROM
myTable
CROSS APPLY
myFunction(myTable.field1, myTable.field2) as myFunction
As long as MyFunction is Inline (not multistatement) and returns only one row for each set of inputs, this often scales much better than Scalar Functions.
This is slightly different from making the whole view a table valued function, that returns many rows.
If such a TVF is multistatment, it WILL call the Salary function for every record. But inline functions can expanded inline, as if a SQL macro, and so only call Salary as required; like the view.
As a general rule for TVFs though, don't return records that will then be discarded.
It should only execute the Salary function for the joined rows. But you are not filtering the tables any further. If ID is a foreign key column and not null then it will execute that function for all the rows.
The actual execution plan is a good place to see for sure.
As said above, the function will only be called for relevant rows. For your further questions, and to get a really good idea of what's happening, you need to gather performance data either through SQL Profiler, or by viewing the actual execution plan and elapsed times. Then test out a few theories and find which is best performance.
Use of function calls in where clause of stored procedure slows down performance in sql server 2005?
SELECT * FROM Member M
WHERE LOWER(dbo.GetLookupDetailTitle(M.RoleId,'MemberRole')) != 'administrator'
AND LOWER(dbo.GetLookupDetailTitle(M.RoleId,'MemberRole')) != 'moderator'
In this query GetLookupDetailTitle is a user defined function and LOWER() is built in function i am asking about both.
Yes.
Both of these are practices to be avoided where possible.
Applying almost any function to a column makes the expression unsargable which means an index cannot be used and even if the column is not indexed it makes cardinality estimates incorrect for the rest of the plan.
Additionally your dbo.GetLookupDetailTitle scalar function looks like it does data access and this should be inlined into the query.
The query optimiser does not inline logic from scalar UDFs and your query will be performing this lookup for each row in your source data, which will effectively enforce a nested loops join irrespective of its suitability.
Additionally this will actually happen twice per row because of the 2 function invocations. You should probably rewrite as something like
SELECT M.* /*But don't use * either, list columns explicitly... */
FROM Member M
WHERE NOT EXISTS(SELECT *
FROM MemberRoles R
WHERE R.MemberId = M.MemberId
AND R.RoleId IN (1,2)
)
Don't be tempted to replace the literal values 1,2 with variables with more descriptive names as this too can mess up cardinality estimates.
Using a function in a WHERE clause forces a table scan.
There's no way to use an index since the engine can't know what the result will be until it runs the function on every row in the table.
You can avoid both the user-defined function and the built-in by
defining "magic" values for administrator and moderator roles and compare Member.RoleId against these scalars
defining IsAdministrator and IsModerator flags on a MemberRole table and join with Member to filter on those flags
I'm developing an application which processes many data in Oracle database.
In some case, I have to get many object based on a given list of conditions, and I use SELECT ...FROM.. WHERE... IN..., but the IN expression just accepts a list whose size is maximum 1,000 items.
So I use OR expression instead, but as I observe -- perhaps this query (using OR) is slower than IN (with the same list of condition). Is it right? And if so, how to improve the speed of query?
IN is preferable to OR -- OR is a notoriously bad performer, and can cause other issues that would require using parenthesis in complex queries.
Better option than either IN or OR, is to join to a table containing the values you want (or don't want). This table for comparison can be derived, temporary, or already existing in your schema.
In this scenario I would do this:
Create a one column global temporary table
Populate this table with your list from the external source (and quickly - another whole discussion)
Do your query by joining the temporary table to the other table (consider dynamic sampling as the temporary table will not have good statistics)
This means you can leave the sort to the database and write a simple query.
Oracle internally converts IN lists to lists of ORs anyway so there should really be no performance differences. The only difference is that Oracle has to transform INs but has longer strings to parse if you supply ORs yourself.
Here is how you test that.
CREATE TABLE my_test (id NUMBER);
SELECT 1
FROM my_test
WHERE id IN (1,2,3,4,5,6,7,8,9,10,
21,22,23,24,25,26,27,28,29,30,
31,32,33,34,35,36,37,38,39,40,
41,42,43,44,45,46,47,48,49,50,
51,52,53,54,55,56,57,58,59,60,
61,62,63,64,65,66,67,68,69,70,
71,72,73,74,75,76,77,78,79,80,
81,82,83,84,85,86,87,88,89,90,
91,92,93,94,95,96,97,98,99,100
);
SELECT sql_text, hash_value
FROM v$sql
WHERE sql_text LIKE '%my_test%';
SELECT operation, options, filter_predicates
FROM v$sql_plan
WHERE hash_value = '1181594990'; -- hash_value from previous query
SELECT STATEMENT
TABLE ACCESS FULL ("ID"=1 OR "ID"=2 OR "ID"=3 OR "ID"=4 OR "ID"=5
OR "ID"=6 OR "ID"=7 OR "ID"=8 OR "ID"=9 OR "ID"=10 OR "ID"=21 OR
"ID"=22 OR "ID"=23 OR "ID"=24 OR "ID"=25 OR "ID"=26 OR "ID"=27 OR
"ID"=28 OR "ID"=29 OR "ID"=30 OR "ID"=31 OR "ID"=32 OR "ID"=33 OR
"ID"=34 OR "ID"=35 OR "ID"=36 OR "ID"=37 OR "ID"=38 OR "ID"=39 OR
"ID"=40 OR "ID"=41 OR "ID"=42 OR "ID"=43 OR "ID"=44 OR "ID"=45 OR
"ID"=46 OR "ID"=47 OR "ID"=48 OR "ID"=49 OR "ID"=50 OR "ID"=51 OR
"ID"=52 OR "ID"=53 OR "ID"=54 OR "ID"=55 OR "ID"=56 OR "ID"=57 OR
"ID"=58 OR "ID"=59 OR "ID"=60 OR "ID"=61 OR "ID"=62 OR "ID"=63 OR
"ID"=64 OR "ID"=65 OR "ID"=66 OR "ID"=67 OR "ID"=68 OR "ID"=69 OR
"ID"=70 OR "ID"=71 OR "ID"=72 OR "ID"=73 OR "ID"=74 OR "ID"=75 OR
"ID"=76 OR "ID"=77 OR "ID"=78 OR "ID"=79 OR "ID"=80 OR "ID"=81 OR
"ID"=82 OR "ID"=83 OR "ID"=84 OR "ID"=85 OR "ID"=86 OR "ID"=87 OR
"ID"=88 OR "ID"=89 OR "ID"=90 OR "ID"=91 OR "ID"=92 OR "ID"=93 OR
"ID"=94 OR "ID"=95 OR "ID"=96 OR "ID"=97 OR "ID"=98 OR "ID"=99 OR
"ID"=100)
I would question the whole approach. The client of the SP has to send 100000 IDs. Where does the client get those IDs from? Sending such a large number of ID as the parameter of the proc is going to cost significantly anyway.
If you create the table with a primary key:
CREATE TABLE my_test (id NUMBER,
CONSTRAINT PK PRIMARY KEY (id));
and go through the same SELECTs to run the query with the multiple IN values, followed by retrieving the execution plan via hash value, what you get is:
SELECT STATEMENT
INLIST ITERATOR
INDEX RANGE SCAN
This seems to imply that when you have an IN list and are using this with a PK column, Oracle keeps the list internally as an "INLIST" because it is more efficient to process this, rather than converting it to ORs as in the case of an un-indexed table.
I was using Oracle 10gR2 above.
I have used Postgres with my Django project for some time now but I never needed to use stored functions. It is very important for me to find the most efficient solution for the following problem:
I have a table, which contains the following columns:
number | last_update | growth_per_second
And I need an efficient solution to update the number based on the last_update and the growth factor, and set the last_update value to current time. I will probably have 100, maybe 150k rows. I need to update all rows in the same time, if possible, but if it will take too long I can split it in smaller parts.
Store what you can't calculate
quickly.
Are you sure you need to maintain this information? If so, can you cache it if querying it is slow? You're setting yourself up for massive table thrash by trying to keep this information consistent in the database.
First if you want to go this route, start with the PostgreSQL documentation on server programming, then come back with a question based on what you have tried. You will want to get familiar with this area anyway because depending on what you are doing....
Now, assuming your data is all inserts and no updates, I would not store this information in your database directly. If it is a smallish amount of information you will end up with index scans anyway and if you are returning a smallish result set you should be able to calculate this quickly.
Instead I would do this: have your last_update column be a foreign key to the same table. Suppose your table looks like this:
CREATE TABLE hits (
id bigserial primary key,
number_hits bigint not null,
last_update_id bigint references hits(id),
....
);
Then I would create the following functions. Note the caveats below.
CREATE FUNCTION last_update(hits) RETURNS hits IMMUTABLE LANGUAGE SQL AS $$
SELECT * FROM hits WHERE id = $1.last_update_id;
$$;
This function allows you, on a small result set, to traverse to the last update record. Note the immutable designation here is only safe if you are guaranteeing that there are no updates or deletions on the hits table. If you do these, then you should change it to stable, and you lose the ability to index output. If you make this guarantee and then must do an update, then you MUST rebuild any indexes that use this (reindex table hits), and this may take a while....
From there, we can:
CREATE FUNCTION growth(hits) RETURNS numeric immutable language sql as $$
SELECT CASE WHEN ($1.last_update).number_hits = 0 THEN NULL
ELSE $1.number_hits / ($1.last_update).number_hits
END;
$$;
Then we can:
SELECT h.growth -- or alternatively growth(h)
FROM hits
WHERE id = 12345;
And it will automatically calculate it. If we want to search on growth, we can index the output:
CREATE INDEX hits_growth_idx ON hits (growth(hits));
This will precalculate for searching purposes. This way if you want to do a:
SELECT * FROM hits WHERE growth = 1;
It can use an index scan on predefined values.
Of course you can use the same techniques to precalculate and store, but this approach is more flexible and if you have to work with a large result set, you can always self-join once, and calculate that way, bypassing your functions.