Where clause on a column that's a result of a UDF - sql

I have a user defined function (e.g. myUDF(a,b)) that returns an integer.
I am trying to ensure this function will be called only once and its results can be used as a condition in the WHERE clause:
SELECT col1, col2, col3,
myUDF(col1,col2) AS X
From myTable
WHERE x>0
SQL Server tries to detect x as column, but it's really an alias for a computed value.
How can you re-write this query so that the filtering can be done on the computed value without having to execute the UDF more than once?

With Tbl AS
(SELECT col1, col2, col3, myUDF(col1,col2) AS X
From table myTable )
SELECT * FROM Tbl WHERE X > 0

If you are using SQL Server 2005 and beyond, you can use Cross Apply:
Select T.col1, T.col2, FuncResult.X
From Table As T
Cross Apply ( Select myUdf(T.col1, T.col2) As X ) As FuncResult
Where FuncResult.X > 0

try
SELECT col1, col2, col3, dbo.myUDF(col1,col2) AS X
From myTable
WHERE dbo.myUDF(col1,col2) >0
but be aware that this will cause a scan since it is not SARGable
Here is another way
select * from(
SELECT col1, col2, col3, dbo.myUDF(col1,col2) AS X
From myTable ) as y
WHERE x>0

SQL Server does not allow you to reference columns by alias. You either have to write out the column twice:
SELECT col1, col2, col3, myUDF(col1,col2) AS X
From table myTable
WHERE myUDF(col1,col2) > 0
Or use a subquery:
SELECT *
FROM (
SELECT col1, col2, col3, myUDF(col1,col2) AS X
From table myTable
) as subq
WHERE x > 0

Depending on the udf and how useful or frequently used it is, you may consider adding it to the table as a computed column. You could then filter on the column as normal and not have to write out the function at all in queries.

I'm not 100% sure what you are doing but since x isn't a column I would remove it from your SQL statement so you have :
SELECT col1, col2, col3, myUDF(col1,col2) AS X From myTable
And then add the condition to your code so you only call it when x > 0

Your question is best answered by the "With" clauses (CTE's I think, in MSSS).
Really the best question is: Should I store this computed value or recalculate it for every row, each and every time I query the table.
Are there 10 rows in the table and always 10 rows?
Are rows being added constantly?
Do you have a purge strategy in place or just let it grow?
Query that table only once a month?
If this is a "long running" function (even after you've optimized the hell out of it), why do you want to execute it more than once, ever?
You asked for once, but you are really asking for once per row, per query.
Storing the answer in an index or "virtual column"
Pros:
Calculate exactly once per row.
Query times don't grow linearly.
Cons:
Increases insert/update time
Calculating every time
Pros:
Insert/update time optimized
Cons:
Query time grows with row count. (not scalable)
If you're querying once a month, why do you care how bad the performance is, go tune something that actually has a big impact on your operations (very slightly facetious).
If you're not inserting a bunch (depends on your hardware) of rows per second, is spending that time up front going to make a big difference?

Related

Oracle WITH clause restrictions

I have a quite complex query that is based on multiple tables unioned together. At the moment, we are using view in order to perform operations on all the rows we need, so the view and a query look like:
CREATE VIEW
V_VIEW
(
COL1, COL2, COL3, COL4
) AS
SELECT
"COL1", "COL2", "COL3", "COL4"
FROM
TABLE1
UNION ALL
SELECT
"COL1", "COL2", "COL3", "COL4"
FROM
TABLE2;
SELECT
COL1, COL2
FROM
( SELECT
COL1, COL2
FROM
V_VIEW
WHERE
COL1 like 'val%'
AND COL2 =
(
SELECT
MAX(COL3)
FROM
V_VIEW
WHERE
COL4 = 'Y' ) part1
UNION ALL
SELECT
COL1, COL2
FROM
( SELECT
COL1, COL2
FROM
V_VIEW
WHERE
COL1 like 'sth%'
AND COL2 =
(
SELECT
MIN(COL3)
FROM
V_VIEW
WHERE
COL4 = 'N' ) part2;
I'm looking for a way to improve performance of this query and unfortunately creating new table that consists all rows of Table1 and Table2 is not an option for now (we are not allowed to interfere with the way rows are being inserted there). I tried to use WITH clause instead of the view, so it would look a bit like:
WITH TEMP_TABLE AS (
SELECT
COL1, COL2, COL3, COL4
FROM
TABLE1
UNION ALL
SELECT
COL1, COL2, COL3, COL4
FROM
TABLE2 )
SELECT
COL1, COL2
FROM
( SELECT
COL1, COL2
FROM
TEMP_TABLE
WHERE
COL1 like 'val%'
AND COL2 =
(
SELECT
MAX(COL3)
FROM
TEMP_TABLE
WHERE
COL4 = 'Y' ) part1
UNION ALL
SELECT
COL1, COL2
FROM
( SELECT
COL1, COL2
FROM
TEMP_TABLE
WHERE
COL1 like 'sth%'
AND COL2 =
(
SELECT
MIN(COL3)
FROM
TEMP_TABLE
WHERE
COL4 = 'N' ) part2
On a small data volume (Table1 and Table2 have about 20k rows) this improves performance very well. However, those tables will eventually get stuffed with millions of rows. I don't entirely understand how WITH clause is being processed, so I wonder: is there a chance that query using WITH closure, on a large set of data, will fail (due to lack of memory?), where a query without it would work slow, but will finish just fine?
You could try using the following:
WITH main_res AS (SELECT col1,
col2,
MAX(CASE WHEN col4 = 'N' THEN col3) OVER () col3_n_max,
MAX(CASE WHEN col4 = 'Y' THEN col3) OVER () col3_y_max
FROM v_view
WHERE col1 LIKE 'val%'
OR col1 LIKE 'sth%')
SELECT col1,
col2
FROM main_res
WHERE (col1 LIKE 'val%' AND col2 = col3_y_max)
OR (col1 LIKE 'sth%' AND col2 = col3_n_max);
This uses a conditional max analytic function to return the max value (depending on the col4 value) across all the rows.
Once you know that information, you can then filter on it appropriately. This should reduce the number of times you're querying each table, which usually is faster (but not always!) than the original query. I advise you test this query and work out if it's faster than the original query (and any other answers) before you choose which one to use.
WITH clause is a kind of VIEW which is created on the fly, used and the code for wont get stored in the DB. However, the it consumes main memory to store the information related to the cursor which is used to retrieve rows from the WITH SELECT query. You are right; WITH query on tables with huge data will slow down the DB.
I am not aware of:
a) Whether TABLE1 and TABLE2 hold full data set or these tables are incrementally updated.
b) Do we have date columns in this table?
c) At what interval these tables are populated or updated?
Based on the answers to above questions:
After discussing with your DBAs:
You can ask DBAs to extract data belonging to TRUNC(SYSDATE) or TRUNC(SYSDATE)-1 from TABLE1 and TABLE2 and populate this data into a single "new" table with same columns along with two additional columns:
a) One column is going to contain 1st three letters of COL1 value.
b) Another column to hold status value with DEFAULT 'Q'.
Create a LIST partition on this new table on COL1 for values 'Val' and 'Sth' and COL4 for Y and N.
Write an anonymous block which prepares data the way you need. Then, simple query on this new table should fetch data for you. We can schedule this anonymous block in job schedule depending on the frequency at which data will be available in the source tables TABLE1 and TABLE2.
These suggestions are based on a set of assumptions and amount of information you have shared.
If there is any UI or report running on this data then, house keeping of this data is required.
Bottom line :
Prepare the data as required by the subsequent process(es) beforehand rather than preparing the data on-the-fly when it is required. This will simplify your entire process and query part also.
Most of the times when we encounter performance bottlenecks in Prod or Int environment, we always look for short-term solutions. Short-time solutions are very much required to sort out the issue at hand. However, I would suggest you to be prepared with a long-term solution as well.
Before investing too much time in rewriting, it would be helpful to ensure that the optimizer is given a fighting chance at doing a good job. Make sure the tables have good stats and appropriate indexes.
Run explain plans on your queries to see what Oracle is actually doing in each case. You may find that something unexpected is going on with those UNION ALL statements. The optimizer sometimes makes dumb decisions and you may need to help it with indexes or strategically applied hints.
The WITH clause is quite handy and does the same job as a standalone view or a view defined inline in the table list, with one key exception: Oracle treats standalone views, WITH-clause views, and inline views slightly differently in the optimization process.
Oracle may choose to materialize the results of a view defined in a WITH clause, while it may merge the view if it is defined inline.
The point is that changing between these three kinds of views in your query will cause odd nuances of the optimizer to start showing up.
Finally, what version of Oracle are you on? The optimizer is one area where version really matters.

How to manipulate a column selected by * in SQLite?

I want a query to return all rows and all columns with one caveat: if, in a given row, colN is null, then instead return the string 'FOO'.
Why dont I just use SELECT col1, col2, ..., COALESCE(colN, 'FOO')?
I am implementing an abstract interface and thus I am required to use SELECT queries which SELECT * (because I cannot make assumptions on what columns there are). I can only assume 1 columns exists: colN.
What would this provide me?
I need this because this query is used in combination with a UNION and this allows me to keep track of the origin of the data.
Any ideas on how to do this?
One thing you could do is
SELECT *, COALESCE(colN, 'FOO') as CoalescedColN
if it's possible to adjust the other select(s) in the UNION accordingly
I don't know if SQL Lite can use this technique but this is what I would do in most other dbs:
select * from
(SELECT col1, col2, ..., COALESCE(colN, 'FOO') from table ) a

Row_number() function for Informix

Does informix has a function similar to the SQLServer and Oracle's row_number()?
I have to make a query using row_number() between two values, but I don't know how.
This is my query in SQLServer:
SELECT col1, col2
FROM (SELECT col1, col2, ROW_NUMBER()
OVER (ORDER BY col1) AS ROWNUM FROM table) AS TB
WHERE TB.ROWNUM BETWEEN value1 AND value2
Some help?
If, as it appears, you are seeking to get first rows 1-100, then rows 101-200, and so on, then you can use a more direct (but non-standard) syntax. Other DBMS have analogous notations, handled somewhat differently.
To fetch rows 101-200:
SELECT SKIP 100 FIRST 100 t.*
FROM Table AS T
WHERE ...other criteria...
You can use a host variable in place of either literal 100 (or a single prepared statement with different values for the placeholders on different iterations).

What do comma-separated integers in a GROUP BY statement accomplish?

I have a query like this:
SELECT col1, col2, col3, col4, col5, SUM(col6) AS total
FROM table_name
WHERE col1 < 99999
GROUP BY 1,2,3,4,5
What does the GROUP BY statement actually accomplish here? The query does not work properly without the comma-separated integers.
It is equivalent to writing:
SELECT col1, col2, col3, col4, col5, SUM(col6) AS total
FROM table_name
WHERE col1 < 99999
GROUP BY col1, col2, col3, col4, col5
The numbers are the values/columns in the select-list expressed by ordinal position in the list, starting with 1.
The numbers used to mandatory; then the ability to use the expressions in the select-list was added. The expressions can get unwieldy, and not all DBMS allow you to use 'display labels' or 'column aliases' from the select-list in the GROUP BY clause, so occasionally using the column numbers is helpful.
In your example, it would be better to use the names - they are simple. And, in general, use names rather than numbers whenever you can.
My guess is that your database product allows for referencing columns in the Group By by position as opposed to only by column name (i.e., 1 for the first column, 2 for the second column etc.) If so, this is a proprietary feature and is not recommended because of portability and (arguably) readability issues (But can admittedly be handy for a quick and dirty query).
Tried kind a same query in MS SQL Server 2005
select distinct host from some_table group by 1,2,3
It error's out saying
Each GROUP BY expression must contain at least one column that is not an outer reference.
So this indicates that those 1,2,3 are nothing but column outer referrence

Is there a difference between DISTINCT colname and DISTINCT(colname)?

I've seen both versions around. On iSeries DB2 you can use either and as far as I can tell they do the same thing. Is there a difference?
No, there is no difference because DISTINCT is a keyword and not a function call.
It's the same difference as between SOME_COLUMN and (SOME_COLUMN) (without any keyword in front)
If you have only one column in your select, then there is no difference.
However when you use distinct outside as -
select disctinct col1, col2, col3 from table
It applies distinct on the group tuple of (col1, col2, col3).
Finally there is no difference in using distinct as select distinct or select distinct()