Coalesce(), ISNULL() clarification - sql

I am currently studying for my MCSA Data Platform, I got the following question wrong and I was looking for an explanation as to why my answer was wrong as the in test explanation did not make much sense.
The Hovercraft Wages table records salaries paid to workers at Contoso. Workers get either a daily rate or a yearly salary. The table conatains the following columns:
EmpID, Daily_Rate, Yearly_Salary
Workers only get one type of income rate and the other column in their record has a value of NULL. You want to run a query calculating each employees total salary based on the assumption that people work 5 days a week 52 weeks per year.
Below are two options the right answer and the answer i chose
SELECT EmpID, CAST(COALESCE(Daily_Rate*5*52, Yearly_Salary) AS money) AS 'Total Salary'
FROM Hovercraft.Wages;
SELECT EMPID, CAST(ISNULL(Daily_Rate*5*52, Yearly_Salary)AS money)AS 'Total Salary'
FROM Hovercraft.Wages;
I selected the second choice as there were only two possible pay fields but was marked as incorrect for the coalesce, Can anybody clarify why an ISNULL is not a valid choice in this example as I do not want to make this mistake in the future
Many Thanks

The biggest difference is that ISNULL is proprietary, while COALESCE is part of SQL standard. Certification course may be teaching to maximum portability of knowledge, so when you have several choices, the course prefers a standard way of solving the problem.
The other difference that may be important in this situation is the data type determination. ISNULL uses the type of the first argument, while COALESCE follows the same rules as CASE, and picks the type with higher precedence. This may be important when Daily_Rate is stored in a column with narrower range.
For completeness, here is a list of other differences between the two (taken from Microsoft SQL Server Blog):
The NULLability of result expression is different,
Validations for ISNULL and COALESCE is different, because NULL value for ISNULL is converted to int, but triggers an error with COALESCE
ISNULL takes only two parameters whereas COALESCE takes variable number of parameters
You may get different query plans for the two functions.
EDIT : From the way the answer is worded I think that the authors want you to use ISNULL in situations when the second argument is guaranteed to be non-NULL, e.g. a non-nullable field, or a constant. While generally this idea is sound, their choice of question to test it is not ideal: the issue is that the problem guarantees that the value of the second ISNULL parameter is non-NULL in situations when it matters, making the two choices logically equivalent.

Besides all the very well known differences, not many people know that COALESCE is just a shorthand for CASE, and inherits all of it's +s and -s -- a list of parameters, proper casting of the result and so on.
But it also can be detrimonial to performance for an unsuspecting developer.
To demonstrate my point, pls run these three queries (on AdventureWorks2012 database) and check the execution plans for them:
SELECT COALESCE((SELECT CustomerID FROM Sales.SalesOrderHeader WHERE SalesOrderId = 57418), 0)
SELECT ISNULL((SELECT CustomerID FROM Sales.SalesOrderHeader WHERE SalesOrderId = 57418), 0)
SELECT CASE WHEN (SELECT CustomerID FROM Sales.SalesOrderHeader WHERE SalesOrderId = 57418) IS NULL
THEN 0
ELSE (SELECT CustomerID FROM Sales.SalesOrderHeader WHERE SalesOrderId = 57418)
END
You see that the first and the third one have the identicall execution plans (because the COALESCE is just a short form of CASE). Also you see that in first and third queries the SalesOrderHeader table is accessed twice, as oposed to just once with ISNULL.
If you also enable SET STATISTICS IO ON for the session, you'll notice that the number of logical reads is double for these two queries.
In this case, COALESCE executed the inner SELECT statement twice, as opposed to ISNULL, which executed it only once. This could make a huge difference.

Related

Adding a SUM statement increases run time way too much, is there a better method?

I have a table with invoice payments, which can be partial or full. I am comparing this calculated field to the total amount of the invoice. I have it twice in the query, once in the Select statement and again in the Where clause. Even if I remove one so it's only in either the Where or the Select, it takes more than an hour to run. If I remove the SUM entirely, it takes 10 seconds to run.
Is there a better method to get the sum? Should I use an index view? A temp table? Note that an invoice number is unique only to a vendor, not unique in general. The initial FROM is a view, if this makes a difference.
select distinct
transdate,
invoicedate,
PAY.OrderAccount,
v.VendorName,
invoiceamountmst,
(select sum(PAY1.settleamountcur) from [VIEW_INVOICE_PAYMENT] PAY1 where PAY.INVOICEID=PAY1.INVOICEID and PAY.OrderAccount=PAY1.OrderAccount) as "InvoiceSUM",
settleamountcur,
Currencycodeinvoice,
PAY.Description,
Voucher
from VIEW_INVOICE_PAYMENT PAY
inner join INVOICE on INVOICE_DOC_NO =invoiceid
JOIN VENDOR V on PAY.OrderAccount=v.VendorAccount
where TRANSDATE is not null
and (select sum(PAY1.settleamountcur) from [VIEW_INVOICE_PAYMENT] PAY1 where PAY.INVOICEID=PAY1.INVOICEID and PAY.OrderAccount=PAY1.OrderAccount)=total_cost_on_invoice
In this answer, when I refer to 'that select', I'm referring to the sub-query in the middle select sum(pay1.settlamountcur) ...
Note that aliases in 'that select' looks a little strange e.g., select sum(PAY1.settleamountcur) from [VIEW_INVOICE_PAYMENT] AX1. Where does the PAY1 alias come from? I may have missed something. If that's a typo in your code, it could be doing bad things (if it even runs). Assuming it's not, however...
For your broader problem, I believe that it will be running that select statement once for every row being returned by your overall table. Indeed, it may be doing it more often, depending on when it's doing your filtering in the execution plan.
Note I'm assuming SQL Server in this answer - but it should apply to other databases as well.
A couple of options
Instead of referring to the view, instead bring the tables into your current query and modify the query as such
Try removing aggregation from the subquery, and instead do it over the whole data set etc e.g., GROUP BY relevant fields, sum across relevant fields. This can be combined with option 1.
Put the sub-query as a CTE, or a sub-query within the FROM component. This may make it use it as a single table rather than running many times (or it may not)
(Sometimes my preferred option for large tables) Get the relevant data from the view into a temporary table first e.g.,
SELECT INVOICEId, OrderAccount, SUM(settleamountcur) AS total_settleamountcur
INTO #Temp
FROM [VIEW_INVOICE_PAYMENT]
GROUP BY INVOICEId, OrderAccount
-- Add any where/having clauses you can to filter
-- Consider creating temp table first with primary key, making joins easier for SQL Server
Then use the #Temp table instead of that select sub-query.

Why does joining on different data types produce a conversion type inconsistently?

As I try to join tables together on a value that's represented in different data types, I get really odd errors. Please consider the following:
I have two tables; let's say one is in database "CoffeeWarehouse," and the other is in database "CoffeeAnalytics":
Table 1: CoffeeWarehouse.dbo.BeanInfo
Table 2: CoffeeAnalytics.dbo.BeanOrderRecord
Now, both tables have a field called OrderNumber (although in table 2, it's spelled as[order number]); in Table 1, it's represented as a string, and in Table 2, it's represented as a float.
I proceed to join the tables together:
SELECT ordernumber,
bor.*
FROM CoffeeWarehouse.dbo.BeanInfo AS bni
LEFT JOIN CoffeeAnalytics.dbo.BeanOrderRecord AS bor ON bor.[order number] = bni.ordernumber;
If I specify the order numbers I'd like by adding the following:
WHERE bni.ordernumber = '48911'
then I see the complete table I'd like- all the fields from the table I've joined are populated properly.
If I add more order numbers, it works too:
WHERE bni.ordernumber IN ('48911', '83716', '98811', ...)
Now for the problem:
Suppose I want to select everything in the table where another field, i.e. CountryOfOrigin, is not null. I'm not going to enter several thousand order numbers- I just want to use a where clause to weed out the rows with incomplete data.
So I add the following to my original query:
WHERE bor.CountryOfOrigin IS NOT NULL
When I execute, I get this error:
Msg 8114, Level 16, State 5, Line 1
Error converting data type varchar to float.
I get the same error if I even simply use this as a where clause:
WHERE bni.ordernumber IS NOT NULL
Why is this the case? When I specify the ordernumber, the join works well- when I want to select many ordernumbers, I get a conversion error.
Any help/insight?
The SQL Server query optimiser can choose different paths to get your results, even with the same query from minute to minute.
In this query, say:
SELECT ordernumber,
bor.*
FROM CoffeeWarehouse.dbo.BeanInfo AS bni
LEFT JOIN CoffeeAnalytics.dbo.BeanOrderRecord AS bor ON bor.[order number] = bni.ordernumber
WHERE bni.ordernumber = '48911';
The query optimiser may, for example, take one of two paths:
It may choose to use BeanInfo as the "driving" table, use an index to narrow down the rows in that table to, say, a single row with order number 48911, and then join to BeanOrderRecord using just that one order number.
It may choose to use BeanOrderRecord as the driving table, join the two tables together by order number to get a full set of results, and then filter that resultset by the order number.
Which path the query optimiser takes will depend on a variety of things, including defined indexes, the number of rows in the table, cardinality, and so on.
Now, if it just so happens that one of your order numbers isn't convertible to a float—say someone typed '!2345' by accident—the first optimiser choice may always work, and the second one may always fail. But you don't get to choose which path the optimiser takes.
This is why you're seeing what you think of as weird results. In one of your queries, all the order numbers are being analysed and that's triggering the error, in another only order numbers that are convertible to float are being analysed, so there's no error. But it's basically just luck that it's working out the way it is. It could just as well be the other way around, or neither query might ever work.
This is one reason it's bad to store things in inappropriate data types. Fixing that would be the obvious solution.
A dirty and terrible fix, however, might be to always cast your FLOAT to a VARCHAR when doing the order number comparison, as I believe it's always safe to cast from FLOAT to VARCHAR. Though you may need to experiment to make sure the resulting VARCHAR value is formatted the same as your order number (or cast to INTEGER first...)
You'll have to resort to some quite fiddly trickery to get any performance out of your existing setup, though. If they were both VARCHAR values you could easily make the table join very fast by indexing each order number column, but as it is the casting you'll have to do will render normal indexes unusable for a join.
If you're using a recent version of SQL Server, you can use TRY_CAST to find the problem row(s):
SELECT * FROM BeanOrderRecord WHERE TRY_CAST([order number] AS VARCHAR) IS NULL
...will find rows with any FLOAT [order number] which can't be converted to a VARCAHR.

Writing a SELECT statement to determine invalid values?

I'm taking a Database Design course online this semester, and this is my first time using SQL. I graduated with a degree in communications, but I'm taking certain computer science classes to help myself. We're using Microsoft SQL Server 2008, and I'm stumped on the last problem of our exercises. First 6 were a breeze (basic select functions, ordering the results, using aliases to rename tables, etc), but the last one deals with null values.
It states:
Write a SELECT statement that determines whether the PaymentDate
column of the Invoices table has any invalid values. To be valid,
PaymentDate must be a null value if there's a balance due and a
non-null value if there's no balance due. Code a compound condition in
the WHERE clause that tests for these conditions.
Don't even know where to begin. Ha ha. I typically learn better in a classroom setting, but my schedule would not allow it with this course, so any explanation would help as well! Any help is appreciated!
Dave D.
So which one is correct? It's difficult to break it down when there's two different answers :) On my day off I'm gonna head to the professor's office so she can explain it to me in person anywho lol
Because there is an incorrect answer already posted, I'm going to walk through this.
This is a question of logic, that says that one of PaymentDate or BalanceDue are null. In SQL, you test for NULL with the expression IS NULL.
So, the where clause for this would look like:
where (PaymentDate is null and BalanceDue is not null) or -- this is the first clause
(PaymentDate is not null and BalanceDue is null) -- this is the second clause
Any other comparison with a NULL value (=, <>, <, <=, >, >=, or in) return NULL boolean values, which are interpreted as FALSE.
Best of luck learning SQL.
The code below will select the records with invalid PaymentDate
SELECT * FROM Invoices WHERE (PaymentDate is not null and BalanceDue is not null) or (PaymentDate is null and BalanceDue is null)

Questions about function COUNT('') and its variety

Is there any difference between COUNT('') and COUNT(*) and COUNT(1) and COUNT(ColumnName)? What approach is faster?
Count(ColumnName) is influenced by the value of the column. The other variants do effectively the same.
Count(*) is slower in some databases (MySQL amongst others), because it retrieves all fields while it doesn't have to. That's whay often 'x' or 1 is used to be safe. SQL Server and Oracle are somewhat smarter and don't retrieve field values if they don't have to.
Note that '' equals NULL on Oracle (yes it does!), which may have an undesired effect there. Not a problem for SQL Server, but you can use 1 to be safe.
COUNT(''), COUNT(1) and COUNT(*) will return the same result. COUNT(ColumnName) might return a different value, because COUNT only counts non-null values.
Performance-wise they should be equivalent, at least on SQL-Server.

SQL Distinct keyword bogs down performance?

I have received a SQL query that makes use of the distinct keyword. When I tried running the query it took at least a minute to join two tables with hundreds of thousands of records and actually return something.
I then took out the distinction and it came back in 0.2 seconds. Does the distinct keyword really make things that bad?
Here's the query:
SELECT DISTINCT
c.username, o.orderno, o.totalcredits, o.totalrefunds,
o.recstatus, o.reason
FROM management.contacts c
JOIN management.orders o ON (c.custID = o.custID)
WHERE o.recDate > to_date('2010-01-01', 'YYYY/MM/DD')
Yes, as using DISTINCT will (sometimes according to a comment) cause results to be ordered. Sorting hundreds of records takes time.
Try GROUP BY all your columns, it can sometimes lead the query optimiser to choose a more efficient algorithm (at least with Oracle I noticed significant performance gain).
Distinct always sets off alarm bells to me - it usually signifies a bad table design or a developer who's unsure of themselves. It is used to remove duplicate rows, but if the joins are correct, it should rarely be needed. And yes there is a large cost to using it.
What's the primary key of the orders table? Assuming it's orderno then that should be sufficient to guarantee no duplicates. If it's something else, then you may need to do a bit more with the query, but you should make it a goal to remove those distincts! ;-)
Also you mentioned the query was taking a while to run when you were checking the number of rows - it can often be quicker to wrap the entire query in "select count(*) from ( )" especially if you're getting large quantities of rows returned. Just while you're testing obviously. ;-)
Finally, make sure you have indexed the custID on the orders table (and maybe recDate too).
Purpose of DISTINCT is to prune duplicate records from the result set for all the selected columns.
If any of the selected columns is unique after join you can drop DISTINCT.
If you don't know that, but you know that the combination of the values of selected column is unique, you can drop DISTINCT.
Actually, normally, with properly designed databases you rarely need DISTINCT and in those cases that you do it is (?) obvious that you need it. RDBMS however can not leave it to chance and must actually build an indexing structure to establish it.
Normally you find DISTINCT all over the place when people are not sure about JOINs and relationships between tables.
Also, in classes when talking about pure relational databases where the result should be a proper set (with no repeating elements = records) you can find it quite common for people to stick DISTINCT in to guarantee this property for purposes of theoretical correctness. Sometimes this creeps in into production systems.
You can try to make a group by like this:
SELECT c.username,
o.orderno,
o.totalcredits,
o.totalrefunds,
o.recstatus,
o.reason
FROM management.contacts c,
management.orders o
WHERE c.custID = o.custID
AND o.recDate > to_date('2010-01-01', 'YYYY-MM-DD')
GROUP BY c.username,
o.orderno,
o.totalcredits,
o.totalrefunds,
o.recstatus,
o.reason
Also verify if you have index on o.recDate