Writing a SELECT statement to determine invalid values? - sql

I'm taking a Database Design course online this semester, and this is my first time using SQL. I graduated with a degree in communications, but I'm taking certain computer science classes to help myself. We're using Microsoft SQL Server 2008, and I'm stumped on the last problem of our exercises. First 6 were a breeze (basic select functions, ordering the results, using aliases to rename tables, etc), but the last one deals with null values.
It states:
Write a SELECT statement that determines whether the PaymentDate
column of the Invoices table has any invalid values. To be valid,
PaymentDate must be a null value if there's a balance due and a
non-null value if there's no balance due. Code a compound condition in
the WHERE clause that tests for these conditions.
Don't even know where to begin. Ha ha. I typically learn better in a classroom setting, but my schedule would not allow it with this course, so any explanation would help as well! Any help is appreciated!
Dave D.
So which one is correct? It's difficult to break it down when there's two different answers :) On my day off I'm gonna head to the professor's office so she can explain it to me in person anywho lol

Because there is an incorrect answer already posted, I'm going to walk through this.
This is a question of logic, that says that one of PaymentDate or BalanceDue are null. In SQL, you test for NULL with the expression IS NULL.
So, the where clause for this would look like:
where (PaymentDate is null and BalanceDue is not null) or -- this is the first clause
(PaymentDate is not null and BalanceDue is null) -- this is the second clause
Any other comparison with a NULL value (=, <>, <, <=, >, >=, or in) return NULL boolean values, which are interpreted as FALSE.
Best of luck learning SQL.

The code below will select the records with invalid PaymentDate
SELECT * FROM Invoices WHERE (PaymentDate is not null and BalanceDue is not null) or (PaymentDate is null and BalanceDue is null)

Related

Query applying Where clause BEFORE Joins?

I'm honestly really confused here, so I'll try to keep it simple.
We have Table A:
id
Table B:
id || number
Table A is a "prefilter" to B, since B contains a lot of different objects, including A.
So my query, trying to get all A's with a filter;
SELECT * FROM A a
JOIN B b ON b.id = a.id
WHERE CAST(SUBSTRING(b.number, 2, 30) AS integer) between 151843 and 151865
Since ALL instances of A starts with a letter ("X******"), I just want to truncate the first letter to let the filter do his work with the number specified by the user.
At first glance, there should be absolutely no worries. But it seems I was wrong. And on something I didn't expect to be...
It seems like my WHERE clause is executed BEFORE my JOIN. Therefore, since many B's have number with more than one Letter at the start, I have an invalid conversion happening. Despite the fact that it would NEVER happen if we stay in A's.
I always thought that where clause was executed after joins, but in this case, it seems postgres wants to prove me wrong.
Any explanations ?
SQLFiddle demonstrating problem: http://sqlfiddle.com/#!15/cd7e6e/7
And even with the SubQuery, it still makes the same error...
You can use the regex substr function to remove everything but digits: CAST(substring(B.number from '\d') AS integer).
See working example here: http://sqlfiddle.com/#!15/cd7e6e/18
SQL is a declarative language. For a select statement, you declare the criteria the data you are looking for must meet. You don't get to choose the execution path, your query isn't executed procedurally.
Thus the optimizer is free to choose any execution plan it likes, as long as it returns records specified by your criteria.
I suggest you change your query to cast to string instead of to integer. Something like:
WHERE SUBSTRING(b.number, 2, 30) between CAST(151843 AS varchar) and CAST(151865 AS varchar)
Do the records of A that are in B have the same id in table B as in A. If those records are inserted in a different order, this may not be the case and therefore return different records than expected.

Coalesce(), ISNULL() clarification

I am currently studying for my MCSA Data Platform, I got the following question wrong and I was looking for an explanation as to why my answer was wrong as the in test explanation did not make much sense.
The Hovercraft Wages table records salaries paid to workers at Contoso. Workers get either a daily rate or a yearly salary. The table conatains the following columns:
EmpID, Daily_Rate, Yearly_Salary
Workers only get one type of income rate and the other column in their record has a value of NULL. You want to run a query calculating each employees total salary based on the assumption that people work 5 days a week 52 weeks per year.
Below are two options the right answer and the answer i chose
SELECT EmpID, CAST(COALESCE(Daily_Rate*5*52, Yearly_Salary) AS money) AS 'Total Salary'
FROM Hovercraft.Wages;
SELECT EMPID, CAST(ISNULL(Daily_Rate*5*52, Yearly_Salary)AS money)AS 'Total Salary'
FROM Hovercraft.Wages;
I selected the second choice as there were only two possible pay fields but was marked as incorrect for the coalesce, Can anybody clarify why an ISNULL is not a valid choice in this example as I do not want to make this mistake in the future
Many Thanks
The biggest difference is that ISNULL is proprietary, while COALESCE is part of SQL standard. Certification course may be teaching to maximum portability of knowledge, so when you have several choices, the course prefers a standard way of solving the problem.
The other difference that may be important in this situation is the data type determination. ISNULL uses the type of the first argument, while COALESCE follows the same rules as CASE, and picks the type with higher precedence. This may be important when Daily_Rate is stored in a column with narrower range.
For completeness, here is a list of other differences between the two (taken from Microsoft SQL Server Blog):
The NULLability of result expression is different,
Validations for ISNULL and COALESCE is different, because NULL value for ISNULL is converted to int, but triggers an error with COALESCE
ISNULL takes only two parameters whereas COALESCE takes variable number of parameters
You may get different query plans for the two functions.
EDIT : From the way the answer is worded I think that the authors want you to use ISNULL in situations when the second argument is guaranteed to be non-NULL, e.g. a non-nullable field, or a constant. While generally this idea is sound, their choice of question to test it is not ideal: the issue is that the problem guarantees that the value of the second ISNULL parameter is non-NULL in situations when it matters, making the two choices logically equivalent.
Besides all the very well known differences, not many people know that COALESCE is just a shorthand for CASE, and inherits all of it's +s and -s -- a list of parameters, proper casting of the result and so on.
But it also can be detrimonial to performance for an unsuspecting developer.
To demonstrate my point, pls run these three queries (on AdventureWorks2012 database) and check the execution plans for them:
SELECT COALESCE((SELECT CustomerID FROM Sales.SalesOrderHeader WHERE SalesOrderId = 57418), 0)
SELECT ISNULL((SELECT CustomerID FROM Sales.SalesOrderHeader WHERE SalesOrderId = 57418), 0)
SELECT CASE WHEN (SELECT CustomerID FROM Sales.SalesOrderHeader WHERE SalesOrderId = 57418) IS NULL
THEN 0
ELSE (SELECT CustomerID FROM Sales.SalesOrderHeader WHERE SalesOrderId = 57418)
END
You see that the first and the third one have the identicall execution plans (because the COALESCE is just a short form of CASE). Also you see that in first and third queries the SalesOrderHeader table is accessed twice, as oposed to just once with ISNULL.
If you also enable SET STATISTICS IO ON for the session, you'll notice that the number of logical reads is double for these two queries.
In this case, COALESCE executed the inner SELECT statement twice, as opposed to ISNULL, which executed it only once. This could make a huge difference.

Where clause affecting join

Dumb question time. Oracle 10g.
Is it possible for a where clause to affect a join?
I've got a query in the form of:
select * from
(select product, product_name from products p
join product_serial ps on product.id = ps.id
join product_data pd on pd.product_value = to_number(p.product_value)) product_result
where product_name like '%prototype%';
Obviously this is a contrived example. No real need to show the table structure as it's all imaginary. Unfortunately, I can't show the real table structure or query. In this case, p.product_value is a VARCHAR2 field which in certain rows have an ID stored inside it rather than text. (Yes, bad design - but something I inherited and am unable to change)
The issue is in the join. If I leave out the where clause, the query works and rows are returned. However, if I add the where clause, I get "invalid number" error on the pd.product_value = to_number(p.product_value) join condition.
Obviously, the "invalid number" error happens when rows are joined which contain non-digits in the p.product_value field. However, my question is how are those rows being selected? If the join succeeds without the outer where clause, shouldn't the outer where clause just select rows from the result of the join? It appears what is happening is the where clause is affecting what rows are joined, despite the join being in an inner query.
Is my question making sense?
It affects the plan that's generated.
The actual order that tables are joined (and so filtered) is not dictated by the order you write your query, but by the statistics on the tables.
In one version, the plan generated co-incidentally means that the 'bad' rows never get processed; because the preceding joins filtered the result set down to a point that they're never joined on.
The introduction of the WHERE clause has meant that ORACLE now believes a different order of join is better (because filtering by the product name requires a certain index, or because it narrows the data down a lot, etc).
This new order means that the 'bad' rows get processed before the join that filters them out.
I would endeavour to clean the data before querying it. Possibly by creating a derived column where the value is already cast to a number, or left as NULL if it is not possible to do so.
You can also use EXPLAIN PLAN to see the different plans being gerenated from your queries.
Short answer: yes.
Long answer: the query engine is free to rewrite your query however it wants, as long as it returns the same results. All of the query is available to it to use for the purpose of producing the most efficient query it can.
In this case, I'd guess that there is an index that covers what you are wanting, but it doesn't cover product name, when you add that to the where clause, the index isn't used and instead there's a scan where both conditions are tested at the same time, thus your error.
Which is really an error in your join condition, you shouldn't be using to_number unless you are sure it's a number.
I guess your to_number(p.product_value) only applies for rows with a valid product_name.
What happens is that your join is applied before your where clause resulting in the failure of the to_number function.
What you need to do is include your product_name like '%prototype%' as a JOIN clause like this:
select * from
(select product, product_name from products p
join product_serial ps on product.id = ps.id
join product_data pd on product_name like '%prototype%' AND
pd.product_value = to_number(p.product_value));
For more background (and a really good read), I'd suggest reading Jonathan Gennick's Subquery Madness.
Basically, the problem is that Oracle is free to evaluate predicates in any order. So it is free to push (or not push) the product_name predicate into your subquery. It is free to evaluate the join conditions in any order. So if Oracle happens to pick a query plan where it filters out the non-numeric product_value rows before it applies the to_number, the query will succeed. If it happens to pick a plan where it applies the to_number before filtering out the non-numeric product_value rows, you'll get an error. Of course, it's also possible that it will return the first N rows successfully and then you'll get an error when you try to fetch row N+1 because row N+1 is the first time that it is trying to apply the to_number predicate to a non-numeric data.
Other than fixing the data model, you could potentially throw some hints into the query to force Oracle to evaluate the predicate that ensures that all the non-numeric data is filtered out before the to_number predicate is applied. But in general, it's a bit challenging to fully hint a query in a way that will force the optimizer to always evaluate things in the "proper" order.

Conversion to datetime fails only on WHERE clause?

I'm having a problem with some SQL server queries. Turns out that I have a table with "Attibute_Name" and "Attibute_Value" fields, which can be of any type, stored in varchar. (Yeah... I know.)
All the dates for a particular attribute seem to be stored the the "YYYY-MM-DD hh:mm:ss" format (not 100% sure about that, there are millions of records here), so I can execute this code without problems:
select /*...*/ CONVERT(DATETIME, pa.Attribute_Value)
from
ProductAttributes pa
inner join Attributes a on a.Attribute_ID = pa.Attribute_ID
where
a.Attribute_Name = 'SomeDate'
However, if I execute the following code:
select /*...*/ CONVERT(DATETIME, pa.Attribute_Value)
from
ProductAttributes pa
inner join Attributes a on a.Attribute_ID = pa.Attribute_ID
where
a.Attribute_Name = 'SomeDate'
and CONVERT(DATETIME, pa.Attribute_Value) < GETDATE()
I will get the following error:
Conversion failed when converting date and/or time from character string.
How come it fails on the where clause and not on the select one?
Another clue:
If instead of filtering by the Attribute_Name I use the actual Attribute_ID stored in database (PK) it will work without problem.
select /*...*/ CONVERT(DATETIME, pa.Attribute_Value)
from
ProductAttributes pa
inner join Attributes a on a.Attribute_ID = pa.Attribute_ID
where
a.Attribute_ID = 15
and CONVERT(DATETIME, pa.Attribute_Value) < GETDATE()
Update
Thanks everyone for the answers. I found it hard to actually choose a correct answer because everyone pointed out something that was useful to understanding the issue. It was definitely having to do with the order of execution.
Turns out that my first query worked correctly because the WHERE clause was executed first, then the SELECT.
My second query failed because of the same reason (as the Attributes were not filtered, the conversion failed while executing the same WHERE clause).
My third query worked because the ID was part of an index (PK), so it took precedence and it drilled down results on that condition first.
Thanks!
You seem to be assuming some sort of short circuiting evaluation or guaranteed ordering of the predicates in the WHERE clause. This is not guaranteed. When you have mixed datatypes in a column like that the only safe way of dealing them is with a CASE expression.
Use (e.g.)
CONVERT(DATETIME,
CASE WHEN ISDATE(pa.Attribute_Value) = 1 THEN pa.Attribute_Value END)
Not
CONVERT(DATETIME, pa.Attribute_Value)
If the conversion is in the WHERE clause it may be evaluated for many more records (values) than it would be if it appears in the projection list. I have talked about this before in different context, see T-SQL functions do no imply a certain order of execution and On SQL Server boolean operator short-circuit. Your case is even simpler, but is similar, and ultimately the root cause is the same: do not an assume an imperative execution order when dealing with a declarative language like SQL.
Your best solution, by a far and a large margin, is to sanitize the data and change the column type to a DATETIME or DATETIME2 type. All other workarounds will have one shortcoming or another, so you may be better to just do the right thing.
Update
After a closer look (sorry, I'm #VLDB and only peeking SO between sessions) I realize you have an EAV store with inherent type-free semantics (the attribute_value can bea string, a date, an int etc). My opinion is that your best bet is to use sql_variant in storage and all the way up to the client (ie. project sql_variant). You can coherce the type in the client, all client APIs have methods to extract the inner type from a sql_variant, see Using sql_variant Data (well, almost all client APIs... Using the sql_variant datatype in CLR). With sql_variant you can store multiple types w/o the problems of going through a string representations, you can use SQL_VARIANT_PROPERTY to inspect things like the BaseType in the stored values, and you can even do thinks like check constraints to enforce data type correctness.
This has to do with the order that a SELECT query is processed. The WHERE clause is processed long before the SELECT. It has to determine which rows to include/exclude. The clause that uses the name must use a scan that investigates all rows, some of which do not contain valid date/time data, whereas the key probably leads to a seek and none of the invalid rows are included at the point. The convert in the SELECT list is performed last, and clearly by this time it is not going to try to convert invalid rows. Since you're mixing date/time data with other data, you may consider storing date or numeric data in dedicated columns with correct data types. In the meantime, you can defer the check in the following way:
SELECT /* ... */
FROM
(
SELECT /* ... */
FROM ProductAttributes AS pa
INNER JOIN dbo.Attributes AS a
ON a.Attribute_ID = pa.Attribute_ID
WHERE a.Attribute_Name = 'SomeDate'
AND ISDATE (pa.Attribute_Value) = 1
) AS z
WHERE CONVERT(CHAR(8), AttributeValue, 112) < CONVERT(CHAR(8), GETDATE(), 112);
But the better answer is probably to use the Attribute_ID key instead of the name if possible.
Seems like a data issue to me. Take a look at the data when you select it using the two different methods, try looking for distinct lengthsand then select the items in the different sets and eyeball them. Also check for nulls? (I'm not sure what happens if you try converting null to a datetime)
I think the problem is you have a bad date in your database (obviously).
In your first example, where you aren't checking the date in the WHERE clause, all of the dates where a.attribute.Name = 'SomeDate' are valid, so it never attempts to convert a bad date.
In your second example, the addition to the WHERE clause is causing the query plan to actually convert all those dates and finding the bad one and then looking at attribute name.
In your third example, changing to use Attribute_Id probably changes the query plan so that it only looks for those where id = 15 First, and then checks to see if those records have a valid date, which they do. (Perhaps Attribute_Id is indexed and Attribute_name isn't)
So, you have a bad date somewhere, but it's not on any records with Arttribute_id = 15.
You can check execution plans. It might be that with first query the second criteria ( CONVERT(DATETIME, pa.Attribute_Value) < GETDATE() ) gets evaluated first over all rows including ones with invalid data (not date), while in the case of the second one - a.Attribute_ID = 15 get evaluated first. Thus excluding rows with non-date values.
btw, the second one might be faster also, and if you don't have anything from Attributes in the select list, you can get rid of inner join Attributes a on a.Attribute_ID = pa.Attribute_ID.
On that note, it would be advisable to get rid of EAV while it's no too late :)

Why does my query take 2 minutes to run?

Note - There are about 2-3 million records in the db.
SELECT
route_date, stop_exception_code, unique_id_no,
customer_reference, stop_name, stop_comment,
branch_id, customer_no, stop_expected_pieces,
datetime_updated, updated_by, route_code
FROM
cops_reporting.distribution_stop_information distribution_stop_information
WHERE
(stop_exception_code <> 'null') AND
(datetime_updated >= { ts '2011-01-25 00:00:01' })
ORDER BY datetime_updated DESC
If you posted the indexes you already have on this table, or maybe a query execution plan, it would be easier to know. As it is, I'm going to guess that you could improve performance if you create a combined index that contains stop_exception_code and datetime_updated. And I can't promise this will actually work, but it might be worth a shot. I can't say much more than that without any other information...
Some rules of thumb:
Index on columns that JOIN.
Index on columns used in WHERE clauses.
'Not equals' is always slower than an 'Equals' condition. Consider splitting the table into those that are null and those that are not, or hiving it off into a joined table as a index.
Using proper JOIN syntax i.e. being explicit about where joins are by writing INNER JOIN speeds things up on some databases (I've seen a 10min+ query get down to 30 secs on mysql just by this change alone)
use aliases for each table and prefix to each column
store as a function/procedure and it will precompile and get much quicker
stop_exception_code <> 'null'
Please tell me that 'null' isn't a string in your database. Standard SQL would be
stop_exception_code IS NOT NULL
or
stop_exception_code is NULL
I'm not sure what a NULL stop_exception_code might mean to you. But if it means something like "I don't know", then using a specific value for "I don't know" might let your server use an index on that column, and index that it might not be able to use for NULL. (Or maybe you've already done that by using the string 'null'.)
Without seeing your DDL, actual query, and execution plan, that's about all I can tell you.