I'm having a problem with some SQL server queries. Turns out that I have a table with "Attibute_Name" and "Attibute_Value" fields, which can be of any type, stored in varchar. (Yeah... I know.)
All the dates for a particular attribute seem to be stored the the "YYYY-MM-DD hh:mm:ss" format (not 100% sure about that, there are millions of records here), so I can execute this code without problems:
select /*...*/ CONVERT(DATETIME, pa.Attribute_Value)
from
ProductAttributes pa
inner join Attributes a on a.Attribute_ID = pa.Attribute_ID
where
a.Attribute_Name = 'SomeDate'
However, if I execute the following code:
select /*...*/ CONVERT(DATETIME, pa.Attribute_Value)
from
ProductAttributes pa
inner join Attributes a on a.Attribute_ID = pa.Attribute_ID
where
a.Attribute_Name = 'SomeDate'
and CONVERT(DATETIME, pa.Attribute_Value) < GETDATE()
I will get the following error:
Conversion failed when converting date and/or time from character string.
How come it fails on the where clause and not on the select one?
Another clue:
If instead of filtering by the Attribute_Name I use the actual Attribute_ID stored in database (PK) it will work without problem.
select /*...*/ CONVERT(DATETIME, pa.Attribute_Value)
from
ProductAttributes pa
inner join Attributes a on a.Attribute_ID = pa.Attribute_ID
where
a.Attribute_ID = 15
and CONVERT(DATETIME, pa.Attribute_Value) < GETDATE()
Update
Thanks everyone for the answers. I found it hard to actually choose a correct answer because everyone pointed out something that was useful to understanding the issue. It was definitely having to do with the order of execution.
Turns out that my first query worked correctly because the WHERE clause was executed first, then the SELECT.
My second query failed because of the same reason (as the Attributes were not filtered, the conversion failed while executing the same WHERE clause).
My third query worked because the ID was part of an index (PK), so it took precedence and it drilled down results on that condition first.
Thanks!
You seem to be assuming some sort of short circuiting evaluation or guaranteed ordering of the predicates in the WHERE clause. This is not guaranteed. When you have mixed datatypes in a column like that the only safe way of dealing them is with a CASE expression.
Use (e.g.)
CONVERT(DATETIME,
CASE WHEN ISDATE(pa.Attribute_Value) = 1 THEN pa.Attribute_Value END)
Not
CONVERT(DATETIME, pa.Attribute_Value)
If the conversion is in the WHERE clause it may be evaluated for many more records (values) than it would be if it appears in the projection list. I have talked about this before in different context, see T-SQL functions do no imply a certain order of execution and On SQL Server boolean operator short-circuit. Your case is even simpler, but is similar, and ultimately the root cause is the same: do not an assume an imperative execution order when dealing with a declarative language like SQL.
Your best solution, by a far and a large margin, is to sanitize the data and change the column type to a DATETIME or DATETIME2 type. All other workarounds will have one shortcoming or another, so you may be better to just do the right thing.
Update
After a closer look (sorry, I'm #VLDB and only peeking SO between sessions) I realize you have an EAV store with inherent type-free semantics (the attribute_value can bea string, a date, an int etc). My opinion is that your best bet is to use sql_variant in storage and all the way up to the client (ie. project sql_variant). You can coherce the type in the client, all client APIs have methods to extract the inner type from a sql_variant, see Using sql_variant Data (well, almost all client APIs... Using the sql_variant datatype in CLR). With sql_variant you can store multiple types w/o the problems of going through a string representations, you can use SQL_VARIANT_PROPERTY to inspect things like the BaseType in the stored values, and you can even do thinks like check constraints to enforce data type correctness.
This has to do with the order that a SELECT query is processed. The WHERE clause is processed long before the SELECT. It has to determine which rows to include/exclude. The clause that uses the name must use a scan that investigates all rows, some of which do not contain valid date/time data, whereas the key probably leads to a seek and none of the invalid rows are included at the point. The convert in the SELECT list is performed last, and clearly by this time it is not going to try to convert invalid rows. Since you're mixing date/time data with other data, you may consider storing date or numeric data in dedicated columns with correct data types. In the meantime, you can defer the check in the following way:
SELECT /* ... */
FROM
(
SELECT /* ... */
FROM ProductAttributes AS pa
INNER JOIN dbo.Attributes AS a
ON a.Attribute_ID = pa.Attribute_ID
WHERE a.Attribute_Name = 'SomeDate'
AND ISDATE (pa.Attribute_Value) = 1
) AS z
WHERE CONVERT(CHAR(8), AttributeValue, 112) < CONVERT(CHAR(8), GETDATE(), 112);
But the better answer is probably to use the Attribute_ID key instead of the name if possible.
Seems like a data issue to me. Take a look at the data when you select it using the two different methods, try looking for distinct lengthsand then select the items in the different sets and eyeball them. Also check for nulls? (I'm not sure what happens if you try converting null to a datetime)
I think the problem is you have a bad date in your database (obviously).
In your first example, where you aren't checking the date in the WHERE clause, all of the dates where a.attribute.Name = 'SomeDate' are valid, so it never attempts to convert a bad date.
In your second example, the addition to the WHERE clause is causing the query plan to actually convert all those dates and finding the bad one and then looking at attribute name.
In your third example, changing to use Attribute_Id probably changes the query plan so that it only looks for those where id = 15 First, and then checks to see if those records have a valid date, which they do. (Perhaps Attribute_Id is indexed and Attribute_name isn't)
So, you have a bad date somewhere, but it's not on any records with Arttribute_id = 15.
You can check execution plans. It might be that with first query the second criteria ( CONVERT(DATETIME, pa.Attribute_Value) < GETDATE() ) gets evaluated first over all rows including ones with invalid data (not date), while in the case of the second one - a.Attribute_ID = 15 get evaluated first. Thus excluding rows with non-date values.
btw, the second one might be faster also, and if you don't have anything from Attributes in the select list, you can get rid of inner join Attributes a on a.Attribute_ID = pa.Attribute_ID.
On that note, it would be advisable to get rid of EAV while it's no too late :)
Related
As I try to join tables together on a value that's represented in different data types, I get really odd errors. Please consider the following:
I have two tables; let's say one is in database "CoffeeWarehouse," and the other is in database "CoffeeAnalytics":
Table 1: CoffeeWarehouse.dbo.BeanInfo
Table 2: CoffeeAnalytics.dbo.BeanOrderRecord
Now, both tables have a field called OrderNumber (although in table 2, it's spelled as[order number]); in Table 1, it's represented as a string, and in Table 2, it's represented as a float.
I proceed to join the tables together:
SELECT ordernumber,
bor.*
FROM CoffeeWarehouse.dbo.BeanInfo AS bni
LEFT JOIN CoffeeAnalytics.dbo.BeanOrderRecord AS bor ON bor.[order number] = bni.ordernumber;
If I specify the order numbers I'd like by adding the following:
WHERE bni.ordernumber = '48911'
then I see the complete table I'd like- all the fields from the table I've joined are populated properly.
If I add more order numbers, it works too:
WHERE bni.ordernumber IN ('48911', '83716', '98811', ...)
Now for the problem:
Suppose I want to select everything in the table where another field, i.e. CountryOfOrigin, is not null. I'm not going to enter several thousand order numbers- I just want to use a where clause to weed out the rows with incomplete data.
So I add the following to my original query:
WHERE bor.CountryOfOrigin IS NOT NULL
When I execute, I get this error:
Msg 8114, Level 16, State 5, Line 1
Error converting data type varchar to float.
I get the same error if I even simply use this as a where clause:
WHERE bni.ordernumber IS NOT NULL
Why is this the case? When I specify the ordernumber, the join works well- when I want to select many ordernumbers, I get a conversion error.
Any help/insight?
The SQL Server query optimiser can choose different paths to get your results, even with the same query from minute to minute.
In this query, say:
SELECT ordernumber,
bor.*
FROM CoffeeWarehouse.dbo.BeanInfo AS bni
LEFT JOIN CoffeeAnalytics.dbo.BeanOrderRecord AS bor ON bor.[order number] = bni.ordernumber
WHERE bni.ordernumber = '48911';
The query optimiser may, for example, take one of two paths:
It may choose to use BeanInfo as the "driving" table, use an index to narrow down the rows in that table to, say, a single row with order number 48911, and then join to BeanOrderRecord using just that one order number.
It may choose to use BeanOrderRecord as the driving table, join the two tables together by order number to get a full set of results, and then filter that resultset by the order number.
Which path the query optimiser takes will depend on a variety of things, including defined indexes, the number of rows in the table, cardinality, and so on.
Now, if it just so happens that one of your order numbers isn't convertible to a float—say someone typed '!2345' by accident—the first optimiser choice may always work, and the second one may always fail. But you don't get to choose which path the optimiser takes.
This is why you're seeing what you think of as weird results. In one of your queries, all the order numbers are being analysed and that's triggering the error, in another only order numbers that are convertible to float are being analysed, so there's no error. But it's basically just luck that it's working out the way it is. It could just as well be the other way around, or neither query might ever work.
This is one reason it's bad to store things in inappropriate data types. Fixing that would be the obvious solution.
A dirty and terrible fix, however, might be to always cast your FLOAT to a VARCHAR when doing the order number comparison, as I believe it's always safe to cast from FLOAT to VARCHAR. Though you may need to experiment to make sure the resulting VARCHAR value is formatted the same as your order number (or cast to INTEGER first...)
You'll have to resort to some quite fiddly trickery to get any performance out of your existing setup, though. If they were both VARCHAR values you could easily make the table join very fast by indexing each order number column, but as it is the casting you'll have to do will render normal indexes unusable for a join.
If you're using a recent version of SQL Server, you can use TRY_CAST to find the problem row(s):
SELECT * FROM BeanOrderRecord WHERE TRY_CAST([order number] AS VARCHAR) IS NULL
...will find rows with any FLOAT [order number] which can't be converted to a VARCAHR.
I'm trying to crosswalk some code values from another developer's code using the business objects frontend (I know, it's sub-optimal, but they haven't given me back-end access).
What I need to do is just pull a record from the relevant table to compare code values to display values. I'm guessing the problem has something to do with the table containing millions of records. Even when I narrow my query to one value, try only records from today, and set Max rows retrieved to 1, it's hanging forever.
The code it generated for my query is:
SELECT
CLINICAL_EVENT.EVENT_CD,
CV_EVENT.DISPLAY
FROM
CLINICAL_EVENT,
CODE_VALUE CV_EVENT
WHERE
( CLINICAL_EVENT.EVENT_CD=CV_EVENT.CODE_VALUE )
AND
(
CLINICAL_EVENT.EVENT_CD = 338743225
AND
CLINICAL_EVENT.EVENT_END_DT_TM
> '16-02-2017 00:00:00'
)
Can you by chance avoid the cross join in your query by using a join syntax instead of the , notation? perhaps the engine is optimizing to avoid the cross join, perhaps not.
SELECT
CLINICAL_EVENT.EVENT_CD,
CV_EVENT.DISPLAY
FROM
CLINICAL_EVENT
INNER JOIN CODE_VALUE CV_EVENT
on CLINICAL_EVENT.EVENT_CD=CV_EVENT.CODE_VALUE
WHERE CLINICAL_EVENT.EVENT_CD = 338743225
AND CLINICAL_EVENT.EVENT_END_DT_TM > '16-02-2017 00:00:00'
Additionally what data type is EVENT_END_DT_TM perhaps implicitly casting your '16-02-2017 00:00:00' to a date or datetime would aid performance.
Expanding a bit on my comment:
The code values and corresponding display values you want to examine are effectively both coming from table CODE_VALUE. The only thing you're gaining from the join is duplication of those results according to the number of times the code value appears on the CLINICAL_EVENT rows satisfying the date criterion (in a sense that encompasses suppressing all appearances if there are no matching rows).
You seem to want simply to compare the code value and corresponding description, rather than to evaluate how many times that code appears. In that case, you are incurring a lot of unneeded work -- and possibly even some unwanted work -- by joining CODE_VALUE to CLINICAL_EVENT. Instead, just select the wanted row(s) directly from CODE_VALUE alone.
I need to check whether a table (in an Oracle DB) contains entries that were updated after a certain date. "Updated" in this case means any of 3 columns (DateCreated, DateModified, DateDeleted) have a value greater than the reference.
The query I have come up so far is this
select * from myTable
where DateCreated > :reference_date
or DateModified > :reference_date
or DateDeleted > :reference_date
;
This works and gives desired results, but is not what I want, because I would like to enter the value for :reference_date only once.
Any ideas on how I could write a more elegant query ?
While what you have looks fine and only uses one bind variable, if for some reason you have positional rather than named binds then you could avoid the need to supply the bind value multiple time by using an inline view or a CTE:
with cte as (select :reference_date as reference_date from dual)
select myTable.*
from cte
join myTable
on myTable.DateCreated > cte.reference_date
or myTable.DateModified > cte.reference_date
or myTable.DateDeleted > cte.reference_date
;
But again I wouldn't consider that better than your original unless you have a really compelling reason and a problem supplying the bind value. Having to set it three times from a calling program probably wouldn't count as compelling, for example, for me anyway. And I'd check it didn't affect performance before deploying - I'd expect Oracle to optimise something like this but the execution plan might be interesting.
I suppose you could rewrite that as:
select * from myTable
where greatest(DateCreated, DateModified, DateDeleted) > :reference_date;
if you absolutely had to, but I wouldn't. Your original query is, IMHO, much easier to understand than this one, plus by using a function, you've lost any chance of using an index, should one exist (unless you have a function based index based on the new clause).
I am currently studying for my MCSA Data Platform, I got the following question wrong and I was looking for an explanation as to why my answer was wrong as the in test explanation did not make much sense.
The Hovercraft Wages table records salaries paid to workers at Contoso. Workers get either a daily rate or a yearly salary. The table conatains the following columns:
EmpID, Daily_Rate, Yearly_Salary
Workers only get one type of income rate and the other column in their record has a value of NULL. You want to run a query calculating each employees total salary based on the assumption that people work 5 days a week 52 weeks per year.
Below are two options the right answer and the answer i chose
SELECT EmpID, CAST(COALESCE(Daily_Rate*5*52, Yearly_Salary) AS money) AS 'Total Salary'
FROM Hovercraft.Wages;
SELECT EMPID, CAST(ISNULL(Daily_Rate*5*52, Yearly_Salary)AS money)AS 'Total Salary'
FROM Hovercraft.Wages;
I selected the second choice as there were only two possible pay fields but was marked as incorrect for the coalesce, Can anybody clarify why an ISNULL is not a valid choice in this example as I do not want to make this mistake in the future
Many Thanks
The biggest difference is that ISNULL is proprietary, while COALESCE is part of SQL standard. Certification course may be teaching to maximum portability of knowledge, so when you have several choices, the course prefers a standard way of solving the problem.
The other difference that may be important in this situation is the data type determination. ISNULL uses the type of the first argument, while COALESCE follows the same rules as CASE, and picks the type with higher precedence. This may be important when Daily_Rate is stored in a column with narrower range.
For completeness, here is a list of other differences between the two (taken from Microsoft SQL Server Blog):
The NULLability of result expression is different,
Validations for ISNULL and COALESCE is different, because NULL value for ISNULL is converted to int, but triggers an error with COALESCE
ISNULL takes only two parameters whereas COALESCE takes variable number of parameters
You may get different query plans for the two functions.
EDIT : From the way the answer is worded I think that the authors want you to use ISNULL in situations when the second argument is guaranteed to be non-NULL, e.g. a non-nullable field, or a constant. While generally this idea is sound, their choice of question to test it is not ideal: the issue is that the problem guarantees that the value of the second ISNULL parameter is non-NULL in situations when it matters, making the two choices logically equivalent.
Besides all the very well known differences, not many people know that COALESCE is just a shorthand for CASE, and inherits all of it's +s and -s -- a list of parameters, proper casting of the result and so on.
But it also can be detrimonial to performance for an unsuspecting developer.
To demonstrate my point, pls run these three queries (on AdventureWorks2012 database) and check the execution plans for them:
SELECT COALESCE((SELECT CustomerID FROM Sales.SalesOrderHeader WHERE SalesOrderId = 57418), 0)
SELECT ISNULL((SELECT CustomerID FROM Sales.SalesOrderHeader WHERE SalesOrderId = 57418), 0)
SELECT CASE WHEN (SELECT CustomerID FROM Sales.SalesOrderHeader WHERE SalesOrderId = 57418) IS NULL
THEN 0
ELSE (SELECT CustomerID FROM Sales.SalesOrderHeader WHERE SalesOrderId = 57418)
END
You see that the first and the third one have the identicall execution plans (because the COALESCE is just a short form of CASE). Also you see that in first and third queries the SalesOrderHeader table is accessed twice, as oposed to just once with ISNULL.
If you also enable SET STATISTICS IO ON for the session, you'll notice that the number of logical reads is double for these two queries.
In this case, COALESCE executed the inner SELECT statement twice, as opposed to ISNULL, which executed it only once. This could make a huge difference.
Is it possible to join two tables based on the same date, not factoring in time?
Something like:
...FROM appointments LEFT JOIN sales ON
appointments.date = sales.date...
The only problem is it is a datetime field, so I want to make sure it is only looking at the date and ignoring time
You can do it like this:
FROM appointments
LEFT JOIN sales ON DATE(appointments.date) = DATE(sales.date)
But I'm pretty sure it won't be able to use an index, so will be very slow.
You might be better off adding a date column to each table.
the ON clause accepts an arbitrary conditional expression, so you can perform the date normalization in both sides before comparing, but it should represent a significant performance penalty.
Yes, but you have to join on a calculated expression, that strips the time from the datetime column values. This will make the query non-SARGable (It cannot use an index) so it will generate a table scan...
I don't know the syntax in mysql to strip time from a datetime value, but whatever it is, just write
FROM appointments a
LEFT JOIN sales s
ON StripTime(s.date) = StripTime(a.date)