"Cannot construct data type datetime" when filtering data, but all values filtered DO have valid dates - sql

I am convinced that this question is NOT a duplicate of:
Cannot construct data type datetime, some of the arguments have values which are not valid
In that case the values passed in are explicitly not valid. Whereas in this case the values that the function could be expected to be called upon are all valid.
I know what the actual problem is, and it's not something that would help most people that find the other question. But it IS something that would be good to be findable on SO.
Please read the answer, and understand why it's different from the linked question before voting to close as dupe of that question.
I've run some SQL that's errored with the error message: Cannot construct data type datetime, some of the arguments have values which are not valid.
My SQL uses DATETIMEFROMPARTS, but it's fine evaluating that function in the select - it's only a problem when I filter on the selected value.
It's also demonstrating weird, can't-possibly-be-happening behaviour w.r.t. other changes to the query.
My query looks roughly like this:
WITH FilteredDataWithDate (
SELECT *, DATETIMEFROMPARTS(...some integer columns representing date data...) AS Date
FROM Table
WHERE <unrelated pre-condition filter>
)
SELECT * FROM FilteredDataWithDate
WHERE Date > '2020-01-01'
If I run that query, then it errors with the invalid data error.
But if I omit the final Date > filter, then it happily renders every result record, so clearly none of the values it's filtering on are invalid.
I've also manually examined the contents of Table WHERE <unrelated pre-condition filter> and verified that everything is a valid date.
It also has a wild collection of other behaviours:
If I replace all of ...some integer columns representing date data... with hard-coded numbers then it's fine.
If I replace some parts of that data with hardcoded values, that fixes it, but others don't. I don't find any particular patterns in what does or doesn't help.
If I remove most of the * columns from the Table select. Then it starts to be fine again.
Specifically, it appears to break any time I include an nvarchar(max) column in the CTE.
If I add an additional filter to the CTE that limits the results to Id values in the following ranges, then the results are:
130,000 and 140,000. Error.
130,000 and 135,000. Fine.
135,000 and 140,000. Fine.!!!!
Filtering by the Date column breaks everything ... but ORDER BY Date is fine. (and confirms that all dates lie within perfectly sensible bounds.)
Adding TOP 1000000 makes it work ... even though there are only about 1000 rows.
... WTAF?!

This took me a while to decode, but it turns out that the SS compiler doesn't necessarily restrict its execution of the function just to rows that are, or could be, relevant to the result set.
Depending on the execution plan it arrives at, the function could get called on any record in Table, even one that doesn't satisfy WHERE <unrelated pre-condition filter>.
This was found by another user, for another function, over here.
So the fact that it could return all the results without the filter wasn't actually proving that every input into the function was valid. And indeed there were some records in the table that weren't in the result set, but still had invalid data.
That actually means that even if you were to add an explicit WHERE filter to exclude rows containing invalid date-component data ... that isn't actually guaranteed to fix it, because the function may still get called against the 'excluded' rows.
Each of the random other things I did will have been influencing the query plan in one way or another that happened to fix/break things.
The solution is, naturally, to fix the underlying table data.

Related

How to cache return value of a function for a single query

I want to use getdate() function 3-4 times in my single query for validation check. But I want that everytime I anticipate to get current datetime in a single query execution I get the same date at all 3-4 places. Not technically computers are that fast that 99.9% times I will get the same datetime at all places in query. But theoretically it may lead to bug. So how can cache that getdate return by calling it once and use that cached values in query.
But to add, I want to write such statement in check constraint, so I cant declare local variables, or any such thing.
SQL Server has the concept of run-time constant functions. The best way to describe these is that the first thing the execution engine does is pull the function references out from the query plan and execute each such function once per query.
Note that the function references appear to be column-based. So different columns can have different values, but different rows should have the same value within a column.
The two most common functions in this category are getdate() and rand(). Ironically, I find that this is a good thing for getdate(), but a bad thing for rand() (what kind of random number generator always returns the same value?).
For some reason, I can't find the actual documentation on run-time constant functions. But here are some respected blog posts that explain the matter:
https://sqlperformance.com/2014/06/t-sql-queries/dirty-secrets-of-the-case-expression
http://sqlblog.com/blogs/andrew_kelly/archive/2008/03/01/when-a-function-is-indeed-a-constant.aspx
https://blogs.msdn.microsoft.com/conor_cunningham_msft/2010/04/23/conor-vs-runtime-constant-functions/

'-999' used for all condition

I have a sample of a stored procedure like this (from my previous working experience):
Select * from table where (id=#id or id='-999')
Based on my understanding on this query, the '-999' is used to avoid exception when no value is transferred from users. So far in my research, I have not found its usage on the internet and other company implementations.
#id is transferred from user.
Any help will be appreciated in providing some links related to it.
I'd like to add my two guesses on this, although please note that to my disadvantage, I'm one of the very youngest in the field, so this is not coming from that much of history or experience.
Also, please note that for any reason anybody provides you, you might not be able to confirm it 100%. Your oven might just not have any leftover evidence in and of itself.
Now, per another question I read before, extreme integers were used in some systems to denote missing values, since text and NULL weren't options at those systems. Say I'm looking for ID#84, and I cannot find it in the table:
Not Found Is Unlikely:
Perhaps in some systems it's far more likely that a record exists with a missing/incorrect ID, than to not be existing at all? Hence, when no match is found, designers preferred all records without valid IDs to be returned?
This however has a few problems. First, depending on the design, user might not recognize the results are a set of records with missing IDs, especially if only one was returned. Second, current query poses a problem as it will always return the missing ID records in addition to the normal matches. Perhaps they relied on ORDERing to ease readability?
Exception Above SQL:
AFAIK, SQL is fine with a zero-row result, but maybe whatever thing that calls/used to call it wasn't as robust, and something goes wrong (hard exception, soft UI bug, etc.) when zero rows are returned? Perhaps then, this ID represented a dummy row (e.g. blanks and zeroes) to keep things running.
Then again, this also suffers from the same arguments above regarding "record is always outputted" and ORDER, with the added possibility that the SQL-caller might have dedicated logic to when the -999 record is the only record returned, which I doubt was the most practical approach even in whatever era this was done at.
... the more I type, the more I think this is the oven, and only the great grandmother can explain this to us.
If you want to avoid exception when no value transferred from user, in your stored procedure declare parameter as null. Like #id int = null
for instance :
CREATE PROCEDURE [dbo].[TableCheck]
#id int = null
AS
BEGIN
Select * from table where (id=#id)
END
Now you can execute it in either ways :
exec [dbo].[TableCheck] 2 or exec [dbo].[TableCheck]
Remember, it's a separate thing if you want to return whole table when your input parameter is null.
To answer your id = -999 condition, I tried it your way. It doesn't prevent any exception

T-SQL - Error converting data type - show offending row

On a simple INSERT command, I am getting an error:
Error converting data type...
The source data has multiple sources and combined makes hundreds of thousands of rows.
Can I re-write my statement to catch the error and show the offending data?
Thanks!
EDIT:
Requests for code:
insert Table_A
([ID]
,[rowVersion]
,[PluginId]
,[rawdataId]
...
...
...
)
select [ID]
,[rowVersion]
,[PluginId]
,[rawdataId]
...
...
...
FROM TABLE_B
Here are two approaches that I've taken, when dealing with this problem. The issue is caused by an implicit conversion from a string to a date.
If you happen to know which field is being converted (which may be true in your example, but not always in mine), then just do:
select *
from table_B
where isdate(col) = 0 and col is not null
This may not be perfect for all data types, but it has worked well for me in practice.
Sometimes, when I want to find the offending row in a select statement, I would run the select, outputting the data into text rather than a grid. This is one of the options in SSMS, along the row of icons beneath the menus. It will output all the rows before the error, which sort of lets you identify the row with the error. This works best when there is an order by clause, but for debugging purpose it has worked for me.
In your case, I might create a temporary table that holds strings, and then do the analysis on this temporary table, particularly if Table_B is not really a table but a more complicated query.
The query statement insert into...select or select ... into ... from has no capability to find the offending data. Instead you can use BCP to set the max_erros and err_files to output all the offending data into an error file. Then you can simply analyze the error file to find all offending rows. [MSDN BCP]1
One solution is to do a binary search to find the problematic value(s). You can do that both by column and by row:
Try to insert only half the columns, if that works the other half of the columns.
Try to insert only half the number of rows. If that works the other half.
And repeat until you found the problem.

For an Oracle NUMBER datatype, LIKE operator vs BETWEEN..AND operator

Assume mytable is an Oracle table and it has a field called id. The datatype of id is NUMBER(8). Compare the following queries:
select * from mytable where id like '715%'
and
select * from mytable where id between 71500000 and 71599999
I would think the second is more efficient since I think "number comparison" would require fewer number of assembly language instructions than "string comparison". I need a confirmation or correction. Please confirm/correct and throw any further comment related to either operator.
UPDATE: I forgot to mention 1 important piece of info. id in this case must be an 8-digit number.
If you only want values between 71500000 and 71599999 then yes the second one is much more efficient. The first one would also return values between 7150-7159, 71500-71599 etc. and so forth. You would either need to sift through unecessary results or write another couple lines of code to filter the rest of them out. The second option is definitely more efficient for what you seem to want to do.
It seems like the execution plan on the second query is more efficient.
The first query is doing a full table scan of the id's, whereas the second query is not.
My Test Data:
Execution Plan of first query:
Execution Plan of second query:
I don't like the idea of using LIKE with a numeric column.
Also, it may not give the results you are looking for.
If you have a value of 715000000, it will show up in the query result, even though it is larger than 71599999.
Also, I do not like between on principle.
If a thing is between two other things, it should not include those two other things. But this is just a personal annoyance.
I prefer to use >= and <= This avoids confusion when I read the query. In addition, sometimes I have to change the query to something like >= a and < c. If I started by using the between operator, I would have to rewrite it when I don't want to be inclusive.
Harv
In addition to the other points raised, using LIKE in the manner you suggest would cause Oracle to not use any indexes on the ID column due to the implicit conversion of the data from number to character, resulting in a full table scan when using LIKE versus and index range scan when using BETWEEN. Assuming, of course, you have an index on ID. Even if you don't, however, Oracle will have to do the type conversion on each value it scans in the LIKE case, which it won't have to do in the other.
You can use math function, otherwise you have to use to_char function to use like, but it will cause performance problems.
select * from mytable where floor(id /100000) = 715
or
select * from mytable where floor(id /100000) = TO_NUMBER('715') // this is parametric

Incorrect Results when calling a Python UDF in Redshift multiple times within a single column inside a select statement

I am encountering an issue in Redshift where calling a UDF more than once per column inside a select statement is returning the same result as the first call to that UDF.
Bit of Background
I have a very simple Python UDF that calculates an md5 hash. The reason for this function is to be able to handle UTF-16/UTF-8 conversion before doing the hash so it is consistent with SQL server. Now the syntax or logic inside the function does not seem to be the issue as we have tried creating even simpler functions that produce the same behavior.
The Problem
My function is named MD5_UTF16 and is called by doing MD5_UTF16(yourvalue), and returns a hash string / hexdigest of the value you pass into the argument.
In my query I need to be able to do this (postgresql syntax):
SELECT MD5_UTF16(column1) || MD5_UTF16(column2)|| MD5_UTF16(column3) AS concatenatedhash
FROM MyTable
i.e. I need to calculate each hash and concatenate them in a single column. If I calculated each of those hashes separately in their own columns, the function generates the correct hashes for those columns. However, in my example above I have called each function and concatenated the results with the results of the other calls. In this scenario what is happening is all the calls to the functions are returning the hash for the first call i.e. MD5_UTF16(column1).
To clarify a bit further using example hash values. Let's pretend these are the hashes for each of the columns above:
Column 1: 275AB169CBEE4550F752C634B9335AE0
Column 2: B2214041A94F50B027FE1DEEC4C8474C
Column 3: 91050DAEFFEE20CDA2FC9914B6E4EBE9
My expected result for the concatenatedhash column would be a simple concatenation of the strings above (275AB169CBEE4550F752C634B9335AE0B2214041A94F50B027FE1DEEC4C8474C91050DAEFFEE20CDA2FC9914B6E4EBE9)
Instead, what I am getting is a concatenation of column 1's hash 3 times:
(275AB169CBEE4550F752C634B9335AE0275AB169CBEE4550F752C634B9335AE0275AB169CBEE4550F752C634B9335AE0)
In my SELECT statement if I had called the function on column 2 (instead of column 1) first, then it would be the hash for column 2 that is repeated.
Has anyone encountered this before?
NOTE: You can only replicate this behavior if you are selecting data out of a table. So doing a:
SELECT MD5_UTF16('hard-coded value 1') || MD5_UTF16('hard-coded value 2')
with no table source will not replicate this behavior.
Work-arounds I am aware of
I do know of a possible workaround but I still would have expected my method above to work, so this question is not about applying the following workaround, but more understanding why the above method is not working.
- Workaround: Calculate each hash in a separate column first then concatenate them. This will have potential performance implications on our end among other things.
EDIT 1
Have found that the issue I've described only happens when there is a join in my query.. even if none of the column data from the joined table are being used in my UDF calls i.e.
SELECT ...concatenated hashes..
FROM table1
JOIN table2 ...
Removing the join seems to cause the hashes to be calculated correctly. Will attempt a workaround using this new knowledge. Not sure if it has anything to do with the execution plan running the UDF's differently when a join is involved - even though none of the column data from the joined table is being used for the UDF calls.