How to handle potential data loss when performing comparisons across data types in different groups - sql

Background:
Our group is going through a Cloudera upgrade to 6.1.1 and I have been tasked with determining how to handle the loss of the implicit data type conversion across data types. See link below for the relevant Release Note details.
https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_611_incompatible_changes.html#hive_union_all_returns_incorrect_data
Not only does this issue affect UNION ALL queries, but there is a function that performs comparisons on columns of different data types (i.e, STRING to BIGINT).
The group has decided that we do not want to change the underlying table meta data. So the solution is to allow for potential data loss by using the CAST() function to cast the data. In the case of UNION ALL, we cast to the destination table's meta data. But, when performing comparisons, I am trying to determine the simplest and easiest way to perform comparisons without getting erroneous results.
Question:
Can I simply cast everything to either STRING or VARCHAR() when performing the comparison? Are there any potential problems that might create incorrect results?
Update:
If there are problems with this approach, is there a correct solution to handle this?
Note: this is my first engagement working with Hadoop/HIVE and I have learned that everything I know in RDBMS land does not always apply.

It is possible that you will have problems. For instance, if comparing a string to an int, then:
'1.00' = 1 --> true, because the values are compared as numbers
But as strings:
'1.00' = '1' --> false, because the values are compared as strings
You can get similar issues with dates, I think.

Related

"Numeric value '' is not recognized" - what column?

I am trying to insert data from a staging table into the master table. The table has nearly 300 columns, and is a mix of data-typed Varchars, Integers, Decimals, Dates, etc.
Snowflake gives the unhelpful error message of "Numeric value '' is not recognized"
I have gone through and cut out various parts of the query to try and isolate where it is coming from. After several hours and cutting every column, it is still happening.
Does anyone know of a Snowflake diagnostic query (like Redshift has) which can tell me a specific column where the issue is occurring?
Unfortunately not at the point you're at. If you went back to the COPY INTO that loaded the data, you'd be able to use VALIDATE() function to get better information to the record and byte-offset level.
I would query your staging table for just the numeric fields and look for blanks, or you can wrap all of your fields destined for numeric fields with try_to_number() functions. A bit tedious, but might not be too bad if you don't have a lot of numbers.
https://docs.snowflake.com/en/sql-reference/functions/try_to_decimal.html
As a note, when you stage, you should try and use the NULL_IF options to get rid of bad characters and/or try to load them into stage using the actual datatypes in your stage table, so you can leverage the VALIDATE() function to make sure the data types are correct before loading into Snowflake.
Query your staging using try_to_number() and/or try_to_decimal() for number and decimal fields of the table and the use the minus to get the difference
Select $1,$2,...$300 from #stage
minus
Select $1,try_to_number($2)...$300 from#stage
If any number field has a string that cannot be converted then it will be null and then minus should return those rows which have a problem..Once you get the rows then try to analyze the columns in the result set for errors.

General replacing of NULL values in SQL Views

We have a fairly recent version of a SQL Server that we're using to extract data into a SAP BW Datawarehouse. We're using views to access data in tables on the SQL server. Some of the fields in these tables contain NULL values. These are transferred into SAP as String Value ('NULL') instead of empty, which causes us a major headache.
I understand that we can use COALESCE() in views to replace NULL values with a desired default value ('', 0, '1900-01-01', etc.), however, doing this for each NULL field that we encounter doesn't appear to be very smart.
Is there a better way of addressing this issue short of changing tables to not allow NULL values? Is it possible to include a custom global function that gets automatically applied to all fields fetched in a view without us having to call this function for each field individually?
#Jeroen Mostert's answer in the comments answers my question.
There is no global toggle, flag or setting that will magically
eliminate NULLs for you. The closest thing is that COALESCE(,
'') will "work" (for some values of "work") for almost every type
(including DATETIME), with the notable exception of DECIMAL. You
cannot write a function to do this, as functions in T-SQL cannot
return different types based on their input. By far the best "fix" is
indeed to fix the processing step, if only because ending up with
1900-01-01 dates in your database is typically quite undesirable.
Therefore, the only options are
go through each field that can potentially hold a NULL value and
cleanse it within the view
handle NULL values on the receiving
end (e.g. SAP BW); this could be done through a generic function to
be placed in the entry layer's start routine or it could be done
manually.
Disallowing NULL values on table level in the source system is in this case not feasible as we cannot change the application that writes to the tables (third party vendor ERP).

How can I change a date field from String to Date or DateTime?

I an using Google Big Query and I have a field, named 'AsOfDate' which is set as a string datatype. I have a bunch of data in this field, which I really want to set as DateTime or just Date. Either is fine. I Googled for a solution, and I thought this would be pretty easy to do, but I can't seem to get the data type updated. I don't want to run a simple select statement; I want to permanently change the Schema. Has anyone run into this and figured out how to do this kind of thing? If so, please share your insights. Thanks!
To quote directly from the official documentation: 'Changing a column's data type is not supported by the BigQuery web UI, the command-line tool, or the API.'
https://cloud.google.com/bigquery/docs/manually-changing-schemas#changing_a_columns_data_type
There are two ways to manually change a column's data type:
Using a SQL query — Choose this option if you are more concerned about
simplicity and ease of use, and you are less concerned about costs.
Recreating the table — Choose this option if you are more concerned
about costs, and you are less concerned about simplicity and ease of
use.
You could use either of the approaches above along with the PARSE_DATE() function to transform your string into a date field.
https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#parse_date

SQL Server - simple select and conversion between int and string

I have a simple select statement like this:
SELECT [dok__Dokument].[dok_Id],
[dok__Dokument].[dok_WartUsNetto],
[dok__Dokument].[dok_WartUsBrutto],
[dok__Dokument].[dok_WartTwNetto],
[dok__Dokument].[dok_WartTwBrutto],
[dok__Dokument].[dok_WartNetto],
[dok__Dokument].[dok_WartVat],
[dok__Dokument].[dok_WartBrutto],
[dok__Dokument].[dok_KwWartosc]
FROM [dok__Dokument]
WHERE [dok_NrPelnyOryg] = 2753
AND [dok_PlatnikId] = 174
AND [dok_OdbiorcaId] = 174
AND [dok_PlatnikAdreshId] = 625
AND [dok_OdbiorcaAdreshId] = 624
Column dok_NrPelnyOryg is of type varchar(30), and not null.
The table contained both integer and string values in this column and this select statement was fired millions of times.
However recently this started crashing with message:
Conversion failed when converting the varchar value 'garbi czerwiec B' to data type int.
Little explanation: the table contains multiple "document" records and the mentioned column contains document original number (which comes from multiple different sources).
I know I can fix this by adding '' around the the number, but I'm rather looking for an explanation why this used to work and while not changing anything now it crashes.
It's possible that a plan change (due to changed statistics, recompile etc) led to this data being evaluated earlier (full scan for example), or that this particular data was not in the table previously (maybe before this started happening, there wasn't bad data in there). If it is supposed to be a number, then make it a numeric column. If it needs to allow strings as well, then stop treating it like a number. If you properly parameterize your statements and always pass a varchar you shouldn't need to worry about whether the value is enclosed in single quotes.
All those equality comparison operations are subject to the Data Type Precedence rules of SQL Server:
When an operator combines two
expressions of different data types,
the rules for data type precedence
specify that the data type with the
lower precedence is converted to the
data type with the higher precedence.
Since character types have lower precedence than int types, the query is basically the same as:
SELECT ...
FROM [dok__Dokument]
WHERE cast([dok_NrPelnyOryg] as int) = 2753
...
This has two effects:
it makes all indexes on columns involved in the WHERE clause useless
it can cause conversion errors.
You're not the first to have this problem, in fact several CSS cases I faced had me eventually write an article about this: On SQL Server boolean operator short-circuit.
The correct solution to your problem is that if the field value is numeric then the column type should be numeric. since you say that the data come from a 3rd party application you cannot change, the best solution is to abandon the vendor of this application and pick one that knows what is doing. Short of that, you need to search for character types on character columns:
SELECT ...
FROM [dok__Dokument]
WHERE [dok_NrPelnyOryg] = '2753'
...
In .Net managed ADO.Net parlance this means you use a SqlCommand like follows:
SqlCommand cmd = new SqlCommand (#" SELECT ...
FROM [dok__Dokument]
WHERE [dok_NrPelnyOryg] = #nrPelnyOryg
... ");
cmd.Parameters.Add("#nrPelnyOryg", SqlDbType.Varchar).Value = "2754";
...
Just make sure you don't fall into he easy trap of passing in a NVARCHAR parameter (Unicode) for comparing with a VARCHAR column, since the same data type precendence rules quoted before will coerce the comparison to occur on the NVARCHAR type, thus rendering indexes, again, useless. the easiest way to fall for this trap is to use the dredded AddWithValue and pass in a string value.
Your query stopped working because someone inserted the text string in to the field you are querying using INT. Up until that time it was possible to implicitly convert the data but now that's no longer the case.
I'd go check your data and, more importantly, the model; as Aaron said do you need to allow strings in that field? If not, change the data type to prevent this happening in the future.

DB Performance and data types

I'm supporting an existing application written by another developer and I have a question as to whether the choices the data type the developer chose to store dates is affecting the performance of certain queries.
Relevant information: The application makes heavy use of a "Business Date" field in one of our tables. The data type for this business date is nvarchar(10) rather than a datetime data type. The format of the dates is "MM/DD/YYYY", so Christmas 2007 is stored as "12/25/2007".
Long story short, we have some heavy duty queries that run once a week and are taking a very long time to execute.
I'm re-writing this application from the ground up, but since I'm looking at this, I want to know if there is a performance difference between using the datetime data type compared to storing dates as they are in the current database.
You will both save disk-space and increase performance if you use datetime instead of nvarchar(10).
If you use the date-fields to do date-calculation (DATEADD etc) you will see a massive increase in query-execution-speed, because the fields do not need to be converted to datetime at runtime.
Operations over DATETIMEs are faster than over VARCHARs converted to DATETIMEs.
If your dates appear anywhere but in SELECT clause (like, you add them, DATEDIFF them, search for them in WHERE clause etc), then you should keep them in internal format.
There are a lot of reasons you should actually use DateTime rather than a varchar to store a date. Performance is one... but i would be concerned about queries like this:
SELECT *
FROM Table
WHERE DateField > '12/25/2007'
giving you the wrong results.
I cannot back this up with numbers, but the datetime-type should be a lot faster, since it can easily be compared, unlike the varchar. In my opinion, it is also worth a shot to look into UNIX timestamps as your data type.
I believe from an architectural perspective a Datetime would be a more efficient data type as it would be stored as a two 4-byte integers, whereas your nvarchar(10) will be stored as up to 22 bytes (two times the number of characters entered + 2 bytes.). Therefore potentially more than double the amount of storage space is required now in comparison to using a Datetime.
This of course has possible implications for indexing, as the smaller the data item, the more records you can fit on an index data page. This in turn produces a smaller index which is of course quicker to traverse and therefore will return results faster.
In summary, Datetime is the way to go.
The date filtering in the nvarchar field is not easy possible, as the data in the index is sorted lexicographically which doesn't match the sorting you would expect for the date. It's the problem with the date format "mm/dd/yyyy". That means "12/25/2007" will be after "12/01/2008" in a nvarchar index, but that's not what you want. "yyyy/mm/dd" would have been fine.
So, you should use a date field and convert the string values to date. You will surely get a big performance boost. That's if you can change the table schema.
Yes. datetime will be far more efficient for date calculations than varchar or nvarchar (why nvarchar - there's no way you've got real unicode in there, right?). Plus strings can be invalid and misinterpreted.
If you are only using the date part, your system may have a smaller date-only version of datetime.
In addition, if you are just doing joins and certain types of operations (>/</= comparisions but not datediff), a date "id" column which is actually an int of the form yyyymmdd is commonly used in datawarehouses. This does allow "invalid" dates, unfortunately, but it also allows more obvious reserved, "special", dates, whereas in datetime, you might use NULL of 1/1/1900 or something. Integrity is usually enforced through a foerign key constraint to a date "dimension."
Seeing that you tagged the question as "sql server", I'm assuming you are using some version of SQL Server, so I recommend that you look at either using datetime or smalldatetime. In addition, in SQL Server 2008, you have a date type as well as a datetime2 with a much larger range. Check out this link which gives some details
One other problem with using varchar (or any other string datatype) is that the data likely contains invalid dates as they are not automatically validated on entry. If you try to chang e the filed to a datetime field, you amay have conversion problems wher people have added dates such as ASAP, Unknown, 1/32/2009, etc. You willneed to check for dates that won't convert using the handy isdate function and either fix or null them out before you try to chnge the data type.
Likely you also have a lot of code that converts the varchar type to date datatype on the fly so that you can do date math as well. All that code will also need to be fixed.
Chances are the datetime type is both more compact and faster, but more importantly using DATETIMES to store a date and time is a better architecture choice. You're less likely to run into weird problems looking for records between a certain date range and most database libraries will map them to your languages Date type, so the code is much cleaner, which is really much more important in the long run.
Even if it were slower, you'd spend more time debugging the strings-as-dates than all your users will ever see in savings combined.