Background:
Our group is going through a Cloudera upgrade to 6.1.1 and I have been tasked with determining how to handle the loss of the implicit data type conversion across data types. See link below for the relevant Release Note details.
https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_611_incompatible_changes.html#hive_union_all_returns_incorrect_data
Not only does this issue affect UNION ALL queries, but there is a function that performs comparisons on columns of different data types (i.e, STRING to BIGINT).
The group has decided that we do not want to change the underlying table meta data. So the solution is to allow for potential data loss by using the CAST() function to cast the data. In the case of UNION ALL, we cast to the destination table's meta data. But, when performing comparisons, I am trying to determine the simplest and easiest way to perform comparisons without getting erroneous results.
Question:
Can I simply cast everything to either STRING or VARCHAR() when performing the comparison? Are there any potential problems that might create incorrect results?
Update:
If there are problems with this approach, is there a correct solution to handle this?
Note: this is my first engagement working with Hadoop/HIVE and I have learned that everything I know in RDBMS land does not always apply.
It is possible that you will have problems. For instance, if comparing a string to an int, then:
'1.00' = 1 --> true, because the values are compared as numbers
But as strings:
'1.00' = '1' --> false, because the values are compared as strings
You can get similar issues with dates, I think.
I have a table that holds drafted data. Most of the values on the table are nullable. I want to return an empty string in the case the value is null and I'm grabbing and copying the draft data onto the frontend. What would be the best way to handle this? I understand I can use ISNULL(variable, '') but don't want it to do that for every single value. Or a ternary on the front end to return a empty string if null. Would there be a better way to handle this?
You can have a trigger on your table before insert or update and populate the fields you want with an empty value for all values that match your "treat null as empty" rule.
In my opinion changing data model to satisfy UI requirements is not a great idea. I would rather have a transformer in the middle (or a custom RowMappeer) that will do the transformation on the data already pulled from the database.
Also note that for some databases (such as Oracle) an empty string and a null are interchangeable in which case my suggestion to use a trigger won't work.
I know you said you did not want to use ISNULL but this is the main reason this function exists so I would rather start using it. If you have lots of such fields you can always use a clever editor to help you build your query using ISNULL rather than typing.
You can use the nvl sql function, which allows you to change null value for any value you want.
I have a table with a column1 nvarchar(50) null. I want to insert this into a more 'tight' table with a nvarchar(30) not null. My idea was to insert a derived column task between source and destination task with this expression: Replace column1 = (DT_WSTR,30)Column1
I get the "truncation may occur error" and I am not allowed to insert the data into the new tighter table.
Also I am 100% sure that no values are over 30 characters in the column. Moreover I do not have the possibility to change the column data type in the source.
What is the best way to create the ETL process?
JotaBe recommended using a data conversion transformation. Yes, that is another way to achieve the same thing, but it will also error out if truncation occurs. Your way should work (I tried it), provided the input data really is less than 30 characters.
You could modify your derived column expression to
(DT_WSTR,30)Substring([Column1], 1, 30)
Consider changing the truncation error disposition of the Derived Column component within your Data Flow. By default, a truncation will cause the Derived Column component to fail. You can configure the component to ignore or redirect rows which are causing a truncation error.
To do this, open the Derived Column Transformation editor and click the 'Configure Error Output...' button in the bottom-left of the dialog. From here, change the setting in the 'Truncation' column for any applicable columns as required.
Be aware that any data which is truncated for columns ignoring failure will not be reported by SSIS during execution. It sounds like you've already done this, but it's important to be sure you've analysed your data as it currently stands and taken into consideration any possible future changes to the nature of the data before disabling truncation reporting.
To do so you must use a Data Conversion Transformation, which allows to change the data type from the original nvarchar(50) to the desired nvarchar(30).
You'll get a new column with the required data type.
Of course, you can decide what to do in case of error: truncation, by configuring this component.
UPDATE
As there are people who have downvoted this answer, let's add 3 more comments:
this solution is checked and works. Create a table with a nvarchar(50) column, a new table with a nvarchar(30) column, add a data flow that uses a data conversion transform and it works witout a glitch. Please, chek it, I guarantee. Besides, as the OP states "Also I am 100% sure that no values are over 30 characters in the column" in his case there will be no truncation problems. However, I recommend treating the possible errors, just in case they happen.
from MSDN: "a package can perform the following types of data conversions: ... Set the column length of string data"
from MSDN: "If the length of an output column of string data is shorter than the length of its corresponding input column, the output data is truncated."
What are the pros and cons of using NULL values in SQL as opposed to default values?
PS. Many similar questions has been asked on here but none answer my question.
I don't know why you're even trying to compare these to cases. null means that some column is empty/has no value, while default value gives a column some value when we don't set it directly in query.
Maybe some example will be better explanation. Let's say we've member table. Each member has an ID and username. Optional he might has an e-mail address (but he doesn't have to). Also each member has a postCount column (which is increased every time user write a post). So e-mail column can have a null value (because e-mail is optional), while postCount column is NOT NULL but has default value 0 (because when we create a new member he doesn't have any posts).
Null values are not ... values!
Null means 'has no value' ... beside the database aspect, one important dimension of non valued variables or fields is that it is not possible to use '=' (or '>', '<'), when comparing variables.
Writting something like (VB):
if myFirstValue = mySecondValue
will not return either True or False if one or both of the variables are non-valued. You will have to use a 'turnaround' such as:
if (isnull(myFirstValue) and isNull(mySecondValue)) or myFirstValue = mySecondValue
The 'usual' code used in such circumstances is
if Nz(myFirstValue) = Nz(mySecondValue, defaultValue)
Is not strictly correct, as non-valued variables will be considered as 'equal' to the 'defaultValue' value (usually Zero-length string).
In spite of this unpleasant behaviour, never never never turn on your default values to zero-length string (or '0's) without a valuable reason, and easing value comparison in code is not a valuable reason.
NULL values are meant to indicate that the attribute is either not applicable or unknown. There are religious wars fought over whether they're a good thing or a bad thing but I fall in the "good thing" camp.
They are often necessary to distinguish known values from unknown values in many situations and they make a sentinel value unnecessary for those attributes that don't have a suitable default value.
For example, whilst the default value for a bank balance may be zero, what is the default value for a mobile phone number. You may need to distinguish between "customer has no mobile phone" and "customer's mobile number is not (yet) known" in which case a blank column won't do (and having an extra column to decide whether that column is one or the other is not a good idea).
Default values are simply what the DBMS will put in a column if you don't explicitly specify it.
It depends on the situation, but it's really ultimately simple. Which one is closer to the truth?
A lot of people deal with data as though it's just data, and truth doesn't matter. However, whenever you talk to the stakeholders in the data, you find that truth always matters. sometimes more, sometimes less, but it always matters.
A default value is useful when you may presume that if the user (or other data source) had provided a value, the value would have been the default. If this presumption does more harm then good, then NULL is better, even though dealing with NULL is a pain in SQL.
Note that there are three different ways default values can be implemented. First, in the application, before inserting new data. The database never sees the difference between a default value provided by the user or one provided by the app!
Second, by declaring a default value for the column, and leaving the data missing in an insert.
Third, by substituting the default value at retrieval time, whenever a NULL is detected. Only a few DBMS products permit this third mode to be declared in the database.
In an ideal world, data is never missing. If you are developing for the real world, required data will eventually be missing. Your applications can either do something that makes sense or something that doesn't make sense when that happens.
As with many things, there are good and bad points to each.
Good points about default values: they give you the ability to set a column to a known value if no other value is given. For example, when creating BOOLEAN columns I commonly give the column a default value (TRUE or FALSE, whatever is appropriate) and make the column NOT NULL. In this way I can be confident that the column will have a value, and it'll be set appropriate.
Bad points about default values: not everything has a default value.
Good things about NULLs: not everything has a known value at all times. For example, when creating a new row representing a person I may not have values for all the columns - let's say I know their name but not their birth date. It's not appropriate to put in a default value for the birth date - people don't like getting birthday cards on January 1st (if that's the default) if their birthday is actually July 22nd.
Bad things about NULLs: NULLs require careful handling. In most databases built on the relational model as commonly implemented NULLs are poison - the presence of a NULL in a calculation causes the result of the calculation to be NULL. NULLs used in comparisons can also cause unexpected results because any comparison with NULL returns UNKNOWN (which is neither TRUE nor FALSE). For example, consider the following PL/SQL script:
declare
nValue NUMBER;
begin
IF nValue > 0 THEN
dbms_output.put_line('nValue > 0');
ELSE
dbms_output.put_line('nValue <= 0');
END IF;
IF nValue <= 0 THEN
dbms_output.put_line('nValue <= 0');
ELSE
dbms_output.put_line('nValue > 0');
END IF;
end;
The output of the above is:
nValue <= 0
nValue > 0
This may be a little surprising. You have a NUMBER (nValue) which is both less than or equal to zero and greater than zero, at least according to this code. The reason this happens is that nValue is actually NULL, and all comparisons with NULL result in UNKNOWN instead of TRUE or FALSE. This can result in subtle bugs which are hard to figure out.
Share and enjoy.
To me, they are somewhat orthogonal.
Default values allow you to gracefully evolve your database schema (think adding columns) without having to modify client code. Plus, they save some typing, but relying on default values for this is IMO bad.
Nulls are just that: nulls. Missing value and a huge PITA when dealing with Three-Valued Logic.
In a Data Warehouse, you would always want to have default values rather than NULLs.
Instead you would have value such as "unknown","not ready","missing"
This allows INNER JOINs to be performed efficiently on the Fact and Dimension tables as 'everything always has a value'
Nulls and default values are different things used for different purposes. If you are trying to avoid using nulls by giving everything a default value, that is a poor practice as I will explain.
Null means we do not know what the value is or will be. For instance suppose you have an enddate field. You don't know when the process being recorded will end, so null is the only appropriate value; using a default value of some fake date way out in the future will cause as much trouble to program around as handling the nulls and is more likely in my experience to create a problem with incorrect results being returned.
Now there are times when we might know what the value should be if the person inserting the record does not. For instance, if you have a date inserted field, it is appropriate to have a default value of the current date and not expect the user to fill this in. You are likely to actually have better information that way for this field.
Sometimes, it's a judgement call and depends on the business rules you have to apply. Suppose you have a speaker honoraria field (Which is the amount a speaker would get paid). A default value of 0 could be dangerous as it it might mean that speakers are hired and we intend to pay them nothing. It is also possible that there may occasionally be speakers who are donating their time for a particular project (or who are employees of the company and thus not paid extra to speak) where zero is a correct value, so you can't use zero as the value to determine that you don't know how much this speaker is to be paid. In this case Null is the only appropriate value and the code should trigger an issue if someone tries to add the speaker to a conference. In a different situation, you may know already that the minimum any speaker will be paid is 3000 and that only speakers who have negotiated a different rate will have data entered in the honoraria field. In this case, it is appropriate to put in a default value of 3000. In another cases, different clients may have different minimums, so the default should be handled differently (usually through a lookup table that automatically populates the minimum honoraria value for that client on the data entry form.
So I feel the best rule is leave the value as null if you truly cannot know at the time the data is entered what the value of the field should be. Use a default value only it is has meaning all the time for that particular situation and use some other technique to fill in the value if it could be different under different circumstances.
I so appreciate all of this discussion. I am in the midst of building a data warehouse and am using the Kimball model rather strictly. There is one very vocal user, however, who hates surrogate keys and wants NULLs all over the place. I told him that it is OK to have NULLable columns for attributes of dimensions and for any dates or numbers that are used in calculations because default values there imply incorrect data. There are, I agree, advantages to allowing NULL in certain columns but it makes cubing a lot better and more reliable if there is a surrogate key for every foreign key to a dimension, even if that surrogate is -1 or 0 for a dummy record. SQL likes integers for joins and if there is a missing dimension value and a dummy is provided as a surrogate key, then you will get the same number of records using one dimension as you would cubing on another dimension. However, calculations have to be done correctly and you have to accommodate for NULL values in those. Birthday should be NULL so that age is not calculated, for example. I believe in good data governance and making these decisions with the users forces them to think about their data in more ways than ever.
As one responder already said, NULL is not a value.
Be very ware of anything proclaimed by anyone who speaks of "the NULL value" as if it were a value.
NULL is not equal to itself. x=y yields false if both x and y are NULL. x=y yields true if both x and y are the default value.
There are almost endless consequences to this seemingly very simple difference. And most of those consequences are booby traps that bite you real bad.
Nulls NEVER save storage space in DB2 for OS/390 and z/OS. Every nullable column requires one additional byte of storage for the null indicator. So, a CHAR(10) column that is nullable will require 11 bytes of storage per row – 10 for the data and 1 for the null indicator. This is the case regardless of whether the column is set to null or not.
DB2 for Linux, Unix, and Windows has a compression option that allows columns set to null to save space. Using this option causes DB2 to eliminate the unused space from a row where columns are set to null. This option is not available on the mainframe, though.
REF: http://www.craigsmullins.com/bp7.htm
So, the best modeling practice for DB2 Z/OS is to use "NOT NULL WITH DEFAULT" as a standard for all columns. It's the same followed in some major shops I knew. Makes the life of programmers more easier not having to handle the Null Indicator and actually saves on storage by eliminating the need to use the extra byte for the NULL INDICATOR.
Two very good Access-oriented articles about Nulls by Allen Browne:
Nulls: Do I need them?
Common Errors with Null
Aspects of working with Nulls in VBA code:
Nothing? Empty? Missing? Null?
The articles are Access-oriented, but could be valuable to those using any database, particularly relative novices because of the conversational style of the writing.
Due to a weird request, I can't put null in a database if there is no value. I'm wondering what can I put in the store procedure for nothing instead of null.
For example:
insert into blah (blah1) values (null)
Is there something like nothing or empty for "blah1" instead using null?
I would push back on this bizarre request. That's exactly what NULL is for in SQL, to denote a missing or inapplicable value in a column.
Is the requester experiencing grief over SQL logic with NULL?
edit: Okay, I've read your reply with the extra detail about this job assignment (btw, generally you should edit your original question instead of posting more information in an answer).
You'll have to declare all columns as NOT NULL and designate a special value in the domain of that column's data type to signify "no value." The appropriate value to choose might be different on a case by case basis, i.e. zero may signify nothing in a person_age column, but it might have significance in an items_in_stock column.
You should document the no-value value for each column. But I suppose they don't believe in documentation either. :-(
Depends on the data type of the column. For numbers (integers, etc) it could be zero (0) but if varchar then it can be an empty string ("").
I agree with other responses that NULL is best suited for this because it transcends all data types denoting the absence of a value. Therefore, zero and empty string might serve as a workaround/hack but they are fundamentally still actual values themselves that might have business domain meaning other than "not a value".
(If only the SQL language supported a "Not Applicable" (N/A) value type that would serve as an alternative to NULL...)
Is null is a valid value for whatever you're storing?
Use a sentry value like INT32.MaxValue, empty string, or "XXXXXXXXXX" and assume it will never be a legitimate value
Add a bit column 'Exists' that you populate with true at the same time you insert.
Edit: But yeah, I'll agree with the other answers that trying to change the requirements might be better than trying to solve the problem.
If you're using a varchar or equivalent field, then use the empty string.
If you're using a numeric field such as int then you'll have to force the user to enter data, else come up with a value that means NULL.
I don't envy you your situation.
There's a difference between NULLs as assigned values (e.g. inserted into a column), and NULLs as a SQL artifact (as for a field in a missing record for an OUTER JOIN. Which might be a foreign concept to these users. Lots of people use Access, or any database, just to maintain single-table lists.) I wouldn't be surprised if naive users would prefer to use an alternative for assignments; and though repugnant, it should work ok. Just let them use whatever they want.
There is some validity to the requirement to not use NULL values. NULL values can cause a lot of headache when they are in a field that will be included in a JOIN or a WHERE clause or in a field that will be aggregated.
Some SQL implementations (such as MSSQL) disallow NULLable fields to be included in indexes.
MSSQL especially behaves in unexpected ways when NULL is evaluated for equality. Does a NULL value in a PaymentDue field mean the same as zero when we search for records that are up to date? What if we have names in a table and somebody has no middle name. It is conceivable that either an empty string or a NULL could be stored, but how do we then get a comprehensive list of people that have no middle name?
In general I prefer to avoid NULL values. If you cannot represent what you want to store using either a number (including zero) or a string (including the empty string as mentioned before) then you should probably look closer into what you are trying to store. Perhaps you are trying to communicate more than one piece of data in a single field.