Do Apache Pig UDFs ever get called with null tuples? - apache-pig

https://wiki.apache.org/pig/UDFManual
The example UDF has a null-check on the input tuple in the exec method. The various built-in methods sometimes do and sometimes don't.
Are there actually any cases where a Pig script will cause a UDF to be called with a null input tuple? Certainly an empty input tuple is normal and expected, or a tuple of one null value, but I've never the tuple itself be null.

A tuple may be null because a previous UDF returns null.
Think on log analysis system, where you 1. parse the log, 2. enrich it with external data.
LOG --(PARSER)--> PARSED_LOG --(ENRICHENER)--> ENRICHED_LOG
If one log LOG have wrong format and cannot be parsed, the UDF PARSED_LOG may returns null.
Therefore, if used directly, the ENRICHENER have to test the input.
You can FILTER those null values before too, specially if the tuples are used multiple times or STORED.

My best understanding after working with Pig for a while is that bare nulls are never passed bare nulls, always a non-null Tuple (which itself may contain nulls.)

Related

What is the purpose of using `timestamp(nullif('',''))`

Folks
I am in the process of moving a decade old back-end from DB2 9.5 to Oracle 19c.
I frequently see in SQL queries and veiw definitions bizarre timestamp(nullif('','')) constructs used instead of a plain null.
What is the point of doing so? Why would anyone in their same mind would want to do so?
Disclaimer: my SQL skills are fairly mediocre. I might well miss something obvious.
It appears to create a NULL value with a TIMESTAMP data type.
The TIMESTAMP DB2 documentation states:
TIMESTAMP scalar function
The TIMESTAMP function returns a timestamp from a value or a pair of values.
TIMESTAMP(expression1, [expression2])
expression1 and expression2
The rules for the arguments depend on whether expression2 is specified and the data type of expression2.
If only one argument is specified it must be an expression that returns a value of one of the following built-in data types: a DATE, a TIMESTAMP, or a character string that is not a CLOB.
If you try to pass an untyped NULL to the TIMESTAMP function:
TIMESTAMP(NULL)
Then you get the error:
The invocation of routine "TIMESTAMP" is ambiguous. The argument in position "1" does not have a best fit.
To invoke the function, you need to pass one of the required DATE, TIMESTAMP or a non-CLOB string to the function which means that you need to coerce the NULL to have one of those types.
This could be:
TIMESTAMP(CAST(NULL AS VARCHAR(14)))
TIMESTAMP(NULLIF('',''))
Using NULLIF is more confusing but, if I have to try to make an excuse for using it, is slightly less to type than casting a NULL to a string.
The equivalent in Oracle would be:
CAST(NULL AS TIMESTAMP)
This also works in DB2 (and is even less to type).
It is not clear why - in any SQL dialect, no matter how old - one would use an argument like nullif('',''). Regardless of the result, that is a constant that can be calculated once and for all, and given as argument to timestamp(). Very likely, it should be null in any dialect and any version. So that should be the same as timestamp(null). The code you found suggests that whoever wrote it didn't know what they were doing.
One might need to write something like that - rather than a plain null - to get null of a specific data type. Even though "theoretical" SQL says null does not have a data type, you may need something like that, for example in a view, to define the data type of the column defined by an expression like that.
In Oracle you can use the cast() function, as MT0 demonstrated already - that is by far the most common and most elegant equivalent.
If you want something much closer in spirit to what you saw in that old code, to_timestamp(null) will have the same effect. No reason to write something more complicated for null given as argument, though - along the lines of that nullif() call.

General replacing of NULL values in SQL Views

We have a fairly recent version of a SQL Server that we're using to extract data into a SAP BW Datawarehouse. We're using views to access data in tables on the SQL server. Some of the fields in these tables contain NULL values. These are transferred into SAP as String Value ('NULL') instead of empty, which causes us a major headache.
I understand that we can use COALESCE() in views to replace NULL values with a desired default value ('', 0, '1900-01-01', etc.), however, doing this for each NULL field that we encounter doesn't appear to be very smart.
Is there a better way of addressing this issue short of changing tables to not allow NULL values? Is it possible to include a custom global function that gets automatically applied to all fields fetched in a view without us having to call this function for each field individually?
#Jeroen Mostert's answer in the comments answers my question.
There is no global toggle, flag or setting that will magically
eliminate NULLs for you. The closest thing is that COALESCE(,
'') will "work" (for some values of "work") for almost every type
(including DATETIME), with the notable exception of DECIMAL. You
cannot write a function to do this, as functions in T-SQL cannot
return different types based on their input. By far the best "fix" is
indeed to fix the processing step, if only because ending up with
1900-01-01 dates in your database is typically quite undesirable.
Therefore, the only options are
go through each field that can potentially hold a NULL value and
cleanse it within the view
handle NULL values on the receiving
end (e.g. SAP BW); this could be done through a generic function to
be placed in the entry layer's start routine or it could be done
manually.
Disallowing NULL values on table level in the source system is in this case not feasible as we cannot change the application that writes to the tables (third party vendor ERP).

Why pandas.Series.values drops some scalar data exists in original Series?

I have a pandas.Series object docs of which scalar values are string.
when I tries to iterate over docs.values, for example make list(docs), some of the scalar entries are dropped, or become a NoneType.
For instance, given target_index is an index with buggy issue, when I check docs[target_index], it returns a string data. However, when I execute list(docs)[target_index], it returns None.
Since pandas.Series.values turn data into numpy.ndarray, I guess the issue has something to do with numpy data type or something, but I cannot figure out whats going wrong exactly.
Here is the buggy json file of dataframe
https://gist.github.com/goodcheer/f9c990171a57ff053b4b0539396f63f6
docs is the profile column series

SQL Server 2008 - Default column value - should i use null or empty string?

For some time i'm debating if i should leave columns which i don't know if data will be passed in and set the value to empty string ('') or just allow null.
i would like to hear what is the recommended practice here.
if it makes a difference, i'm using c# as the consuming application.
I'm afraid that...
it depends!
There is no single answer to this question.
As indicated in other responses, at the level of SQL, NULL and empty string have very different semantics, the former indicating that the value is unknown, the latter indicating that the value is this "invisible thing" (in displays and report), but none the less it a "known value". A example commonly given in this context is that of the middle name. A null value in the "middle_name" column would indicate that we do not know whether the underlying person has a middle name or not, and if so what this name is, an empty string would indicate that we "know" that this person does not have a middle name.
This said, two other kinds of factors may help you choose between these options, for a given column.
The very semantics of the underlying data, at the level of the application.
Some considerations in the way SQL works with null values
Data semantics
For example it is important to know if the empty-string is a valid value for the underlying data. If that is the case, we may loose information if we also use empty string for "unknown info". Another consideration is whether some alternate value may be used in the case when we do not have info for the column; Maybe 'n/a' or 'unspecified' or 'tbd' are better values.
SQL behavior and utilities
Considering SQL behavior, the choice of using or not using NULL, may be driven by space consideration, by the desire to create a filtered index, or also by the convenience of the COALESCE() function (which can be emulated with CASE statements, but in a more verbose fashion). Another consideration is whether any query may attempt to query multiple columns to append them (as in SELECT name + ', ' + middle_name AS LongName etc.).
Beyond the validity of the choice of NULL vs. empty string, in given situation, a general consideration it to try and be as consistent as possible, i.e. to try and stick to ONE particular way, and to only/purposely/explicitly depart from this way for good reasons and in few cases.
Don't use empty string if there is no value. If you need to know if a value is unknown, have a flag for it. But 9 times out of 10, if the information is not provided, it's unknown, and that's fine.
NULL means unknown value. An empty string means a known value - a string with length zero. These are totally different things.
empty when I want a valid default value that may or may not be changed, for example, a user's middle name.
NULL when it is an error if the ensuing code does not set the value explicitly.
However, By initializing strings with the Empty value instead of null, you can reduce the chances of a NullReferenceException occurring.
Theory aside, I tend to view:
Empty string as a known value
NULL as unknown
In this case, I'd probably use NULL.
One important thing is to be consistent: mixing NULLs and empty strings will end in tears.
On a practical implementation level, empty string takes 2 bytes in SQL Server where as NULLs are bitmapped. In some conditions and for wide/larger tables it makes a different in performance because it's more data to shift around.

TSQL: No value instead of Null

Due to a weird request, I can't put null in a database if there is no value. I'm wondering what can I put in the store procedure for nothing instead of null.
For example:
insert into blah (blah1) values (null)
Is there something like nothing or empty for "blah1" instead using null?
I would push back on this bizarre request. That's exactly what NULL is for in SQL, to denote a missing or inapplicable value in a column.
Is the requester experiencing grief over SQL logic with NULL?
edit: Okay, I've read your reply with the extra detail about this job assignment (btw, generally you should edit your original question instead of posting more information in an answer).
You'll have to declare all columns as NOT NULL and designate a special value in the domain of that column's data type to signify "no value." The appropriate value to choose might be different on a case by case basis, i.e. zero may signify nothing in a person_age column, but it might have significance in an items_in_stock column.
You should document the no-value value for each column. But I suppose they don't believe in documentation either. :-(
Depends on the data type of the column. For numbers (integers, etc) it could be zero (0) but if varchar then it can be an empty string ("").
I agree with other responses that NULL is best suited for this because it transcends all data types denoting the absence of a value. Therefore, zero and empty string might serve as a workaround/hack but they are fundamentally still actual values themselves that might have business domain meaning other than "not a value".
(If only the SQL language supported a "Not Applicable" (N/A) value type that would serve as an alternative to NULL...)
Is null is a valid value for whatever you're storing?
Use a sentry value like INT32.MaxValue, empty string, or "XXXXXXXXXX" and assume it will never be a legitimate value
Add a bit column 'Exists' that you populate with true at the same time you insert.
Edit: But yeah, I'll agree with the other answers that trying to change the requirements might be better than trying to solve the problem.
If you're using a varchar or equivalent field, then use the empty string.
If you're using a numeric field such as int then you'll have to force the user to enter data, else come up with a value that means NULL.
I don't envy you your situation.
There's a difference between NULLs as assigned values (e.g. inserted into a column), and NULLs as a SQL artifact (as for a field in a missing record for an OUTER JOIN. Which might be a foreign concept to these users. Lots of people use Access, or any database, just to maintain single-table lists.) I wouldn't be surprised if naive users would prefer to use an alternative for assignments; and though repugnant, it should work ok. Just let them use whatever they want.
There is some validity to the requirement to not use NULL values. NULL values can cause a lot of headache when they are in a field that will be included in a JOIN or a WHERE clause or in a field that will be aggregated.
Some SQL implementations (such as MSSQL) disallow NULLable fields to be included in indexes.
MSSQL especially behaves in unexpected ways when NULL is evaluated for equality. Does a NULL value in a PaymentDue field mean the same as zero when we search for records that are up to date? What if we have names in a table and somebody has no middle name. It is conceivable that either an empty string or a NULL could be stored, but how do we then get a comprehensive list of people that have no middle name?
In general I prefer to avoid NULL values. If you cannot represent what you want to store using either a number (including zero) or a string (including the empty string as mentioned before) then you should probably look closer into what you are trying to store. Perhaps you are trying to communicate more than one piece of data in a single field.