Why SUM(null) is not 0 in Oracle? - sql

It would be appreciated explaining the internal functionality of SUM function in Oracle, when encountering null values:
The result of
select sum(null) from dual;
is null
But when a null value is in a sequence of values (like sum of a null-able column), the calculated value of null value will be 0
select sum(value) from
(
select case when mod(level , 2) = 0 then null else level end as value from dual
connect by level <= 10
)
is 25
This will be more interesting when seeing the result of
select (1 + null) from dual
is null
As any operation with null will result null (except is null operator).
==========================
Some update due to comments:
create table odd_table as select sum(null) as some_name from dual;
Will result:
create table ODD_TABLE
(
some_name NUMBER
)
Why some_name column is of type number?

If you are looking for a rationale for this behaviour, then it is to be found in the ANSI SQL standards which dictate that aggregate operators ignore NULL values.
If you wanted to override that behaviour then you're free to:
Sum(Coalesce(<expression>,0))
... although it would make more sense with Sum() to ...
Coalesce(Sum(<expression>),0)
You might more meaningfully:
Avg(Coalesce(<expression>,0))
... or ...
Min(Coalesce(<expression,0))
Other ANSI aggregation quirks:
Count() never returns null (or negative, of course)
Selecting only aggregation functions without a Group By will always return a single row, even if there is no data from which to select.
So ...
Coalesce(Count(<expression>),0)
... is a waste of a good coalesce.

SQL does not treat NULL values as zeros when calculating SUM, it ignores them:
Returns the sum of all the values, or only the DISTINCT values, in the expression. Null values are ignored.
This makes a difference only in one case - when the sequence being totalled up does not contain numeric items, only NULLs: if at least one number is present, the result is going to be numeric.

You're looking at this the wrong way around. SUM() operates on a column, and ignores nulls.
To quote from the documentation:
This function takes as an argument any numeric data type or any nonnumeric data type that can be implicitly converted to a numeric data type. The function returns the same data type as the numeric data type of the argument.
A NULL has no data-type, and so your first example must return null; as a NULL is not numeric.
Your second example sums the numeric values in the column. The sum of 0 + null + 1 + 2 is 3; the NULL simply means that a number does not exist here.
Your third example is not an operation on a column; remove the SUM() and the answer will be the same as nothingness + 1 is still nothingness. You can't cast a NULL to an empty number as you can with a string as there's no such thing as an empty number. It either exists or it doesn't.

Arithmetic aggregate functions ignore nulls.
SUM() ignores them
AVG() calculates the average as if the null rows didn't exist (nulls don't count in the total or the divisor)

As Bohemian has pointed out, both SUM and AVG exclude entries with NULL in them. Those entries do not go into the aggregate. If AVG treated NULL entries as zero, it would bias the result towards zero.
It may appear to the casual observer as though SUM is treating NULL entries as zero. It's really excluding them. If all the entries are excluded, the result is no value at all, which is NULL. Your example illustrates this.

This is incorrect: The sum of 0 + null + 1 + 2 is 3;
select 0 + null + 1 + 2 total from dual;
Result is null!
Similar statements give result null if any operand is null.

Here's a solution if you want to sum and NOT ignore nulls.
This solution splits the records into two groups: nulls and non-nulls. NVL2(a, 1, NULL) does this by changing all the non-nulls to 1 so they sort together identically. It then sorts those two groups to put the null group first (if there is one), then sums just the first of the two groups. If there are no nulls, there will be no null group, so that first group will contain all the rows. If, instead, there is at least one null, then that first group will only contain those nulls, and the sum of those nulls will be null.
SELECT SUM(a) AS standards_compliant_sum,
SUM(a) KEEP(DENSE_RANK FIRST ORDER BY NVL2(a, 1, NULL) DESC) AS sum_with_nulls
FROM (SELECT 41 AS a FROM DUAL UNION ALL
SELECT NULL AS a FROM DUAL UNION ALL
SELECT 42 AS a FROM DUAL UNION ALL
SELECT 43 AS a FROM DUAL);
You can optionally include NULLS FIRST to make it a little more clear about what's going on. If you're intentionally ordering for the sake of moving nulls around, I always recommend this for code clarity.
SELECT SUM(a) AS standards_compliant_sum,
SUM(a) KEEP(DENSE_RANK FIRST ORDER BY NVL2(a, 1, NULL) DESC NULLS FIRST) AS sum_with_nulls
FROM (SELECT 41 AS a FROM DUAL UNION ALL
SELECT NULL AS a FROM DUAL UNION ALL
SELECT 42 AS a FROM DUAL UNION ALL
SELECT 43 AS a FROM DUAL);

Related

Bigquery equivalent for pandas fillna(method='ffill') [duplicate]

I have a Big Query table that looks like this:
![Table[(https://ibb.co/1ZXMH71)
As you can see most values are empty.
I'd like to forward-fill those empty values, meaning using the last known value ordered by time.
Apparently, there is a function for that called FILL
https://cloud.google.com/dataprep/docs/html/FILL-Function_57344752
But I have no idea how to use it.
This is the Query I've tried posting on the web UI:
SELECT sns_6,Time
FROM TABLE_PATH
FILL sns_6,-1,0 order: Time
the error I get is:
Syntax error: Unexpected identifier "sns_6" at [3:6]
What I want is to get a new table where the column sns_6 is filled with the last known value.
As a bonus: I'd like this to happen for all columns but because fill only supports a single column, for now, I'll have to iterate over all the columns. If anyone has an idea of how to do the iteration This would be a great bonus.
Below is for BigQuery Standard SQL
I'd like to forward-fill those empty values, meaning using the last known value ordered by time
#standardSQL
SELECT time
LAST_VALUE(sns_1 IGNORE NULLS) OVER(ORDER BY time) sns_1,
LAST_VALUE(sns_2 IGNORE NULLS) OVER(ORDER BY time) sns_2
FROM `project.dataset.table`
I'd like this to happen for all columns
You can add as many below lines as many columns you need to fill (obviously you need to replace sns_N with the real column's name
LAST_VALUE(sns_N IGNORE NULLS) OVER(ORDER BY time) sns_N
I'm not sure what your screen shot has to do with your query.
I think this will do what you want:
SELECT sns_6, Time,
LAST_VALUE(sns_6 IGNORE NULLS) ORDER BY (Time) as imputed_sns_6
FROM TABLE_PATH;
EDIT:
This query works fine when I run it:
select table_path.*, last_value(sn_6 ignore nulls) over (order by time)
from (select 1 as time, null as sn_6 union all
select 2, 1 union all
select 3, null union all
select 4, null union all
select 5, null union all
select 6, 0 union all
select 7, null union all
select 8, null
) table_path;

How To Forward-Fill empty values in a table

I have a Big Query table that looks like this:
![Table[(https://ibb.co/1ZXMH71)
As you can see most values are empty.
I'd like to forward-fill those empty values, meaning using the last known value ordered by time.
Apparently, there is a function for that called FILL
https://cloud.google.com/dataprep/docs/html/FILL-Function_57344752
But I have no idea how to use it.
This is the Query I've tried posting on the web UI:
SELECT sns_6,Time
FROM TABLE_PATH
FILL sns_6,-1,0 order: Time
the error I get is:
Syntax error: Unexpected identifier "sns_6" at [3:6]
What I want is to get a new table where the column sns_6 is filled with the last known value.
As a bonus: I'd like this to happen for all columns but because fill only supports a single column, for now, I'll have to iterate over all the columns. If anyone has an idea of how to do the iteration This would be a great bonus.
Below is for BigQuery Standard SQL
I'd like to forward-fill those empty values, meaning using the last known value ordered by time
#standardSQL
SELECT time
LAST_VALUE(sns_1 IGNORE NULLS) OVER(ORDER BY time) sns_1,
LAST_VALUE(sns_2 IGNORE NULLS) OVER(ORDER BY time) sns_2
FROM `project.dataset.table`
I'd like this to happen for all columns
You can add as many below lines as many columns you need to fill (obviously you need to replace sns_N with the real column's name
LAST_VALUE(sns_N IGNORE NULLS) OVER(ORDER BY time) sns_N
I'm not sure what your screen shot has to do with your query.
I think this will do what you want:
SELECT sns_6, Time,
LAST_VALUE(sns_6 IGNORE NULLS) ORDER BY (Time) as imputed_sns_6
FROM TABLE_PATH;
EDIT:
This query works fine when I run it:
select table_path.*, last_value(sn_6 ignore nulls) over (order by time)
from (select 1 as time, null as sn_6 union all
select 2, 1 union all
select 3, null union all
select 4, null union all
select 5, null union all
select 6, 0 union all
select 7, null union all
select 8, null
) table_path;

"bad double value" in Google BigQuery

I'm working in Google BigQuery (not using LegacySQL), and I'm currently trying to cast() a string as a float64. Each time I get the error "Bad double value". I've also tried safe_cast() but it completely eliminates some of my id's (Ex: if one customer repeats 3 times for 3 different dates, and only has 'null' for a single "Height" entry, that customer is completely eliminated after I do safe_cast(), not just the row that had the 'null' value). I don't have any weird string value in my data, just whole or rational numbers or null entries.
Here's my current code:
select id, date,
cast(height as float64) as height,
cast(weight as float64) as weight
from (select id, date, max(height) as height, max(weight) as weight
from table
group by 1,2
)
group by 1, 2
Of course safe_cast() returns NULL values. That is because you have inappropriate values in the data.
You can find these by doing:
select height, weight
from table
where safe_cast(height) is null or safe_cast(weight) is null;
Once you understand what the values are, fix the values or adjust the logic of the query.
If you just want the max of values are are properly numeric, then cast before the aggregation:
select id, date,
max(safe_cast(height as float64)) as height,
max(safe_cast(weight as float64)) as weight
from table
group by 1, 2;
A subquery doesn't seem necessary or desirable for your query.

Why are my SQL SUMs not coming back NULL when they include NULL values?

I use a CTE to calculate spans of time in a log as shown in this fiddle:
http://www.sqlfiddle.com/#!3/b99448/6
Note that one of the rows has a NULL value because that is the most recent log entry and no calculation could be made.
However, if I SUM these results the NULL is being treated as a zero:
http://www.sqlfiddle.com/#!3/b99448/4
How can I get this to stop ignoring NULL values?
I would expect the sum to be NULL since it is adding a NULL value.
The aggregation functions ignore NULL values. They are not treated as 0 -- the distinction is more important for AVG(), MIN(), and MAX(). So, SUM() only returns NULL when all values are NULL.
If you want to get NULL back, here is a simple expression:
select (case when count(*) = count(a.DateTimeChangedUtc) and
count(*) = count(b.DateTimeChangedUTC)
then SUM(DATEDIFF(SECOND, a.DateTimeChangedUtc, b.DateTimeChangedUTC))
end) AS TimeSpentSeconds
This returns NULL if either argument is ever NULL.

Oracle SQL Least Value Not Null

Take the following query...
select
trunc(dateX)-trunc(sysdate) daysTilX,
trunc(dateY)-trunc(sysdate) daysTilY,
least(trunc(dateX)-trunc(sysdate),trunc(dateY)-trunc(sysdate)) leastOfTheTwo
from myTable
If dateX or dateY is null then least() returns null. I need to figure out how to have the leastOfTheTwo column return null only if both dateX and dateY are null, otherwise, I want the number. Any ideas?
UPDATE To be clear, I cannot use nvl on the dates because they represent due dates. Meaning -1 (one day late), 0 (today), 1 (tomorrow), null (neither due dates ever set).
select
trunc(dateX)-trunc(sysdate) daysTilX,
trunc(dateY)-trunc(sysdate) daysTilY,
least(trunc(nvl(dateX, dateY))-trunc(sysdate),trunc(nvl(dateY, dateX))-trunc(sysdate)) leastOfTheTwo
from myTable
select
trunc(dateX)-trunc(sysdate) daysTilX,
trunc(dateY)-trunc(sysdate) daysTilY,
least(nvl(trunc(dateX)-trunc(sysdate),0),nvl(trunc(dateY)-trunc(sysdate),0)) leastOfTheTwo
from myTable