SQL Impala convert NaN values to NULL values - sql

I have the following column in my table:
col 1
1
3
NULL
NaN
5
"Bad" aggregations return NaN instead of NULL and the variable is type DOUBLE in the end.
I want to have one type of missing values only, hence I need to convert NULL to NaN or the other way around.
My problem is that when I partition with a window function it does not recognize NaNs as equal to
NULLS and creates separate subgroups, which is something I do not want.
Any suggestions on how to convert them?

Related

Possible to represent NaN in scientific notation?

Is there a way to represent NaN in SQL with some sort of 'special value' ? Or is it always necessary to use a cast from a string to represent it, for example:
SELECT CAST('NaN' AS FLOAT)
-- NaN
Is there any such value-construction I can use directly, such as:
SELECT 1.2345e-6789
-- NaN
You should be able to use 0.0/0.0: IEEE 754 requires that the division of zero by zero to result in NaN.

filter values in a dataframe column based on null values in a different column python dataframe

I've been stuck on this for a bit so hopefully someone has better guidance.
I currently have a dataframe that looks something like this(only way more rows):
|"released_date"| "status" |
+-------------+--------+
| 12/12/20 |released|
+-------------+--------+
| 10/01/20 | NaN |
+-------------+--------+
| NaN | NaN |
+-------------+--------+
| NaN. |released|
+-------------+--------+
I wanted to do df['status'].fillna('released' if df.released_date.notnull())
aka, fill any Nan value in the status column of df with "released" as long as df.released_date is't a null value.
I keep getting various error messages when I do this though in different variations, first for the code above is a syntax error, which I imagine is because notnull() returns a boolean array?
I feel like there is a simple answer for this and I somehow am not seeing it. I haven't found any questions like this where I'm trying to organize something based on the null values in a dataframe, which leads me to wonder if my methodology isn't ideal in the first place? How can I filter values in a dataframe column based on null values in a different column without using isnull() or notnull() if those only return boolean arrays anyways? using == Null doesn't seem to work either...
Try:
idx = df[(df['status'].isnull()) & (~df['released_date'].isnull())].index
df.loc[idx,'status'] = 'released'
First get the index of all rows with 'status' equals null and 'released_date' notequals null. Then use df.loc to update the status column.
Prints:
released_date status
0 12/12/20 released
1 10/01/20 released
2 NaN NaN
3 NaN released

How to create new columns using groupby based on logical expressions

I have this CSV file
http://www.sharecsv.com/s/2503dd7fb735a773b8edfc968c6ae906/whatt2.csv
I want to create three columns, 'MT_Value','M_Value', and 'T_Data', one who has the mean of the data grouped by year and month, which I accomplished by doing this.
data.groupby(['Year','Month']).mean()
But for M_value I need to do the mean of only the values different from zero, and for T_Data I need the count of the values that are zero divided by the total of values, I guess that for the last one I need to divide the amount of values that are zero by the amount of total data grouped, but honestly I am a bit lost. I looked on google and they say something about transform but I didn't understood very well
Thank you.
You could do something like this:
(data.assign(M_Value=data.Valor.where(data.Valor!=0),
T_Data=data.Valor.eq(0))
.groupby(['Year','Month'])
[['Valor','M_Value','T_Data']]
.mean()
)
Explanation: assign will create new columns with respective names. Now
data.Valor.where(data.Valor!=0) will replace 0 values with nan, which will be ignored when we call mean().
data.Valor.eq(0) will replace 0 with 1 and other values with 0. So when you do mean(), you compute count(Valor==0)/total_count().
Output:
Valor M_Value T_Data
Year Month
1970 1 2.306452 6.500000 0.645161
2 1.507143 4.688889 0.678571
3 2.064516 7.111111 0.709677
4 11.816667 13.634615 0.133333
5 7.974194 11.236364 0.290323
... ... ... ...
1997 10 3.745161 7.740000 0.516129
11 11.626667 21.800000 0.466667
12 0.564516 4.375000 0.870968
1998 1 2.000000 15.500000 0.870968
2 1.545455 5.666667 0.727273
[331 rows x 3 columns]

Pandas Duplicated returns some not duplicate values?

I am trying to remove duplicates from dataset.
Before using df.drop_duplicates(), I run df[df.duplicated()] to check which values are treated as duplicates. Values that I don't consider to be duplicates are returned, see example below. All columns are checked.
How to get accurate duplicate results and drop real duplicates?
city price year manufacturer cylinders fuel odometer
whistler 26880 2016.0 chrysler NaN gas 49000.0
whistler 17990 2010.0 toyota NaN hybrid 117000.0
whistler 15890 2010.0 audi NaN gas 188000.0
whistler 8800 2007.0 nissan NaN gas 163000.0
Encountered the same problem.
At first, it looks like
df.duplicated(subset='my_column_of_interest')
returns results which actually have unique values in my_column_of_interest field.
This is not the case, though. The documentation shows that duplicated uses the keep parameter to opt for keeping all duplicates, just the first or just the last. Its default value is first.
Which means that if you have a value present twice in this column, running
df.duplicated(subset='my_column_of_interest') will return results that only contain this value once (since only its first occurrence is kept).

When/Why does Oracle adds NaN to a row in a database table

I know that NaN stands for Not a Number. But, I have trouble understanding when and why Oracle adds this to a row.
Is it when it encounters a value less than 0 like a negative number or when its a garbage value.
From the documentaton:
The Oracle Database numeric data types store positive and negative fixed and floating-point numbers, zero, infinity, and values that are the undefined result of an operation—"not a number" or NAN.
As far as I'm aware you can only get NaN in a binary_float or binary_double column; those data types have their own literals for NaN too, and there's an is nan condition for them too, and the nanvl() function to manipulate them.
An example of a way to get such a value is to divide a zero float/double value by zero:
select 0f/0 from dual;
0F/0
----
NaN
... so if you're seeing NaNs your application logic or underlying data might be broken. (Note you can't get this with a 'normal' number type; you get ORA-01476: divisor is equal to zero unless the numerator is float or double).
You won't get NaN for zero or negative numbers though. It's also possible you have a string column and an application is putting the word 'NaN' in, but storing numbers as strings is a bad idea on many levels, so hopefully that is not the case.
Nope <=0 is still a number so not quite. NaN (or infinity) are special values that the DB uses to keep it's sanity when dealing with non-computable numbers (+-∞, or simply something that is not a number).
Here's some code:
DECLARE
l_bd_test binary_double;
l_int_test INTEGER;
BEGIN
l_bd_test := 'NAN';
l_int_test := 0;
IF l_bd_test IS NAN THEN
DBMS_OUTPUT.PUT_LINE(l_bd_test || ' IS NAN');
ELSE
DBMS_OUTPUT.PUT_LINE(l_bd_test || ' IS A #');
END IF;
IF l_int_test IS NAN THEN
DBMS_OUTPUT.PUT_LINE(l_int_test || ' IS NAN');
ELSE
DBMS_OUTPUT.PUT_LINE(l_int_test || ' IS A #');
END IF;
END;
/
Substitute NAN for INFINITY or even negate it and see the results.