PERCENTILE_DISC - Exclude Nulls - sql

I'm having trouble with PERCENTILE_DISC where the below query is returning the "Median" as 310, but the actual Median is 365. It's returning 310 as it's including the NULL value.
Is there a way to have PERCENTILE_DISC exclude NULLS? Gordon Lindoff mentioned in this post that NULLs are excluded from PERCENTILE_DISC but this doesn't seem to be the case.
Here's the simple example showing the problem:
SELECT
DISTINCT PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY Numbers) OVER (PARTITION BY Category)
from
(
select 1 as Category,420 as Numbers
union all
select 1,425
union all
select 1,NULL
union all
select 1,310
union all
select 1,300
) t1

According to the documentation for PERCENTILE_DISC, the result is always equal to a specific column value.
PERCENTILE_DISC does ignore the NULL-s: if you ask for PERCENTILE_DISC(0) you would get 300, not NULL.
To get the value you want (365, which is the average between 310 and 420), you need to use PERCENTILE_CONT. See the first example in the documentation mentioned above, where it highlights the difference between PERCENTILE_CONT and PERCENTILE_DISC.

Related

Bigquery equivalent for pandas fillna(method='ffill') [duplicate]

I have a Big Query table that looks like this:
![Table[(https://ibb.co/1ZXMH71)
As you can see most values are empty.
I'd like to forward-fill those empty values, meaning using the last known value ordered by time.
Apparently, there is a function for that called FILL
https://cloud.google.com/dataprep/docs/html/FILL-Function_57344752
But I have no idea how to use it.
This is the Query I've tried posting on the web UI:
SELECT sns_6,Time
FROM TABLE_PATH
FILL sns_6,-1,0 order: Time
the error I get is:
Syntax error: Unexpected identifier "sns_6" at [3:6]
What I want is to get a new table where the column sns_6 is filled with the last known value.
As a bonus: I'd like this to happen for all columns but because fill only supports a single column, for now, I'll have to iterate over all the columns. If anyone has an idea of how to do the iteration This would be a great bonus.
Below is for BigQuery Standard SQL
I'd like to forward-fill those empty values, meaning using the last known value ordered by time
#standardSQL
SELECT time
LAST_VALUE(sns_1 IGNORE NULLS) OVER(ORDER BY time) sns_1,
LAST_VALUE(sns_2 IGNORE NULLS) OVER(ORDER BY time) sns_2
FROM `project.dataset.table`
I'd like this to happen for all columns
You can add as many below lines as many columns you need to fill (obviously you need to replace sns_N with the real column's name
LAST_VALUE(sns_N IGNORE NULLS) OVER(ORDER BY time) sns_N
I'm not sure what your screen shot has to do with your query.
I think this will do what you want:
SELECT sns_6, Time,
LAST_VALUE(sns_6 IGNORE NULLS) ORDER BY (Time) as imputed_sns_6
FROM TABLE_PATH;
EDIT:
This query works fine when I run it:
select table_path.*, last_value(sn_6 ignore nulls) over (order by time)
from (select 1 as time, null as sn_6 union all
select 2, 1 union all
select 3, null union all
select 4, null union all
select 5, null union all
select 6, 0 union all
select 7, null union all
select 8, null
) table_path;

get the latest records

I am currently still on my SQL educational journey and need some help!
The query I have is as below;
SELECT
Audit_Non_Conformance_Records.kf_ID_Client_Reference_Number,
Audit_Non_Conformance_Records.TimeStamp_Creation,
Audit_Non_Conformance_Records.Clause,
Audit_Non_Conformance_Records.NC_type,
Audit_Non_Conformance_Records.NC_Rect_Received,
Audit_Non_Conformance_Records.Audit_Num
FROM Audit_Non_Conformance_Records
I am trying to tweak this to show only the most recent results based on Audit_Non_Conformance_Records.TimeStamp_Creation
I have tried using MAX() but all this does is shows the latest date for all records.
basically the results of the above give me this;
But I only need the result with the date 02/10/2019 as this is the latest result. There may be multiple results however. So for example if 02/10/2019 had never happened I would need all of the idividual recirds from the 14/10/2019 ones.
Does that make any sense at all?
You can filter with a subquery:
SELECT
kf_ID_Client_Reference_Number,
TimeStamp_Creation,
Clause,
NC_type,
NC_Rect_Received,
Audit_Num
FROM Audit_Non_Conformance_Records a
where TimeStamp_Creation = (
select max(TimeStamp_Creation)
from Audit_Non_Conformance_Records
)
This will give you all whose TimeStamp_Creation is equal to the greater value available in the table.
If you want all records that have the greatest day (exluding time), then you can do:
SELECT
kf_ID_Client_Reference_Number,
TimeStamp_Creation,
Clause,
NC_type,
NC_Rect_Received,
Audit_Num
FROM Audit_Non_Conformance_Records a
where cast(TimeStamp_Creation as date) = (
select cast(max(TimeStamp_Creation) as date)
from Audit_Non_Conformance_Records
)
Edit
If you want the latest record per refNumber, then you can correlate the subquery, like so:
SELECT
kf_ID_Client_Reference_Number,
TimeStamp_Creation,
Clause,
NC_type,
NC_Rect_Received,
Audit_Num
FROM Audit_Non_Conformance_Records a
where TimeStamp_Creation = (
select max(TimeStamp_Creation)
from Audit_Non_Conformance_Records a1
where a1.refNumber = a.refNumber
)
For performance, you want an index on (refNumber, TimeStamp_Creation).
If you want the latest date in SQL Server, you can express this as:
SELECT TOP (1) WITH TIES ancr.kf_ID_Client_Reference_Number,
ancr.TimeStamp_Creation,
ancr.Clause,
ancr.NC_type,
ancr.NC_Rect_Received,
ancr.Audit_Num
FROM Audit_Non_Conformance_Records ancr
ORDER BY CONVERT(date, ancr.TimeStamp_Creation) DESC;
SQL Server is pretty good about handling dates with conversions, so I would not be surprised if this used an index on TimeStamp_Creation.

How To Forward-Fill empty values in a table

I have a Big Query table that looks like this:
![Table[(https://ibb.co/1ZXMH71)
As you can see most values are empty.
I'd like to forward-fill those empty values, meaning using the last known value ordered by time.
Apparently, there is a function for that called FILL
https://cloud.google.com/dataprep/docs/html/FILL-Function_57344752
But I have no idea how to use it.
This is the Query I've tried posting on the web UI:
SELECT sns_6,Time
FROM TABLE_PATH
FILL sns_6,-1,0 order: Time
the error I get is:
Syntax error: Unexpected identifier "sns_6" at [3:6]
What I want is to get a new table where the column sns_6 is filled with the last known value.
As a bonus: I'd like this to happen for all columns but because fill only supports a single column, for now, I'll have to iterate over all the columns. If anyone has an idea of how to do the iteration This would be a great bonus.
Below is for BigQuery Standard SQL
I'd like to forward-fill those empty values, meaning using the last known value ordered by time
#standardSQL
SELECT time
LAST_VALUE(sns_1 IGNORE NULLS) OVER(ORDER BY time) sns_1,
LAST_VALUE(sns_2 IGNORE NULLS) OVER(ORDER BY time) sns_2
FROM `project.dataset.table`
I'd like this to happen for all columns
You can add as many below lines as many columns you need to fill (obviously you need to replace sns_N with the real column's name
LAST_VALUE(sns_N IGNORE NULLS) OVER(ORDER BY time) sns_N
I'm not sure what your screen shot has to do with your query.
I think this will do what you want:
SELECT sns_6, Time,
LAST_VALUE(sns_6 IGNORE NULLS) ORDER BY (Time) as imputed_sns_6
FROM TABLE_PATH;
EDIT:
This query works fine when I run it:
select table_path.*, last_value(sn_6 ignore nulls) over (order by time)
from (select 1 as time, null as sn_6 union all
select 2, 1 union all
select 3, null union all
select 4, null union all
select 5, null union all
select 6, 0 union all
select 7, null union all
select 8, null
) table_path;

Subqueries: What am I doing fundamentally wrong?

I thought that selecting values from a subquery in SQL would only yield values from that subset until I found a very nasty bug in code. Here is an example of my problem.
I'm selecting the rows that contain the latest(max) function by date. This correctly returns 4 rows with the latest check in of each function.
select *, max(date) from cm where file_id == 5933 group by function_id;
file_id function_id date value max(date)
5933 64807 1407941297 1 1407941297
5933 64808 1407941297 11 1407941297
5933 895175 1306072348 1306072348
5933 895178 1363182349 1363182349
When selecting only the value from the subset above, it returns function values from previous dates, i.e. rows that don't belong in the subset above. You can see the result below where the dates are older than in the first subset.
select temp.function_id, temp.date, temp.value
from (select *, max(date)
from cm
where file_id 5933
group by function_id) as temp;
function_id date value
64807 1306072348 1 &lt-outdated row, not in first subset
64808 1306072348 17 &lt-outdated row, not in first subset
895175 1306072348
895178 1363182349
What am I doing fundamentally wrong? Shouldn't selects performed on subqueries only return possible results from those subqueries?
SQLite allows you to use MAX() to select the row to be returned by a GROUP BY, but this works only if the MAX() is actually computed.
When you throw the max(date) column away, this no longer works.
In this case, you actually want to use the date value, so you can just keep the MAX():
SELECT function_id,
max(date) AS date,
value
FROM cm
WHERE file_id = 5933
GROUP BY function_id
You seem to be missing the fact that your subquery is returning ALL rows for the given file_id. If you want to restrict your subquery to recs with the most recent date, then you need to restrict it with a WHERE NOT EXISTS clause to check that no more recent records exist for the given condition.
Perhaps my question was not formulated correctly, but this post had the solutions I was essentially looking for:
https://stackoverflow.com/a/123481/2966951
https://stackoverflow.com/a/121435/2966951
Filtering out the most recent row was my problem. I was surprised that selecting from a subquery with a max value could yield anything other than that value.

Why SUM(null) is not 0 in Oracle?

It would be appreciated explaining the internal functionality of SUM function in Oracle, when encountering null values:
The result of
select sum(null) from dual;
is null
But when a null value is in a sequence of values (like sum of a null-able column), the calculated value of null value will be 0
select sum(value) from
(
select case when mod(level , 2) = 0 then null else level end as value from dual
connect by level <= 10
)
is 25
This will be more interesting when seeing the result of
select (1 + null) from dual
is null
As any operation with null will result null (except is null operator).
==========================
Some update due to comments:
create table odd_table as select sum(null) as some_name from dual;
Will result:
create table ODD_TABLE
(
some_name NUMBER
)
Why some_name column is of type number?
If you are looking for a rationale for this behaviour, then it is to be found in the ANSI SQL standards which dictate that aggregate operators ignore NULL values.
If you wanted to override that behaviour then you're free to:
Sum(Coalesce(<expression>,0))
... although it would make more sense with Sum() to ...
Coalesce(Sum(<expression>),0)
You might more meaningfully:
Avg(Coalesce(<expression>,0))
... or ...
Min(Coalesce(<expression,0))
Other ANSI aggregation quirks:
Count() never returns null (or negative, of course)
Selecting only aggregation functions without a Group By will always return a single row, even if there is no data from which to select.
So ...
Coalesce(Count(<expression>),0)
... is a waste of a good coalesce.
SQL does not treat NULL values as zeros when calculating SUM, it ignores them:
Returns the sum of all the values, or only the DISTINCT values, in the expression. Null values are ignored.
This makes a difference only in one case - when the sequence being totalled up does not contain numeric items, only NULLs: if at least one number is present, the result is going to be numeric.
You're looking at this the wrong way around. SUM() operates on a column, and ignores nulls.
To quote from the documentation:
This function takes as an argument any numeric data type or any nonnumeric data type that can be implicitly converted to a numeric data type. The function returns the same data type as the numeric data type of the argument.
A NULL has no data-type, and so your first example must return null; as a NULL is not numeric.
Your second example sums the numeric values in the column. The sum of 0 + null + 1 + 2 is 3; the NULL simply means that a number does not exist here.
Your third example is not an operation on a column; remove the SUM() and the answer will be the same as nothingness + 1 is still nothingness. You can't cast a NULL to an empty number as you can with a string as there's no such thing as an empty number. It either exists or it doesn't.
Arithmetic aggregate functions ignore nulls.
SUM() ignores them
AVG() calculates the average as if the null rows didn't exist (nulls don't count in the total or the divisor)
As Bohemian has pointed out, both SUM and AVG exclude entries with NULL in them. Those entries do not go into the aggregate. If AVG treated NULL entries as zero, it would bias the result towards zero.
It may appear to the casual observer as though SUM is treating NULL entries as zero. It's really excluding them. If all the entries are excluded, the result is no value at all, which is NULL. Your example illustrates this.
This is incorrect: The sum of 0 + null + 1 + 2 is 3;
select 0 + null + 1 + 2 total from dual;
Result is null!
Similar statements give result null if any operand is null.
Here's a solution if you want to sum and NOT ignore nulls.
This solution splits the records into two groups: nulls and non-nulls. NVL2(a, 1, NULL) does this by changing all the non-nulls to 1 so they sort together identically. It then sorts those two groups to put the null group first (if there is one), then sums just the first of the two groups. If there are no nulls, there will be no null group, so that first group will contain all the rows. If, instead, there is at least one null, then that first group will only contain those nulls, and the sum of those nulls will be null.
SELECT SUM(a) AS standards_compliant_sum,
SUM(a) KEEP(DENSE_RANK FIRST ORDER BY NVL2(a, 1, NULL) DESC) AS sum_with_nulls
FROM (SELECT 41 AS a FROM DUAL UNION ALL
SELECT NULL AS a FROM DUAL UNION ALL
SELECT 42 AS a FROM DUAL UNION ALL
SELECT 43 AS a FROM DUAL);
You can optionally include NULLS FIRST to make it a little more clear about what's going on. If you're intentionally ordering for the sake of moving nulls around, I always recommend this for code clarity.
SELECT SUM(a) AS standards_compliant_sum,
SUM(a) KEEP(DENSE_RANK FIRST ORDER BY NVL2(a, 1, NULL) DESC NULLS FIRST) AS sum_with_nulls
FROM (SELECT 41 AS a FROM DUAL UNION ALL
SELECT NULL AS a FROM DUAL UNION ALL
SELECT 42 AS a FROM DUAL UNION ALL
SELECT 43 AS a FROM DUAL);