Bigquery equivalent for pandas fillna(method='ffill') [duplicate] - google-bigquery

I have a Big Query table that looks like this:
![Table[(https://ibb.co/1ZXMH71)
As you can see most values are empty.
I'd like to forward-fill those empty values, meaning using the last known value ordered by time.
Apparently, there is a function for that called FILL
https://cloud.google.com/dataprep/docs/html/FILL-Function_57344752
But I have no idea how to use it.
This is the Query I've tried posting on the web UI:
SELECT sns_6,Time
FROM TABLE_PATH
FILL sns_6,-1,0 order: Time
the error I get is:
Syntax error: Unexpected identifier "sns_6" at [3:6]
What I want is to get a new table where the column sns_6 is filled with the last known value.
As a bonus: I'd like this to happen for all columns but because fill only supports a single column, for now, I'll have to iterate over all the columns. If anyone has an idea of how to do the iteration This would be a great bonus.

Below is for BigQuery Standard SQL
I'd like to forward-fill those empty values, meaning using the last known value ordered by time
#standardSQL
SELECT time
LAST_VALUE(sns_1 IGNORE NULLS) OVER(ORDER BY time) sns_1,
LAST_VALUE(sns_2 IGNORE NULLS) OVER(ORDER BY time) sns_2
FROM `project.dataset.table`
I'd like this to happen for all columns
You can add as many below lines as many columns you need to fill (obviously you need to replace sns_N with the real column's name
LAST_VALUE(sns_N IGNORE NULLS) OVER(ORDER BY time) sns_N

I'm not sure what your screen shot has to do with your query.
I think this will do what you want:
SELECT sns_6, Time,
LAST_VALUE(sns_6 IGNORE NULLS) ORDER BY (Time) as imputed_sns_6
FROM TABLE_PATH;
EDIT:
This query works fine when I run it:
select table_path.*, last_value(sn_6 ignore nulls) over (order by time)
from (select 1 as time, null as sn_6 union all
select 2, 1 union all
select 3, null union all
select 4, null union all
select 5, null union all
select 6, 0 union all
select 7, null union all
select 8, null
) table_path;

Related

PERCENTILE_DISC - Exclude Nulls

I'm having trouble with PERCENTILE_DISC where the below query is returning the "Median" as 310, but the actual Median is 365. It's returning 310 as it's including the NULL value.
Is there a way to have PERCENTILE_DISC exclude NULLS? Gordon Lindoff mentioned in this post that NULLs are excluded from PERCENTILE_DISC but this doesn't seem to be the case.
Here's the simple example showing the problem:
SELECT
DISTINCT PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY Numbers) OVER (PARTITION BY Category)
from
(
select 1 as Category,420 as Numbers
union all
select 1,425
union all
select 1,NULL
union all
select 1,310
union all
select 1,300
) t1
According to the documentation for PERCENTILE_DISC, the result is always equal to a specific column value.
PERCENTILE_DISC does ignore the NULL-s: if you ask for PERCENTILE_DISC(0) you would get 300, not NULL.
To get the value you want (365, which is the average between 310 and 420), you need to use PERCENTILE_CONT. See the first example in the documentation mentioned above, where it highlights the difference between PERCENTILE_CONT and PERCENTILE_DISC.

SQL Grabbing unque counts per category

I'm pretty new to SQL and Redshift, but there is a weird problem I'm getting.
So my data looks like below. Ignore id, date_time actual values... I just put random info, but its the same format
id date_time(var char 255)
1 2019-01-11T05:01:59
1 2019-01-11T05:01:59
2 2019-01-11T05:01:59
3 2019-01-11T05:01:59
1 2019-02-11T05:01:59
2 2019-02-11T05:01:59
I'm trying to get the number of counts of unique ID's per month.
I've tried the following command below. Given the amount of data, I just tried to do a demo of the first 10 rows of my table...
SELECT COUNT(DISTINCT id),
LEFT(date_time,7)
FROM ( SELECT top 10*
FROM myTable.ME )
GROUP BY LEFT(date_time, 7), id
I expect something like below.
count left
3 2019-01
2 2019-02
But I'm instead getting similar to what's below
I then tried the below command which seems correct.
SELECT COUNT(DISTINCT id),
LEFT(date_time,7)
FROM ( SELECT top 1000000*
FROM myTable.ME )
GROUP BY LEFT(date_time, 7)
However, if you remove the DISTINCT portion, you get the results below. It seems like it is only looking at a certain month (2019-01), rather than other months.
If anyone can tell me what is wrong with the commands I'm using or can give me the correct command, I'll be very grateful. Thank you.
EDIT: Could it possibly be because maybe my data isn't clean?
Why are you using a string for the date? That is simply wrong. There are built-in types. But assuming you have some reason or cannot change it, use string functions:
select left(date_time, 7) as yyyymm,
count(distinct id)
from t
group by yyyymm
order by yyyymm;
In your first query you have id in the group by which does not do what you want.

How To Forward-Fill empty values in a table

I have a Big Query table that looks like this:
![Table[(https://ibb.co/1ZXMH71)
As you can see most values are empty.
I'd like to forward-fill those empty values, meaning using the last known value ordered by time.
Apparently, there is a function for that called FILL
https://cloud.google.com/dataprep/docs/html/FILL-Function_57344752
But I have no idea how to use it.
This is the Query I've tried posting on the web UI:
SELECT sns_6,Time
FROM TABLE_PATH
FILL sns_6,-1,0 order: Time
the error I get is:
Syntax error: Unexpected identifier "sns_6" at [3:6]
What I want is to get a new table where the column sns_6 is filled with the last known value.
As a bonus: I'd like this to happen for all columns but because fill only supports a single column, for now, I'll have to iterate over all the columns. If anyone has an idea of how to do the iteration This would be a great bonus.
Below is for BigQuery Standard SQL
I'd like to forward-fill those empty values, meaning using the last known value ordered by time
#standardSQL
SELECT time
LAST_VALUE(sns_1 IGNORE NULLS) OVER(ORDER BY time) sns_1,
LAST_VALUE(sns_2 IGNORE NULLS) OVER(ORDER BY time) sns_2
FROM `project.dataset.table`
I'd like this to happen for all columns
You can add as many below lines as many columns you need to fill (obviously you need to replace sns_N with the real column's name
LAST_VALUE(sns_N IGNORE NULLS) OVER(ORDER BY time) sns_N
I'm not sure what your screen shot has to do with your query.
I think this will do what you want:
SELECT sns_6, Time,
LAST_VALUE(sns_6 IGNORE NULLS) ORDER BY (Time) as imputed_sns_6
FROM TABLE_PATH;
EDIT:
This query works fine when I run it:
select table_path.*, last_value(sn_6 ignore nulls) over (order by time)
from (select 1 as time, null as sn_6 union all
select 2, 1 union all
select 3, null union all
select 4, null union all
select 5, null union all
select 6, 0 union all
select 7, null union all
select 8, null
) table_path;

Why SUM(null) is not 0 in Oracle?

It would be appreciated explaining the internal functionality of SUM function in Oracle, when encountering null values:
The result of
select sum(null) from dual;
is null
But when a null value is in a sequence of values (like sum of a null-able column), the calculated value of null value will be 0
select sum(value) from
(
select case when mod(level , 2) = 0 then null else level end as value from dual
connect by level <= 10
)
is 25
This will be more interesting when seeing the result of
select (1 + null) from dual
is null
As any operation with null will result null (except is null operator).
==========================
Some update due to comments:
create table odd_table as select sum(null) as some_name from dual;
Will result:
create table ODD_TABLE
(
some_name NUMBER
)
Why some_name column is of type number?
If you are looking for a rationale for this behaviour, then it is to be found in the ANSI SQL standards which dictate that aggregate operators ignore NULL values.
If you wanted to override that behaviour then you're free to:
Sum(Coalesce(<expression>,0))
... although it would make more sense with Sum() to ...
Coalesce(Sum(<expression>),0)
You might more meaningfully:
Avg(Coalesce(<expression>,0))
... or ...
Min(Coalesce(<expression,0))
Other ANSI aggregation quirks:
Count() never returns null (or negative, of course)
Selecting only aggregation functions without a Group By will always return a single row, even if there is no data from which to select.
So ...
Coalesce(Count(<expression>),0)
... is a waste of a good coalesce.
SQL does not treat NULL values as zeros when calculating SUM, it ignores them:
Returns the sum of all the values, or only the DISTINCT values, in the expression. Null values are ignored.
This makes a difference only in one case - when the sequence being totalled up does not contain numeric items, only NULLs: if at least one number is present, the result is going to be numeric.
You're looking at this the wrong way around. SUM() operates on a column, and ignores nulls.
To quote from the documentation:
This function takes as an argument any numeric data type or any nonnumeric data type that can be implicitly converted to a numeric data type. The function returns the same data type as the numeric data type of the argument.
A NULL has no data-type, and so your first example must return null; as a NULL is not numeric.
Your second example sums the numeric values in the column. The sum of 0 + null + 1 + 2 is 3; the NULL simply means that a number does not exist here.
Your third example is not an operation on a column; remove the SUM() and the answer will be the same as nothingness + 1 is still nothingness. You can't cast a NULL to an empty number as you can with a string as there's no such thing as an empty number. It either exists or it doesn't.
Arithmetic aggregate functions ignore nulls.
SUM() ignores them
AVG() calculates the average as if the null rows didn't exist (nulls don't count in the total or the divisor)
As Bohemian has pointed out, both SUM and AVG exclude entries with NULL in them. Those entries do not go into the aggregate. If AVG treated NULL entries as zero, it would bias the result towards zero.
It may appear to the casual observer as though SUM is treating NULL entries as zero. It's really excluding them. If all the entries are excluded, the result is no value at all, which is NULL. Your example illustrates this.
This is incorrect: The sum of 0 + null + 1 + 2 is 3;
select 0 + null + 1 + 2 total from dual;
Result is null!
Similar statements give result null if any operand is null.
Here's a solution if you want to sum and NOT ignore nulls.
This solution splits the records into two groups: nulls and non-nulls. NVL2(a, 1, NULL) does this by changing all the non-nulls to 1 so they sort together identically. It then sorts those two groups to put the null group first (if there is one), then sums just the first of the two groups. If there are no nulls, there will be no null group, so that first group will contain all the rows. If, instead, there is at least one null, then that first group will only contain those nulls, and the sum of those nulls will be null.
SELECT SUM(a) AS standards_compliant_sum,
SUM(a) KEEP(DENSE_RANK FIRST ORDER BY NVL2(a, 1, NULL) DESC) AS sum_with_nulls
FROM (SELECT 41 AS a FROM DUAL UNION ALL
SELECT NULL AS a FROM DUAL UNION ALL
SELECT 42 AS a FROM DUAL UNION ALL
SELECT 43 AS a FROM DUAL);
You can optionally include NULLS FIRST to make it a little more clear about what's going on. If you're intentionally ordering for the sake of moving nulls around, I always recommend this for code clarity.
SELECT SUM(a) AS standards_compliant_sum,
SUM(a) KEEP(DENSE_RANK FIRST ORDER BY NVL2(a, 1, NULL) DESC NULLS FIRST) AS sum_with_nulls
FROM (SELECT 41 AS a FROM DUAL UNION ALL
SELECT NULL AS a FROM DUAL UNION ALL
SELECT 42 AS a FROM DUAL UNION ALL
SELECT 43 AS a FROM DUAL);

SQL Difference between using two dates

My database has a table called 'clientordermas'. Two columns of that table are as follows.
execid and filledqty.
Three records of those two fields are as follows.
E02011/03/12-05:57_24384 : 1000
E02011/03/12-05:57_24384 : 800
E02011/03/09-05:57_24384 : 600
What i need to do is get the filledqty diffrence btween latest date and 1 before latest date which is 400(1000-400).
I have extracted the date from the execid as follows:
(SUBSTR (execid, 3, 10)
I tried so hard but but I was unable to write the sql query to get 400. Can someone please help me to do this???
P.S I need to select maximum filled quantity from the same date. That is 1000 not, 800.
You can use window functions to access "nearby" rows, so if you first clean up the data in a subquery and then use window functions to access the next row, you should get the right results. But unless you have an index on substr(execid, 3, 10), this is going to be be slow.
WITH datevalues AS
(
SELECT max(filledqty) maxfilledqty, substr(execid, 3, 10) execiddate
FROM clientordermas
GROUP BY substr(execid, 3, 10)
)
SELECT
execiddate,
maxfilledqty -
last_value(maxfilledqty) over(ORDER BY execiddate DESC ROWS BETWEEN 0 PRECEDING AND 1 FOLLOWING)
FROM datevalues
ORDER BY execiddate DESC;
WITH maxqtys AS (
SELECT substr(a.execid,3,10) AS date, MAX(a.filledqty) AS maxqty
FROM clientordermas a
GROUP BY substr(a.execid,3,10)
)
SELECT a.maxqty-b.maxqty
FROM maxqtys a, maxqtys b
WHERE a.date <> b.date AND ROWNUM=1
ORDER BY a.date DESC, b.date DESC
This first creates a subquery (maxqty) which contains the max filledqty for each unique date, then cross join this subquery to itself, excluding the same rows. This results in a table containing pairs of dates (excluding the same dates).
Sort these pairs by date descending, and the top row till contain the last and 2nd-to-last date, with the appropriate quantities.