How create a calculated column in google bigquery? - sql

I have a data in Google Bigquery like this
id yearmonth value
00007BR0011 201705 8.0
00007BR0011 201701 3.0
and I need to create a table where per id shows the subtraction by year in order to create something like this
id value
00007BR0011 5
The value 5 is the subtraction of the value in 201705 minus the value in 201701
I am using standard SQL, but don't know how to create the column with the calculation.
Sorry in advance if it is too basic, but didn't find anything yet useful.

Perhaps a single table/result set would work for your purposes:
select id,
(max(case when yearmonth = 201705 then value end) -
max(case when yearmonth = 201701 then value end) -
)
from t
where yearmonth in (201705, 201701)
group by id;

It's difficult to answer this based on the current level of detail, but if the smaller value is always subtracted from the larger (and both are never null), you could handle it this way using GROUP BY:
SELECT
id,
MAX(value) - MIN(value) AS new_value
FROM
`your-project.your_dataset.your_table`
GROUP BY
id
From here, you could save these results as a new table, or save this query as a view definition (which would be similar to having it calculated on the fly if the underlying data is changing).
Another option is to add a column under the table schema, then run an UPDATE query to populate it.
If the smaller value isn't always subtracted from the larger, but rather the lower date is what matters (and there are always two), another way to do this would be to use analytic (or window) functions to select the value with the lowest date:
SELECT
DISTINCT
id,
(
FIRST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
-
LAST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
) AS new_value
FROM
`your-project.your_dataset.your_table`
Because analytic functions operate on the source rows, DISTINCT is needed to eliminate the duplicate rows.
If there could be more than two rows and you need all the prior values subtracted from the latest value, you could handle it this way (which would also be safe against NULLs or only having one row):
SELECT
DISTINCT
id,
(
FIRST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
-
(
SUM(value) OVER(PARTITION BY id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
-
FIRST_VALUE(value) OVER(PARTITION BY id ORDER BY yearmonth DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
) AS new_value
FROM
`your-project.your_dataset.your_table`
You could technically do the same thing with grouping and ARRAY_AGG with dereferencing, although this method will be significantly slower on larger data sets:
SELECT
id,
(
ARRAY_AGG(value ORDER BY yearmonth DESC)[OFFSET(0)]
-
(
SUM(value)
-
ARRAY_AGG(value ORDER BY yearmonth DESC)[OFFSET(0)]
)
) AS new_value
FROM
`your-project.your_dataset.your_table`
GROUP BY
id

Related

Min and max value per group keeping order

I have a small problem in Redshift with with grouping; I have a table like following:
INPUT
VALUE CREATED UPDATED
------------------------------------
1 '2020-09-10' '2020-09-11'
1 '2020-09-11' '2020-09-13'
2 '2020-09-15' '2020-09-16'
1 '2020-09-17' '2020-09-18'
I want to obtain this output:
VALUE CREATED UPDATED
------------------------------------
1 '2020-09-10' '2020-09-13'
2 '2020-09-15' '2020-09-16'
1 '2020-09-17' '2020-09-18'
If I do a simple Min and Max date grouping by the value, it doesn't work.
This is an example of a gap-and-islands problem. If there are no time gaps in the data, then a difference of row numbers is a simple solution:
select value, min(created), max(updated)
from (select t.*,
row_number() over (order by created) as seqnum,
row_number() over (partition by value order by created) as seqnum_2
from t
) t
group by value, (seqnum - seqnum_2)
order by min(created);
Why this works is a little tricky to explain. But if you look at the results of the subquery, you will see how the difference between the row numbers identifies adjacent rows with the same value.

Can Db2 LAG function refer to itself?

I'm trying to put information to identify GROUP ID by replicating this Excel formula:
IF(OR(A2<>A1,AND(B2<>"000",B1="000")),D1+1,D1)
This formula is written when my cursor is in "D2", meaning I've referred to the newly added column value in the previous row to generate the current value.
I'd like to this with Db2 SQL, but I'm not sure how to because I'll need to do LAG function on the column I'm going to add and referring their value.
Kindly advise if having better way to do.
Thanks.
You need nested OLAP-functions, assuming ORDER BY SERIAL_NUMBER, EVENT_TIMESTAMP returns the order shown in Excel:
with cte as
(
select ...
case --IF(OR(A2<>A1,AND(B2<>"000",B1="000"))
when (lag(OPERATION)
over (order by SERIAL_NUMBER, EVENT_TIMESTAMP) = '000'
and OPERATION <> '000')
or lag(SERIAL_NUMBER,1,'')
over (order by SERIAL_NUMBER, EVENT_TIMESTAMP) <> SERIAL_NUMBER
then 1
else 0
end as flag -- start of new group
from tab
)
select ...
sum(flag)
over (order by SERIAL_NUMBER, EVENT_TIMESTAMP
rows unbounded preceding) as GROUP_ID
from cte
Your code is counting the number of "breaks" in your data, where a "break" is defined as 000 or the value in the first column changing.
In SQL, you can do this as a cumulative sum:
select t.*,
sum(case when prev_serial_number = serial_number or operation <> '000'
then 0 else 1
end) over (order by event_timestamp rows between unbounded preceding and current row) as column_d
from (select t.*,
lag(serial_number) over (order by event_timestamp) as prev_serial_number
from t
) t

SQL query on redshift to get the first and the last value

I have a data set like this.
I need to write a query which gives me the below output
for every SessionID and VisitID, it should sort based on the date_time Column and provide me with the First Category and the Last Category.
I have used the following code
rank() OVER( PARTITION BY SessionID
, VisitID
ORDER by
date_Time DESC ) as click_rank_last
where click_rank_last = 1
to get the last Category. But what I need is to get the first and the last in a single query with minimum hit to the data base as the data is huge and querying in costly.
Need the most optimum query!
One way would be:
select distinct
sessionid,
visitid,
first_value(category) over (
partition by sessionid, visitid
order by date_time
rows between unbounded preceding and unbounded following),
last_value(category) over (
partition by sessionid, visitid
order by date_time
rows between unbounded preceding and unbounded following)
from tbl

SQL query for backfilling register read values

I have a table with ID,timestamp,register reads for a day, the register reads are like running totals starts at 12.00 at midnight and ends at 11.00 at night.
Problem is there are some random timeintervals in which the cumulative reads may not be present, I need to back fill those,
The below picture gives a snapshot of the problem, The KWH_RDNG is the difference between two cumulative intervals divided by 1000, but the 4th column 5.851 is actually accumulation of 3 missing hours along with the 4th hour value. its fine if i simply divide 5.851/4 and distribute it.
The challenge is they can happen at random intervals and it can be different for different meters (1st column). I am using SQL Server 2016.
Please help.!!
This is a gaps and islands problem -- sort of. You need to identify groups of NULL values with the subsequent value. One method is to use a cumulative sum of the non-NULL value on or after each value. This defines the groups.
Then, you need the count and the reading. So, this should do the calculation:
select t.*,
(max_kwh_rding / cnt) as new_kwh_rding
from (select t.*, count(*) over (partition by meter_serial, grp) as cnt,
max(kwh_rding) over (partition by meter_serial, grp) as max_kwh_rding
from (select t.*,
count(kwh_rding) over (partition by meter_serial order by read_utc desc rows between unbounded preceding and current row) as grp
from t
) t
) t
where cnt > 1;
You can incorporate this into an update:
with toupdate as (
select t.*,
(max_kwh_rding / cnt) as new_kwh_rding
from (select t.*, count(*) over (partition by meter_serial, grp) as cnt,
max(kwh_rding) over (partition by meter_serial, grp) as max_kwh_rding
from (select t.*,
count(kwh_rding) over (partition by meter_serial order by read_utc desc rows between unbounded preceding and current row) as grp
from t
) t
) t
where cnt > 1
)
update toupdate
set kwh_rding = max_kwh_rding;

SQL - Window function to get values from previous row where value is not null

I am using Exasol, in other DBMS it was possible to use analytical functions such LAST_VALUE() and specify some condition for the ORDER BY clause withing the OVER() function, like:
select ...
LAST_VALUE(customer)
OVER (PARTITION BY ID ORDER BY date_x DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING ) as the_last
Unfortunately I get the following error:
ERROR: [0A000] Feature not supported: windowing clause (Session:
1606983630649130920)
the same do not happen if instead of AND 1 PRECEDING I use: CURRENT ROW.
Basically what I wanted is to get the last value according the Order by that is NOT the current row. In this example it would be the $customer of the previous row.
I know that I could use the LAG(customer,1) OVER ( ...) but the problem is that I want the previous customer that is NOT null, so the offset is not always 1...
How can I do that?
Many thanks!
Does this work?
select lag(customer) over (partition by id
order by (case when customer is not null then 1 else 0 end),
date
)
You can do this with two steps:
select t.*,
max(customer) over (partition by id, max_date) as max_customer
from (select t.*,
max(case when customer is not null then date end) over (partition by id order by date) as max_date
from t
) t;