running sum with window function - sql

I have the following data in a table:
col1
---
1
2
5
9
10
I want to update col2 in the table with the running sum of the difference between col1 and the previous value of col1 minus 1
col2 = col2.prev + col1 - col1.prev - 1
The result would be:
col1 | col2
--------------
1 | 0
2 | 0
5 | 2
9 | 5
10 | 5
I tried using a window function:
SELECT sum(col1 - lag(col1) OVER (ORDER BY col1) - 1) AS col2 FROM table1
But this is not allowed - ERROR: aggregate function calls cannot contain window function calls
Is there another way I can accomplish this? I know I could easily write a function to loop through the rows but I get the impression from what I've read that this method is not efficient and discouraged in most cases. Please correct me if I have the wrong impression.

ERROR: aggregate function calls cannot contain window function calls
This error message is displayed because it is not permitted to apply an aggregate function on a column generated through a windowed expression. Nor is it permitted to apply a second windowed expression. The solution is to simply wrap the result in a cte & apply the second windowed expression in a subsequent select statement.
WITH mytable(col1) AS (
VALUES (1), (2), (5), (9), (10)
)
, lagdiff AS (
SELECT
col1
, COALESCE(col1 - lag(col1) OVER (ORDER BY col1) - 1, 0) col2_
FROM mytable
)
SELECT
col1
, SUM(col2_) OVER (ORDER BY col1) col2
FROM lagdiff
Produces Output:
col1 | col2
------+------
1 | 0
2 | 0
5 | 2
9 | 5
10 | 5
(5 rows)

Related

BigQuery - replicate rows with modified values

The title of the post might not accurately represent what I want to do. I have a BigQuery table with a userId column and a bunch of feature columns. Let's say the table is like this.
_____________________________
|userId| col1 | col2 | col3 |
-------|------|------|-------
|u1 | 0.3 | 0.0 | 0.0 |
|u2 | 0.0 | 0.1 | 0.6 |
-----------------------------
Each row has a userId (userIds may or may not be distinct across rows), followed by some feature values. Most of those are 0 except a few.
Now, for each of the rows, I want to create additional rows where only one non-zero feature is substituted with 0. With the example above, the resulting table would look like this.
_____________________________
|userId| col1 | col2 | col3 |
-------|------|------|-------
|u1 | 0.3 | 0.0 | 0.0 |
|u1 | 0.0* | 0.0 | 0.0 |
|u2 | 0.0 | 0.1 | 0.6 |
|u2 | 0.0 | 0.0* | 0.6 |
|u2 | 0.0 | 0.1 | 0.0* |
-----------------------------
Values with asterisk represent the columns for which the non-zero value was set to 0. Since u1 had 1 nonzero feature, only one additional row was added to it with col1 value set to 0. u2 had 2 non-zero columns (col2 and col3). As such, two additional rows were added, one with col2 set to 0 and the other with col3 set to 0.
The table has around 2000 columns and more than 20 million rows.
Normally, I post the crude attempts I could come up with. However, in this case, I don't even know where to start from. I did have one bizarre idea of joining this table with an unpivoted version of it. But, I don't know how to unpivot a BQ table.
Below is for BigQuery Standard SQL
It is generic enough - you don't need to specify column names or repeat same chunk of code 2000 times!
Assuming that your initial data is in project.dataset.table table
#standardSQL
create temp table flatten as
with temp as (
select userid, offset,
split(col_kv, ':')[offset(0)] as col,
cast(split(col_kv, ':')[offset(1)] as float64) as val
from `project.dataset.table` t,
unnest(split(translate(to_json_string(t), '{}"', ''))) col_kv with offset
where split(col_kv, ':')[offset(0)] != 'userid'
), numbers as (
select * from unnest((
select generate_array(1, max(offset))
from temp)) as grp
), targets as (
select userid, grp from temp, numbers
where grp = offset and val != 0
), flatten_result as (
select *, 0 as grp from temp union all
select userid, offset, col, if(offset = grp, 0, val) as val, grp
from temp left join targets using(userid)
)
select * from flatten_result;
execute immediate '''create temp table pivot as
select userid, ''' || (
select string_agg(distinct "max(if(col = '" || col || "', val, null)) as " || col)
from flatten
) || ''' from flatten group by userid, grp''';
select * from pivot order by userid;
your final output is in temp table pivot
If to apply above to sample data from your question output of script is
and output of pivot table is under last VIW RESULT link
One method is brute force:
select userid, col1, col2, col3
from t
union all
select userid, 0 as col1, col2, col3
from t
where col1 = 0
union all
select userid, col1, 0 as col2, col3
from t
where col2 = 0
union all
select userid, col1, col2, 0 as col3
from t
where col3 = 0;
This is verbose -- and convoluted with hundreds of columns. I can't readily think of a simpler method.

Comparing with LAG Analytic function

I am using oracle PL/SQL.
I am trying to compare column values with LAG function.
Following is the statement:
decode(LAG(col1,1) OVER (ORDER BY col3),col1,'No Change','Change_Occured') Changes
As for first row, LAG will always compare with the previous empty row. So for my query the first row of column 'Changes' is always showing the value as Change_Occured when in fact no change has happened. Is there any way to handle this scenario ?
Assume this table:
| col1 | col2 |
| 2 | 3 |
| 2 | 6 |
| 2 | 7 |
| 2 | 9 |
Each row of col1 is compared with previous value so result will be
| col1 | col2 | Changes |
| 2 | 3 | Change_occured |
| 2 | 9 | No Change |
| 2 | 5 | No Change |
| 2 | 8 | No Change |
So how should I handle the first row of column Changes
The syntax for LAG Analytic function is:
LAG (value_expression [,offset] [,default]) OVER ([query_partition_clause] order_by_clause)
default - The value returned if the offset is outside the scope of the window. The default value is NULL.
SQL> WITH sample_data AS(
2 SELECT 2 col1, 3 col2 FROM dual UNION ALL
3 SELECT 2 col1, 6 col2 FROM dual UNION ALL
4 SELECT 2 col1, 7 col2 FROM dual UNION ALL
5 SELECT 2 col1, 9 col2 FROM dual
6 )
7 -- end of sample_data mimicking real table
8 SELECT col1, LAG(col1,1) OVER (ORDER BY col2) changes FROM sample_data;
COL1 CHANGES
---------- ----------
2
2 2
2 2
2 2
Therefore, in the DECODE expression you are comparing the NULL value with a real value and it is evaluated as Change_Occurred
You could use the default value as the column value itself:
DECODE(LAG(col1,1, col1) OVER (ORDER BY col2),col1,'No Change','Change_Occured') Changes
For example,
SQL> WITH sample_data AS(
2 SELECT 2 col1, 3 col2 FROM dual UNION ALL
3 SELECT 2 col1, 6 col2 FROM dual UNION ALL
4 SELECT 2 col1, 7 col2 FROM dual UNION ALL
5 SELECT 2 col1, 9 col2 FROM dual
6 )
7 -- end of sample_data mimicking real table
8 SELECT col1,
9 DECODE(
10 LAG(col1,1, col1) OVER (ORDER BY col2),
11 col1,
12 'No Change',
13 'Change_Occured'
14 ) Changes
15 FROM sample_data;
COL1 CHANGES
---------- --------------
2 No Change
2 No Change
2 No Change
2 No Change
SQL>
May be:
decode(LAG(col1,1, col1) OVER (ORDER BY col3),col1,'No Change','Change_Occured') Changes
The optional default value is returned if the offset goes beyond the scope of the window. If you do not specify default, then its default is null.

Group Concat in Redshift

I have a table like this:
| Col1 | Col2 |
|:-----------|------------:|
| 1 | a;b; |
| 1 | b;c; |
| 2 | c;d; |
| 2 | d;e; |
I want the result to be some thing like this.
| Col1 | Col2 |
|:-----------|------------:|
| 1 | a;b;c;|
| 2 | c;d;e;|
Is there some way to write a set function which adds unique values in a column into an array and then displays them. I am using the Redshift Database which mostly uses postgresql with the following difference:
Unsupported PostgreSQL Functions
Have a look at Redshift's listagg() function which is similar to MySQL's group_concat. You would need to split the items first and then use listagg() to give you a list of values. Do take note, though, that, as the documentation states:
LISTAGG does not support DISTINCT expressions
(Edit: As of 11th October 2018, DISTINCT is now supported. See the docs.)
So will have to take care of that yourself. Assuming you have the following table set up:
create table _test (col1 int, col2 varchar(10));
insert into _test values (1, 'a;b;'), (1, 'b;c;'), (2, 'c;d;'), (2, 'd;e;');
Fixed number of items in Col2
Perform as many split_part() operations as there are items in Col2:
select
col1
, listagg(col2, ';') within group (order by col2)
from (
select col1, split_part(col2, ';', 1) as col2 from _test
union select col1, split_part(col2, ';', 2) as col2 from _test
)
group by col1
;
Varying number of items in Col2
You would need a helper here. If there are more rows in the table than items in Col2, a workaround with row_number() could work (but is expensive for large tables):
with _helper as (
select
(row_number() over())::int as part_number
from
_test
),
_values as (
select distinct
col1
, split_part(col2, ';', part_number) as col2
from
_test, _helper
where
length(split_part(col2, ';', part_number)) > 0
)
select
col1
, listagg(col2, ';') within group (order by col2) as col2
from
_values
group by
col1
;

SQL Server : duplicate rows while changing a columns value

I have a table T, say
1 | a
2 | a
I want to duplicate its rows while changing the value of the second column to b, so as to have
1 | a
2 | a
1 | b
2 | b
I came to
INSERT INTO T(col1, col2)
SELECT col1, 'b'
FROM T
but I get an error
Only one expression can be specified in the select list when the subquery is not introduced with EXISTS.
Remove those extra parentheses in the SELECT :
INSERT INTO T(col1, col2)
SELECT col1, 'b' AS col2 FROM T;

Oracle column number increase

I would like to know how to increase the row number by 1 in Column 1 when Column 2 value changes in Oracle
What I am looking for is to achieve this :
COL1 COL2 COL3 |
1 2000 xx |
1 2000 xy |
1 2000 xyz |
2 3020 x |
2 3020 xiii |
3 5666666 ueueu
Any idea ?
I think you are looking for a window function:
select row_number() over (partition by col2 order by col3) as col1,
col2,
col3
from the_table;
If you want to increase col1 value after updating col2 on table t_ then you can use trigger.
CREATE OR REPLACE TRIGGER upcol1
AFTER UPDATE ON t_ FOR EACH ROW
WHEN (old.col2 != new.col2)
BEGIN
UPDATE t_ SET col1=:new.col1+1
WHERE col2=:new.col2 AND col3=:new.col3;
END;