Cumulative sum over a table - sql

What is the best way to perform a cumulative sum over a table in Postgres, in a way that can bring the best performance and flexibility in case more fields / columns are added to the table.
Table
a b d
1 59 15 181
2 16 268
3 219
4 102
Cumulative
a b d
1 59 15 181
2 31 449
3 668
4 770

You can use window functions, but you need additional logic to avoid values where there are NULLs:
SELECT id,
(case when a is not null then sum(a) OVER (ORDER BY id) end) as a,
(case when b is not null then sum(b) OVER (ORDER BY id) end) as b,
(case when d is not null then sum(d) OVER (ORDER BY id) end) as d
FROM table;
This assumes that the first column that specifies the ordering is called id.

Window functions for running sum.
SELECT sum(a) OVER (ORDER BY d) as "a",
sum(b) OVER (ORDER BY d) as "b",
sum(d) OVER (ORDER BY d) as "d"
FROM table;
If you have more than one running sum, make sure the orders are the same.
http://www.postgresql.org/docs/8.4/static/tutorial-window.html
http://www.postgresonline.com/journal/archives/119-Running-totals-and-sums-using-PostgreSQL-8.4-Windowing-functions.html
It's important to note that if you want your columns to appear as the aggregate table in your question (each field uniquely ordered), it'd be a little more involved.
Update: I've modified the query to do the required sorting, without a given common field.
SQL Fiddle: (1) Only Aggregates, or (2) Source Data Beside Running Sum
WITH
rcd AS (
select row_number() OVER() as num,a,b,d
from tbl
),
sorted_a AS (
select row_number() OVER(w1) as num, sum(a) over(w2) a
from tbl
window w1 as (order by a nulls last),
w2 as (order by a nulls first)
),
sorted_b AS (
select row_number() OVER(w1) as num, sum(b) over(w2) b
from tbl
window w1 as (order by b nulls last),
w2 as (order by b nulls first)
),
sorted_d AS (
select row_number() OVER(w1) as num, sum(d) over(w2) d
from tbl
window w1 as (order by d nulls last),
w2 as (order by d nulls first)
)
SELECT sorted_a.a, sorted_b.b, sorted_d.d
FROM rcd
JOIN sorted_a USING(num)
JOIN sorted_b USING(num)
JOIN sorted_d USING(num)
ORDER BY num;

I think what you are really looking for is this:
SELECT id
, sum(a) OVER (PARTITION BY a_grp ORDER BY id) as a
, sum(b) OVER (PARTITION BY b_grp ORDER BY id) as b
, sum(d) OVER (PARTITION BY d_grp ORDER BY id) as d
FROM (
SELECT *
, count(a IS NULL OR NULL) OVER (ORDER BY id) as a_grp
, count(b IS NULL OR NULL) OVER (ORDER BY id) as b_grp
, count(d IS NULL OR NULL) OVER (ORDER BY id) as d_grp
FROM tbl
) sub
ORDER BY id;
The expression count(col IS NULL OR NULL) OVER (ORDER BY id) forms groups of consecutive non-null rows for a, b and d in the subquery sub.
In the outer query we run cumulative sums per group. NULL values form their own group and stay NULL automatically. No additional CASE statement necessary.
SQL Fiddle (with some added values for column a to demonstrate the effect).

Related

Teradata how can I use count(distinct)?

I saw some questions similar but I still couldn't figure out.
There is many columns, but to short, I want to count distinct by g1 column.
I thought I could just use COUNT(DISTINCT)...?
Please help me for this problem.
Thank you so much in advance
G1
C1
expected results
1
A
2
1
A
2
1
B
2
2
A
3
2
B
3
2
A
3
2
C
3
3
A
1
3
A
1
3
A
1
3
A
1
Most (all?) databases do not support using COUNT(DISTINCT ...) as an analytic function. So in this case, I would suggest just joining to a subquery which finds the distinct counts:
SELECT t1.G1, t1.C1, t2.cnt
FROM yourTable t1
INNER JOIN
(
SELECT G1, COUNT(DISTINCT C1) AS cnt
FROM yourTable
GROUP BY G1
) t2
ON t2.G1 = t1.G1
ORDER BY
t1.G1, t1.C1;
You can get a windowed count distinct, but you always need two functions, one possible way is:
SELECT G1, C1, max(dr) over (partition by G1) as cnt
FROM
(
SELECT G1, C1,
dense_rank()
over (partition by G1
order by C1) AS dr
FROM yourTable
) as dt
Depending on your data and actual query this might perform better than Tim's query :-)
Of course, this can be modified for NULLable columns flagging the 1st occurance:
SELECT G1, C1, sum(flag) over (partition by G1) as cnt
FROM
(
SELECT G1, C1,
case
when lag(C1)
over (partition by G1
order by C1) = C1
then null
else C1
end as flag
FROM yourTable
) as dt
You can use the below query:
SELECT count(C1) from TableName Group by G1
If your database doesn't use count(distinct) as a window function, you can use the handy trick of the sum of dense_rank():
select t.*,
(-1 +
dense_rank() over (partition by g1 order by c1 asc) +
dense_rank() over (partition by g1 order by c1 desc)
) as expected
from t;
Given that count(distinct) as a window function is easily implemented this way, I am surprised that many databases do not support this functionality directly.
One nuance: This counts NULL as a valid value. You don't have NULL values in your sample data so I don't think this affects you. But, if you wanted an exact equivalent:
select t.*,
( (case when count(*) over (partition by g1) = count(c1) over (partition by g1)
then -1 else -2
end) +
dense_rank() over (partition by g1 order by c1 asc) +
dense_rank() over (partition by g1 order by c1 desc)
) as expected
from t;

How to combine multiple rows data until next row value is not null in SQL Server

I have data like this and there is no id column to group, for example:
A
B
C
NULL
F
D
R
NULL
R
T
G
Expected output:
ABC
FDR
RTG
This is a gaps-and-island problem. One option uses a cumulative sum to define the groups, then aggregation - but you need a column that defines the ordering of the rows, I assumed id.
select string_agg(val, '') within group (order by id) vals
from (
select
val,
sum(case when val is null then 1 else 0 end) over(order by id) grp
from mytable
) t
group by grp
order by grp
If there may be consecutive nulls, then you need a where clause in the outer query:
select string_agg(val, '') within group (order by id) vals
from (
select
val,
sum(case when val is null then 1 else 0 end) over(order by id) grp
from mytable
) t
where val is not null
group by grp
order by grp
You could also use window counts to build the groups:
select string_agg(val, '') within group (order by id) vals
from (
select
val,
count(*) over(order by id) cnt1,
count(val) over(order by id) cnt2
from mytable
) t
group by cnt1 - cnt2
order by cnt1 - cnt2

SQL: histograms for multiple columns

Given the following table:
Column A Column B
east red
west blue
east green
I want to find out the of column values of each column and how many times each value is present in the table. Given the output above the result should look like:
A values A value counts B values B value counts
east 2 red 1
west 1 blue 1
green 1
This is achievable by running SELECT colX, count(colX) From Table GROUP BY colX for each column. This is not a scalable solution if there is a complex WHERE condition since it needs to be executed for each query.
An alternative is to execute the complex where query once and compute the aggregations in the server code. But is there a single SQL query that can compute that?
You can use window function :
select cola, count(*) over (partition by cola) as a_count,
colb, count(*) over (partition by colb) as b_count
This will count for both columns (a & b) with their values display.
You can use subqueries to aggregate, then union all and aggregate again to combine the results:
select max(a) as a, max(a_cnt) as a_cnt, max(b) as b, max(b_cnt) as b
from ((select a, count(*) as a_cnt, null as b, null as b_cnt,
row_number() over (order by count(*) desc) as seqnum
from t
group by a
) union all
(select null, null, b, count(*),
row_number() over (order by count(*) desc) as seqnum
from t
group by b
)
) ab
group by seqnum
order by seqnum;
If you are using Oracle you can use user_tab_cols to generate the SQL for all columns in your table
SELECT 'SELECT '
|| Listagg(column_name
||',count(1) over (partition by '
||column_name
||') as '
||column_name
||'_cnt', ',')
within GROUP (ORDER BY column_id)
||' FROM '
||'TEST_DATA'
FROM user_tab_cols
WHERE table_name = 'TEST_DATA'
Sample output is below
SELECT ID,count(1) over (partition by ID) as ID_cnt,VALUE,count(1) over (partition by
VALUE) as VALUE_cnt FROM TEST_DATA

How to create "subsets" as a result from a column in SQL

Let's suppose that I've got as a result from one query the next set of values of one column:
Value
1 A
2 B
3 C
4 D
5 E
6 F
7 G
8 H
9 I
10 J
Now, I would like to see this information with another order, establishing a limit to the number of values of every single subset. Now suppose that I choose 3 as a limit,the information will be given like this (one column for all the subsets):
Values
1 A, B, C
2 D, E, F
3 G, H, I
4 J,
Obviously, the last row will contain the remaining values when their number is smaller than the limit established.
Is it possible to perform a query like this in SQL?
What about if the limit is dynamic?. It can be chosen randomly.
create table dee_t (id int identity(1,1),name varchar(10))
insert into dee_t values ('A'),('B'),('c'),('D'),('E'),('F'),('g'),('H'),('I'),('J')
;with cte as
(
select (id-1)/3 +1 rno ,* from dee_t
) select rno ,
(select name+',' from cte where rno = c.rno for xml path('') )
from cte c group by rno
You can do this by using few calculations with row_number, like this:
select
GRP,
max(case when RN = 1 then Value end),
max(case when RN = 2 then Value end),
max(case when RN = 0 then Value end)
from (
select
row_number() over (order by Value) % 3 as RN,
(row_number() over (order by Value)+2) / 3 as GRP,
Value
from Table1
) X
group by GRP
The first row_number creates numbers for the columns (1,2,0,1,2,0...) and the second one creates numbers for the rows (1,1,1,2...). Those are then used to group the values into correct place using case, but you can also use pivot instead of it if you like it more.
If you want them into same column, of course just concatenate the cases instead of selecting them on different columns, but beware of nulls.
Example in SQL Fiddle
Thanks a lot for all your reply. Finally I've got a Solution with the help of Rajen Singh
This is the code than can be used:
WITH CTE_0 AS
(
SELECT DISTINCT column_A_VALUE AS id
FROM Table
WHERE column_A_VALUE IS NOT NULL
), CTE_1 AS
(
SELECT ROW_NUMBER() OVER (ORDER BY id) RN, id
FROM CTE_0
), CTE_2 AS
(
SELECT RN%30 GROUP, ID
FROM CTE_1
)SELECT STUFF(( SELECT ','''+CAST(ID AS NVARCHAR(20))+''''
FROM CTE_2
WHERE GROUP = A.GROUP
FOR XML PATH('')),1,1,'') IDS
FROM CTE_2 A
GROUP BY GROUP

Order by newly selected column

I have a query like:
SELECT
R.*
FROM
(SELECT A, B,
(SELECT smth from another table) as C,
ROW_NUMBER() OVER (ORDER BY C DESC) AS RowNumber
FROM SomeTable) R
WHERE
RowNumber BETWEEN 10 AND 20
This gives me an error on ORDER BY C DESC.
I understand why this error is caused, so I've thought of adding another SELECT with ORDER BY and only than selecting rows from 10 to 20. But I don't think it's good to have 3 nested SELECT commands.
How else is it possible to select these rows?
A column cannot refer to an alias on same level, you have to table-derive it first, or use CTE.
SELECT
R.* , ROW_NUMBER() OVER (ORDER BY C DESC) AS RowNumber
FROM
(SELECT A, B, (SELECT smth from another table) as C
FROM SomeTable) R
-- WHERE
-- but you still cannot do this
-- RowNumber BETWEEN 10 AND 20
Need to do this:
select S.*
from
(
SELECT
R.* , ROW_NUMBER() OVER (ORDER BY C DESC) AS RowNumber
FROM
(SELECT A, B,
(SELECT smth from another table) as C
FROM SomeTable) R
) as s
where s.RowNumber between 10 and 20
To avoid deep nesting and to make it at least look pleasant, use CTE:
with R as
(
SELECT A, B, (SELECT smth from another table) as C
FROM SomeTable
)
,S AS
(
SELECT R.*, ROW_NUMBER() OVER (ORDER BY C DESC) AS RowNumber
FROM R
)
SELECT S.*
FROM S
WHERE S.RowNumber BETWEEN 1 AND 20
You cannot use aliased columns in the same SELECT, but you can wrap it into another select to make it work:
SELECT R.*
FROM (SELECT ABC.A, ABC.B, ABC.C, ROW_NUMBER() OVER (ORDER BY C DESC) AS RowNumber
FROM (SELECT A, B, (SELECT smth from another table) as C FROM SomeTable) ABC
) R
WHERE R.RowNumber BETWEEN 10 AND 20