Calculating Cumulative Sum in PostgreSQL - sql

I want to find the cumulative or running amount of field and insert it from staging to table. My staging structure is something like this:
ea_month id amount ea_year circle_id
April 92570 1000 2014 1
April 92571 3000 2014 2
April 92572 2000 2014 3
March 92573 3000 2014 1
March 92574 2500 2014 2
March 92575 3750 2014 3
February 92576 2000 2014 1
February 92577 2500 2014 2
February 92578 1450 2014 3
I want my target table to look something like this:
ea_month id amount ea_year circle_id cum_amt
February 92576 1000 2014 1 1000
March 92573 3000 2014 1 4000
April 92570 2000 2014 1 6000
February 92577 3000 2014 2 3000
March 92574 2500 2014 2 5500
April 92571 3750 2014 2 9250
February 92578 2000 2014 3 2000
March 92575 2500 2014 3 4500
April 92572 1450 2014 3 5950
I am really very much confused with how to go about achieving this result. I want to achieve this result using PostgreSQL.
Can anyone suggest how to go about achieving this result-set?

Basically, you need a window function. That's a standard feature nowadays. In addition to genuine window functions, you can use any aggregate function as window function in Postgres by appending an OVER clause.
The special difficulty here is to get partitions and sort order right:
SELECT ea_month, id, amount, ea_year, circle_id
, sum(amount) OVER (PARTITION BY circle_id
ORDER BY ea_year, ea_month) AS cum_amt
FROM tbl
ORDER BY circle_id, ea_year, ea_month;
And no GROUP BY.
The sum for each row is calculated from the first row in the partition to the current row - or quoting the manual to be precise:
The default framing option is RANGE UNBOUNDED PRECEDING, which is
the same as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. With
ORDER BY, this sets the frame to be all rows from the partition
start up through the current row's last ORDER BY peer.
Bold emphasis mine.
This is the cumulative (or "running") sum you are after.
In default RANGE mode, rows with the same rank in the sort order are "peers" - same (circle_id, ea_year, ea_month) in this query. All of those show the same running sum with all peers added to the sum. But I assume your table is UNIQUE on (circle_id, ea_year, ea_month), then the sort order is deterministic and no row has peers. (And you might as well use the cheaper ROWS mode.)
Postgres 11 added tools to include / exclude peers with the new frame_exclusion options. See:
Aggregating all values not in the same group
Now, ORDER BY ... ea_month won't work with strings for month names. Postgres would sort alphabetically according to the locale setting.
If you have actual date values stored in your table you can sort properly. If not, I suggest to replace ea_year and ea_month with a single column the_date of type date in your table.
Transform what you have with to_date():
to_date(ea_year || ea_month , 'YYYYMonth') AS the_date
For display, you can get original strings with to_char():
to_char(the_date, 'Month') AS ea_month
to_char(the_date, 'YYYY') AS ea_year
While stuck with the unfortunate design, this will work:
SELECT ea_month, id, amount, ea_year, circle_id
, sum(amount) OVER (PARTITION BY circle_id ORDER BY the_date) AS cum_amt
FROM (SELECT *, to_date(ea_year || ea_month, 'YYYYMonth') AS the_date FROM tbl) sub
ORDER BY circle_id, mon;

Related

Hive QL to populate a sequence of numbers between limits

Not sure how to put this in a straight forward manner but I'm trying to make something work in Hive SQL. I need to create a sequence of numbers from lower limit to upper limit.
Ex:
select min(year) from table
Let's assume it results in 2010
select max(year) from table
Let's assume it results in 2015
I need to publish each year from 2010 to 2015 in a select query.
And I'm trying to put the min calculation & max calculation inside the same SQL which will/should create sequential years in the output.
Any ideas?
Well I have an idea but in order to use it, you will have to define the lowest possible and the largest possible values for the years that might be present in your table.
Let's say the smallest possible year is 1900 and the largest possible year is 2200.
Since the largest possible difference in this case is 2200-1900=300, you will have to use the following string: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ... ... 298 299 300.
In the query, you split this string using space as a delimiter thus getting an array, and then you explode that array.
Have a look:
SELECT
minval + delta
FROM
(
SELECT
min(year) minval,
max(year) maxval,
split('0 1 2 3 4 5 6 7 8 9 10 11 12 13 ... ... ... 298 299 300', ' ') delta_list
FROM
table
) t
LATERAL VIEW explode(delta_list) dlist AS delta
WHERE (maxval-minval) >= delta
;
So you end up with 301 rows but you only need the rows with delta values not exceeding the difference between max year and min year, which is reflected in the where clause
set hivevar:end_year=2019;
set hivevar:start_year=2010;
select ${hivevar:start_year}+i as year
from
(
select posexplode(split(space((${hivevar:end_year}-${hivevar:start_year})),' ')) as (i,x)
)s;
Result:
year
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
Have a look also at this answer about generating missing dates.

Finding a lagged cumulative total from a table containing both cumulative and delta entries

I have a SQL table with a schema where a value is either a cumulative value for a particular category, or a delta on top of the previous value. While I appreciate this is not a particularly great design, it comes from an external source and thus I can't change it in any way.
The table looks something like the following:
Date Category AmountSoldType AmountSold
-----------------------------------------------------
Jan 1 Apples Cumulative 100
Jan 1 Bananas Cumulative 50
Jan 2 Apples Delta 20
Jan 2 Bananas Delta 10
Jan 3 Apples Delta 25
Jan 3 Bananas Cumulative 75
For this example, I want to produce the total cumulative number of fruits sold by item at the beginning of each day:
Date Category AmountSold
--------------------------------
Jan 1 Apples 0
Jan 1 Bananas 0
Jan 2 Apples 100
Jan 2 Bananas 50
Jan 3 Apples 170
Jan 3 Bananas 60
Jan 4 Apples 195
Jan 4 Bananas 75
Intuitively, I want to take the most recent cumulative total, and add any deltas that have appeared since that entry.
I imagine something akin to
SELECT Date, Category
LEAD((subquery??), 1) OVER (PARTITION BY Category ORDER BY Date) AS Amt
FROM Fruits
GROUP BY Date, Category
ORDER BY Date ASC
is what I want, but I'm having trouble putting the right subquery together. Any suggestions?
You seem to want to add the deltas to the most recent cumulative -- all before the current date.
If so, I think this logic does what you want:
select f.*,
(max(case when date = date_cumulative then amountsold else 0 end) over (partition by category
) +
sum(case when date > date_cumulative then amountsold else 0 end) over (partition by category order by date rows between unbounded preceding and 1 preceding
)
) amt
from (select f.*,
max(case when AmountSoldType = 'cumulative' then date else 0 end) over
(partition by category order by date rows between unbounded preceding and current_row
) as date_cumulative
from fruits f
) f
I'm a bit confused by this data set (notwithstanding the mistake in adding up the apples). I assume the raw data states end-of-day figures, so for example 20 apples were sold on Jan 2 (because there is a delta of 20 reported for that day).
In your example results, it does not appear valid to say that zero apples were sold on Jan 1. It isn't actually possible to say how many were sold on that day, because it is not clear whether the 100 cumulative apples were accrued during Jan 1 (and thus should be excluded from the start-of-day figure you seek) or whether they were accrued on previous days (and should be included), or some mix of the two. That day's data should thus be null.
It is also not clear whether all data sets must begin with a cumulative, or whether data sets can begin with a delta (which might require working backwards from a subsequent cumulative), and whether you potentially have access to multiple data sets from your external source which form a continuous consistent sequence, or whether "cumulatives" relate purely to a single data set received. I'm going to assume at least that all data sets begin with a cumulative.
All that said, this problem is a simple case of firstly converting all rows into either all deltas, or all cumulatives. Assuming we go for all cumulatives, then recursing through each row in order, it is a case of either selecting the AmountSold as-is (if the row is a cumulative), or adding the AmountSold to the result of the previous step (if it is a delta).
Once pre-processed like this, then for a start-of-day cumulative, it is all just a question of looking at the previous day's cumulative (which was an end-of-day cumulative, if my initial assumption was correct that all raw data relates to end-of-day figures).
Using the LAG function in this final step to get the previous day's cumulative, will also neatly produce a null for the first row.

SQL store results table with month name

I have several CSV's stored to query against. Each CSV represents a month of data. I would like to count all the records in each CSV and save that data to a table as a row in the table. For instance, the table that represents May should return something that looks like this with June following. The data starts in Feb 2018 and continues to Feb 2019 so year value would be needed as well.
Month Results
----------------
May 18 1170
June 18 1167
I want to run the same query against all the tables for purposes of efficiency. I also want the query to work with all future updates eg. a March 19 table gets added, and the query will still work.
So far, I have this query.
SELECT COUNT(*)
FROM `months_data.*`
I am querying in Google Big Query using Standard SQL.
It sounds like you just want an aggregation that counts rows for each month:
SELECT
DATE_TRUNC(DATE(timestamp), MONTH) AS Month,
COUNT(*) AS Results
FROM `dataset.*`
GROUP BY month
ORDER BY month
You can use the DATE_FORMAT function if you want to control the formatting.
You seem to need union all:
select 2018 as yyyy, 2 as mm, count(*) as num
from feb2018
union all
select 2018 as yyyy, 3 as mm, count(*)
from mar2018
union all
. . .
Note that you have a poor data model. You should be storing all the data in a single table with a date column.

SQL/HQL - Partition By

Trying to understand Partition By and getting super confused, I have the following data:
Name ID Product Date Amount
Jason 1 Car Jan 2017 $10
Jason 1 Car Feb 2017 $5
Jason 2 Car Jan 2017 $50
Jason 2 Car Feb 2017 $60
Jason 3 House Jan 2017 $20
Jason 3 House Feb 2017 $30
Would doing:
Select Name, ID, Product, Date, Amount,
**LAG(Amount,1) Over Partition By Name Order by Date**
FROM table
give me Jason's correct previous month amount for the appropriate Product and ID number? So, for example at Feb 2017: Jason, ID 1 and Product Car's should give me the amount $5.
OR
Would I need to modify the Partition by to include the Product and ID, such as:
Select Name, ID, Product, Date, Amount,
**LAG(Amount,1) Over Partition By *Name, ID, Product* Order by Date** FROM table'
Thanks!
I myself also came here in search of some understanding of the "partition by" clause. But to answer your question, the new column created would give you the previous row's value. So you don't have to add the other columns (i.e. Product and ID) in to your Partition by clause.
Essentially, you would have your existing 5 columns, plus one (in which contains the row contains the value of the previous row in "amount").

group yearmonth field by quarter in sql server

I have a int field in my database which represent year and month like 201501 stands for 2015 Jan,
i need to group by reporting_date field and showcase the quarterly data .The table is in the following format .Reporting_date is an int field rather than a datetime and interest_payment is float
reporting_date interest_payment
200401 5
200402 10
200403 25
200404 15
200406 5
200407 20
200408 25
200410 10
the output of the query should like this
reporting_date interest_payment
Q1 -2004 40
Q2 -2004 20
Q3 -2004 40
Q4 -2004 10
i tried using the normal group by statement
select reporting_date , sum(interest_payment) as interest_payment from testTable
group by reporting_date
but got different result output.Any help would be appreciated
Thanks
before grouping you need to calculate report_quarter, which is equal to
(reporting_date%100-1)/3
then do select
select report_year, 'Q'+cast(report_quarter+1 as varchar(1)), SUM (interest_payment)
from
(
select
*,
(reporting_date%100 - 1)/3 as report_quarter,
reporting_date/100 as report_year
from #x
) T
group by report_year, report_quarter
order by report_year, report_quarter
I see two problems here:
You need to convert reporting_date into a quarter.
You need to SUM() the values in interest_payment for each quarter.
You seem to have the right idea for (2) already, so I'll just help with (1).
If the numbers are all 6 digits (see my comment above) you can just do some numeric manipulation to turn them into quarters.
First, convert into months by dividing by 100 and keeping the remainder: MOD(reporting_date/100).
Then, convert that into a quarter: MOD(MOD(reporting_date/100)/4)+1
Add a Q and the year if desired.
Finally, use that value in your GROUP BY.
You didn't specify which DBMS you are using, so you may have to convert the functions yourself.