SQL: Apply an aggregate result per day using window functions - sql

Consider a time-series table that contains three fields time of type timestamptz, balance of type numeric, and is_spent_column of type text.
The following query generates a valid result for the last day of the given interval.
SELECT
MAX(DATE_TRUNC('DAY', (time))) as last_day,
SUM(balance) FILTER ( WHERE is_spent_column is NULL ) AS value_at_last_day
FROM tbl
2010-07-12 18681.800775017498741407984000
However, I am in need of an equivalent query based on window functions to report the total value of the column named balance for all the days up to and including the given date .
Here is what I've tried so far, but without any valid result:
SELECT
DATE_TRUNC('DAY', (time)) AS daily,
SUM(sum(balance) FILTER ( WHERE is_spent_column is NULL ) ) OVER ( ORDER BY DATE_TRUNC('DAY', (time)) ) AS total_value_per_day
FROM tbl
group by 1
order by 1 desc
2010-07-12 16050.496339044977568391974000
2010-07-11 13103.159119670350269890284000
2010-07-10 12594.525752964512456914454000
2010-07-09 12380.159588711091681327014000
2010-07-08 12178.119542536668113577014000
2010-07-07 11995.943973804127033140014000
EDIT:
Here is a sample dataset:
LINK REMOVED
The running total can be computed by applying the first query above on the entire dataset up to and including the desired day. For example, for day 2009-01-31, the result is 97.13522530000000000000, or for day 2009-01-15 when we filter time as time < '2009-01-16 00:00:00' it returns 24.446144000000000000.
What I need is an alternative query that computes the running total for each day in a single query.
EDIT 2:
Thank you all so very much for your participation and support.
The reason for differences in result sets of the queries was on the preceding ETL pipelines. Sorry for my ignorance!
Below I've provided a sample schema to test the queries.
https://www.db-fiddle.com/f/veUiRauLs23s3WUfXQu3WE/2
Now both queries given above and the query given in the answer below return the same result.

Consider calculating running total via window function after aggregating data to day level. And since you aggregate with a single condition, FILTER condition can be converted to basic WHERE:
SELECT daily,
SUM(total_balance) OVER (ORDER BY daily) AS total_value_per_day
FROM (
SELECT
DATE_TRUNC('DAY', (time)) AS daily,
SUM(balance) AS total_balance
FROM tbl
WHERE is_spent_column IS NULL
GROUP BY 1
) AS daily_agg
ORDER BY daily

Related

Grouping by last day of each month—inefficient running

I am attempting to pull month end balances from all accounts a customer has for every month. Here is what I've written. This runs correctly and gives me what I want—but it also runs extremely slowly. How would you recommend speeding it up?
SELECT DISTINCT
[AccountNo]
,SourceDate
,[AccountType]
,[AccountSub]
,[Balance]
FROM [DW].[dbo].[dwFactAccount] AS fact
WHERE SourceDate IN (
SELECT MAX(SOURCEDATE)
FROM [DW].[dbo].[dwFactAccount]
GROUP BY MONTH(SOURCEDATE),
YEAR (SOURCEDATE)
)
ORDER BY SourceDate DESC
I'd try a window function:
SELECT * FROM (
SELECT
[AccountNo]
,[SourceDate]
,[AccountType]
,[AccountSub]
,[Balance]
,ROW_NUMBER() OVER(
PARTITION BY accountno, EOMONTH(sourcedate)
ORDER BY sourcedate DESC
) as rn
FROM [DW].[dbo].[dwFactAccount]
)x
WHERE x.rn = 1
The row number will establish an incrementing counter in order of sourcedate descending. The counter will restart from 1 when the month in sourcedate changes (or the account number changes) thanks to the EOMONTH function quantising any date in a given month to be the last date of the month (2020-03-9 12:34:56 becomes 2020-03-31, as do all other datetimes in March). Any similar tactic to quantise to a fixed date in the month would also work such as using YEAR(sourcedate), MONTH(sourcedate)
You need to build a dimension table for the date with Date as PK, and your SourceDate in the fact table ref. that date dimension table.
Date dimension table can have month, year, week, is_weekend, is_holiday, etc. columns. You join your fact table with the date dimension table and you can group data using any columns in date table you want.
Your absolute first step should be to view the execution plan for the query and determine why the query is slow.
The following explains how to see a graphical execution plan:
Display an Actual Execution Plan
The steps to interpreting the plan and optimizing the query are too much for an SO answer, but you should be able to find some good articles on the topic by Googling. You could also post the plan in an edit to your question and get some real feedback on what steps to take to improve query performance.

Returning single records per month

I have a use case function that needs to returns a single row only for every end of month.
I tried using select distinct and it is showing multiple records for the same end of month
SELECT DISTINCT CASE
WHEN eff_interest_balance < 0.01 THEN trial_balance_date
WHEN date_paid < trial_balance_date THEN date_paid
END as A
, period
FROM dbo.Intpayments[enter image description here][1]
WHERE loan_number = 60023
ORDER BY period ASC
Each row should return single date for each month
Distinct is returning unique rows, not grouping them. You are looking to aggregate rows. This means using some combination of aggregate functions and group by.
What your current query is missing is some sort of logic for aggregating the rows that are in the same period. Do you want to compare the sum of these values? The min, the max?
In any case, the basic idea of aggregating and grouping would look like this - I don't think this summing is what you want, but the query shows the basic idea of aggregating and grouping:
SELECT
period
, SUM(eff_interest_balance) AS SumOfBalance
FROM dbo.Intpayments
WHERE loan_number = 60023
GROUP BY period

How to group timestamps into islands (based on arbitrary gap)?

Consider this list of dates as timestamptz:
I grouped the dates by hand using colors: every group is separated from the next by a gap of at least 2 minutes.
I'm trying to measure how much a given user studied, by looking at when they performed an action (the data is when they finished studying a sentence.) e.g.: on the yellow block, I'd consider the user studied in one sitting, from 14:24 till 14:27, or roughly 3 minutes in a row.
I see how I could group these dates with a programming language by going through all of the dates and looking for the gap between two rows.
My question is: how would go about grouping dates in this way with Postgres?
(Looking for 'gaps' on Google or SO brings too many irrelevant results; I think I'm missing the vocabulary for what I'm trying to do here.)
SELECT done, count(*) FILTER (WHERE step) OVER (ORDER BY done) AS grp
FROM (
SELECT done
, lag(done) OVER (ORDER BY done) <= done - interval '2 min' AS step
FROM tbl
) sub
ORDER BY done;
The subquery sub returns step = true if the previous row is at least 2 min away - sorted by the timestamp column done itself in this case.
The outer query adds a rolling count of steps, effectively the group number (grp) - combining the aggregate FILTER clause with another window function.
fiddle
Related:
Query to find all timestamps more than a certain interval apart
How to label groups in postgresql when group belonging depends on the preceding line?
Select longest continuous sequence
Grouping or Window
About the aggregate FILTER clause:
Aggregate columns with additional (distinct) filters
Conditional lead/lag function PostgreSQL?
Building up on Erwin's answer, here is the full query for tallying up the amount of time people spent on those sessions/islands:
My data only shows when people finished reviewing something, not when they started, which means we don't know when a session truly started; and some islands only have one timestamp in them (leading to a 0-duration estimate.) I'm accounting for both by calculating the average review time and adding it to the total duration of islands.
This is likely very idiosyncratic to my use case, but I learned a thing or two in the process, so maybe this will help someone down the line.
-- Returns estimated total study time and average time per review, both in seconds
SELECT (EXTRACT( EPOCH FROM logged) + countofislands * avgreviewtime) as totalstudytime, avgreviewtime -- add total logged time to estimate for first-review-in-island and 1-review islands
FROM
(
SELECT -- get the three key values that will let us calculate total time spent
sum(duration) as logged
, count(island) as countofislands
, EXTRACT( EPOCH FROM sum(duration) FILTER (WHERE duration != '00:00:00'::interval) )/( sum(reviews) FILTER (WHERE duration != '00:00:00'::interval) - count(reviews) FILTER (WHERE duration != '00:00:00'::interval)) as avgreviewtime
FROM
(
SELECT island, age( max(done), min(done) ) as duration, count(island) as reviews -- calculate the duration of islands
FROM
(
SELECT done, count(*) FILTER (WHERE step) OVER (ORDER BY done) AS island -- give a unique number to each island
FROM (
SELECT -- detect the beginning of islands
done,
(
lag(done) OVER (ORDER BY done) <= done - interval '2 min'
) AS step
FROM review
WHERE clicker_id = 71 AND "done" > '2015-05-13' AND "done" < '2015-05-13 15:00:00' -- keep the queries small and fast for now
) sub
ORDER BY done
) grouped
GROUP BY island
) sessions
) summary

How can I find the Max/Min value in SQL Server as well as the time it occurred at?

I have a table that is being updated every 10 minutes with data of flow rates, at the end of the day I need a query that will generate the Average flow, the minimum flow value and the time it occurred at (same with max) and then the total flow for the day. I have this so far:
SELECT *
FROM [WS6].[dbo].[MasterData]
Where [_Datetime] between '2015-07-06' and '2015-07-06 23:59:59'
This gets me the whole days data, but how can I filter for the values I need? Thank you
In SQL Server 2012, you can use first_value():
SELECT avg(flow), min(flow), max(flow), sum(flow),
first_value(_DateTime) over (order by flow asc) as minTime,
first_value(_DateTime) over (order by flow desc) as maxTime
FROM [WS6].[dbo].[MasterData]
Where [_Datetime] >= '2015-07-06' and [_Datetime] < '2015-07-07';
In earlier versions you can use window functions or cross apply.
It depends on what you are trying to get the Min and Max value on. for example if you are trying to get the min and max of field between those dates you would do something like this where _field is the filed you want the value for (will return 1 or no rows):
SELECT MIN(_field) AS 'MIN', MAX(_field) AS 'MAX'
FROM [WS6].[dbo].[MasterData]
Where [_Datetime] between '2015-07-06' and '2015-07-06 23:59:59'
if you are trying to get the min and max of many fields grouped by another field you can try something like this (may return many, 1 or no rows).
SELECT Group_by_field, MIN(_field) AS 'MIN', MAX(_field) AS 'MAX'
FROM [WS6].[dbo].[MasterData]
Where [_Datetime] between '2015-07-06' and '2015-07-06 23:59:59'
Group by Group_by_field
It would be helpful if you provided a sample table and sample output of what you want to do.

How can I make this query run efficiently?

In BigQuery, we're trying to run:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT value, UTC_USEC_TO_DAY(timestamp) as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [Datastore.PerformanceDatum]
WHERE type = "MemoryPerf"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
which returns a relatively small amount of data. But we're getting the message:
Error: Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead. For more details, please see https://developers.google.com/bigquery/docs/query-reference#groupby
What is making this query fail, the size of the subquery? Is there some equivalent query we can do which avoids the problem?
Edit in response to comments: If I add GROUP EACH BY (and drop the outer ORDER BY), the query fails, claiming GROUP EACH BY is here not parallelizable.
I wrote an equivalent query that works for me:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, UTC_USEC_TO_DAY(dtimestamp) as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [io_sensor_data.moscone_io13]
WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
If I run only the inner query, I get 3,660,624 results. Is your dataset bigger than that?
The outer select gives me only 4 results when grouped by day. I'll try a different grouping to see if I can hit a limit there:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, dtimestamp / 1000 as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [io_sensor_data.moscone_io13]
WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
Runs too, now with 57,862 different groups.
I tried different combinations to get to the same error. I was able to get the same error as you doubling the amount of initial data. An easy "hack" to double the amount of data is changing:
FROM [io_sensor_data.moscone_io13]
To:
FROM [io_sensor_data.moscone_io13], [io_sensor_data.moscone_io13]
Then I get the same error. How much data do you have? Can you apply an additional filter? As you are already partitioning the percentile_rank by day, can you add an additional query to only analyze a fraction of the days (for example, only last month)?