Calculate the monthly average including the date where data is missing - sql

I want to calculate the monthly average of some data using SQL query where the data resides in redshift DB.
The data is present in the following format in the table.
s_date | sales
------------+-------
2020-08-04 | 10
2020-08-05 | 20
---- | --
---- | --
The data may not be present for all the date in a month. If the data is not present for a day, it should be considered as 0.
Following query using AVG() function "group by" month as gives the average of based on the data on available date.
select trunc(date_trunc('MONTH', s_date)::timestamp) as month, avg(sales) from sales group by month;
However it does not consider the data for missing dates as 0. What should be the right query to calculate the monthly average as expected?
One more expectation is that, for the current month, the average should be calculated based on the data till today. So it should not consider entire month (like 30 or 31 days).
Regards,
Paul

Using a calendar table might be the easiest way to go here:
WITH dates AS (
SELECT date_trunc('day', t)::date AS dt
FROM generate_series('2020-01-01'::timestamp, '2020-12-31'::timestamp, '1 day'::interval) t
),
cte AS (
SELECT t.dt, COALESCE(SUM(s.sales), 0) AS sales
FROM dates t
LEFT JOIN sales s ON t.dt = s.s_date
GROUP BY t.dt
)
SELECT
LEFT(dt::text, 7) AS ym,
AVG(sales) AS avg_sales
FROM cte
GROUP BY
LEFT(dt::text, 7);
The logic here is to first generate an intermediate table in the second CTE which has one record for each data in your data set, along with the total sales for that date. Then, we aggregate by year/month, and report the average sales.

Related

Generate sales fact data from a date range

I have a "placements" table with the following fields:
id,
name (string),
start_date (date),
end_date (date),
hourly_rate (float)
My goal is to run reports to see billing forecasts based on this placement data, where 8 hours are billed each workday (M-F.) The billing forecasts need to be able to be run weekly or monthly.
For example, a new placement is made for Jan 1st 2021 to Jan 31st 2021 with a hourly_rate of $50 and I am now able to run a report for an arbitrary time period (weekly, monthy, daily) and see the additional billings for that time period.
To do this, I wanted to use the placement data above to generate another set of data in a fact table that would look like this:
id,
placement_id,
date (date) -- or could be a reference to a date dimension table
amount (float) -- placement.hourly_rate x 8 hours
Is there a way to use Postgres to generate these new rows, or is there a better way to accomplish my goal?
You can use generate_series() and a lateral join:
select p.id, d.date, p.hourly_rate * 8 as amount
from placements p
cross join lateral generate_series(start_date, end_date, '1 day') d(date)

Count distinct customers, active within a year, for every week of the year

I am working with an existing E-commerce database. Actually, this process is usually done in Excel, but we want to try it directly with a query in PostgreSQL (version 10.6).
We define as an active customer a person who has bought at least once within 1 year. This means, if I analyze week 22 in 2020, an active customer will be the one that has bought at least once since week 22, 2019.
I want the output for each week of the year (2020). Basically what I need is ...
select
email,
orderdate,
id
from
orders_table
where
paid = true;
|---------------------|-------------------|-----------------|
| email | orderdate | id |
|---------------------|-------------------|-----------------|
| email1#email.com |2020-06-02 05:04:32| Order-2736 |
|---------------------|-------------------|-----------------|
I can't create new tables. And I would like to see the output like this:
Year| Week | Active customers
2020| 25 | 6978
2020| 24 | 3948
depending on whether there is a year and week column you can use a OVER (PARTITION BY ...) with extract:
SELECT
extract(year from orderdate),
extract(week from orderdate),
sum(1) as customer_count_in_week,
OVER (PARTITION BY extract(YEAR FROM TIMESTAMP orderdate),
extract(WEEK FROM TIMESTAMP orderdate))
FROM ordertable
WHERE paid=true;
Which should bucket all orders by year and week, thus showing the total count per week in a year where paid is true.
references:
https://www.postgresql.org/docs/9.1/tutorial-window.html
https://www.postgresql.org/docs/8.1/functions-datetime.html
if I analyze week 22 in 2020, an active customer will be the one that has bought at least once since week 22, 2019.
Problems on your side
This method has some corner case ambiguities / issues:
Do you include or exclude "week 22 in 2020"? (I exclude it below to stay closer to "a year".)
A year can have 52 or 53 full weeks. Depending on the current date, the calculation is based on 52 or 53 weeks, causing a possible bias of almost 2 %!
If you start the time range on "the same date last year", then the margin of error is only 1 / 365 or ~ 0.3 %, due to leap years.
A fixed "period of 365 days" (or 366) would eliminate the bias altogether.
Problems on the SQL side
Unfortunately, window functions do not currently allow the DISTINCT key word (for good reasons). So something of the form:
SELECT count(DISTINCT email) OVER (ORDER BY year, week
GROUPS BETWEEN 52 PRECEDING AND 1 PRECEDING)
FROM ...
.. triggers:
ERROR: DISTINCT is not implemented for window functions
The GROUPS keyword has only been added in Postgres 10 and would otherwise be just what we need.
What's more, your odd frame definition wouldn't even work exactly, since the number of weeks to consider is not always 52, as discussed above.
So we have to roll our own.
Solution
The following simply generates all weeks of interest, and computes the distinct count of customers for each. Simple, except that date math is never entirely simple. But, depending on details of your setup, there may be faster solutions. (I had several other ideas.)
The time range for which to report may change. Here is an auxiliary function to generate weeks of a given year:
CREATE OR REPLACE FUNCTION f_weeks_of_year(_year int)
RETURNS TABLE(year int, week int, week_start timestamp)
LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE
ROWS 52 COST 10 AS
$func$
SELECT _year, d.week::int, d.week_start
FROM generate_series(date_trunc('week', make_date(_year, 01, 04)::timestamp) -- first day of first week
, LEAST(date_trunc('week', localtimestamp), make_date(_year, 12, 28)::timestamp) -- latest possible start of week
, interval '1 week') WITH ORDINALITY d(week_start, week)
$func$;
Call:
SELECT * FROM f_weeks_of_year(2020);
It returns 1 row per week, but stops at the current week for the current year. (Empty set for future years.)
The calculation is based on these facts:
The first ISO week of the year always contains January 04.
The last ISO week cannot start after December 28.
Actual week numbers are computed on the fly using WITH ORDINALITY. See:
PostgreSQL unnest() with element number
Aside, I stick to timestamp and avoid timestamptz for this purpose. See:
Generating time series between two dates in PostgreSQL
The function also returns the timestamp of the start of the week (week_start), which we don't need for the problem at hand. But I left it in to make the function more useful in general.
Makes the main query simpler:
WITH weekly_customer AS (
SELECT DISTINCT
EXTRACT(YEAR FROM orderdate)::int AS year
, EXTRACT(WEEK FROM orderdate)::int AS week
, email
FROM orders_table
WHERE paid
AND orderdate >= date_trunc('week', timestamp '2019-01-04') -- max range for 2020!
ORDER BY 1, 2, 3 -- optional, might improve performance
)
SELECT d.year, d.week
, (SELECT count(DISTINCT email)
FROM weekly_customer w
WHERE (w.year, w.week) >= (d.year - 1, d.week) -- row values, see below
AND (w.year, w.week) < (d.year , d.week) -- exclude current week
) AS active_customers
FROM f_weeks_of_year(2020) d; -- (year int, week int, week_start timestamp)
db<>fiddle here
The CTE weekly_customer folds to unique customers per calendar week once, as duplicate entries are just noise for our calculation. It's used many times in the main query. The cut-off condition is based on Jan 04 once more. Adjust to your actual reporting period.
The actual count is done with a lowly correlated subquery. Could be a LEFT JOIN LATERAL ... ON true instead. See:
What is the difference between LATERAL and a subquery in PostgreSQL?
Using row value comparison to make the range definition simple. See:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'

Access query to pull previous month's data

I have a table in Access 2013 (Table1) that contains the following columns:
ID (pk), ReportDate, Amount
The most current data is 30-50 days old. For example, today (6/22/16) the most recent data would be the 5/1/16 row, as the 6/1/16 data won't be entered until mid-July. (All dates in the ReportDate column are the 1st of the month, i.e.: 4/1/16, 5/1/16, etc.)
I need to write a query that will do a 6-month lookback, but exclude the most current month's data.
So, for example, if I ran the query today (6/22/16), I would only get the rows that correspond to the following months:
12/1/2015
1/1/2016
2/1/2016
3/1/2016
4/1/2016
The data for 5/1/16 should be excluded, as it's the most recent month.
I can pull the previous 6 months worth of data with setting the criteria (in QBE) for ReportDate to>=DateAdd("m",-6,Date()), but I can't seem to figure out how to exclude the most recent month.
This should give you the start date of the most recent month in your table:
SELECT Max(ReportDate) AS MaxOfReportDate
FROM Table1;
If that is the month you want to exclude, use that query as a subquery which you cross join back to the table. Then you can use a WHERE clause with a BETWEEN condition whose end points are determined by DateAdd() expressions based on MaxOfReportDate:
SELECT t.ID, t.ReportDate, t.Amount
FROM
Table1 AS t,
(
SELECT Max(ReportDate) AS MaxOfReportDate
FROM Table1
) AS sub
WHERE
t.ReportDate BETWEEN DateAdd('m', -6, sub.MaxOfReportDate)
AND DateAdd('m', -1, sub.MaxOfReportDate);

Sum of shifting range in SQL Query

I am trying to write an efficient query to get the sum of the previous 7 days worth of values from a relational DB table, and record each total against the final date in the 7 day period (e.g. the 'WeeklyTotals Table' in the example below). For example, in my WeeklyTotals query, I would like the value for February 15th to be 333, since that is the total sum of users from Feb 9th - Feb 15th, and so on:
I have a base query which gets me my previous weeks users for today's date (simplified for the sake of the example):
SELECT Date, Sum("Total Users")
FROM "UserRecords"
WHERE (dateadd(hour, -8, "UserRecords"."Date") BETWEEN
dateadd(hour, -8, sysdate) - INTERVAL '7 DAY' AND dateadd(hour, -8, sysdate);
The problem is, this only get's me the total for today's date. I need a query which will get me this information for the previous seven days.
I know I can make a view for each date (since I only need the previous seven entries) and join them all together, but that seems really inefficient (I'll have to create/update 7 views, and then do all the inner join operations). I am wondering if there's a more efficient way to achieve this.
Provided there are no gaps, you can use a running total with SUM OVER including the six previous rows. Use ROW_NUMBER to exclude the first six records, as their totals don't represent complete weeks.
select log_date, week_total
from
(
select
log_date,
sum(total_users) over (order by log_date rows 6 preceding) as week_total,
row_number() over (order by log_date) as rn
from mytable
where log_date > 0
)
where rn >= 7
order by log_date;
UPDATE: In case there are gaps, it should be
sum(total_users) over (order by log_date range interval '6' day preceding)
but I don't know whether PostgreSQL supports this already. (Moreover the ROW_NUMBER exclusion wouldn't work then and would have to be replaced by something else.)
Here's a a query that self joins to the previous 6 days and sums the value to get the weekly totals:
select u1.date, sum(u2.total_users) as weekly_users
from UserRecords u1
join UserRecords u2
on u1.date - u2.date < 7
and u1.date >= u2.date
group by u1.date
order by u1.date
You can use the SUM over Window function, with the expression using Date Part, of week.
Self joins are much slower than Window functions.

Bigquery SQL for sliding window aggregate

Hi I have a table that looks like this
Date Customer Pageviews
2014/03/01 abc 5
2014/03/02 xyz 8
2014/03/03 abc 6
I want to get page view aggregates grouped by week but showing aggregates for past 30 days - (sliding window aggregates with window-size of 30 days for every week)
I am using google bigquery
EDIT: Gordon - re your comment about "Customer", Actually what I need is slightly more complicated thats why I included customer in the table above. I am looking to get the number of customers who had >n pageviews in a 30day window every week. something like this
Date Customers>10 pageviews in 30day window
2014/02/01 10
2014/02/08 5
2014/02/15 6
2014/02/22 15
However to keep it simple, I will work my way if I could just get a sliding window aggregate of pageviews ignoring customers altogether. something like this
Date count of pageviews in 30day window
2014/02/01 50
2014/02/08 55
2014/02/15 65
2014/02/22 75
How about this:
SELECT changes + changes1 + changes2 + changes3 changes28days, login, USEC_TO_TIMESTAMP(week)
FROM (
SELECT changes,
LAG(changes, 1) OVER (PARTITION BY login ORDER BY week) changes1,
LAG(changes, 2) OVER (PARTITION BY login ORDER BY week) changes2,
LAG(changes, 3) OVER (PARTITION BY login ORDER BY week) changes3,
login,
week
FROM (
SELECT SUM(payload_pull_request_changed_files) changes,
UTC_USEC_TO_WEEK(created_at, 1) week,
actor_attributes_login login,
FROM [publicdata:samples.github_timeline]
WHERE payload_pull_request_changed_files > 0
GROUP BY week, login
))
HAVING changes28days > 0
For each user it counts how many changes they have submitted per week. Then with LAG() we can peek into the next row, how many changes they submitted the -1, -2, and -3 week. Then we just add those 4 weeks to see how many changes were submitted on the last 28 days.
Now you can wrap everything in a new query to filter users with changes>X, and count them.
I have created the following "Times" table:
Table Details: Dim_Periods
Schema
Date TIMESTAMP
Year INTEGER
Month INTEGER
day INTEGER
QUARTER INTEGER
DAYOFWEEK INTEGER
MonthStart TIMESTAMP
MonthEnd TIMESTAMP
WeekStart TIMESTAMP
WeekEnd TIMESTAMP
Back30Days TIMESTAMP -- the date 30 days before "Date"
Back7Days TIMESTAMP -- the date 7 days before "Date"
and I use such query to handle "running sums"
SELECT Date,Count(*) as MovingCNT
FROM
(SELECT Date,
Back7Days
FROM DWH.Dim_Periods
where Date < timestamp(current_date()) AND
Date >= (DATE_ADD (CURRENT_TIMESTAMP(), -5, 'month'))
)P
CROSS JOIN EACH
(SELECT repository_url,repository_created_at
FROM publicdata:samples.github_timeline
) L
WHERE timestamp(repository_created_at)>= Back7Days
AND timestamp(repository_created_at)<= Date
GROUP EACH BY Date
Note that it can be used for "Month to date", Week to Date" "30 days back" etc. aggregations as well.
However, performance is not the best and the query can take a while on larger data sets due to the Cartesian join.
Hope this helps