How do I calculate daily averages for multi-year data in SQL? - sql

I have a daily weather data SQL table with columns including date, type (temperature, rain, wind etc measurement) and value. The dataset spans 20 years of data.
How can I calculate daily averages for each day and measurement type, averaging values for the given date from data for all the 20 years in question? So e.g. I want to see the average temperature for 1 Jan (average of temperatures for 1 Jan 2020, 1 Jan 2019, etc)
Given there's a total of 750 million rows of data, should I create a materialised view of the calculations or what's the best way to cache the answers?

You need to extract the month and day from the date. The standard SQL function uses extract():
select extract(month from date) as month, extract(day from date) as day,
avg(temperature), avg(rain), . . .
from t
group by extract(month from date), extract(day from date);
Not all databases support these standard functions so you may need to use the functions specific to your (unspecified) database.

it would depend on which sql server you use, but in general, you should extract the day and the month from the date (on Microsoft SQL Server it is the DATEPART function) and then group by that and calculate the averages.
SELECT DATEPART(month, date_col) AS Month,
DATEPART(day, date_col) AS Day,
AVG(temp) AS Temp,
AVG(rain) AS Rain,
...
FROM table
GROUP BY DATEPART(month, date_col), DATEPART(day, date_col)

There is an extension to postgresql called timescaledb that makes it easier to query this type of data. Beware that it does make changes to the postgresql-database that requires changes to backup-routines. And if the current database is partitioned it will require a dump and restore.
A query can look like this:
-- By month
select
extract(year from created_at) as year,
extract(month from time_bucket('1 day', created_at)) as month,
min(temp) as temp,
from
readings
where
created_at > '2019-01-01' and created_at < '2020-01-01'
group by
year,
month
order by
year,
month;

750 Mio rows. You need an efficient index. Consider this function and the index based on it.
Assuming a table weather with a date column date:
CREATE FUNCTION f_mmdd(date) -- or timestamp input?
RETURNS int LANGUAGE sql IMMUTABLE PARALLEL SAFE AS
'SELECT (EXTRACT(month FROM $1) * 100 + EXTRACT(day FROM $1))::int';
CREATE INDEX weather_mmdd_idx ON weather(f_mmdd(date));
This index helps to quickly identify all rows for a particular day of the year.
The manual about EXTRACT.
The above expression proved fastest for various reasons. Just re-ran some performance tests in Postgres 13, and nothing changed.
Details in this closely related answer:
How do you do date math that ignores the year?
There is also EXTRACT(doy FROM date) to extract the day of the year (1–365/366), which is even faster. But, obviously, there is an off-by-one error for dates past Feb 29 in leap years in the Gregorian calendar.
Then the query for Jan 01 can be:
SELECT date_trunc('day', date) -- if it's a timestamp column
-- date -- if it's really a date column (which I find hard to believe)
, avg(temperature) AS avg_temperature
, avg(rain) AS avg_rain
-- , ...
FROM weather
WHERE f_mmdd(date) = f_mmdd('2000-01-01') -- or just 101 for Jan 01
GROUP BY 1;
The year in f_mmdd('2000-01-01') is arbitrary. Or just use the integer 101 for Jan 01.
You might be able to optimize further with multicolumn indexes for particular dimensions (temperature, rain, ...). But that depends on undisclosed details.
Sounds like the dataset isn't going to change. So a MATERIALIZED VIEW with readily computed aggregates per day might be a better alternative in the long run.
A word of warning: Computed averages are only correct if the measurements are spread out evenly across each day. Else, computed numbers are just averages of the given numbers, not actual average values for each day.

Related

Count distinct customers, active within a year, for every week of the year

I am working with an existing E-commerce database. Actually, this process is usually done in Excel, but we want to try it directly with a query in PostgreSQL (version 10.6).
We define as an active customer a person who has bought at least once within 1 year. This means, if I analyze week 22 in 2020, an active customer will be the one that has bought at least once since week 22, 2019.
I want the output for each week of the year (2020). Basically what I need is ...
select
email,
orderdate,
id
from
orders_table
where
paid = true;
|---------------------|-------------------|-----------------|
| email | orderdate | id |
|---------------------|-------------------|-----------------|
| email1#email.com |2020-06-02 05:04:32| Order-2736 |
|---------------------|-------------------|-----------------|
I can't create new tables. And I would like to see the output like this:
Year| Week | Active customers
2020| 25 | 6978
2020| 24 | 3948
depending on whether there is a year and week column you can use a OVER (PARTITION BY ...) with extract:
SELECT
extract(year from orderdate),
extract(week from orderdate),
sum(1) as customer_count_in_week,
OVER (PARTITION BY extract(YEAR FROM TIMESTAMP orderdate),
extract(WEEK FROM TIMESTAMP orderdate))
FROM ordertable
WHERE paid=true;
Which should bucket all orders by year and week, thus showing the total count per week in a year where paid is true.
references:
https://www.postgresql.org/docs/9.1/tutorial-window.html
https://www.postgresql.org/docs/8.1/functions-datetime.html
if I analyze week 22 in 2020, an active customer will be the one that has bought at least once since week 22, 2019.
Problems on your side
This method has some corner case ambiguities / issues:
Do you include or exclude "week 22 in 2020"? (I exclude it below to stay closer to "a year".)
A year can have 52 or 53 full weeks. Depending on the current date, the calculation is based on 52 or 53 weeks, causing a possible bias of almost 2 %!
If you start the time range on "the same date last year", then the margin of error is only 1 / 365 or ~ 0.3 %, due to leap years.
A fixed "period of 365 days" (or 366) would eliminate the bias altogether.
Problems on the SQL side
Unfortunately, window functions do not currently allow the DISTINCT key word (for good reasons). So something of the form:
SELECT count(DISTINCT email) OVER (ORDER BY year, week
GROUPS BETWEEN 52 PRECEDING AND 1 PRECEDING)
FROM ...
.. triggers:
ERROR: DISTINCT is not implemented for window functions
The GROUPS keyword has only been added in Postgres 10 and would otherwise be just what we need.
What's more, your odd frame definition wouldn't even work exactly, since the number of weeks to consider is not always 52, as discussed above.
So we have to roll our own.
Solution
The following simply generates all weeks of interest, and computes the distinct count of customers for each. Simple, except that date math is never entirely simple. But, depending on details of your setup, there may be faster solutions. (I had several other ideas.)
The time range for which to report may change. Here is an auxiliary function to generate weeks of a given year:
CREATE OR REPLACE FUNCTION f_weeks_of_year(_year int)
RETURNS TABLE(year int, week int, week_start timestamp)
LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE
ROWS 52 COST 10 AS
$func$
SELECT _year, d.week::int, d.week_start
FROM generate_series(date_trunc('week', make_date(_year, 01, 04)::timestamp) -- first day of first week
, LEAST(date_trunc('week', localtimestamp), make_date(_year, 12, 28)::timestamp) -- latest possible start of week
, interval '1 week') WITH ORDINALITY d(week_start, week)
$func$;
Call:
SELECT * FROM f_weeks_of_year(2020);
It returns 1 row per week, but stops at the current week for the current year. (Empty set for future years.)
The calculation is based on these facts:
The first ISO week of the year always contains January 04.
The last ISO week cannot start after December 28.
Actual week numbers are computed on the fly using WITH ORDINALITY. See:
PostgreSQL unnest() with element number
Aside, I stick to timestamp and avoid timestamptz for this purpose. See:
Generating time series between two dates in PostgreSQL
The function also returns the timestamp of the start of the week (week_start), which we don't need for the problem at hand. But I left it in to make the function more useful in general.
Makes the main query simpler:
WITH weekly_customer AS (
SELECT DISTINCT
EXTRACT(YEAR FROM orderdate)::int AS year
, EXTRACT(WEEK FROM orderdate)::int AS week
, email
FROM orders_table
WHERE paid
AND orderdate >= date_trunc('week', timestamp '2019-01-04') -- max range for 2020!
ORDER BY 1, 2, 3 -- optional, might improve performance
)
SELECT d.year, d.week
, (SELECT count(DISTINCT email)
FROM weekly_customer w
WHERE (w.year, w.week) >= (d.year - 1, d.week) -- row values, see below
AND (w.year, w.week) < (d.year , d.week) -- exclude current week
) AS active_customers
FROM f_weeks_of_year(2020) d; -- (year int, week int, week_start timestamp)
db<>fiddle here
The CTE weekly_customer folds to unique customers per calendar week once, as duplicate entries are just noise for our calculation. It's used many times in the main query. The cut-off condition is based on Jan 04 once more. Adjust to your actual reporting period.
The actual count is done with a lowly correlated subquery. Could be a LEFT JOIN LATERAL ... ON true instead. See:
What is the difference between LATERAL and a subquery in PostgreSQL?
Using row value comparison to make the range definition simple. See:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'

Rounding dates in SQL

I'd like to figure out the age of a person based on two dates: their birthday and the date they were created in a database.
The age is being calculated in days instead of years, though. Here's my query:
SELECT date_of_birth as birthday, created_at, (created_at - date_of_birth) as Age
FROM public.users
WHERE date_of_birth IS NOT NULL
The date_of_birth field is a date w/o a timestamp, but the created_at field is a date with a timestamp (e.g. 2017-05-06 01:27:40).
And my output looks like this:
0 years 0 mons 9645 days 1 hours 27 mins 40.86485 secs
Any idea how can I round/calculate the ages by the nearest year?
Using PostgreSQL.
If you are using MS SQLServer than you could
CONVERT(DATE, created_at)
and than calculate difference in months like
DATEDIFF(month, created_at, GETDATE())/12
means you can use reminder in months to add or substract one year.
In PostgreSQL, dates are handled very differently to MSSQL & MySQL. In fact it follows the SQL standard very well, even if it’s not always intuitive.
To actually calculate the age of something, you can use age():
SELECT age(date1,date1)
Like all of PostgreSQL’s functions, there are variations of data type, and you may need to do something like this:
SELECT age(date1::date,date1::date)
or, more formally:
SELECT age(cast(date1 as date),cast(date1 as date))
The result will be an interval, which displays as a string :
SELECT age(current_date::date,'1981-01-17'::date);
-- 36 years 3 mons 22 days
If you just want the age in years, you can use extract:
SELECT extract('year' from age(current_date::date,'1981-01-17'::date));
Finally, if you want it correct to the nearest year, you can apply the old trick of adding half an interval:
extract('year' from age(current_date::date,'1981-01-17'::date)+interval '.5 year');
It’s not as simple as some of the other DBMS products, but it’s much more flexible, if you can get your head around it.
Here are some references:
https://www.postgresql.org/docs/current/static/functions-datetime.html
http://www.sqlines.com/postgresql/how-to/datediff

Sql daily comparison by month

I am writing a report for work which requires that I compare the amount of students dropped on a daily basis, what I mean is the report needs to show that on today the 5th of august X amounts of students dropped from the 1st to the 5th compared to X amount which dropped within the 1st to the 5th of July and so on for each Month of the year. Can anyone please help me by providing me with a query which I can use to have that info? thanks.
You want to compare the first number of days from a month. The following query gives you an example:
select yr, mon, count(*)
from (select extract(year from date) as yr, extract(month from date) as mon,
extract(day from date) as day
from t
) t
where day <= extract(day from now())
group by yr, mon
However, the exact syntax may depend on your database. For instance, the current date may be now(), getdate(), system_date or something else. Some databases don't support the extract, but most have a way to get the month and day of month.

PostgreSQL - Getting statistical data

I need to collect some statistical information in my application.
I have a table of users (tb_user)
Every time a new user accesses the application, it adds a new record in this table, ie, one line for each user. The main field are id and date_hour (timestamp for the first time user accessed the application).
tb_user
id (bigint) | date_time (timestamp with time zone)
1 | 2012-01-29 11:29:50.359-03
2 | 2012-01-31 14:27:10.359-03
I need get:
amount average users by day, week and month
Example:
by day: 55.45
by week : XX.XX
month: XX.XX
EDIT:
My best solution was:
WITH daily_count AS (SELECT COUNT(id) AS user_count FROM tb_user)
SELECT user_count, tbaux2.days, (user_count/tbaux2.days) FROM daily_count,
(SELECT EXTRACT(DAY FROM (t2.diff) ) + 1 AS days
FROM
(with tbaux AS(SELECT min(date_time) AS min FROM tb_user)
SELECT (now() - min) AS diff
FROM tbaux) AS t2) AS tbaux2
GROUP BY user_count, tbaux2.days
But this solution only worked with EXTRACT (DAY ... With weeks and month did not work
Any help is welcome.
Alternatively:
SELECT user_count, tbaux2.days, (user_count/tbaux2.days) AS userPerDay, ((user_count/tbaux2.days) * 7) AS userPerWeek, ((user_count/tbaux2.days) * 30) AS userPerMonth
EDIT 2:
Based on responses from #Bruno, there are some considerations:
When I asked the question, in really I requested a way to select data by day, month and year. I believe that the search that I posted and #Bruno refined, should be interpreted as average of "a day, every 7 days and every 30 days" and not by days, weeks and months. I believe that if it is interpreted in this way, there not will be problems of gender-quoted in example (10% drop). I believe this approach of "every" is answer I need in moment, so will sign this answer.
I suggest as an improvement of post:
Consider only closed day in result (not collect users of the current day, and not counting the current day in division)
The result is two numeric digits.
New research considering a data really per week and per month.
Thanks.
You should look into aggregate functions (min, max, count, avg), which go hand in hand with GROUP BY. For date-based aggregations, date_trunc is also useful.
For example, this will return the number of rows per day:
SELECT date_trunc('day', date_time) AS day_start,
COUNT(id) AS user_count FROM tb_user
GROUP BY date_trunc('day', date_time);
You can then do the daily average using something like this (with a CTE):
WITH daily_count AS (SELECT date_trunc('day', date_time) AS day_start,
COUNT(id) AS user_count FROM tb_user
GROUP BY date_trunc('day', date_time))
SELECT AVG(user_count) FROM daily_count;
Use 'week' instead of day for the weekly counts, and so on (see date_trunc documentation).
EDIT: (Following comment: average up to and including 5/1/2012, i.e. before the 6th.)
WITH daily_count AS (SELECT date_trunc('day', date_time) AS day_start,
COUNT(id) AS user_count
FROM tb_user
WHERE date_time >= DATE('2012-01-01') AND date_time < DATE('2012-01-06')
GROUP BY date_trunc('day', date_time))
SELECT SUM(user_count)/(DATE('2012-01-06') - DATE('2012-01-01')) FROM daily_count;
What's above is over-complicated, in this case. This should give you the same result:
SELECT COUNT(id)/(DATE('2012-01-06') - DATE('2012-01-01'))
FROM tb_user
WHERE date_time >= DATE('2012-01-01') AND date_time < DATE('2012-01-06');
EDIT 2: After your edit, I guess what you're after is just a single global average for the entire period of existence of your database, rather than groups by month/week/day.
This should give you the average number of rows per day:
WITH total_min_max AS (SELECT
COUNT(id) AS total_visits,
MIN(date_time) AS first_date_time,
MAX(date_time) AS last_date_time,
FROM tb_user)
SELECT total_visits/((last_date_time::date-first_date_time::date)+1) AS users_per_day
FROM total_min_max
(I would replace last_date_time with NOW() to make the average over the time until now, rather than until the last visit, if there's no recent visit.)
Then, for daily, weekly, and "monthly":
WITH daily_avg AS (
WITH total_min_max AS (SELECT
COUNT(id) AS total_visits,
MIN(date_time) AS first_date_time,
MAX(date_time) AS last_date_time,
FROM tb_user)
SELECT total_visits/((last_date_time::date-first_date_time::date)+1) AS users_per_day
FROM total_min_max)
SELECT
users_per_day,
(users_per_day * 7) AS users_per_week,
(users_per_month * 30) AS users_per_month
FROM daily_avg
This being said, conclusions you draw from such statistics might not be great, especially if you want to see how it changes.
I would also normalise the data per day rather than assuming 30 days in a month (if not per hour, because not all days have 24 hours). Say you have 10 visits per day in Jan 2011 and 10 visits per day in Feb 2011. That gives you 310 visits in Jan and 280 visits in Feb. If you don't pay attention, you could think you've had a almost a 10% drop in terms of number of visitors, so something went wrong in Feb, when really, this isn't the case.

SQL queries with date types

I am wanting to do some queries on a sales table and a purchases table, things like showing the total costs, or total sales of a particular item etc.
Both of these tables also have a date field, which stores dates in sortable format(just the default I suppose?)
I am wondering how I would do things with this date field as part of my query to use date ranges such as:
The last year, from any given date of the year
The last 30 days, from any given day
To show set months, such as January, Febuary etc.
Are these types of queries possible just using a DATE field, or would it be easier to store months and years as separate tex fields?
If a given DATE field MY_DATE, you can perform those 3 operation using various date functions:
1. Select last years records
SELECT * FROM MY_TABLE
WHERE YEAR(my_date) = YEAR(CURDATE()) - 1
2. Last 30 Days
SELECT * FROM MY_TABLE
WHERE DATE_SUB(CURDATE(), INTERVAL 30 DAY) < MY_DATE
3. Show the month name
SELECT MONTHNAME(MY_DATE), * FROM MY_TABLE
I have always found it advantageous to store dates as Unix timestamps. They're extremely easy to sort by and query by range, and MySQL has built-in features that help (like UNIX_TIMESTAMP() and FROM_UNIXTIME()).
You can store them in INT(11) columns; when you program with them you learn quickly that a day is 86400 seconds, and you can get more complex ranges by multiplying that by a number of days (e.g. a month is close enough to 86400 * 30, and programming languages usually have excellent facilities for converting to and from them built into standard libraries).