I am gathering analytics for my app. For each metric I track, I allow it to be viewed over an interval of 7, 30 or 90 days, along with grouping by date, by weekday or by time of day.
What's the best approach to handle this?
Is it possible to avoid having perform 6 different queries 6 for each metric (1 for each interval, one for each grouping)?
Example
Median conversation response time (group by day of week) Analytic(mon, tue, wed..)
Median conversation response time (group by time of day) Analytic(1 am, 2 am, 3 am..)
New conversations (group by day of week) Analytic(mon, tue, wed..)
New conversations (group by date) Analytic(20 aug, 19 aug, 18 aug etc...)
Sorry for mixing a little posgressql whit sql-server. The #... (e.g. #Param) are sql-server syntax for variable I don't know postgressql syntax for it but hopefully this will still illustrate the technique.
SELECT
date
,CASE
WHEN #Param = 'date' THEN date AS
WHEN #Param = 'dow' THEN EXTRACT(DOW FROM date)
....
ELSE
END as ParamGRoup
,SUM(amount) as TotalAmount
FROM
Table
WHERE
date BETWEEN #LowDateParam TO #HighDateParam
GROUP BY
date
,CASE
WHEN #Param = 'date' THEN date AS
WHEN #Param = 'dow' THEN EXTRACT(DOW FROM date)
....
ELSE
END as ParamGRoup
I typed this before you got your example out so I hope it is still clear. Anyway, one way you can do it is to use a case statement and then group by the same case statement. You may have to cast the THEN/Else portions of case statement as a VARCHAR or something to ensure consistent datatype if it is ever possible for it to be grouped by multiple levels but it should get you there.
Related
I have a daily weather data SQL table with columns including date, type (temperature, rain, wind etc measurement) and value. The dataset spans 20 years of data.
How can I calculate daily averages for each day and measurement type, averaging values for the given date from data for all the 20 years in question? So e.g. I want to see the average temperature for 1 Jan (average of temperatures for 1 Jan 2020, 1 Jan 2019, etc)
Given there's a total of 750 million rows of data, should I create a materialised view of the calculations or what's the best way to cache the answers?
You need to extract the month and day from the date. The standard SQL function uses extract():
select extract(month from date) as month, extract(day from date) as day,
avg(temperature), avg(rain), . . .
from t
group by extract(month from date), extract(day from date);
Not all databases support these standard functions so you may need to use the functions specific to your (unspecified) database.
it would depend on which sql server you use, but in general, you should extract the day and the month from the date (on Microsoft SQL Server it is the DATEPART function) and then group by that and calculate the averages.
SELECT DATEPART(month, date_col) AS Month,
DATEPART(day, date_col) AS Day,
AVG(temp) AS Temp,
AVG(rain) AS Rain,
...
FROM table
GROUP BY DATEPART(month, date_col), DATEPART(day, date_col)
There is an extension to postgresql called timescaledb that makes it easier to query this type of data. Beware that it does make changes to the postgresql-database that requires changes to backup-routines. And if the current database is partitioned it will require a dump and restore.
A query can look like this:
-- By month
select
extract(year from created_at) as year,
extract(month from time_bucket('1 day', created_at)) as month,
min(temp) as temp,
from
readings
where
created_at > '2019-01-01' and created_at < '2020-01-01'
group by
year,
month
order by
year,
month;
750 Mio rows. You need an efficient index. Consider this function and the index based on it.
Assuming a table weather with a date column date:
CREATE FUNCTION f_mmdd(date) -- or timestamp input?
RETURNS int LANGUAGE sql IMMUTABLE PARALLEL SAFE AS
'SELECT (EXTRACT(month FROM $1) * 100 + EXTRACT(day FROM $1))::int';
CREATE INDEX weather_mmdd_idx ON weather(f_mmdd(date));
This index helps to quickly identify all rows for a particular day of the year.
The manual about EXTRACT.
The above expression proved fastest for various reasons. Just re-ran some performance tests in Postgres 13, and nothing changed.
Details in this closely related answer:
How do you do date math that ignores the year?
There is also EXTRACT(doy FROM date) to extract the day of the year (1–365/366), which is even faster. But, obviously, there is an off-by-one error for dates past Feb 29 in leap years in the Gregorian calendar.
Then the query for Jan 01 can be:
SELECT date_trunc('day', date) -- if it's a timestamp column
-- date -- if it's really a date column (which I find hard to believe)
, avg(temperature) AS avg_temperature
, avg(rain) AS avg_rain
-- , ...
FROM weather
WHERE f_mmdd(date) = f_mmdd('2000-01-01') -- or just 101 for Jan 01
GROUP BY 1;
The year in f_mmdd('2000-01-01') is arbitrary. Or just use the integer 101 for Jan 01.
You might be able to optimize further with multicolumn indexes for particular dimensions (temperature, rain, ...). But that depends on undisclosed details.
Sounds like the dataset isn't going to change. So a MATERIALIZED VIEW with readily computed aggregates per day might be a better alternative in the long run.
A word of warning: Computed averages are only correct if the measurements are spread out evenly across each day. Else, computed numbers are just averages of the given numbers, not actual average values for each day.
I am trying to create a variable that sums sales in 3 months for each customer after their first purchase in the time series. The code below errors and says I'm missing a parentheses.
sum(case
when merch.trans_dt between min(merch.trans_dt)
and add_date(min(merch.trans_dt), interval 3 month)
then merch.rdswrit_rps_netnet_pur_amt
end) as spend_next3
You can do this by simply using addition, rather than a function: min(merch.trans_dt) + interval 3 month.
However, this may not give you the answer you want. In many cases, such as to_date('1/31/2015','mm/dd/yyyy') + interval '3' month, this will result in ORA-01839: date not valid for month specified.
You're better off using add_months as indicated previously in the comments: add_months(min(merch.trans_dt),3).
Lets say I have a large table that just consists of three columns.
Integer id,
timestamp ts,
double value
If I wanted to get the values given a complicated date expression what is the best way to achieve that ?
For example if I wanted to get all the values at anytime on weekend days and only between 18:00 and 8:00 on weekdays and any time on school holidays for the year 2014.
Obviously some of these times are variable and so the solution should be dynamic. I was thinking
of storing a series of date intervals for things like school holidays in another table to check against. However, I would like to create a custom Postgres function to hide some of the complexity.
Does anyone know of similar code or have suggestions ?
Especially dealing with cases like the times above except on weekend logic ?
Thanks
With a holiday table
select *
from
t
left join
holiday on date_trunc('day', t.ts) = holiday.day
where
extract(dow from ts) in (0, 6) -- Weekend
or
(extract(hour from ts) >= 18 and extract(hour from ts) <= 8)
or
holiday.day is not null -- Holiday
I am wanting to do some queries on a sales table and a purchases table, things like showing the total costs, or total sales of a particular item etc.
Both of these tables also have a date field, which stores dates in sortable format(just the default I suppose?)
I am wondering how I would do things with this date field as part of my query to use date ranges such as:
The last year, from any given date of the year
The last 30 days, from any given day
To show set months, such as January, Febuary etc.
Are these types of queries possible just using a DATE field, or would it be easier to store months and years as separate tex fields?
If a given DATE field MY_DATE, you can perform those 3 operation using various date functions:
1. Select last years records
SELECT * FROM MY_TABLE
WHERE YEAR(my_date) = YEAR(CURDATE()) - 1
2. Last 30 Days
SELECT * FROM MY_TABLE
WHERE DATE_SUB(CURDATE(), INTERVAL 30 DAY) < MY_DATE
3. Show the month name
SELECT MONTHNAME(MY_DATE), * FROM MY_TABLE
I have always found it advantageous to store dates as Unix timestamps. They're extremely easy to sort by and query by range, and MySQL has built-in features that help (like UNIX_TIMESTAMP() and FROM_UNIXTIME()).
You can store them in INT(11) columns; when you program with them you learn quickly that a day is 86400 seconds, and you can get more complex ranges by multiplying that by a number of days (e.g. a month is close enough to 86400 * 30, and programming languages usually have excellent facilities for converting to and from them built into standard libraries).
How can I create a stored procedure that accepts a start and end date.(e.g April 1 - April 30
1.) Get the business days including Saturdays x (a value). +
2.) Get Holidays x (a value)
and return the total.
I'm new to this, I guess it would be a tsql function. hmm.
any help would be appreciated.
Thanks
The simplest solution to this problem is to create a Calendar table that contains a value for every day you might want to consider. You could then add columns that indicate whether it is a business day or a holiday. With that, the problem becomes trivial:
Select ..
From Calendar
Where IsBusinessDay = 1
And Calendar.[Date] Between '2010-04-01' And '2010-04-30'
If you wanted the count of days, you could then do:
Select Sum( Case When IsBusinessDay = 1 Then 1 Else 0 End ) As BusinessDayCount
, Sum( Case When IsHoliday = 1 Then 1 Else 0 End ) As HolidayCount
From Calendar
Where Calendar.[Date] Between '2010-04-01' And '2010-04-30'
http://classicasp.aspfaq.com/date-time-routines-manipulation/how-do-i-count-the-number-of-business-days-between-two-dates.html
First, you will need to store all of the holidays into an independant table (Christmas, Easter, New Year Day, etc. with their respective dates (normally timed at midnight));
Second, you will have to generate, into a temporary table maybe, the dates of the office days, it then excludes the dates contained in the Holidays table.
Third, you may set the office hours to these dates depending on what day it is, if you have different working hours on different day.
That is the algorithm for you to find the appropriate code implementation.
Let me know if this helps!