Break down from range date to daily with high effeciency? - sql

The data have start_date and end_date like 2020-09-18 and 2020-09-28. I need to break it down to daily which is 11 days including 2020-09-18.
My solution is to create a date table with every single day.
with cte as(
select b.fulldate,
count(1) over (partition by a,b,metric_c,metric_d) as count,
a,b,
metric_c, metric_d
from a
join dim_date b
on b.fulldate between a.start_date and a.end_date
)
select
fulldate,
a,b,
metric_c / count as metric_c, --maybe some cast or convert in here
metric_d / count as metric_d
from cte
This is what I'm using currently. But is there a more effective way? If the table have 1,000,000 rows and maybe 10 metric, how can I get a better performance?
Thanks in advance anyway. Maybe there's some method that don't have to use an extra date table(which need some update if it's not enough date there), and have a really brilliant performance with millions data. If not, I'll keep using my method then.

I would keep the dim_date data model you have, as it has materialized the rows between the start_dates and end_dates.
The table DIM_DATE is an example of a confirmed dimension and it cab be used across any other subject areas in your reporting application that would need a date dimension.
I would check if in your DIM_DATE you have an index on the key which is being looked up (b.full_date) field.

I wouldn't be surprised if a recursive subquery had better performance if you have lots of dates and relatively short periods:
with cte as (
select start_date, end_date,
metric_a / (datediff(day, start_date, end_date) + 1) as metric_a,
metric_b / (datediff(day, start_date, end_date) + 1) as metric_b
from a
union all
select dateadd(day, 1, start_date), end_date, metric_a, metric_b
from cte
where start_date < end_date
)
select *
from cte;
You can just add more metrics into the CTE as needed.
If any of the periods exceed 100 days, then you need to add option (maxrecursion 0).

Related

PL-SQL query to calculate customers per period from start and stop dates

I have a PL-SQL table with a structure as shown in the example below:
I have customers (customer_number) with insurance cover start and stop dates (cover_start_date and cover_stop_date). I also have dates of accidents for those customers (accident_date). These customers may have more than one row in the table if they have had more than one accident. They may also have no accidents. And they may also have a blank entry for the cover stop date if their cover is ongoing. Sorry I did not design the data format, but I am stuck with it.
I am looking to calculate the number of accidents (num_accidents) and number of customers (num_customers) in a given time period (period_start), and from that the number of accidents-per-customer (which will be easy once I've got those two pieces of information).
Any ideas on how to design a PL-SQL function to do this in a simple way? Ideally with the time periods not being fixed to monthly (for example, weekly or fortnightly too)? Ideally I will end up with a table like this shown below:
Many thanks for any pointers...
You seem to need a list of dates. You can generate one in the query and then use correlated subqueries to calculate the columns you want:
select d.*,
(select count(distinct customer_id)
from t
where t.cover_start_date <= d.dte and
(t.cover_end_date > d.date + interval '1' month or t.cover_end_date is null)
) as num_customers,
(select count(*)
from t
where t.accident_date >= d.dte and
t.accident_date < d.date + interval '1' month
) as accidents,
(select count(distinct customer_id)
from t
where t.accident_date >= d.dte and
t.accident_date < d.date + interval '1' month
) as num_customers_with_accident
from (select date '2020-01-01' as dte from dual union all
select date '2020-02-01' as dte from dual union all
. . .
) d;
If you want to do arithmetic on the columns, you can use this as a subquery or CTE.

adding a row for missing data

Between a date range 2017-02-01 - 2017-02-10, i'm calculating a running balance.
I have days where we have missing data, how would I include these missing dates with the previous days balance ?
Example data:
we are missing data for 2017-02-04,2017-02-05 and 2017-02-06, how would i add a row in the query with the previous balance?
The date range is a parameter, so could change....
Can i use something like the lag function?
I would be inclined to use a recursive CTE and then fill in the values. Here is one approach using outer apply:
with dates as (
select mind as dte, mind, maxd
from (select min(date) as mind, max(date) as maxd from t) t
union all
select dateadd(day, 1, dte), mind, maxd
from dates
where dte < maxd
)
select d.dte, t.balance
from dates d outer apply
(select top 1 t.*
from t
where t.date <= d.dte
order by t.date desc
) t;
You can generate dates using tally table as below:
Declare #d1 date ='2017-02-01'
Declare #d2 date ='2017-02-10'
;with cte_dates as (
Select top (datediff(D, #d1, #d2)+1) Dates = Dateadd(day, Row_Number() over (order by (Select NULL))-1, #d1) from
master..spt_values s1, master..spt_values s2
)
Select * from cte_dates left join ....
And do left join to your table and get running total
Adding to the date range & CTE solutions, I have created Date Dimension tables in numerous databases where I just left join to them.
There are free scripts online to create date dimension tables for SQL Server. I highly recommend them. Plus, it makes aggregation by other time periods much more efficient (e.g. Quarter, Months, Year, etc....)

How to select the Average per WORK Day/Week/Month/60 Days/90 Days when not all days are in table?

OK, So after a lot of research I am still stuck. I get the simple basics:
I can select average of all values using a group by statement. grouping where the dates are within my range and by the average I want.
Select AVG(Qty)
From (
Select Sum(ItemQty) As Qty
From ReceivedInfo
Where DateUsed >= '2014-04-01'
AND DateUsed < '2014-04-30'
Group By Date(DateUsed)
)
the Issue here is that I need to include days that aren't in the table too. All Workdays should be accounted for even days that don't have an entry. I am not sure how to get an average of Qty per day (or month, or 60 days and so on) including all, and only, workdays from a data range. Exclude Weekends.
Would it be enough to do something like:
Select Sum(QTy) / (julianday('2014-04-30') - julianday('2014-04-01' )) - (((julianday('2014-04-30') - julianday('2014-04-01' )) / 7 ) * 2) As Average
From (
Select Sum(ItemQty) As Qty
From ReceivedInfo
Where DateUsed >= '2014-04-01'
AND DateUsed < '2014-04-30'
)
Or Do I have to determine the number of work days outside of the query?
A while back I saw someone do something rather interesting with a CTE that solved a similar problem: perhaps this would work for you?
WITH DateRange(Date) AS
(
SELECT '2014-04-01' Date
UNION ALL
SELECT DATEADD(day, 1, Date) Date
FROM DateRange
WHERE Date < '2014-04-30'
)
SELECT CONVERT(VARCHAR(10),Date, 121) -- either SELECT from or JOIN to this table
FROM DateRange
WHERE DatePart(dw, Date) NOT IN (1,7) -- this could limit it to Mon-Fri only
EDIT: I think that the following would be the linkage for you to include this in your resultset:
Select AVG(Qty)
From (
Select Sum(ItemQty) As Qty
From
DateRange
LEFT JOIN
ReceivedInfo ON
DateRange.Date = ReceivedInfo.DateUsed
Where DateRange.Date >= '2014-04-01'
AND DateRange.Date < '2014-04-30'
Group By DateRange.Date
)
EDIT 2: removed variables as per commenter below. Thanks for the reminder!
Your final query should work fine if you want the missing values to be treated as 0. You can, however, simplify it. So, to a close approximation:
Select Sum(QTy) / (5.0/7 * (julianday('2014-04-30') - julianday('2014-04-01' ) ) )
From ReceivedInfo
Where DateUsed >= '2014-04-01' AND DateUsed < '2014-04-30';
Of course, this is just an approximation of the number of work days (which is similar to your proposed solution). A full solution would use cumbersome day-of-week arithmetic.

in sql, calculating date parts versus date lookup table in group queries

many queries are by week, month or quarter when the base table date is either date or timestamp.
in general, in group by queries, does it matter whether using
- functions on the date
- a day table that has extraction pre-calculated
note: similar question as DATE lookup table (1990/01/01:2041/12/31)
for example, in postgresql
create table sale(
tran_id serial primary key,
tran_dt date not null default current_date,
sale_amt decimal(8,2) not null,
...
);
create table days(
day date primary key,
week date not null,
month date not null,
quarter date non null
);
-- week query 1: group using funcs
select
date_trunc('week',tran_dt)::date - 1 as week,
count(1) as sale_ct,
sum(sale_amt) as sale_amt
from sale
where date_trunc('week',tran_dt)::date - 1 between '2012-1-1' and '2011-12-31'
group by date_trunc('week',tran_dt)::date - 1
order by 1;
-- query 2: group using days
select
days.week,
count(1) as sale_ct,
sum(sale_amt) as sale_amt
from sale
join days on( days.day = sale.tran_dt )
where week between '2011-1-1'::date and '2011-12-31'::date
group by week
order by week;
to me, whereas the date_trunc() function seems more organic, the the days table is easier to use.
is there anything here more than a matter of taste?
-- query 3: group using instant "immediate" calendar table
WITH calender AS (
SELECT ser::date AS dd
, date_trunc('week', ser)::date AS wk
-- , date_trunc('month', ser)::date AS mon
-- , date_trunc('quarter', ser)::date AS qq
FROM generate_series( '2012-1-1' , '2012-12-31', '1 day'::interval) ser
)
SELECT
cal.wk
, count(1) as sale_ct
, sum(sa.sale_amt) as sale_amt
FROM sale sa
JOIN calender cal ON cal.dd = sa.tran_dt
-- WHERE week between '2012-1-1' and '2011-12-31'
GROUP BY cal.wk
ORDER BY cal.wk
;
Note: I fixed an apparent typo in the BETWEEN range.
UPDATE: I used Erwin's recursive CTE to squeeze out the duplicated date_trunc(). Nested CTE galore:
WITH calendar AS (
WITH RECURSIVE montag AS (
SELECT '2011-01-01'::date AS dd
UNION ALL
SELECT dd + 1 AS dd
FROM montag
WHERE dd < '2012-1-1'::date
)
SELECT mo.dd, date_trunc('week', mo.dd + 1)::date AS wk
FROM montag mo
)
SELECT
cal.wk
, count(1) as sale_ct
, sum(sa.sale_amt) as sale_amt
FROM sale sa
JOIN calendar cal ON cal.dd = sa.tran_dt
-- WHERE week between '2012-1-1' and '2011-12-31'
GROUP BY cal.wk
ORDER BY cal.wk
;
Yes, it is more than a matter of taste. The performance of the query depends on the method.
As a first approximation, the functions should be faster. They don't require joins, doing the read in a single table scan.
However, a good optimizer could make effective use of a lookup table. It would know the distribution of the target values. And, an in memory join could be quite fast.
As a database design, I think having a calendar table is very useful. Some information such as holidays just isn't going to work as a function. However, for most ad hoc queries the date functions are fine.
1. Your expression:
... between '2012-1-1' and '2011-12-31'
doesn't work. Basic BETWEEN requires the left argument to be less than or equal to the right argument. Would have to be:
... BETWEEN SYMMETRIC '2012-1-1' and '2011-12-31'
Or it's just a typo and you mean something like:
... BETWEEN '2011-1-1' and '2011-12-31'
It's unclear to me, what your queries are supposed to retrieve. I'll assume you want all weeks (Monday to Sunday) that start in 2011 for the rest of this answer. This expression generates exactly that in less than a microsecond on modern hardware (works for any year):
SELECT generate_series(
date_trunc('week','2010-12-31'::date) + interval '7d'
,date_trunc('week','2011-12-31'::date) + interval '6d'
, '1d')::date
*Note that the ISO 8601 definition of the "first week of a year is slightly different.
2. Your second query does not work at all. No GROUP BY?
3. The question you link to did not deal with PostgreSQL, which has outstanding date / timestamp support. And it has generate_series() which can obviate the need for a separate "days" table in most cases - as demonstrated above. Your query would look like this:
In the meantime #wildplasser provided an example query that was supposed to go here.
By popular* demand, a recursive CTE version - which is actually not that far from being a serious alternative!
* and by "popular" I mean #wildplasser's very serious request.
WITH RECURSIVE days AS (
SELECT '2011-01-01'::date AS dd
,date_trunc('week', '2011-01-01'::date )::date AS wk
UNION ALL
SELECT dd + 1
,date_trunc('week', dd + 1)::date AS wk
FROM days
WHERE dd < '2011-12-31'::date
)
SELECT d.wk
,count(*) AS sale_ct
,sum(s.sale_amt) AS sale_amt
FROM days d
JOIN sale s ON s.tran_dt = d.dd
-- WHERE d.wk between '2011-01-01' and '2011-12-31'
GROUP BY 1
ORDER BY 1;
Could also be written as (compare to #wildplasser's version):
WITH RECURSIVE d AS (
SELECT '2011-01-01'::date AS dd
UNION ALL
SELECT dd + 1 FROM d WHERE dd < '2011-12-31'::date
), days AS (
SELECT dd, date_trunc('week', dd + 1)::date AS wk
FROM d
)
SELECT ...
4. If performance is of the essence, just make sure, that you do not apply functions or calculations to the values of your table. This prohibits the use of indexes and is generally very slow, because every row has to be processed. That's why your first query is going to suck with big table. When ever possible, apply calculations to the values you filter with, instead.
Indexes on expressions are one way around this. If you had an index like
CREATE INDEX sale_tran_dt_week_idx ON sale (date_trunc('week', tran_dt)::date);
.. your first query could be very fast again - at some cost for write operations for index maintenance.

SQL for counting events by date

I feel like I've seen this question asked before, but neither the SO search nor google is helping me... maybe I just don't know how to phrase the question. I need to count the number of events (in this case, logins) per day over a given time span so that I can make a graph of website usage. The query I have so far is this:
select
count(userid) as numlogins,
count(distinct userid) as numusers,
convert(varchar, entryts, 101) as date
from
usagelog
group by
convert(varchar, entryts, 101)
This does most of what I need (I get a row per date as the output containing the total number of logins and the number of unique users on that date). The problem is that if no one logs in on a given date, there will not be a row in the dataset for that date. I want it to add in rows indicating zero logins for those dates. There are two approaches I can think of for solving this, and neither strikes me as very elegant.
Add a column to the result set that lists the number of days between the start of the period and the date of the current row. When I'm building my chart output, I'll keep track of this value and if the next row is not equal to the current row plus one, insert zeros into the chart for each of the missing days.
Create a "date" table that has all the dates in the period of interest and outer join against it. Sadly, the system I'm working on already has a table for this purpose that contains a row for every date far into the future... I don't like that, and I'd prefer to avoid using it, especially since that table is intended for another module of the system and would thus introduce a dependency on what I'm developing currently.
Any better solutions or hints at better search terms for google? Thanks.
Frankly, I'd do this programmatically when building the final output. You're essentially trying to read something from the database which is not there (data for days that have no data). SQL isn't really meant for that sort of thing.
If you really want to do that, though, a "date" table seems your best option. To make it a bit nicer, you could generate it on the fly, using i.e. your DB's date functions and a derived table.
I had to do exactly the same thing recently. This is how I did it in T-SQL (
YMMV on speed, but I've found it performant enough over a coupla million rows of event data):
DECLARE #DaysTable TABLE ( [Year] INT, [Day] INT )
DECLARE #StartDate DATETIME
SET #StartDate = whatever
WHILE (#StartDate <= GETDATE())
BEGIN
INSERT INTO #DaysTable ( [Year], [Day] )
SELECT DATEPART(YEAR, #StartDate), DATEPART(DAYOFYEAR, #StartDate)
SELECT #StartDate = DATEADD(DAY, 1, #StartDate)
END
-- This gives me a table of all days since whenever
-- you could select #StartDate as the minimum date of your usage log)
SELECT days.Year, days.Day, events.NumEvents
FROM #DaysTable AS days
LEFT JOIN (
SELECT
COUNT(*) AS NumEvents
DATEPART(YEAR, LogDate) AS [Year],
DATEPART(DAYOFYEAR, LogDate) AS [Day]
FROM LogData
GROUP BY
DATEPART(YEAR, LogDate),
DATEPART(DAYOFYEAR, LogDate)
) AS events ON days.Year = events.Year AND days.Day = events.Day
Create a memory table (a table variable) where you insert your date ranges, then outer join the logins table against it. Group by your start date, then you can perform your aggregations and calculations.
The strategy I normally use is to UNION with the opposite of the query, generally a query that retrieves data for rows that don't exist.
If I wanted to get the average mark for a course, but some courses weren't taken by any students, I'd need to UNION with those not taken by anyone to display a row for every class:
SELECT AVG(mark), course FROM `marks`
UNION
SELECT NULL, course FROM courses WHERE course NOT IN
(SELECT course FROM marks)
Your query will be more complex but the same principle should apply. You may indeed need a table of dates for your second query
Option 1
You can create a temp table and insert dates with the range and do a left outer join with the usagelog
Option 2
You can programmetically insert the missing dates while evaluating the result set to produce the final output
WITH q(n) AS
(
SELECT 0
UNION ALL
SELECT n + 1
FROM q
WHERE n < 99
),
qq(n) AS
(
SELECT 0
UNION ALL
SELECT n + 1
FROM q
WHERE n < 99
),
dates AS
(
SELECT q.n * 100 + qq.n AS ndate
FROM q, qq
)
SELECT COUNT(userid) as numlogins,
COUNT(DISTINCT userid) as numusers,
CAST('2000-01-01' + ndate AS DATETIME) as date
FROM dates
LEFT JOIN
usagelog
ON entryts >= CAST('2000-01-01' AS DATETIME) + ndate
AND entryts < CAST('2000-01-01' AS DATETIME) + ndate + 1
GROUP BY
ndate
This will select up to 10,000 dates constructed on the fly, that should be enough for 30 years.
SQL Server has a limitation of 100 recursions per CTE, that's why the inner queries can return up to 100 rows each.
If you need more than 10,000, just add a third CTE qqq(n) and cross-join with it in dates.