How does DATEADD work when joining the same table with itself? - sql

I have a table with monthly production values.
Example:
Outdate | Prod Value | ID
2/28/19 | 110 | 4180
3/31/19 | 100 | 4180
4/30/19 | 90 | 4180
I also have a table that has monthly forecast values.
Example:
Forecast End Date | Forecast Value | ID
2/28/19 | 120 | 4180
3/31/19 | 105 | 4180
4/30/19 | 80 | 4180
I want to create a table that has a row that contains the ID, the Prod Value, the current month (example: March) forecast, the previous month forecast, the next month forecast.
What I want:
ID | Prod Value | Outdate | Current Forecast | Previous Forecast | Next Forecast
4180 | 100 | 3/31/19 | 105 | 120 | 80
The problem is that when I used DATEADD to bring in the specific value from the Forecast table for the previous month, random months are missing from my final values.
I've tried adding in another LEFT JOIN / INNER JOIN with the DateDimension table when adding in the Next Month and Previous Month forecast, but that either does not solve the problem or adds in too many rows.
My DateDimension table that has these columns: DateKey
Date, Day, DaySuffix, Weekday, WeekDayName, IsWeekend, IsHoliday, DOWInMonth, DayOfYear, WeekOfMonth, WeekOfYear, ISOWeekOfYear, Month, MonthName, Quarter, QuarterName, Year, MMYYYY, MonthYear, FirstDayOfMonth, LastDayOfMonth, FirstDayOfQuarter, LastDayOfQuarter, FirstDayOfYear, LastDayOfYear, FirstDayOfNextMonth, FirstDayOfNextYear
My query is along these lines (abbreviated for simplicity)
SELECT A.ArchiveKey, BH.ID, d.[Date], BH.Outdate, BH.ProdValue, BH.Forecast, BHP.Forecast, BHN.Foreceast
FROM dbo.BudgetHistory bh
INNER JOIN dbo.DateDimension d ON bh.outdate = d.lastdayofmonth
INNER JOIN dbo.Archive a ON bh.ArchiveKey = a.ArchiveKey
LEFT JOIN dbo.BudgetHistory bhp ON bh.ID = bhp.ID AND bhp.outdate = DATEADD(m, - 1, bh.Outdate)
LEFT JOIN dbo.BudgetHistory bhn ON bh.ID = bhn.ID AND bhn.outdate = DATEADD(m, 1, bh.Outdate)
WHERE bh.ID IS NOT NULL
I get something like this:
+------+------------+---------+------------------+-------------------+---------------+
| ID | Prod Value | Outdate | Current Forecast | Previous Forecast | Next Forecast |
+------+------------+---------+------------------+-------------------+---------------+
| 4180 | 110 | 2/28/19 | 120 | NULL | NULL |
| 4180 | 100 | 3/31/19 | 105 | 120 | 80 |
| 4180 | 90 | 4/30/19 | 80 | NULL | NULL |
+------+------------+---------+------------------+-------------------+---------------+
And the pattern doesn't seem to follow anything reasonable.
I want the values to be filled in for each row.

You could join the tables, then use window functions LEAD() and LAG() to recover the next and previous forecast values:
SELECT
p.ID,
p.ProdValue,
p.Outdate,
f.ForecastValue,
LAG(f.ForecastValue) OVER(PARTITION BY f.ID ORDER BY f.ForecastEndDate) PreviousForecast,
LEAD(f.ForecastValue) OVER(PARTITION BY f.ID ORDER BY f.ForecastEndDate) NextForecast
FROM prod p
INNER JOIN forecast f ON p.ID = f.ID AND p.Outdate = f.ForecastEndDate
This demo on DB Fiddle with your sample data returns:
ID | ProdValue | Outdate | ForecastValue | PreviousForecast | NextForecast
---: | --------: | :------------------ | ------------: | ---------------: | -----------:
4180 | 110 | 28/02/2019 00:00:00 | 120 | null | 105
4180 | 100 | 31/03/2019 00:00:00 | 105 | 120 | 80
4180 | 90 | 30/04/2019 00:00:00 | 80 | 105 | null

DATEADD only does end of month adjustments if the newly calculated value isn't a valid date. So DATEADD(month,-1,'20190331') produces 28th February. But DATEADD(month,-1,'20190228') produces 28th January, not the 31st.
I would probably go with GMB's answer. If you want to do something DATEADD based though, you can use:
bhp.outdate = DATEADD(month, DATEDIFF(month,'20010131', bh.Outdate) ,'20001231')
This always works out the last day of the previous month from bh.OutDate, but it does it by computing it as an offset from a fixed date, and then applying that offset to a different fixed date.
You can just reverse the places of 20010131 and 20001231 to compute the month after rather than the month before. There's no significance about them other than them both having 31 days and having the "one month apart" relationship we're wishing to apply.

Related

Dealing with required duplicates in table records

Here's the situation. My team forecasts sales and revenue numbers at a monthly resolution but would like all reporting to be at a daily resolution. So what I am doing is ingesting these numbers and dividing the monthly targets by number of days and saving it in a table.
So I start of with something like this:
| date | forecasted_units | forecasted_revenue |
|---------|------------------|--------------------|
| 2020-01 | 372 | 9300 |
| 2020-02 | 435 | 9280 |
...
My target table now looks like this:
| date | forecasted_units | forecasted_revenue |
|------------|------------------|--------------------|
| 2020-01-01 | 12 | 300 |
| 2020-01-02 | 12 | 300 |
| 2020-01-03 | 12 | 300 |
...
| date | forecasted_units | forecasted_revenue |
|------------|------------------|--------------------|
| 2020-02-01 | 15 | 320 |
| 2020-02-02 | 15 | 320 |
| 2020-02-03 | 15 | 320 |
...
Now my table is quite a lot wider than the one above and all of them have duplicate records. As you can see there's a lot of data redundancy. Now my question is, Is there a more efficient method to save the same resolution of data in one table.
My immediate thought is to reshape the table to include a start date and end date to look like this:
| start_date | end_date | forecasted_units | forecasted_revenue |
|------------|------------|------------------|--------------------|
| 2020-01-01 | 2020-01-31 | 12 | 300 |
| 2020-02-01 | 2020-02-29 | 15 | 320 |
But that would offload all the computation to the instance generating all the reports because it would have to generate the data for each day in between the start and end date.
Is there a better way to do this?
Unfortunately, Redshift does not support handy Postgres function generate_series(), which would have largely simplified the task here.
Typical alternative solutions would involve a calendar table - basically, a table that enumerates all possible dates. If you have a table with a sufficient number of rows, you can generate such dataset on the fly with row_number() and dateadd():
select dateadd(day, row_number() over(order by 1) - 1, '2020-01-01') dt
from my_large_table;
You can store the results in another table (using the create table ... as select ... syntax), or use the query result directly. In both cases, you would then join it with your actual table. To count the number of days in the month, we use a window count:
select
d.dt,
t.forecasted_unit / count(*) over(partition by t.date) forecasted_units,
t.forecasted_revenue / count(*) over(partition by t.date) forecasted_revenue
from (
select dateadd(day, row_number() over(order by 1) - 1, '2020-01-01') dt
from my_large_table
) d
inner join mytable t on t.date = date_trunc('month', d.dt)

SQLite: generating customer counts for a date range (months) using a normalized table

I have a sales funnel dataset in SQLite and each row represents a movement through the funnel. As there are quite a few ways a potential customer can move through the funnel (and possibly even go backwards), I wasn't planning on flattening/denormalizing the table. How could I calculate "the number of customers per month up to today"?
customer | opp_value | status_old | status_new | current_status | status_change_date | current_lead_status | lead_created_date
cust_8 | 22 | confirmed | paying | paying | 2020-01-01 | Customer | 2020-01-01
cust_9 | 23 | confirmed | paying | churned | 2020-01-03 | Customer | 2020-01-02
cust_9 | 23 | paying | churned | churned | 2020-03-24 | Customer | 2020-02-25
cust_13 | 30 | negotiation | lost | paying | 2020-04-03 | Lost | 2020-03-20
cust_14 | 45 | qualified | confirmed | paying | 2020-03-03 | Customer | 2020-02-28
cust_14 | 45 | confirmed | paying | paying | 2020-04-03 | Customer | 2020-02-28
... | ... | ... | ... | ... | ... | ... | ...
We're assuming we use end-of-month as definition for whether a customer is still with us.
The result, with the above data should be:
month | customers
Jan-2020 | 2 (cust_8, cust_9)
Feb-2020 | 1 (cust_8, cust_9)
Mar-2020 | 1 (cust_8) # cust_9 churned
Apr-2020 | 2 (cust_8, cust_14)
May-2020 | 2 (cust_8, cust_14)
The part I'd really like to understand is how to create the month column, as I can't rely on the dates of status_change_date as there might be missing records. Would one have to manually generate that column? I know I can generate dates manually using:
WITH RECURSIVE cnt (
x
) AS (
SELECT 0
UNION ALL
SELECT x + 1
FROM cnt
LIMIT (
SELECT
ROUND(((julianday ('2020-05-01') - julianday ('2020-01-01')) / 30) + 1))
)
SELECT
date(julianday ('2020-01-01'), '+' || x || ' month') AS month
FROM cnt
but wondering if there is a better way? Would it possibly be easier to create a snapshot table and generate the current state of each customer for each date?
If you have the dates, you can use a brute-force method. This determines the most recent status for each customer for each date:
select d.date,
sum(as_of_status = 'paying')
from (select distinct d.date, t.customer,
first_value(status_new) over (partition by d.date, t.customer order by t. status_change_date desc) as as_of_status
from dates d join
t
on t.status_change_date <= d.date
) dc
group by d.date
order by d.date;

Slicing account balance data in BigQuery to generate a debit report

I have a collection of account balances over time:
+-----------------+------------+-------------+-----------------------+
| account_balance | department | customer_id | timestamp |
+-----------------+------------+-------------+-----------------------+
| 5 | A | 1 | 2019-02-12T00:00:00 |
| -10 | A | 1 | 2019-02-13T00:00:00 |
| -35 | A | 1 | 2019-02-14T00:00:00 |
| 20 | A | 1 | 2019-02-15T00:00:00 |
+-----------------+------------+-------------+-----------------------+
Each record shows the total account balance of a customer at a specified timestamp. The account balance increases e.g. to 20 from -35, when a customer tops-up his account with 55. As a customer uses a services, his account balances decreases e.g. from 5 to -10.
I want to aggregate this data in two ways:
1) Get the debit, credit and balance (credit-debit) of a department per month and year. The results from April should be a summary of all previous months:
+---------+--------+-------+------------+-------+--------+
| balance | credit | debit | department | month | year |
+---------+--------+-------+------------+-------+--------+
| 5 | 10 | -5 | A | 1 | 2019 |
| 20 | 32 | -12 | A | 2 | 2019 |
| 35 | 52 | -17 | A | 3 | 2019 |
| 51 | 70 | -19 | A | 4 | 2019 |
+---------+--------+-------+------------+-------+--------+
A customer's account balance might not change every month. There might be account balance records of customer 1 in February, but not March.
Notes towards the solution:
use EXTRACT(MONTH from timestamp) month
use EXTRACT(YEAR from timestamp) year
GROUP BY month, year, department
2) Get the change of debit, credit and balance of a department by date.
+---------+--------+-------+------------+-------------+
| balance | credit | debit | department | date |
+---------+--------+-------+------------+-------------+
| 5 | 10 | -5 | A | 2019-01-15 |
| 15 | 22 | -7 | A | 2019-02-15 |
| 15 | 20 | -5 | A | 2019-03-15 |
| 16 | 18 | -2 | A | 2019-04-15 |
+---------+--------+-------+------------+-------------+
51 70 -19
When I create a SUM of the deltas, I should get the same values as the last row from results in 1).
Notes towards the solution:
use account_balance - LAG(account_balance) OVER(PARTITION BY department ORDER BY timestamp ASC) delta to compute deltas
Your question is unclear, but it sounds like you want to get the outstanding balance at any given point in time.
The following query does this for 1 point in time.
with calendar as (
select cast('2019-06-01' as timestamp) as balance_calc_ts
),
most_recent_balance as (
select customer_id, balance_calc_ts,max(timestamp) as most_recent_balance_ts
from <table>
cross join calendar
where timestamp < balance_calc_ts -- or <=
group by 1,2
)
select t.customer_id, t.account_balance, mrb.balance_calc_ts
from <table> t
inner join most_recent_balance mrb on t.customer_id = mrb.customer_id and t.timestamp = mrb.balance_calc_ts
If you need to calculate it at a series of points in time, you will need to modify the calendar CTE to return more dates. This is the beauty of CROSS JOINS in BQ!

SQL: Get an aggregate (SUM) of a calculation of two fields (DATEDIFF) that has conditional logic (CASE WHEN)

I have a dataset that includes a bunch of stay data (at a hotel). Each row contains a start date and an end date, but no duration field. I need to get a sum of the durations.
Sample Data:
| Stay ID | Client ID | Start Date | End Date |
| 1 | 38 | 01/01/2018 | 01/31/2019 |
| 2 | 16 | 01/03/2019 | 01/07/2019 |
| 3 | 27 | 01/10/2019 | 01/12/2019 |
| 4 | 27 | 05/15/2019 | NULL |
| 5 | 38 | 05/17/2019 | NULL |
There are some added complications:
I am using Crystal Reports and this is a SQL Expression, which obeys slightly different rules. Basically, it returns a single scalar value. Here is some more info: http://www.cogniza.com/wordpress/2005/11/07/crystal-reports-using-sql-expression-fields/
Sometimes, the end date field is blank (they haven't booked out yet). If blank, I would like to replace it with the current timestamp.
I only want to count nights that have occurred in the past year. If the start date of a given stay is more than a year ago, I need to adjust it.
I need to get a sum by Client ID
I'm not actually any good at SQL so all I have is guesswork.
The proper syntax for a Crystal Reports SQL Expression is something like this:
(
SELECT (CASE
WHEN StayDateStart < DATEADD(year,-1,CURRENT_TIMESTAMP) THEN DATEDIFF(day,DATEADD(year,-1,CURRENT_TIMESTAMP),ISNULL(StayDateEnd,CURRENT_TIMESTAMP))
ELSE DATEDIFF(day,StayDateStart,ISNULL(StayDateEnd,CURRENT_TIMESTAMP))
END)
)
And that's giving me the correct value for a single row, if I wanted to do this:
| Stay ID | Client ID | Start Date | End Date | Duration |
| 1 | 38 | 01/01/2018 | 01/31/2019 | 210 | // only days since June 4 2018 are counted
| 2 | 16 | 01/03/2019 | 01/07/2019 | 4 |
| 3 | 27 | 01/10/2019 | 01/12/2019 | 2 |
| 4 | 27 | 05/15/2019 | NULL | 21 |
| 5 | 38 | 05/17/2019 | NULL | 19 |
But I want to get the SUM of Duration per client, so I want this:
| Stay ID | Client ID | Start Date | End Date | Duration |
| 1 | 38 | 01/01/2018 | 01/31/2019 | 229 | // 210+19
| 2 | 16 | 01/03/2019 | 01/07/2019 | 4 |
| 3 | 27 | 01/10/2019 | 01/12/2019 | 23 | // 2+21
| 4 | 27 | 05/15/2019 | NULL | 23 |
| 5 | 38 | 05/17/2019 | NULL | 229 |
I've tried to just wrap a SUM() around my CASE but that doesn't work:
(
SELECT SUM(CASE
WHEN StayDateStart < DATEADD(year,-1,CURRENT_TIMESTAMP) THEN DATEDIFF(day,DATEADD(year,-1,CURRENT_TIMESTAMP),ISNULL(StayDateEnd,CURRENT_TIMESTAMP))
ELSE DATEDIFF(day,StayDateStart,ISNULL(StayDateEnd,CURRENT_TIMESTAMP))
END)
)
It gives me an error that the StayDateEnd is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. But I don't even know what that means, so I'm not sure how to troubleshoot, or where to go from here. And then the next step is to get the SUM by Client ID.
Any help would be greatly appreciated!
Although the explanation and data set are almost impossible to match, I think this is an approximation to what you want.
declare #your_data table (StayId int, ClientId int, StartDate date, EndDate date)
insert into #your_data values
(1,38,'2018-01-01','2019-01-31'),
(2,16,'2019-01-03','2019-01-07'),
(3,27,'2019-01-10','2019-01-12'),
(4,27,'2019-05-15',NULL),
(5,38,'2019-05-17',NULL)
;with data as (
select *,
datediff(day,
case
when datediff(day,StartDate,getdate())>365 then dateadd(year,-1,getdate())
else StartDate
end,
isnull(EndDate,getdate())
) days
from #your_data
)
select *,
sum(days) over (partition by ClientId)
from data
https://rextester.com/HCKOR53440
You need a subquery for sum based on group by client_id and a join between you table the subquery eg:
select Stay_id, client_id, Start_date, End_date, t.sum_duration
from your_table
inner join (
select Client_id,
SUM(CASE
WHEN StayDateStart < DATEADD(year,-1,CURRENT_TIMESTAMP) THEN DATEDIFF(day,DATEADD(year,-1,CURRENT_TIMESTAMP),ISNULL(StayDateEnd,CURRENT_TIMESTAMP))
ELSE DATEDIFF(day,StayDateStart,ISNULL(StayDateEnd,CURRENT_TIMESTAMP))
END) sum_duration
from your_table
group by Client_id
) t on t.Client_id = your_table.client_id

SQL Query question (count dates for each month)

Hope you can help me with the following, i have the following view availble:
DD/MM/YYYY
ENTITY | StartDate | EndDate | CodeA | CodeB | Revenue | Currency
AZERT | 01/01/2011 | 02/01/2011 | SU | BOLD | 100 | EUR
AZERT | 28/01/2011 | 02/02/2011 | SU | BOLD | 500 | EUR
Can someone help with a query to pull the data so that I get the following summed?
ENTITY | YYYY.MM | CodeA | CodeB | DAYS | TIMES | Revenue | Currency
AZERT | 2011.01 | SU | BOD | 5 | 2 | 500 | EUR
AZERT | 2011.02 | SU | BOD | 1 | 0 | 100 | EUR
Where YYYY.MM is created depending on the difference between Sdate and EDate.
And DAYS is the variance between the start and end day in the right month
And TIMES is the number of times that the StartDate occurs in that month
Revenue splitted depening how many days there are.
Assuming you are using SQL Server 2005 or later, would this work?
SELECT
DATEADD(dd, -DAY(EndDate + 1,EndDate)) -- Get first day of month for EndDate
,CodeA
,CodeB
,SUM(DATEDIFF(dd,StartDate, EndDate)) AS 'DAYS'
FROM
TABLE1
GROUP BY
DATEADD(dd, -DAY(EndDate + 1,EndDate))
,CodeA
,CodeB