Join grouping by periods, full join not working as intended - sql

I have a sales table:
SALES
|---------|-------------|-------------|
| order | ammount | date |
|---------|-------------|-------------|
| 001 | $2,000 | 2018-01-01 |
| 002 | $3,000 | 2018-01-01 |
| 003 | $1,500 | 2018-01-03 |
| 004 | $1,700 | 2018-01-04 |
| 005 | $1,800 | 2018-01-09 |
| 006 | $4,200 | 2018-01-11 |
|---------|-------------|-------------|
Aditionally, I have a table that groups said sales according to arbitrary time periods:
BUDGET PERIODS
|---------|-------------|--------------|
| ID | start_date | end_date |
|---------|-------------|--------------|
| 1 | 2018-01-01 | 2018-01-02 | <- notice this is a 2 day period...
| 2 | 2018-01-03 | 2018-01-05 | <-- but this is 3 days
|---------|-------------|--------------|
So, my result table query looked like this:
GROUPED SALES
|--------------|---------------|-----------------|
| start_date | end_date | ammount |
|--------------|---------------|-----------------|
| 2018-01-01 | 2018-01-02 | $5,000 |
| 2018-01-03 | 2018-01-05 | $3,200 |
|--------------|---------------|-----------------|
I accomplished it by a query as such:
SELECT
bp.start_date,
bp.end_date,
SUM(s.ammount)
FROM
budget_periods bp
LEFT JOIN
sales s ON s.date >= bp.start_date AND s.date <= bp.end_date
GROUP BY
start_date,
end_date
Everything is awesome, then. BUT, I notice that, of course, some sales are not included because they're not in budget periods. Hence, I want to include them "somewhere". I decided that that "somewhere" would be the week of the sale (using the week truncate function in Postgres). Hence, my grouped sales should look like this now:
GROUPED SALES
|--------------|---------------|-----------------|
| start_date | end_date | ammount |
|--------------|---------------|-----------------|
| 2018-01-01 | 2018-01-02 | $5,000 |
| 2018-01-03 | 2018-01-05 | $3,200 |
| 2018-01-08 | 2018-01-14 | $6,000 |
|--------------|---------------|-----------------|
Notice that if you truncate-to-week both 2018-01-09 and 2018-01-11, it shows 2018-01-08. To calculate my end_date, the budget period is "defaulted" to seven days, so it's six days later than the start_date.
So, I modified the query into a FULL JOIN like this:
SELECT
COALESCE(bp.start_date, DATE_TRUNC('WEEK', s.date)) AS new_start_date,
COALESCE(bp.end_date, DATE_TRUNC('WEEK', s.date) + INTERVAL '6 DAY') AS new_end_date,
SUM(s.ammount)
FROM
budget_periods bp
FULL JOIN
sales s ON s.date >= bp.start_date AND s.date <= bp.end_date
GROUP BY
new_start_date,
new_end_date
But then, the result table is the same as when I had a LEFT JOIN. How should I approach this?
Thank you for your time in reading such a long time to explain issue.

If you want all rows in sales then make that the first table in a LEFT JOIN. However, I think the FULL JOIN should work, as should this LEFT JOIN:
SELECT COALESCE(bp.start_date, DATE_TRUNC('WEEK', s.date)) as new_start_date,
COALESCE(bp.end_date, DATE_TRUNC('WEEK', s.date) + interval '6 day') as new_end_date,
SUM(s.amount)
FROM sales s LEFT JOIN
budget_periods bp
ON s.date >= bp.start_date AND s.date <= bp.end_date
GROUP BY new_start_date, new_end_date;
The only reason things would be filtered out of the FULL JOIN is through a WHERE clause, but you don't have one.

Related

Generating counts of open tickets over time, given opened and closed dates

I have a set of data for some tickets, with datetime of when they were opened and closed (or NULL if they are still open).
+------------------+------------------+
| opened_on | closed_on |
+------------------+------------------+
| 2019-09-01 17:00 | 2020-01-01 13:37 |
| 2020-04-14 11:00 | 2020-05-14 14:19 |
| 2020-03-09 10:00 | NULL |
+------------------+------------------+
We would like to generate a table of data showing the total count of tickets that were open through time, grouped by date. Something like the following:
+------------------+------------------+
| date | num_open |
+------------------+------------------+
| 2019-09-01 00:00 | 1 |
| 2020-09-02 00:00 | 1 |
| etc... | |
| 2020-01-01 00:00 | 0 |
| 2020-01-02 00:00 | 0 |
| etc... | |
| 2020-03-08 00:00 | 0 |
| 2020-03-09 00:00 | 1 |
| etc... | |
| 2020-04-14 00:00 | 2 |
+------------------+------------------+
Note that I am not sure about how the num_open is considered for a given date - should it be considered from the point of view of the end of the date or the start of it i.e. if one opened and closed on the same date, should that count as 0?
This is in Postgres, so I thought about using window functions for this, but trying to truncate by the date is making it complex. I have tried using a generate_series function to create the date series to join onto, but when I use the aggregate functions, I've "lost" access to the individual ticket datetimes.
You can use generate_series() to build the list of dates, and then a left join on inequality conditions to bring the table:
select s.dt, count(t.opened_on) num_open
from generate_series(date '2019-09-01', date '2020-09-01', '1 day') s(dt)
left join mytable t
on s.dt >= t.opened_on and s.dt < coalesce(t.closed_on, 'infinity')
group by s.dt
Actually, this seems a bit closer to what you want:
select s.dt, count(t.opened_on) num_open
from generate_series(date '2019-09-01', date '2020-09-01', '1 day') s(dt)
left join mytable t
on s.dt >= t.opened_on::date and s.dt < coalesce(t.closed_on::date, 'infinity')
group by s.dt

SQLite: generating customer counts for a date range (months) using a normalized table

I have a sales funnel dataset in SQLite and each row represents a movement through the funnel. As there are quite a few ways a potential customer can move through the funnel (and possibly even go backwards), I wasn't planning on flattening/denormalizing the table. How could I calculate "the number of customers per month up to today"?
customer | opp_value | status_old | status_new | current_status | status_change_date | current_lead_status | lead_created_date
cust_8 | 22 | confirmed | paying | paying | 2020-01-01 | Customer | 2020-01-01
cust_9 | 23 | confirmed | paying | churned | 2020-01-03 | Customer | 2020-01-02
cust_9 | 23 | paying | churned | churned | 2020-03-24 | Customer | 2020-02-25
cust_13 | 30 | negotiation | lost | paying | 2020-04-03 | Lost | 2020-03-20
cust_14 | 45 | qualified | confirmed | paying | 2020-03-03 | Customer | 2020-02-28
cust_14 | 45 | confirmed | paying | paying | 2020-04-03 | Customer | 2020-02-28
... | ... | ... | ... | ... | ... | ... | ...
We're assuming we use end-of-month as definition for whether a customer is still with us.
The result, with the above data should be:
month | customers
Jan-2020 | 2 (cust_8, cust_9)
Feb-2020 | 1 (cust_8, cust_9)
Mar-2020 | 1 (cust_8) # cust_9 churned
Apr-2020 | 2 (cust_8, cust_14)
May-2020 | 2 (cust_8, cust_14)
The part I'd really like to understand is how to create the month column, as I can't rely on the dates of status_change_date as there might be missing records. Would one have to manually generate that column? I know I can generate dates manually using:
WITH RECURSIVE cnt (
x
) AS (
SELECT 0
UNION ALL
SELECT x + 1
FROM cnt
LIMIT (
SELECT
ROUND(((julianday ('2020-05-01') - julianday ('2020-01-01')) / 30) + 1))
)
SELECT
date(julianday ('2020-01-01'), '+' || x || ' month') AS month
FROM cnt
but wondering if there is a better way? Would it possibly be easier to create a snapshot table and generate the current state of each customer for each date?
If you have the dates, you can use a brute-force method. This determines the most recent status for each customer for each date:
select d.date,
sum(as_of_status = 'paying')
from (select distinct d.date, t.customer,
first_value(status_new) over (partition by d.date, t.customer order by t. status_change_date desc) as as_of_status
from dates d join
t
on t.status_change_date <= d.date
) dc
group by d.date
order by d.date;

How does DATEADD work when joining the same table with itself?

I have a table with monthly production values.
Example:
Outdate | Prod Value | ID
2/28/19 | 110 | 4180
3/31/19 | 100 | 4180
4/30/19 | 90 | 4180
I also have a table that has monthly forecast values.
Example:
Forecast End Date | Forecast Value | ID
2/28/19 | 120 | 4180
3/31/19 | 105 | 4180
4/30/19 | 80 | 4180
I want to create a table that has a row that contains the ID, the Prod Value, the current month (example: March) forecast, the previous month forecast, the next month forecast.
What I want:
ID | Prod Value | Outdate | Current Forecast | Previous Forecast | Next Forecast
4180 | 100 | 3/31/19 | 105 | 120 | 80
The problem is that when I used DATEADD to bring in the specific value from the Forecast table for the previous month, random months are missing from my final values.
I've tried adding in another LEFT JOIN / INNER JOIN with the DateDimension table when adding in the Next Month and Previous Month forecast, but that either does not solve the problem or adds in too many rows.
My DateDimension table that has these columns: DateKey
Date, Day, DaySuffix, Weekday, WeekDayName, IsWeekend, IsHoliday, DOWInMonth, DayOfYear, WeekOfMonth, WeekOfYear, ISOWeekOfYear, Month, MonthName, Quarter, QuarterName, Year, MMYYYY, MonthYear, FirstDayOfMonth, LastDayOfMonth, FirstDayOfQuarter, LastDayOfQuarter, FirstDayOfYear, LastDayOfYear, FirstDayOfNextMonth, FirstDayOfNextYear
My query is along these lines (abbreviated for simplicity)
SELECT A.ArchiveKey, BH.ID, d.[Date], BH.Outdate, BH.ProdValue, BH.Forecast, BHP.Forecast, BHN.Foreceast
FROM dbo.BudgetHistory bh
INNER JOIN dbo.DateDimension d ON bh.outdate = d.lastdayofmonth
INNER JOIN dbo.Archive a ON bh.ArchiveKey = a.ArchiveKey
LEFT JOIN dbo.BudgetHistory bhp ON bh.ID = bhp.ID AND bhp.outdate = DATEADD(m, - 1, bh.Outdate)
LEFT JOIN dbo.BudgetHistory bhn ON bh.ID = bhn.ID AND bhn.outdate = DATEADD(m, 1, bh.Outdate)
WHERE bh.ID IS NOT NULL
I get something like this:
+------+------------+---------+------------------+-------------------+---------------+
| ID | Prod Value | Outdate | Current Forecast | Previous Forecast | Next Forecast |
+------+------------+---------+------------------+-------------------+---------------+
| 4180 | 110 | 2/28/19 | 120 | NULL | NULL |
| 4180 | 100 | 3/31/19 | 105 | 120 | 80 |
| 4180 | 90 | 4/30/19 | 80 | NULL | NULL |
+------+------------+---------+------------------+-------------------+---------------+
And the pattern doesn't seem to follow anything reasonable.
I want the values to be filled in for each row.
You could join the tables, then use window functions LEAD() and LAG() to recover the next and previous forecast values:
SELECT
p.ID,
p.ProdValue,
p.Outdate,
f.ForecastValue,
LAG(f.ForecastValue) OVER(PARTITION BY f.ID ORDER BY f.ForecastEndDate) PreviousForecast,
LEAD(f.ForecastValue) OVER(PARTITION BY f.ID ORDER BY f.ForecastEndDate) NextForecast
FROM prod p
INNER JOIN forecast f ON p.ID = f.ID AND p.Outdate = f.ForecastEndDate
This demo on DB Fiddle with your sample data returns:
ID | ProdValue | Outdate | ForecastValue | PreviousForecast | NextForecast
---: | --------: | :------------------ | ------------: | ---------------: | -----------:
4180 | 110 | 28/02/2019 00:00:00 | 120 | null | 105
4180 | 100 | 31/03/2019 00:00:00 | 105 | 120 | 80
4180 | 90 | 30/04/2019 00:00:00 | 80 | 105 | null
DATEADD only does end of month adjustments if the newly calculated value isn't a valid date. So DATEADD(month,-1,'20190331') produces 28th February. But DATEADD(month,-1,'20190228') produces 28th January, not the 31st.
I would probably go with GMB's answer. If you want to do something DATEADD based though, you can use:
bhp.outdate = DATEADD(month, DATEDIFF(month,'20010131', bh.Outdate) ,'20001231')
This always works out the last day of the previous month from bh.OutDate, but it does it by computing it as an offset from a fixed date, and then applying that offset to a different fixed date.
You can just reverse the places of 20010131 and 20001231 to compute the month after rather than the month before. There's no significance about them other than them both having 31 days and having the "one month apart" relationship we're wishing to apply.

Balance for each month by type

I have a table, in SQL-Server, with several records of input and output values with columns for type and date.
Something like that:
DATE |INPUT |OUTPUT |TYPE
2018-01-10 | 256.35| |A
2018-02-05 | | 35.00|B
2018-02-15 | 65.30| |A
2018-03-20 | 158.00| |B
2018-04-02 | | 63.32|B
2018-05-12 | | 128.12|A
2018-06-20 | | 7.35|B
I need help to make a query to returns the sum of inputs and outputs (as balance), per type, but it should return that sum at the end of each month, that is:
YEAR|MONTH|TYPE|BALANCE
2018| 1|A | 256.35
2018| 1|B | 0.00
2018| 2|A | 321.65
2018| 2|B | -35.00
2018| 3|A | 321.65
2018| 3|B | 123.00
2018| 4|A | 321.65
2018| 4|B | 59.68
2018| 5|A | 193.53
2018| 5|B | 59.68
2018| 6|A | 193.53
2018| 6|B | 52.33
2018| 7|A | 193.53
2018| 7|B | 52.33
Don't forget that the balance of each month is affected by the balance of the previous month, or in other words, the balance of each month is not only the movements of that month but of all the previous months also.
It should also be noted that it should include a record for each month of the year/type (up to the current date), even if a given month/type don't have movements, starting at the first month/year of the oldest movement and ending at actual date (in this case 2018 July).
Result achieved, there you go:
declare #min_month datetime=(select dateadd(month,datediff(month,0,min([DATE])),0) from _yourtable)
declare #max_month datetime=(select dateadd(month,datediff(month,0,max([DATE])),0) from _yourtable)
;WITH months(d) AS (
select #min_month
UNION ALL
SELECT dateadd(month,1,d) -- Recursion
FROM months
where dateadd(month,1,d)<=getdate()
)
select distinct
year(m.d) as YEAR,
month(m.d) as MONTH,
types.v as [TYPE]
,sum(isnull(t.[INPUT],0)-isnull(t.[OUTPUT],0)) over (partition by types.v order by m.d)
from months m
cross join (select distinct type from _yourtable)types(v)
left join _yourtable t on dateadd(month,datediff(month,0,t.[DATE]),0)=m.d and types.v=t.TYPE
order by m.d,type
option(maxrecursion 0)
You can use Lag function, below code might help:
select year(date), month(date), type
, sum(input-output) + isnull(lag(sum(input-output),1,0) over(order by year(date), month(date), type), 0)
from test group by year(date), month(date), type
Assuming your source data is in the structure you have initially provided (ie: this is not the result of another query), this is a fairly straightforward transformation using a table of dates and a running total via an ordered sum.
If you already have a dates table, you can remove the first 2 ctes in this script:
declare #t table(DateValue date,InputAmount decimal(8,2),OutputAmount decimal(8,2),ProdType nvarchar(1));
insert into #t values
('2018-01-10',256.35,null,'A')
,('2018-02-05',null, 35.00,'B')
,('2018-02-15', 65.30,null,'A')
,('2018-03-20',158.00,null,'B')
,('2018-04-02',null, 63.32,'B')
,('2018-05-12',null,128.12,'A')
,('2018-06-20',null, 7.35,'B')
;
-- Min date can just be min date in the source table, but the max date should be the month end of the max date in the source table0
declare #MinDate date = (select min(DateValue) from #t);
declare #MaxDate date = (select max(dateadd(day,-1,dateadd(month,datediff(month,0,DateValue)+1,0))) from #t);
with n(n) as (select * from (values(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t(t)) -- Using a tally table, built a table of dates
,d(d) as (select top(select datediff(day,#MinDate,#MaxDate)+1) dateadd(day,row_number() over (order by (select null))-1,#MinDate) from n n1,n n2,n n3, n n4)
,m as (select p.ProdType -- Then join to the source data to create a date value for each posible day for each product type
,d.d
,dateadd(day,-1,dateadd(month,datediff(month,0,d)+1,0)) as m -- And calculate a running total using a windowed aggregate
,sum(isnull(t.InputAmount,0) - isnull(t.OutputAmount,0)) over (partition by p.ProdType order by d.d) as RunningTotal
from d
cross join (select distinct ProdType
from #t
) as p
left join #t as t
on d.d = t.DateValue
and p.ProdType = t.ProdType
)
select m
,ProdType
,RunningTotal as Balance
from m
where m = d
order by m.d
,m.ProdType;
Output:
+-------------------------+----------+---------+
| m | ProdType | Balance |
+-------------------------+----------+---------+
| 2018-01-31 00:00:00.000 | A | 256.35 |
| 2018-01-31 00:00:00.000 | B | 0.00 |
| 2018-02-28 00:00:00.000 | A | 321.65 |
| 2018-02-28 00:00:00.000 | B | -35.00 |
| 2018-03-31 00:00:00.000 | A | 321.65 |
| 2018-03-31 00:00:00.000 | B | 123.00 |
| 2018-04-30 00:00:00.000 | A | 321.65 |
| 2018-04-30 00:00:00.000 | B | 59.68 |
| 2018-05-31 00:00:00.000 | A | 193.53 |
| 2018-05-31 00:00:00.000 | B | 59.68 |
| 2018-06-30 00:00:00.000 | A | 193.53 |
| 2018-06-30 00:00:00.000 | B | 52.33 |
+-------------------------+----------+---------+

How to group by week in postgresql

I've a database table commits with the following columns:
id | author_name | author_email | author_date (timestamp) |
total_lines
Sample contents are:
1 | abc | abc#xyz.com | 2013-03-24 15:32:49 | 1234
2 | abc | abc#xyz.com | 2013-03-27 15:32:49 | 534
3 | abc | abc#xyz.com | 2014-05-24 15:32:49 | 2344
4 | abc | abc#xyz.com | 2014-05-28 15:32:49 | 7623
I want to get a result as follows:
id | name | week | commits
1 | abc | 1 | 2
2 | abc | 2 | 0
I searched online for similar solutions but couldnt get any helpful ones.
I tried this query:
SELECT date_part('week', author_date::date) AS weekly,
COUNT(author_email)
FROM commits
GROUP BY weekly
ORDER BY weekly
But its not the right result.
If you have multiple years, you should take the year into account as well. One way is:
SELECT date_part('year', author_date::date) as year,
date_part('week', author_date::date) AS weekly,
COUNT(author_email)
FROM commits
GROUP BY year, weekly
ORDER BY year, weekly;
A more natural way to write this uses date_trunc():
SELECT date_trunc('week', author_date::date) AS weekly,
COUNT(author_email)
FROM commits
GROUP BY weekly
ORDER BY weekly;
If you want the count of all the intermediate weeks as well where there are no commits/records, you can get it by providing a start_date and end_date to generate_series() function
SELECT t1.year_week week,
t2.commit_count
FROM (SELECT week,
To_char(week, 'IYYY-IW') year_week
FROM generate_series('2020-02-01 06:06:51.25+00'::DATE,
'2020-04-05 12:12:33.25+00'::
DATE, '1 week'::interval) AS week) t1
LEFT OUTER JOIN (SELECT To_char(author_date, 'IYYY-IW') year_week,
COUNT(author_email) commit_count
FROM commits
GROUP BY year_week) t2
ON t1.year_week = t2.year_week;
The output will be:
week | commit_count
----------+-------------
2020-05 | 2
2020-06 | NULL
2020-07 | 1