Distinct count for entire dataset, grouped by month

Distinct count for entire dataset, grouped by month - sql

I am dealing with a sales order table (ORDER) that looks roughly like this (updated 2018/12/20 to be closer to my actual data set):
SOID SOLINEID INVOICEDATE SALESAMOUNT AC
5 1 2018-11-30 100.00 01
5 2 2018-12-05 50.00 02
4 1 2018-12-12 25.00 17
3 1 2017-12-31 75.00 03
3 2 2018-01-03 25.00 05
2 1 2017-11-25 100.00 17
2 2 2017-11-27 35.00 03
1 1 2017-11-20 15.00 08
1 2 2018-03-15 30.00 17
1 3 2018-04-03 200.00 05
I'm able to calculate the average sales by SOID and SOLINEID:
SELECT SUM(SALESAMOUNT) / COUNT(DISTINCT SOID) AS 'Total Sales per Order ($)',
SUM(SALESAMOUNT) / COUNT(SOLINEID) AS 'Total Sales per Line ($)'
FROM ORDER
This seems to provide a perfectly good answer, but I was then given an additional constraint, that this count be done by year and month. I thought I could simply add
GROUP BY YEAR(INVOICEDATE), MONTH(MONTH)
But this aggregates the SOID and then performs the COUNT(DISTINCT SOID). This becomes a problem with SOIDs that appears across multiple months, which is fairly common since we invoice upon shipment.
I want to get something like this:
Year Month Total Sales Per Order Total Sales Per Line
2018 11 0.00
The sore thumb sticking out is that I need some way of defining in which month and year an SOID will be aggregated if it spans across multiple ones; for that purpose, I'd use MAX(INVOICEDATE).
From there, however, I'm just not sure how to tackle this. WITH? A subquery? Something else? I would appreciate any help, even if it's just pointing in the right direction.

You should select Year() and month() for invocedate and group by
SELECT YEAR(INVOICEDATE) year
, MONTH(INVOICEDATE) month
, SUM(SALESAMOUNT) / COUNT(DISTINCT SOID) AS 'Total Sales per Order ($)'
, SUM(SALESAMOUNT) / COUNT(SOLINEID) AS 'Total Sales per Line ($)'
FROM ORDER
GROUP BY YEAR(INVOICEDATE), MONTH(INVOICEDATE)

Here are the results, but the data sample does not have enuf rows to show Months...
SELECT
mDateYYYY,
mDateMM,
SUM(SALESAMOUNT) / COUNT(DISTINCT t1.SOID) AS 'Total Sales per Order ($)',
SUM(SALESAMOUNT) / COUNT(SOLINEID) AS 'Total Sales per Line ($)'
FROM DCORDER as t1
left join
(Select
SOID
,Year(max(INVOICEDATE)) as mDateYYYY
,Month(max(INVOICEDATE)) as mDateMM
From DCOrder
Group By SOID
) as t2
On t1.SOID = t2.SOID
Group by mDateYYYY, mDateMM
mDateYYYY mDateMM Total Sales per Order ($) Total Sales per Line ($)
2018 12 87.50 58.33
I have used new SQL still MAX(INVOICEDATE)(not above), with new 12/20 data, and excluded AC=17.
YYYY MM Total Sales per Order ($) Total Sales per Line ($)
2017 11 35.00 35.00
2018 1 100.00 50.00
2018 4 215.00 107.50
2018 12 150.00 75.00

Related

Showing Two Fields With Different Timeline in the Same Date Structure

In the project I am currently working on in my company, I would like to show sales related KPIs together with Customer Score metric on SQL / Tableau / BigQuery
The primary key is order id in both tables. However, order date and the date we measure Customer Score may be different. For example the the sales information for an order that is released in Feb 2020 will be aggregated in Feb 2020, however if the customer survey is made in March 2020, the Customer Score metric must be aggregated in March 2020. And what I would like to achieve in the relational database is as follows:
Sales:
Order ID
Order Date(m/d/yyyy)
Sales ($)
1000
1/1/2021
1000
1001
2/1/2021
2000
1002
3/1/2021
1500
1003
4/1/2021
1700
1004
5/1/2021
1800
1005
6/1/2021
900
1006
7/1/2021
1600
1007
8/1/2021
1900
Customer Score Table:
Order ID
Customer Survey Date(m/d/yyyy)
Customer Score
1000
3/1/2021
8
1001
3/1/2021
7
1002
4/1/2021
3
1003
6/1/2021
6
1004
6/1/2021
5
1005
7/1/2021
3
1006
9/1/2021
1
1007
8/1/2021
7
Expected Output:
KPI
Jan-21
Feb-21
Mar-21
Apr-21
May-21
June-21
July-21
Aug-21
Sep-21
Sales($)
1000
2000
1500
1700
1800
900
1600
1900
AVG Customer Score
7.5
3
5.5
3
7
1
I couldn't find a way to do this, because order date and survey date may/may not be the same.
For sample data and expected output, click here.

I think what you want to do is aggregate your results to the month (KPI) first before joining, as opposed to joining on the ORDER_ID
For example:
with order_month as (
select date_trunc(order_date, MONTH) as KPI, sum(sales) as sales
from `testing.sales`
group by 1
),
customer_score_month as (
select date_trunc(customer_survey_date, MONTH) as KPI, avg(customer_score) as avg_customer_score
from `testing.customer_score`
group by 1
)
select coalesce(order_month.KPI,customer_score_month.KPI) as KPI, sales, avg_customer_score
from order_month
full outer join customer_score_month
on order_month.KPI = customer_score_month.KPI
order by 1 asc
Here, we aggregate the total sales for each month based on the order date, then we aggregate the average customer score for each month based on the date the score was submitted. Now we can join these two on the month value.
This results in a table like this:
KPI
sales
avg_customer_score
2021-01-01
1000
null
2021-02-01
2000
null
2021-03-01
1500
7.5
2021-04-01
1700
3.0
2021-05-01
1800
null
2021-06-01
900
5.5
2021-07-01
1600
3.0
2021-08-01
1900
7.0
2021-09-01
null
1.0
You can pivot the results of this table in Tableau, or leverage a case statement to pull out each month into its own column - I can elaborate more if that will be helpful

Include "0" results in COUNT(*) aggregate

Good morning, I've searched in the forum one doubt that I have but the results that I've seen didn't give me a solution.
I have two tables.
CARS:
Id Model
1 Seat
2 Audi
3 Mercedes
4 Ford
BREAKDOWNS:
IdBd Description Date Price IdCar
1 Engine 01/01/2020 500 € 3
2 Battery 05/01/2020 0 € 1
3 Wheel's change 10/02/2020 110,25 € 4
4 Electronic system 15/03/2020 100 € 2
5 Brake failure 20/05/2020 0 € 4
6 Engine 25/05/2020 400 € 1
I wanna make a query that shows the number of breakdowns by month with 0€ of cost.
I have this query:
SELECT Year(breakdowns.[Date]) AS YEAR, StrConv(MonthName(Month(breakdowns.[Date])),3) AS MONTH, Count(*) AS [BREAKDOWNS]
FROM cars LEFT JOIN breakdowns ON (cars.Id = breakdowns.IdCar AND breakdowns.[Price]=0)
GROUP BY breakdowns.[Price], Year(breakdowns.[Date]), Month(breakdowns.[Date]), MonthName(Month(breakdowns.[Date]))
HAVING ((Year([breakdowns].[Date]))=[Insert a year:])
ORDER BY Year(breakdowns.[Date]), Month(breakdowns.[Date]);
And the result is (if I put year '2020'):
YEAR MONTH BREAKDOWNS
2020 January 1
2020 May 1
And I want:
YEAR MONTH BREAKDOWNS
2020 January 1
2020 February 0
2020 March 0
2020 May 1
Thanks!

The HAVING condition should be in WHERE (otherwise it changes the Outer to an Inner join). But as long as you don't use columns from cars there's no need to join it.
To get rows for months without a zero price you should switch to conditional aggregation (Access doesn't support Standard SQL CASE, but IIF?).
SELECT Year(breakdowns.[Date]) AS YEAR,
StrConv(MonthName(Month(breakdowns.[Date])),3) AS MONTH,
SUM(CASE WHEN breakdowns.[Price]=0 THEN 1 ELSE 0 END) AS [BREAKDOWNS]
FROM breakdowns
JOIN cars
ON (cars.Id = breakdowns.IdCar)
WHERE ((Year([breakdowns].[Date]))=[Insert a year:])
GROUP BY breakdowns.[Price], Year(breakdowns.[Date]), Month(breakdowns.[Date]), MonthName(Month(breakdowns.[Date]))
ORDER BY Year(breakdowns.[Date]), Month(breakdowns.[Date]

Calculate Churn by aggregating by date range in SQL

I am trying to calculate the churn rate from a data that has customer_id, group, date. The aggregation is going to be by id, group and date. The churn formula is (customers in previous cohort - customers in last cohort)/customers in previous cohort
customers in previous cohort refers to cohorts in before 28 days
customers in last cohort refers to cohorts in last 28 days
I am not sure how to aggregate them by date range to calculate the churn.
Here is sample data that I copied from SQL Group by Date Range:
Date Group Customer_id
2014-03-01 A 1
2014-04-02 A 2
2014-04-03 A 3
2014-05-04 A 3
2014-05-05 A 6
2015-08-06 A 1
2015-08-07 A 2
2014-08-29 XXXX 2
2014-08-09 XXXX 3
2014-08-10 BB 4
2014-08-11 CCC 3
2015-08-12 CCC 2
2015-03-13 CCC 3
2014-04-14 CCC 5
2014-04-19 CCC 4
2014-08-16 CCC 5
2014-08-17 CCC 3
2014-08-18 XXXX 2
2015-01-10 XXXX 3
2015-01-20 XXXX 4
2014-08-21 XXXX 5
2014-08-22 XXXX 2
2014-01-23 XXXX 3
2014-08-24 XXXX 2
2014-02-25 XXXX 3
2014-08-26 XXXX 2
2014-06-27 XXXX 4
2014-08-28 XXXX 1
2014-08-29 XXXX 1
2015-08-30 XXXX 2
2015-09-31 XXXX 3
The goal is to calculate the churn rate every 28 days in between 2014 and 2015 by the formula given above. So, it is going to be aggregating the data by rolling it by 28 days and calculating the churn by the formula.
Here is what I tried to aggregate the data by date range:
SELECT COUNT(distinct customer_id) AS count_ids, Group,
DATE_SUB(CAST(Date AS DATE), INTERVAL 56 DAY) AS Date_min,
DATE_SUB(CURRENT_DATE, INTERVAL 28 DAY) AS Date_max
FROM churn_agg
GROUP BY count_ids, Group, Date_min, Date_max
Hope someone will help me with aggregation and churn calculation. I want to simply deduct the aggregated count_ids to deduct it from the next aggregated count_ids which is after 28 days. So this is going to be successive deduction of the same column value (count_ids). I am not sure if I have to use rolling window or simple aggregation to find the churn.

As corrected by #jarlh, it's not 2015-09-31 but 2015-09-30
You can use this to create 28 days calendar:
create table daysby28 (i int, _Date date);
insert into daysby28 (i, _Date)
SELECT i, cast('01-01-2014'as date) + i*INTERVAL '28 day'
from generate_series(0,50) i
order by 1;
After you use #jarlh churn_agg table creation he sent with the fiddle, with this query, you get what you want:
with cte as
(
select count(Customer) as TotalCustomer, Cohort, CohortDateStart From
(
select distinct a.Customer_id as Customer, b.i as Cohort, b._Date as CohortDateStart
from churn_agg a left join daysby28 b on a._Date >= b._Date and a._Date < b._Date + INTERVAL '28 day'
) a
group by Cohort, CohortDateStart
)
select a.CohortDateStart,
1.0*(b.TotalCustomer - a.TotalCustomer)/(1.0*b.TotalCustomer) as Churn from cte a
left join cte b on a.cohort > b.cohort
and not exists(select 1 from cte c where c.cohort > b.cohort and c.cohort < a.cohort)
order by 1
The fiddle of all together is here

Last 3 months average next to current month value in hive

I have a table which has the monthly sales values for each of the items. I need last 3 months average sales value next to the current month sales for each item.
Need to perform this operation in hive.
The sample input table looks like below
Item_ID Sales Month
A 4295 Dec-2018
A 245 Nov-2018
A 1337 Oct-2018
A 3290 Sep-2018
A 2000 Aug-2018
B 856 Dec-2018
B 1694 Nov-2018
B 4286 Oct-2018
B 2780 Sep-2018
B 3100 Aug-2018
The result table should look like this
Item_ID Sales_Current_Month Month Sales_Last_3_months_average
A 4295 Dec-2018 1624
A 245 Nov-2018 2209
B 856 Dec-2018 2920
B 1694 Nov-2018 3388.67

Assuming there is no missing months data, you can use avg window function to do this.
select t.*
,avg(sales) over(partition by item_id order by month rows between 3 preceding and 1 preceding) as avg_sales_prev_3_months
from tbl t
If month column is in a format different from yyyyMM, use an appropriate conversion so the ordering works as expected.

SQL Creating a cumulative sum column in a table by a specific order

I apologize for the confusing title. I am dealing with an issue this morning that I thought I solved with everyone's help here but I can't do what I originally had hoped with just the master_line_num. Once again, below is a small subset of the data I am working with:
ID Proj_Id Year Quarter Value **Cumu_Value** Master_Line_Num
1 "C102" 2017 1 200.00 **200.00** 1
2 "C102" 2017 2 200.00 **400.00** 2
3 "C102" 2017 3 200.00 **600.00** 3
4 "C102" 2017 4 200.00 **800.00** 4
5 "C102" 2018 1 400.00 **1200.00** 5
6 "C102" 2018 2 400.00 **1600.00** 6
7 "C102" 2018 3 400.00 **2000.00** 7
8 "C102" 2018 4 400.00 **2400.00** 8
9 "B123" 2017 1 100.00 **100.00** 1
10 "B123" 2017 2 100.00 **200.00** 2
11 "B123" 2017 3 100.00 **300.00** 3
12 "B123" 2017 4 100.00 **400.00** 4
13 "B123" 2018 1 200.00 **600.00** 5
14 "B123" 2018 2 200.00 **800.00** 6
15 "B123" 2018 3 200.00 **1000.00** 7
16 "B123" 2018 4 200.00 **1200.00** 8
The desired values I am trying to get is the "Cumu_Value" column. I am trying to get those values by adding up the "value" column by year, by quarter for a specific "Proj_Id". I originally just tried to multiply the "value" column by the master_line_num column after getting that but then realized that it doesn't work due to the "value" column changing between years.
Is it possible to calculate this with T-SQL or do I need to do something more extravagant?

SQL supports the cumulative sum as a window function, so this is easy to express:
select . . . ,
sum(value) over (partition by proj_id order by year, quarter) as cumulative_sum

You need a Windowed Aggregate, this will return a Cumulative Sum:
sum(value)
over (partition by proj_id
order by Year, Quarter
rows unbounded preceding)
Caution, don't use (partition by proj_id order by Year, Quarter) without the ROWS as it defaults to RANGE which might return a different result and has much more overhead. RANGE includes all rows with the same value as the current. In your case it would return:
800
800
800
800
2400
2400
2400
2400
Edit:
After checking your other question I noticed that you don't have a Master_Line_Num in your data, so you better use ORDER BY Year, Quarter instead.

You can try something like this:
select t1.id, t1.proj_ID, t1.Year, t1.Value, SUM(t2.Value) as Cumu_sum, Master_Line_Num
from #tablename t1
inner join #tablename t2 on t1.id >= t2.id
group by t1.id, t1.Value
order by t1.id

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Distinct count for entire dataset, grouped by month - sql

Related

Showing Two Fields With Different Timeline in the Same Date Structure

Include "0" results in COUNT(*) aggregate

Calculate Churn by aggregating by date range in SQL

Last 3 months average next to current month value in hive

SQL Creating a cumulative sum column in a table by a specific order

Categories

Resources