Unify two tables with different number of rows and no common id - sql

I want to merge these two tables into one. The idea is to create a table by month and show: month, number of unique customers who bought, number of invoices, number of products, total income, total income from product A
I'm having trouble adding the total income from product A per month since the table has two rows while the other results have four.
Example of table:
CustomerID
InvoiceID
ProductId
Date
Income
1
101
A
1/11/2016
600
2
103
B
12/10/2015
300
My query so far:
SELECT
MONTH(date) AS month,
COUNT (DISTINCT customerId) AS numOfCustomers,
SUM(income) AS sumOfIncome,
COUNT(invoiceId) AS numOfInvoice,
COUNT(productId) AS numOfProduct
FROM
x
WHERE
YEAR(date) = 2016
GROUP BY
MONTH(date)
SELECT
MONTH(date) AS month,
SUM(income) AS sumOfIncomeA
FROM
x
WHERE
(productId) = 'A'
AND YEAR(date) = 2016
GROUP BY
MONTH(date)

Here's a solution that first creates a big list of months. You can modify the "months" CTE to go back as far as you need. By default, this query will go back 83 years from today. After you have a good list of months, then you can join your data to it so that you are guaranteed to have all the months, and only sales data if present.
--First CTE "x" is used to create a sequence of 10 numbers.
WITH x as (
SELECT * FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) as x(a)
)
--Second CTE "y" creates a sequence of 1000 numbers.
, y as (
SELECT ROW_NUMBER() OVER(ORDER BY hundreds.a, tens.a, ones.a) as row_num
FROM x as ones, x as tens, x as hundreds
)
--Third CTE "months" creates a sequence of months going back in time from today.
--To go farther back than 1000 months, modify the "y" CTE to have a "thousands" (or more) table(s).
, months as (
SELECT
YEAR(DATEADD(month, -1 * y.row_num, GETDATE())) as [year]
, MONTH(DATEADD(month, -1 * y.row_num, GETDATE())) as [month]
, CAST(YEAR(DATEADD(month, -1 * y.row_num, GETDATE())) as nvarchar(6))
+ RIGHT('00' + CAST(MONTH(DATEADD(month, -1 * y.row_num, GETDATE())) as nvarchar(6)),2) as YEAR_MONTH
FROM y
)
--Main select.
--First FROM is a list of months so that we know for a fact we have all the months in the year.
--Then do a LEFT OUER JOIN to your main data. All months will be returned.
--If there is no match in the data table, then the value will be null.
--You can use an ISNULL(SUM(x.income),0) to convert nulls to 0.
SELECT
m.[month] AS month,
COUNT (DISTINCT x.customerId) AS numOfCustomers,
SUM(x.income) AS sumOfIncome,
COUNT(x.invoiceId) AS numOfInvoice,
COUNT(x.productId) AS numOfProduct
FROM months as m
LEFT OUTER JOIN x
ON YEAR(x.[date]) = m.[year]
AND MONTH(x.[date]) = m.[month]
WHERE
x.YEAR([date]) = 2016
GROUP BY
m.MONTH([date])

Like I wrote in the comment, as both tables could not have the same month present you need a FULL OUTER JOIN
so
The COALESCE for month is needed, as it could be that one of the month coulb be NULL
SELECT
COALESCE(t1.month,t2-month),
t1.numOfCustomers,t1.sumOfIncome,t1.numOfInvoice,t1.numOfProduct
,t2.sumOfIncomeA
FROM
(SELECT
MONTH(date) AS month,
COUNT (DISTINCT customerId) AS numOfCustomers,
SUM(income) AS sumOfIncome,
COUNT(invoiceId) AS numOfInvoice,
COUNT(productId) AS numOfProduct
FROM
x
WHERE
YEAR(date) = 2016
GROUP BY
MONTH(date)) t1
FULL OUTER JOIN
(SELECT
MONTH(date) AS month,
SUM(income) AS sumOfIncomeA
FROM
x
WHERE
(productId) = 'A'
AND YEAR(date) = 2016
GROUP BY
MONTH(date)) t2 ON t1.month = t2.month

Related

Detect if a month is missing and insert them automatically with a select statement (MSSQL)

I am trying to write a select statement which detects if a month is not existent and automatically inserts that month with a value 0. It should insert all missing months from the first entry to the last entry.
Example:
My table looks like this:
After the statement it should look like this:
You need a recursive CTE to get all the years in the table (and the missing ones if any) and another one to get all the month numbers 1-12.
A CROSS join of these CTEs will be joined with a LEFT join to the table and finally filtered so that rows prior to the first year/month and later of the last year/month are left out:
WITH
limits AS (
SELECT MIN(year) min_year, -- min year in the table
MAX(year) max_year, -- max year in the table
MIN(DATEFROMPARTS(year, monthnum, 1)) min_date, -- min date in the table
MAX(DATEFROMPARTS(year, monthnum, 1)) max_date -- max date in the table
FROM tablename
),
years(year) AS ( -- recursive CTE to get all the years of the table (and the missing ones if any)
SELECT min_year FROM limits
UNION ALL
SELECT year + 1
FROM years
WHERE year < (SELECT max_year FROM limits)
),
months(monthnum) AS ( -- recursive CTE to get all the month numbers 1-12
SELECT 1
UNION ALL
SELECT monthnum + 1
FROM months
WHERE monthnum < 12
)
SELECT y.year, m.monthnum,
DATENAME(MONTH, DATEFROMPARTS(y.year, m.monthnum, 1)) month,
COALESCE(value, 0) value
FROM months m CROSS JOIN years y
LEFT JOIN tablename t
ON t.year = y.year AND t.monthnum = m.monthnum
WHERE DATEFROMPARTS(y.year, m.monthnum, 1)
BETWEEN (SELECT min_date FROM limits) AND (SELECT max_date FROM limits)
ORDER BY y.year, m.monthnum
See the demo.
You should not be storing date components in two separate columns; instead, you should have just one column, with a proper date-like datatype.
One approach is to use a recursive query to generate all starts of month between the earliest and latest date in the table, then brin the table with a left join.
In SQL Server:
with cte as (
select min(datefromparts(year, monthnum, 1)) as dt,
max(datefromparts(year, monthnum, 1)) as dt_max
from mytable
union all
select dateadd(month, 1, dt)
from cte
where dt < dt_max
)
select c.dt, coalesce(t.value, 0) as value
from cte c
left join mytable t on datefromparts(t.year, t.month, 1) = c.dt
If your data spreads over more that 100 months, you need to add option(maxrecursion 0) at the end of the query.
You can extract the date components in the final select if you like:
select
year(c.dt) as yr,
month(c.dt) as monthnum,
datename(month, c.dt) as monthname,
coalesce(t.value, 0) as value
from ...

sql user retention calculation

I have a table records like this in Athena, one user one row in a month:
month, id
2020-05 1
2020-05 2
2020-05 5
2020-06 1
2020-06 5
2020-06 6
Need to calculate the percentage=( users come both prior month and current month )/(prior month total users).
Like in the above example, users come both in May and June 1,5 , May total user 3, this should calculate a percentage of 2/3*100
with monthly_mau AS
(SELECT month as mauMonth,
date_format(date_add('month',1,cast(concat(month,'-01') AS date)), '%Y-%m') AS nextMonth,
count(distinct userid) AS monthly_mau
FROM records
GROUP BY month
ORDER BY month),
retention_mau AS
(SELECT
month,
count(distinct useridLeft) AS retention_mau
FROM (
(SELECT
userid as useridLeft,month as monthLeft,
date_format(date_add('month',1,cast(concat(month,'-01') AS date)), '%Y-%m') AS nextMonth
FROM records ) AS prior
INNER JOIN
(SELECT
month ,
userid
FROM records ) AS current
ON
prior.useridLeft = current.userid
AND prior.nextMonth = current.month )
WHERE userid is not null
GROUP BY month
ORDER BY month )
SELECT *, cast(retention_mau AS double)/cast(monthly_mau AS double)*100 AS retention_mau_percentage
FROM monthly_mau as m
INNER JOIN monthly_retention_mau AS r
ON m.nextMonth = r.month
order by r.month
This gives me percentage as 100 which is not right. Any idea?
Hmmm . . . assuming you have one row per user per month, you can use window functions and conditional aggregation:
select month, count(*) as num_users,
sum(case when prev_month = dateadd('month', -1, month) then 1 else 0 end) as both_months
from (select r.*,
cast(concat(month, '-01') AS date) as month_date,
lag(cast(concat(month, '-01') AS date)) over (partition by id order by month) as prev_month_date
from records r
) r
group by month;

Group By - select by a criteria that is met every month

The below query returns all USERS that have SUM(AMOUNT) > 10 in a given month. It includes Users in a month even if they don't meet the criteria in other months.
But I'd like to transform this query to return all USERS who must meet the criteria SUM(AMOUNT) > 10 every single month (i.e., from the first month in the table to the last one) across the entire data.
Put another way, exclude users who don't meet SUM(AMOUNT) > 10 every single month.
select USERS, to_char(transaction_date, 'YYYY-MM') as month
from Table
GROUP BY USERS, month
HAVING SUM(AMOUNT) > 10;
One approach uses a generated calendar table representing all months in your data set. We can left join this calendar table to your current query, and then aggregate over all months by user:
WITH months AS (
SELECT DISTINCT TO_CHAR(transaction_date, 'YYYY-MM') AS month
FROM yourTable
),
cte AS (
SELECT USERS, TO_CHAR(transaction_date, 'YYYY-MM') AS month
FROM yourTable
GROUP BY USERS, month
HAVING SUM(AMOUNT) > 10
)
SELECT
t.USERS
FROM months m
LEFT JOIN cte t
ON m.month = t.month
GROUP BY
t.USERS
HAVING
COUNT(t.USERS) = (SELECT COUNT(*) FROM months);
The HAVING clause above asserts that the number of months to which a user matches is in fact the total number of months. This would imply that the user meets the sum criteria for every month.
Perhaps you could use a correlated subquery, such as:
select t.*
from (select distinct table.users from table) t
where not exists
(
select to_char(u.transaction_date, 'YYYY-MM') as month
from table u
where u.users = t.users
group by month
having sum(u.amount) <= 10
)
One option would be using sign(amount-10) vs. sign(amount) logic as
SELECT q.users
FROM
(
with tab(users, transaction_date,amount) as
(
select 1,date'2018-11-24',8 union all
select 1,date'2018-11-24',18 union all
select 2,date'2018-10-24',13 union all
select 3,date'2018-11-24',18 union all
select 3,date'2018-10-24',28 union all
select 3,date'2018-09-24', 3 union all
select 4,date'2018-10-24',28
)
SELECT users, to_char(transaction_date, 'YYYY-MM') as month,
sum(sign(amount-10)) as cnt1,
sum(sign(amount)) as cnt2
FROM tab t
GROUP BY users, month
) q
GROUP BY q.users
HAVING sum(q.cnt1) = sum(q.cnt2)
GROUP BY q.users
users
-----
2
4
Rextester Demo
You need to compare the number of months > 10 to the number of months between the min and the max date:
SELECT users, Count(flag) AS months, Min(mth), Max(mth)
FROM
(
SELECT users, date_trunc('month',transaction_date) AS mth,
CASE WHEN Sum(amount) > 10 THEN 1 end AS flag
FROM tab t
GROUP BY users, mth
) AS dt
GROUP BY users
HAVING -- adding the number of months > 10 to the min date and compare to max
Min(mth) + (INTERVAL '1' MONTH * (Count(flag)-1)) = Max(mth)
If missing months don't count it would be a simple count(flag) = count(*)

Get valid orders at the starting day of each year

In a table containing Order information (call it Order) we have the following fields:
OrderId int
OrderDate Date
BindingTime int
Binding time is in months.
An order is called "Active" between its OrderDate and DATEADD(mm, BindingTime, OrderDate).
What I'd like to do is to group the orders by year so that if an order is "active" on the first day of a year it would be taken into account. The aim is to calculate each year's inbound and outbound orders. So the query result will be COUNT of orders and the year. And by year we mean the number of orders which were active on the first day of that year.
Mind that, we would like to have all the years between two given numbers in our results. E.g. If there was no active order on the first day of 2016 we would still like to to have a row for (0, 2016).
I've used a recursive CTE to generate a range of years, so that a 'zero' year will not be omitted
declare #YEAR1 as date = '20110101';
declare #YEAR2 as date = '20190101';
WITH YEARS AS (SELECT #YEAR1 y
UNION ALL
SELECT dateadd(year,1,y) FROM YEARS WHERE y < #YEAR2)
SELECT YEARS.y,count(0) YearStartActiveOrders FROM YourTable
CROSS JOIN YEARS
WHERE YEARS.y BETWEEN CAST(orderdate as date)
AND CAST(DATEADD(mm, BindingTime, OrderDate) as date)
GROUP BY Years.y
Seems like what you need is a Date table (having a list of all days per year) and left joining that table with your grouped data (active order count, per day of the year).
You can use this date table from Aaron Bertrand. I generated the #dim table with the following params, to only generate two years data (2015, 2016):
DECLARE #StartDate DATE = '20150101', #NumberOfYears INT = 2;
Then you can do the following:
with ordertable as
(
select 1 as orderid, '20160101' as orderdate, 2 as bindingtime union all
select 2, '20160305', 3 union all
select 3, '20160305', 5 union all
select 4, '20150305', 5
)
select d.year, isnull(count(orderid), 0) nrActiveOrdersFirstDayOfYear
from #dim d
left join ordertable g on d.year = year(g.orderdate)
and g.orderdate = d.date
and d.FirstOfYear between g.orderdate and DATEADD(mm, g.bindingtime, OrderDate)
group by d.year
With the sample data I took as an example, you would get the result:
year nrActiveOrdersFirstDayOfYear
2015 0
2016 1
Working demo here.

Pad out an SQL table with data for Graphing Purposes

SQL Server 2005
I have an SQL Function (ftn_GetExampleTable) which returns a table with multiple result rows
EXAMPLE
ID MemberID MemberGroupID Result1 Result2 Result3 Year Week
1 1 1 High Risk 2 xx 2011 22
2 11 4 Low Risk 1 yy 2011 21
3 12 5 Med Risk 3 zz 2011 25
etc.
Now I do a count and group by on a table above this for Result 2 for instance so I get
SELECT MemberGroupID, Result2, Count(*) AS ExampleCount, Year, Week
FROM ftn_GetExampleTable
GROUP BY MemberGroupID, Result2, Year, Week
MemberGroupID Result2 ExampleCount Year Week
1 2 4 2011 22
4 1 2 2011 21
5 3 1 2011 25
Now imagine when I go to graph this new table between Weeks 20 and 23 of Year 2011, you'll see that it won't graph 20 or 23 or certain groups or even certain results in this example as they are not in the included data, so I need "false data" inserted into this table which has all the possibilities so they at least show on a graph even if the count is 0, does this make sense?
I am wondering on the easiest and kind of most dynamic way as it could be Result1 or Result3 I want to Graph on (different column types).
Thanks in advance
It looks like your dimensions are: MemberGroupID,Result2, and week (Year,Week).
One approach to solving this is to generate a list of all values you want for all the dimensions, and produce a cartesian product of them. As an example,
SELECT m.MemberGroupID, n.Result2, w.Year, w.Week
FROM (SELECT MemberGroupID FROM ftn_GetExampleTable GROUP BY MemberGroupID) m
CROSS
JOIN (SELECT Result2 FROM ftn_GetExampleTable GROUP BY Result2 ) n
CROSS
JOIN (SELECT Year, Week FROM myCalendar WHERE ... ) w
You don't necessarily need a table named myCalendar. (That approach does seem to be the popular one.) You just need a row source from which you can derive a list of (Year, Week) tuples. (There are answers to the question elsewhere in Stackoverflow, how to generate a list of dates.)
And the list of MemberGroupID and Result2 values doesn't have to come from the ftn_GetExampleTable rowsource, you could substitute another query.
With a cartesian product of those dimensions, you've got a complete "grid". Now you can LEFT JOIN your original result set to that.
Any place you don't have a matching row from the "gappy" result query, you'll get a NULL returned. You can leave the NULL, or replace it with a 0, which is probably what you want if it's a "count" you are returning.
SELECT d.MemberGroupID
, d.Result2
, d.Year
, d.Week
, IFNULL(r.ExampleCount,0) as ExampleCount
FROM ( <dimension query from above> ) d
LEFT
JOIN ( <original ExampleCount query> ) r
ON r.MemberGroupID = d.MemberGroupID
AND r.Result2 = d.Result2
AND r.Year = d.Year
AND r.Week = d.Week
That query can be refactored to make use of Common Table Expressions, which makes the query a little easier to read, especially if you are including multiple measures.
; WITH d AS ( /* <dimension query with no gaps (example above)> */
)
, r AS ( /* <original query with gaps> */
SELECT MemberGroupID, Result2, Count(*) AS ExampleCount, Year, Week
FROM ftn_GetExampleTable
GROUP BY MemberGroupID, Result2, Year, Week
)
SELECT d.*
, IFNULL(r.ExampleCount,0)
FROM d
LEFT
JOIN r
ON r.Year=d.Year AND r.Week=d.Week AND r.MemberGroupID = d.MemberGroupID
AND r.Result2 = d.Result2
This isn't a complete working solution to your problem, but it outlines an approach you can use.
Whenever I need to generate a sequence within SQL-Server I use the sys.all_objects table along with the ROW_NUMBER function, then maninpulate it as required:
SELECT ROW_NUMBER() OVER(ORDER BY Object_ID) AS Sequence
FROM Sys.All_Objects
So for the list of year and week numbers I would use:
DECLARE #StartDate DATETIME,
#EndDate DATETIME
SET #StartDate = '20110101'
SET #EndDate = '20120601'
SELECT DATEPART(YEAR, Date) AS YEAR,
DATEPART(WEEK, Date) AS WeekNum
FROM ( SELECT DATEADD(WEEK, ROW_NUMBER() OVER(ORDER BY Object_ID) - 1, #StartDate) AS Date
FROM Sys.All_Objects
) Dates
WHERE Date < #endDate
Where the dates subquery provides a list of dates at one week intervals between your start and end dates.
So in your example the end result would be something like:
DECLARE #StartDate DATETIME,
#EndDate DATETIME
SET #StartDate = '20110101'
SET #EndDate = '20120601'
;WITH Data AS
( SELECT MemberGroupID,
Result2,
Count(*) AS ExampleCount,
Year,
Week
FROM ftn_GetExampleTable
GROUP BY MemberGroupID, Result2, Year, Week
), Dates AS
( SELECT DATEPART(YEAR, Date) AS YEAR,
DATEPART(WEEK, Date) AS WeekNum
FROM ( SELECT DATEADD(WEEK, ROW_NUMBER() OVER(ORDER BY Object_ID) - 1, #StartDate) AS Date
FROM Sys.All_Objects
) Dates
WHERE Date < #endDate
)
SELECT YearNum,
WeeNum,
MemberID,
Result2,
COALESCE(ExampleCount, 0) AS ExampleCount
FROM Dates
LEFT JOIN Data
ON YearNum = Data.Year
AND WeekNum = Data.Week