Cohort Analysis using SQL (Snowflake)

Cohort Analysis using SQL (Snowflake) - sql

I am doing a cohort analysis using the table TRANSACTIONS. Below is the table schema,
USER_ID NUMBER,
PAYMENT_DATE_UTC DATE,
IS_PAYMENT_ADDED BOOLEAN
Below is a quick query to see how USER_ID 12345 (an example) goes through the different cohorts based on the date filter provided,
WITH RESULT(
SELECT
USER_ID,
TO_DATE(PAYMENT_DATE_UTC) AS PAYMENT_DATE,
SUM(CASE WHEN IS_PAYMENT_ADDED=TRUE THEN 1 ELSE 0 END) AS PAYMENT_ADDED_COUNT
FROM TRANSACTIONS
GROUP BY 1,2
HAVING PAYMENT_ADDED_COUNT>=1
ORDER BY 2
)
SELECT
COUNT(DISTINCT r.USER_ID),
SUM(r.PAYMENT_ADDED_COUNT)
FROM RESULT r
WHERE r.USER_ID=12345
AND (r.PAYMENT_DATE>='2021-02-01' AND r.PAYMENT_DATE<'2021-02-15')
The result for this query with the time frame (two weeks) would be
| 1 | 55 |
and this USER_ID would be classified as a Regular User Cohort (one who has made more than 10 payments) for the provided date filter
If the same query is run with the time frame as just one day say '2021-02-07', the result would be
| 1 | 10 |
and this USER_ID would be classified as as Occasional User Cohort (one who has made between 1 and 10 payments) for the provided date filter
I have this below query to bucket the USER_ID's into the two different cohorts based on the sum of the payments added,
WITH
ALL_USER_COHORT AS
(SELECT
USER_ID,
SUM(CASE WHEN IS_PAYMENT_ADDED=TRUE THEN 1 ELSE 0 END ) AS PAYMENT_ADDED_COUNT
FROM TRANSACTIONS
GROUP BY USER_ID
),
OCASSIONAL_USER_COHORT AS
(SELECT
USER_ID,
SUM(CASE WHEN IS_PAYMENT_ADDED=TRUE THEN 1 ELSE 0 END ) AS PAYMENT_ADDED_COUNT
FROM TRANSACTIONS
GROUP BY USER_ID
HAVING (PAYMENT_ADDED_COUNT>=1 AND PAYMENT_ADDED_COUNT<=10)
),
REGULAR_USER_COHORT AS
(SELECT
USER_ID,
SUM(CASE WHEN IS_PAYMENT_ADDED=TRUE THEN 1 ELSE 0 END ) AS PAYMENT_ADDED_COUNT
FROM TRANSACTIONS
GROUP BY USER_ID
HAVING PAYMENT_ADDED_COUNT>10
)
SELECT
COUNT(DISTINCT ou.USER_ID) AS "OCCASIONAL USERS",
COUNT(DISTINCT ru.USER_ID) AS "REGULAR USERS"
FROM ALL_USER_COHORT au
LEFT JOIN OCASSIONAL_USER_COHORT ou ON au.USER_ID=ou.USER_ID
LEFT JOIN REGULAR_USER_COHORT ru ON au.USER_ID=ru.USER_ID
LEFT JOIN TRANSACTIONS t ON au.USER_ID=t.USER_ID
WHERE au.USER_ID=12345
AND TO_DATE(t.PAYMENT_DATE_UTC)>='2021-02-07'
Ideally the USER_ID 12345 should be bucketed as "OCCASIONAL USERS" as per the provided date filter but the query buckets it as "REGULAR USERS" instead.

For starters you CTE could have the redundancy removed like so:
WITH all_user_cohort AS (
SELECT
USER_ID,
SUM(IFF(is_payment_added=TRUE, 1,0)) AS payment_added_count
FROM transactions
GROUP BY user_id
), ocassional_user_cohort AS (
SELECT * FROM all_user_cohort
WHERE PAYMENT_ADDED_COUNT between 1 AND 10
), regular_user_cohort AS (
SELECT * FROM all_user_cohort
WHERE PAYMENT_ADDED_COUNT > 10
)
SELECT
COUNT(DISTINCT ou.user_id) AS "OCCASIONAL USERS",
COUNT(DISTINCT ru.user_id) AS "REGULAR USERS"
FROM all_user_cohort AS au
LEFT JOIN ocassional_user_cohort ou ON au.user_id=ou.user_id
LEFT JOIN regular_user_cohort ru ON au.user_id=ru.user_id
LEFT JOIN transactions t ON au.user_id=t.user_id
WHERE au.user_id=12345
AND TO_DATE(t.payment_date_utc)>='2021-03-01'
But the reason you are getting this problem is you are doing the which do the belong in across all time.
What you are wanting is to move the date filter into all_user_cohort, and not making tables when you can just sum the number of rows meeting the need.
WITH all_user_cohort AS (
SELECT
USER_ID,
SUM(IFF(is_payment_added=TRUE, 1,0)) AS payment_added_count
FROM transactions
WHERE TO_DATE(payment_date_utc)>='2021-03-01'
GROUP BY user_id
)
SELECT
SUM(IFF(payment_added_count between 1 AND 10, 1,0)) AS "OCCASIONAL USERS"
SUM(IFF(payment_added_count > 10, 1,0)) AS "REGULAR USERS"
FROM transactions
WHERE au.user_id=12345
Which can also be done differently, if that is more what your looking for, for other reasons.
WITH all_user_cohort AS (
SELECT
USER_ID,
SUM(IFF(is_payment_added=TRUE, 1,0)) AS payment_added_count
FROM transactions
WHERE TO_DATE(payment_date_utc)>='2021-03-01'
GROUP BY user_id
), classify_users AS (
SELECT user_id
,CASE
WHEN payment_added_count between 1 AND 10 THEN 'OCCASIONAL USERS'
WHEN payment_added_count > 10 THEN 'REGULAR USERS'
ELSE 'users with zero payments'
END AS classified
FROM all_user_cohort
)
SELECT classified
,count(*)
FROM classify_users
WHERE user_id=12345
GROUP BY 1

Related

I am having trouble joining these two results in one query

Banking Transactions Transcription
A client requested a query for a dashboard in their online banking web service. It should return a list of all the customer accounts and get their transactions for the current month.
The result should have the following columns: iban/transactions/total
-iban: Account iban
-transactions: list of transaction amount record for a specific account iban:
Record is a transaction amount
Records are separated by a '+' sign
Records are sorted in ascending order of dt
-total: total amount of transactions
The result should be sorted in descending order by total number of transactions, then in descending order by total.
Note:
-Only transactions in the current month should be included in the result.
-The current month is September.
-The ID is INT primary key, Iban is varchar, account_id is INT foreign key (id), dt is datetime, amount is varchar
I am new to this so here is the example I can put in the best for reference:
Accounts
ID iban
1 GT92 GJH2 AYZM
2 MT82 GWLY FWMY
3 GI36 YOPG Y6NQ
Transactions
account_id dt amount
1 2022-08-25 13:59:30 $42.87
1 2022-08-26 19:12:32 $24.04
1 2022-09-05 17:35:29 $70.07
1 2022-09-10 13:09:40 $26.15
1 2022-09-13 16:28:55 $10.15
2 2022-08-26 05:05:38 $82.83
2 2022-09-03 05:12:33 $34.14
2 2022-09-03 17:19:27 $94.94
2 2022-09-04 10:36:07 $69.31
2 2022-09-12 05:15:22 $90.06
2 2022-09-18 14:30:52 $54.85
3 2022-09-25 04:28:37 $45.99
3 2022-08-22 21:12:42 $65.98
3 2022-08-29 04:45:23 $10.99
3 2022-09-02 09:32:25 $98.36
3 2022-09-02 14:58:25 $25.45
3 2022-09-06 21:15:47 $57.98
3 2022-09-10 10:25:26 $37.90
I tried STUFF and XML PATH with money convert to get sum for particular IDs but can not get the results in single query
select iban,
STUFF((select '+' +amount from transactions where account_id = id
for XML PATH('')),1,1,'')[transactions]
from Accounts
order by id;
select
SUM((case when isnumeric([amount])=1 then convert(money,[amount]) else 0 end)) as Transactions from transactions
group by account_id;

Nest the aggregate query and JOIN on ID and account_id fields. This means SELECTing account_id in aggregate query. Also, include record Count in the aggregate query. Include ORDER BY dt in the STUFF() SQL.
SELECT iban, TransTotal,
STUFF((SELECT '+' +amount FROM transactions WHERE account_id = id ORDER BY dt
FOR XML PATH('')),1,1,'')[transactions]
FROM Accounts INNER JOIN (SELECT account_id, Count(account_id) AS TransCount
SUM((CASE WHEN isnumeric([amount])=1 THEN convert(money,[amount]) ELSE 0 END)) AS TransTotal
FROM transactions
GROUP BY account_id) AS T
ON Accounts.id = T.account_id
ORDER BY TransCount DESC, TransTotal DESC;
Add filter criteria for year/month in both STUFF and aggregate query. Format() function is one way.
Format(dt, 'yyyyMM') = Format(GetDate(), 'yyyyMM')
Instead of GetDate(), could use static parameter or input by user.
Another approach:
SELECT iban, Transactions, TransTotal FROM Accounts INNER JOIN
(SELECT account_id, Count(account_id) AS TransCount,
STRING_AGG(amount, '+') WITHIN GROUP (ORDER BY dt) AS Transactions,
SUM((CASE WHEN isnumeric([amount])=1 THEN convert(money,[amount]) ELSE 0 END)) AS TransTotal
FROM transactions
WHERE Format(dt, 'yyyyMM')=Format(GetDate(), 'yyyyMM')
GROUP BY account_id) AS T
ON Accounts.id = T.account_id
ORDER BY TransCount DESC, TransTotal DESC

How to count certain the ages of people who have a log record from another table in sql?

I want to get a count of how many people who are 18 are recorded in the logs table only once. Now if I have the same person who entered 2 times, I can see that there are 2 people with age 18. I can't make it appear only once. How do I do this???
My logs table and people table are connected by card_id.
My logs table has the login date and card_id.
While my members' table has the birthdate and card_id columns.
HERE is the query I made
select
card_id, sum("18") as "18"
from
( select logs.login, members.card_id,
count(distinct (case when 0 <= age and age <= 18 then age end)) as "18",
count( (case when 19 <= age and age <= 30 then age end)) as "30",
count ( (case when 31 <= age and age <= 50 then age end)) as "50"
from
(select login, date_part('year', age(birthdate)) as age, members.card_id as card_id,
logs.login
from members
left join logs on logs.card_id=members.card_id
) as members
left join logs on logs.card_id=members.card_id
group by logs.login, members.card_id
) as members
where login <= '20221029' group by card_id;
I want to create a table like this:
18 | 30 | 50 |
---------------
2 | 0 | 0

Count the distinct card_id-s.
select count(distinct card_id)
from members join logs using (card_id)
where extract('year' from age(birthdate)) = 18
and login <= '20221029';
Unrelated but it seems that you are storing login as text. This is not a good idea. Use type date instead.
Addition afer the question update
select count(*) filter (where user_age = 18) as age_18,
count(*) filter (where user_age between 19 and 30) as age_30,
count(*) filter (where user_age between 31 and 50) as age_50
from
(
select distinct on (card_id)
extract('year' from age(birthdate)) user_age
from members inner join logs using (card_id)
where login <= '20221029'
order by card_id, login desc -- pick the latest login
) AS t;

oracle sql get transactions between the period

I have 3 tables in oracle sql namely investor, share and transaction.
I am trying to get new investors invested in any shares for a certain period. As they are the new investor, there should not be a transaction in the transaction table for that investor against that share prior to the search period.
For the transaction table with the following records:
Id TranDt InvCode ShareCode
1 2020-01-01 00:00:00.000 inv1 S1
2 2019-04-01 00:00:00.000 inv1 S1
3 2020-04-01 00:00:00.000 inv1 S1
4 2021-03-06 11:50:20.560 inv2 S2
5 2020-04-01 00:00:00.000 inv3 S1
For the search period between 2020-01-01 and 2020-05-01, I should get the output as
5 2020-04-01 00:00:00.000 inv3 S1
Though there are transactions for inv1 in the table for that period, there is also a transaction prior to the search period, so that shouldn't be included as it's not considered as new investor within the search period.
Below query is working but it's really taking ages to return the results calling from c# code leading to timeout issues. Is there anything we can do to refine to get the results quicker?
WITH
INVESTORS AS
(
SELECT I.INVCODE FROM INVESTOR I WHERE I.CLOSED IS NULL)
),
SHARES AS
(
SELECT S.SHARECODE FROM SHARE S WHERE S.DORMANT IS NULL))
),
SHARES_IN_PERIOD AS
(
SELECT DISTINCT
T.INVCODE,
T.SHARECODE,
T.TYPE
FROM TRANSACTION T
JOIN INVESTORS I ON T.INVCODE = I.INVCODE
JOIN SHARES S ON T.SHARECODE = S.SHARECODE
WHERE T.TRANDT >= :startDate AND T.TRANDT <= :endDate
),
PREVIOUS_SHARES AS
(
SELECT DISTINCT
T.INVCODE,
T.SHARECODE,
T.TYPE
FROM TRANSACTION T
JOIN INVESTORS I ON T.INVCODE = I.INVCODE
JOIN SHARES S ON T.TRSTCODE = S.TRSTCODE
WHERE T.TRANDT < :startDate
)
SELECT
DISTINCT
SP.INVCODE AS InvestorCode,
SP.SHARECODE AS ShareCode,
SP.TYPE AS ShareType
FROM SHARES_IN_PERIOD SP
WHERE (SP.INVCODE, SP.SHARECODE, SP.TYPE) NOT IN
(
SELECT
PS.INVCODE,
PS.SHARECODE,
PS.TYPE
FROM PREVIOUS_SHARES PS
)
With the suggestion given by #Gordon Linoff, I tried following options (for all the shares I need) but they are taking long time too. Transaction table is over 32 million rows.
1.
WITH
SHARES AS
(
SELECT S.SHARECODE FROM SHARE S WHERE S.DORMANT IS NULL))
)
select t.invcode, t.sharecode, t.type
from (select t.*,
row_number() over (partition by invcode, sharecode, type order by trandt)
as seqnum
from transactions t
) t
join shares s on s.sharecode = t.sharecode
where seqnum = 1 and
t.trandt >= date '2020-01-01' and
t.trandt < date '2020-05-01';
WITH
INVESTORS AS
(
SELECT I.INVCODE FROM INVESTOR I WHERE I.CLOSED IS NULL)
),
SHARES AS
(
SELECT S.SHARECODE FROM SHARE S WHERE S.DORMANT IS NULL))
)
select t.invcode, t.sharecode, t.type
from (select t.*,
row_number() over (partition by invcode, sharecode, type order by trandt)
as seqnum
from transactions t
) t
join investors i on i.invcode = t.invcode
join shares s on s.sharecode = t.sharecode
where seqnum = 1 and
t.trandt >= date '2020-01-01' and
t.trandt < date '2020-05-01';
select t.invcode, t.sharecode, t.type
from (select t.*,
row_number() over (partition by invcode, sharecode, type order by trandt)
as seqnum
from transactions t
) t
where seqnum = 1 and
t.sharecode IN (SELECT S.SHARECODE FROM SHARE S WHERE S.DORMANT IS NULL)))
and
t.trandt >= date '2020-01-01' and
t.trandt < date '2020-05-01';

If you want to know if the first record in transactions for a share is during a period, you can use window functions:
select t.*
from (select t.*,
row_number() over (partition by invcode, sharecode order by trandt) as seqnum
from transactions t
) t
where seqnum = 1 and
t.sharecode = :sharecode and
t.trandt >= date '2020-01-01' and
t.trandt < date '2020-05-01';
For performance for this code, you want an index on transactions(invcode, sharecode, trandate).

I'm creating an average retention curve for a business. However, the way I've written it

I'm creating an average retention curve for a business. However right now, the denominator in every month is accounting for every customer that had a transaction in month 0 but the problem is after Month 0 not every customer has been a customer long enough to have a transaction in the subsequent months. How would I change this query so the denominator only accounts for customers to have been around long enough to have a transaction in that month? So for instance all customers that had their first transaction in July are removed from Month 1- Month 29 denominator?
Below is the query I'm currently using:
SELECT
cohorts.user_count
,CAST(cohorts.user_count AS DECIMAL(36,4)) / first_cohort.user_count AS retention_pct
,cohorts.transaction_count
,cohorts.Cohort
from
(
select
count(distinct t.owner) AS user_count
,count(t.id) AS transaction_count
,datediff(month,[f.first_transaction_date:month],[t.createdon:month]) as Cohort
from [transaction_cache as t]
JOIN
(
SELECT
owner
,MIN(createdon) AS first_transaction_date
FROM [transaction_cache]
WHERE [createdon:year] > ['2017-01-01':date:year]
GROUP BY 1
) AS f
ON f.owner = t.owner
where [t.createdon:year] > ['2017-01-01':date:year]
and t.status = 'successful'
and t.type = 'savings'
group by 3
) AS cohorts
JOIN
(
select
count(distinct t.owner) AS user_count
,count(t.id) AS transaction_count
,datediff(month,[f.first_transaction_date:month],[t.createdon:month]) as Cohort
from [transaction_cache as t]
JOIN
(
SELECT
owner
,MIN(createdon) AS first_transaction_date
FROM [transaction_cache]
WHERE [createdon:year] > ['2017-01-01':date:year]
GROUP BY 1
) AS f
ON f.owner = t.owner
where [t.createdon:year] > ['2017-01-01':date:year]
and t.status = 'successful'
and t.type = 'savings'
and datediff(month,[f.first_transaction_date:month],[t.createdon:month]) = 0
group by 3
) AS first_cohort
ON 1=1
order by 4 asc

Full Outer Join, Coalesce, and Group By (Oh My!)

I'm going to ask this in two parts, because my logic may be way off, and if so, the syntax doesn't really matter.
I have 10 queries. Each query returns month, supplier, and count(some metric). The queries use various tables, joins, etc. Not all month/supplier combinations exist in the output for each query. I would like to combine these into a single data set that can be exported and pivoted on in Excel.
I'd like the output to look like this:
Month | Supplier | Metric1 |Metric2 |..| Metric 10
2018-01 | Supp1 | _value_ | _value_ |...| _value_ |
2018-01 | Supp2 | NULL | _value_ |...| NULL
What is the best / easiest / most efficient way to accomplish this?
I've tried various methods to accomplish the above, but I can't seem to get the syntax quite right. I wanted to make a very simple test case and build upon it, but I only have select privileges on the db, so I am unable to test it out. I was able to create a query that at least doesn't result in any squiggly red error lines, but applying the same logic to the bigger problem doesn't work.
This is what I've got:
create table test1(name varchar(20),credit int);
insert into test1 (name, credit) values ('Ed',1),('Ann',1),('Jim',1),('Ed',1),('Ann',1);
create table test2 (name varchar(10), debit int);
insert into test2 (name, debit) values ('Ann',1),('Sue',1),('Sue',1),('Sue',1);
select
coalesce(a.name, b.name) as name,
cred,
deb
from
(select name, count(credit) as cred
from test1
group by name) a
full outer join
(select name, count(debit) as deb
from test2
group by name) b on
a.name =b.name;
Am I headed down the right path?
UPDATE: Based on Gordon's input, I tried this on the first two queries:
select Month, Supp,
sum(case when which = 1 then metric end) as Exceptions,
sum(case when which = 2 then metric end) as BackOrders
from (
(
select Month, Supp, metric, 1 as which
from (
select (convert(char(4),E.PostDateTime,120)+'-'+convert(char(2),E.PostDateTime,101)) as Month, E.TradingPartner as Supp, count(distinct(E.excNum)) as metric
from db..TrexcMangr E
where (E.DSHERep in ('AVR','BTB') OR E.ReleasedBy in ('AVR','BTB')) AND year(E.PostDateTime) >= '2018'
) a
)
union all
(
select Month, Supp, metric, 2 as which
from (
select (convert(char(4),T.UpdatedDateTime,120)+'-'+convert(char(2),T.UpdatedDateTime,101)) as Month, P.Supplier as Supp, count(*) as metric
from db1..trordertext T
inner join mdid_Tran..trOrderPO P on P.PONum = T.RefNum
where T.TextType = 'BO' AND (T.CreatedBy in ('AVR','BTB') OR T.UpdatedBy in ('AVR','BTB')) AND year(UpdatedDateTime) >=2018
) b
)
) q
group by Month, Supp
... but I'm getting a group by error.

One method uses union all and group by:
select month, supplier,
sum(case when which = 1 then metric end) as metric_01,
sum(case when which = 2 then metric end) as metric_02,
. . .
from ((select Month, Supplier, Metric, 1 as which
from (<query1>) q
. . .
) union all
(select Month, Supplier, Metric, 2 as which
from (<query2>) q
. . .
) union all
. . .
) q
group by month, supplier;

SELECT
CalendarMonthStart,
Supp,
SUM(CASE WHEN metric_id = 1 THEN metric END) as Exceptions,
SUM(CASE WHEN metric_id = 2 THEN metric END) as BackOrders
FROM
(
SELECT
DATEADD(month, DATEDIFF(month, 0, E.PostDateTime), 0) AS CalendarMonthStart,
E.TradingPartner AS Supp,
COUNT(DISTINCT(E.excNum)) AS metric,
1 AS metric_id
FROM
db..TrexcMangr E
WHERE
( E.DSHERep in ('AVR','BTB')
OR E.ReleasedBy in ('AVR','BTB')
)
AND E.PostDateTime >= '2018-01-01'
GROUP BY
1, 2
UNION ALL
SELECT
DATEADD(month, DATEDIFF(month, 0, T.UpdatedDateTime), 0) AS CalendarMonthStart,
T.UpdatedDateTime,
P.Supplier AS Supp,
COUNT(*) AS metric,
2 AS metric_id
FROM
db1..trordertext T
INNER JOIN
mdid_Tran..trOrderPO P
ON P.PONum = T.RefNum
WHERE
( T.CreatedBy in ('AVR','BTB')
OR T.UpdatedBy in ('AVR','BTB')
)
AND T.TextType = 'BO'
AND T.UpdatedDateTime >= '2018-01-01'
GROUP BY
1, 2
)
combined
GROUP BY
CalendarMonthStart,
Supp

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Cohort Analysis using SQL (Snowflake) - sql

Related

I am having trouble joining these two results in one query

How to count certain the ages of people who have a log record from another table in sql?

oracle sql get transactions between the period

I'm creating an average retention curve for a business. However, the way I've written it

Full Outer Join, Coalesce, and Group By (Oh My!)

Categories

Resources