How to write a concise sql to get subscription rate by month.
formula: subscription rate = subscription count/ trial count
NOTE: The tricky part is the subscription event should be attributed to the month that company started the trail.
| id | date | type |
|-------|------------|-------|
| 10001 | 2019-01-01 | Trial |
| 10001 | 2019-01-15 | Sub |
| 10002 | 2019-01-20 | Trial |
| 10002 | 2019-02-10 | Sub |
| 10003 | 2019-01-01 | Trial |
| 10004 | 2019-02-10 | Trial |
Based on the above table, the out output should be:
2019-01-01 2/3
2019-02-01 0/1
One option is a self-join to identify whether each trial eventually subscribed, then aggregation and arithmetics:
select
date_trunc('month', t.date) date_month
1.0 * count(s.id) / count(t.id) rate
from mytable t
left join mytable s on s.id = t.id and s.type = 'Sub'
where t.type = 'Trial'
group by date_trunc('month', t.date)
The syntax to truncate a date to the beginning of the month widely varies across databases. The above would work in Postgres. Alternatives are available in other databases, such as:
date_format(t.date, '%Y-%m-01') -- MySQL
trunc(t.date, 'mm') -- Oracle
datefromparts(year(t.date), month(t.date), 1) -- SQL Server
You can do this with window functions. Assuming that there are not duplicate trial/subs:
select date_trunc('month', date) as yyyymm,
count(*) where (num_subs > 0) * 1.0 / count(*)
from (select t.*,
count(*) filter (where type = 'Sub') over (partition by id) as num_subs
from t
) t
where type = 'Trial'
group by yyyymm;
If an id can have duplicate trials or subs, then I suggest that you ask a new question with more detail about the duplicates.
You an also do this with two levels of aggregation:
select trial_date,
count(sub_date) * 1.0 / count(*)
from (select id, min(date) filter (where type = 'trial') as trial_date,
min(date) filter (where type = 'sub') as sub_date
from t
group by id
) id
group by trial_date;
Related
I have the following data set:
| EMAIL | SIGNUP_DATE |
| A#ABC.COM | 1/1/2021 |
| B#ABC.COM | 1/2/2021 |
| C#ABC.COM | 1/3/2021 |
In order to find the running total of email signups as of a certain day, I ran the following sql query:
select
signup_date,
count(email) OVER (order by signup_date ASC) as running_total_signups
I got the following results:
| SIGNUP_DATE | RUNNING_TOTAL_SIGNUPS |
| 1/1/21 | 1 |
| 1/2/21 | 2 |
| 1/3/21 | 3 |
However for my next step, I want to be able to see not just the running total signups, but the actual signup names themselves. Therefore I want to run the same window function (count(email) OVER (order by signup_date ASC)) but instead of a count(email) just a select distinct email. This would hopefully result in the following output:
| SIGNUP_DATE | RUNNING_TOTAL_SIGNUPS |
| 1/1/21 | a#abc.com |
| 1/2/21 | a#abc.com |
| 1/2/21 | b#abc.com |
| 1/3/21 | a#abc.com |
| 1/3/21 | b#abc.com |
| 1/3/21 | c#abc.com |
How would I do this? I'm getting an error on this code:
select
signup_date,
distinct email OVER (order by signup_date ASC) as running_total_signups
One way would be to cross-join the results and filter the joined table having a total <= to the running total:
with counts as (
select *,
Count(*) over (order by SIGNUP_DATE asc) as tot
from t
)
select c1.EMAIL, c1.SIGNUP_DATE
from counts c1
cross join counts c2
where c2.tot <= c1.tot
I want to run the same window function (count(email) OVER (order by
signup_date ASC)) but instead of a count(email) just a select distinct
email
Why do you want COUNT() window function?
It has nothing to do with with your reqirement.
All you need is a simple self join:
SELECT t1.SIGNUP_DATE, t2.EMAIL
FROM tablename t1 INNER JOIN tablename t2
ON t2.SIGNUP_DATE <= t1.SIGNUP_DATE
ORDER BY t1.SIGNUP_DATE, t2.EMAIL;
which will work for your sample data, but just in case there are more than 1 rows for each day in your table you should use:
SELECT t1.SIGNUP_DATE, t2.EMAIL
FROM (SELECT DISTINCT SIGNUP_DATE FROM tablename) t1 INNER JOIN tablename t2
ON t2.SIGNUP_DATE <= t1.SIGNUP_DATE
ORDER BY t1.SIGNUP_DATE, t2.EMAIL;
See the demo.
It's actually slightly simpler than Stu proposed:
select
x2.signup_date,
x1.email
from
signups x1
INNER JOIN signups x2 ON x1.signup_date <= x2.signup_date
order by signup_date
If you join the table to itself but for any date that is less than or equal to, it causes a half cartesian explosion. The lowest dated row matches with only itself. The next one matches with itself and the earlier one, so one of the table aliases has its data repeated.. This continues adding more rows to the explosion as the dates increase:
In this resultset we can see we want the emails from x1, and the dates from x2
I'm trying to build a monthly tally of active equipment, grouped by service area from a database log table. I think I'm 90% of the way there; I have a list of months, along with the total number of items that existed, and grouped by region.
However, I also need to know the state of each item as they were on the first of each month, and this is the part I'm stuck on. For instance, Item 1 is in region A in January, but moves to Region B in February. Item 2 is marked as 'inactive' in February, so shouldn't be counted. My existing query will always count item 1 in region A, and item 2 as 'active'.
I can correctly show that Item 3 is deleted in March, and Item 4 doesn't show up until the April count. I realize that I'm getting the first values because my query is specifying the min date, I'm just not sure how I need to change it to get what I want.
I think I'm looking for a way to group by Max(OperationDate) for each Month.
The Table looks like this:
| EQUIPID | EQUIPNAME | EQUIPACTIVE | DISTRICT | REGION | OPERATIONDATE | OPERATION |
|---------|-----------|-------------|----------|--------|----------------------|-----------|
| 1 | Item 1 | 1 | 1 | A | 2015-01-01T00:00:00Z | INS |
| 2 | Item 2 | 1 | 1 | A | 2015-01-01T00:00:00Z | INS |
| 3 | Item 3 | 1 | 1 | A | 2015-01-01T00:00:00Z | INS |
| 2 | Item 2 | 0 | 1 | A | 2015-02-10T00:00:00Z | UPD |
| 1 | Item 1 | 1 | 1 | B | 2015-02-15T00:00:00Z | UPD |
| 3 | (null) | (null) | (null) | (null) | 2015-02-21T00:00:00Z | DEL |
| 1 | Item 1 | 1 | 1 | A | 2015-03-01T00:00:00Z | UPD |
| 4 | Item 4 | 1 | 1 | B | 2015-03-10T00:00:00Z | INS |
There is also a subtable that holds attributes that I care about. It's structure is similar. Unfortunately, due to previous design decisions, there is no correlation to operations between the two tables. Any joins will need to be done using the EquipmentID, and have the overlapping states matched up for each date.
Current query:
--cte to build date list
WITH calendar (dt) AS
(SELECT &fromdate from dual
UNION ALL
SELECT Add_Months(dt,1)
FROM calendar
WHERE dt < &todate)
SELECT dt, a.district, a.region, count(*)
FROM
(SELECT EQUIPID, DISTRICT, REGION, OPERATION, MIN(OPERATIONDATE ) AS FirstOp, deleted.deldate
FROM Equipment_Log
LEFT JOIN
(SELECT EQUIPID,MAX(OPERATIONDATE) as DelDate
FROM Equipment_Log
WHERE OPERATION = 'DEL'
GROUP BY EQUIPID
) Deleted
ON Equipment_Log.EQUIPID = Deleted.EQUIPID
WHERE OPERATION <> 'DEL' --AND additional unimportant filters
GROUP BY EQUIPID,DISTRICT, REGION , OPERATION, deldate
) a
INNER JOIN calendar
ON (calendar.dt >= FirstOp AND calendar.dt < deldate)
OR (calendar.dt >= FirstOp AND deldate is null)
LEFT JOIN
( SELECT EQUIPID, MAX(OPERATIONDATE) as latestop
FROM SpecialEquip_Table_Log
--where SpecialEquip filters
group by EQUIPID
) SpecialEquip
ON a.EQUIPID = SpecialEquip.EQUIPID and calendar.dt >= SpecialEquip.latestop
GROUP BY dt, district, region
ORDER BY dt, district, region
Take only last operation for each id. This is what row_number() and where rn = 1 do.
We have calendar and data. Make partitioned join.
I assumed that you need to fill values for months where entries for id are missing. So nvl(lag() ignore nulls) are needed, because if something appeared in January it still exists in Feb, March and we need district, region values from last not empty row.
Now you have everything to make count. That part where you mentioned SpecialEquip_Table_Log is up to you, because you left-joined this table and not used it later, so what is it for? Join if you need it, you have id.
db<>fiddle
with
calendar(mth) as (
select date '2015-01-01' from dual union all
select add_months(mth, 1) from calendar where mth < date '2015-05-01'),
data as (
select id, dis, reg, dt, op, act
from (
select equipid id, district dis, region reg,
to_char(operationdate, 'yyyy-mm') dt,
row_number()
over (partition by equipid, trunc(operationdate, 'month')
order by operationdate desc) rn,
operation op, nvl(equipactive, 0) act
from t)
where rn = 1 )
select mth, dis, reg, sum(act) cnt
from (
select id, mth,
nvl(dis, lag(dis) ignore nulls over (partition by id order by mth)) dis,
nvl(reg, lag(reg) ignore nulls over (partition by id order by mth)) reg,
nvl(act, lag(act) ignore nulls over (partition by id order by mth)) act
from calendar
left join data partition by (id) on dt = to_char(mth, 'yyyy-mm') )
group by mth, dis, reg
having sum(act) > 0
order by mth, dis, reg
It may seem complicated, so please run subqueries separately at first to see what is going on. And test :) Hope this helps.
I am trying to find the number of orders I got in the month of April. I have 3 orders but my query gets the result 0. What could be the problem?
Here's the table:
id | first | middle | last | product_name | numberOut | Date
1 | Muhammad | Sameer | Khan | Macbook | 1 | 2020-04-01
2 | Chand | Shah | Khurram | Dell Optiplex | 1 | 2020-04-02
3 | Sultan | | Chohan | HP EliteBook | 1 | 2020-03-31
4 | Express | Eva | Plant | Dell Optiplex | 1 | 2020-03-11
5 | Rana | Faryad | Ali | HP EliteBook | 1 | 2020-04-02
And here's the query:
SELECT SUM(CASE WHEN strftime('%m', oDate) = '04' THEN 'id' END) FROM orders;
If you want all Aprils, then you can just look at the month. I would recommend:
select count(*)
from orders o
where o.date >= '2020-04-01' and o.date < '2020-05-01';
Note that this does direct comparisons of date to a valid dates in the where clause.
The problem with your code is this:
THEN 'id'
You are using the aggregate function SUM() and you sum over a string literal like 'id' which is implicitly converted to 0 (because it can't be converted to a number) so the result is 0.
Even if you remove the single quotes you will not get the result that you want because you will get the sum of the ids.
But if you used:
THEN 1 ELSE 0
then you would get the correct result.
But with SQLite you can write it simpler:
SELECT SUM(strftime('%m', oDate) = '04') FROM orders;
without the CASE expression.
Or since you just want to count the orders then COUNT() will do it:
SELECT COUNT(*) FROM orders WHERE strftime('%m', oDate) = '04';
Edit.
If you want to count the orders for all the months then group by month:
SELECT strftime('%Y-%m', oDate) AS month,
COUNT(*) AS number_of_orders
FROM orders
GROUP BY month;
SELECT SUM(CASE WHEN strftime('%m', oDate) = '04' THEN 1 ELSE 0 END) FROM orders;
if you need to use SUM
There is a problem with your query. You do not need to do that aggregation operation.
SELECT COUNT(*) from table_name WHERE strftime('%m', Date) = '04';
I would use explicit date comparisons rather than date functions - this makes the query SARGeable, ie it may benefit an existing index.
The most efficient approach, with a filter in the where clause:
select count(*) cnt
from orders
where oDate >= '2020-04-01' and oDate < '2020-05-01'
Alternatively, if you want a result of 0 even when there are no orders in April you can do conditional aggregation, as you originally intended:
select sum(case when oDate >= '2020-04-01' and oDate < '2020-05-01' then 1 else 0 end) cnt
from orders
I'm using standard SQL in Google Bigquery. So I have some data about metrics in this format:
Date | metric_name | metric_level
01/02/2019 | metric_one | 1
02/03/2019 | metric_one | 2
14/02/2019 | metric_two | 6
17/02/2019 | metric_two | 4
01/03/2019 | metric_three | 2
10/03/2019 | metric_three | 7
I want to get it in this format, date history going back one year, and then each metric filled in for each date. If a metric has no data for a particular date then it uses the most recent data point:
Date | metric_one | metric_two | metric_three
..........
01/02/2019 | 1 | null | null
02/02/2019 | 1 | null | null
03/02/2019 | 1 | null | null
...........
...........
13/02/2019 | 1 | null | null
14/02/2019 | 1 | 6 | null
15/02/2019 | 1 | 6 | null
...........
...........
09/03/2019 | 2 | 4 | 2
10/03/2019 | 2 | 4 | 7
11/03/2019 | 2 | 4 | 7
...........
and so on.
I've managed to write some code that does this, but I want to know if there's a more efficient way of doing it. There are actually a lot more than 3 metrics, so if I can make it more efficient in any way then it will save a lot of resources in the long run.
This is my code
WITH date_arr AS(
SELECT
date
FROM UNNEST(
GENERATE_DATE_ARRAY(
DATE_SUB(CURRENT_DATE(),INTERVAL 365 DAY),
CURRENT_DATE(),
INTERVAL 1 day
)
) AS date
),
metric_one_raw AS (
SELECT
date,
metric_level
FROM database
WHERE metric_name = 'metric_one'
),
metric_one_gapless AS (
SELECT
d.date AS date,
IFNULL(metric_level, LAST_VALUE(metric_level IGNORE NULLS) OVER(window_latest)) AS metric_one
FROM date_arr d
LEFT JOIN metric_one_raw i
ON d.date = i.date
WINDOW window_latest AS (ORDER BY d.date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
),
metric_two_raw AS (
SELECT
date,
metric_level
FROM database
WHERE metric_name = 'metric_two'
),
metric_two_gapless AS (
SELECT
d.date AS date,
IFNULL(metric_level, LAST_VALUE(metric_level IGNORE NULLS) OVER(window_latest)) AS metric_two
FROM date_arr d
LEFT JOIN metric_two_raw i
ON d.date = i.date
WINDOW window_latest AS (ORDER BY d.date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
),
metric_three_raw AS (
SELECT
date,
metric_level
FROM database
WHERE metric_name = 'metric_three'
),
metric_three_gapless AS (
SELECT
d.date AS date,
IFNULL(metric_level, LAST_VALUE(metric_level IGNORE NULLS) OVER(window_latest)) AS metric_three
FROM date_arr d
LEFT JOIN metric_three_raw i
ON d.date = i.date
WINDOW window_latest AS (ORDER BY d.date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
)
SELECT
*
FROM metric_one_gapless
LEFT JOIN metric_two_gapless USING(date)
LEFT JOIN metric_three_gapless USING(date)
Hope that makes sense. Thanks in advance!
You can do the following:
Generate the dates
Use a cross join to get all the rows
Use a left join to bring in your data
Use last_value() to fill in NULL values.
In other database, I would prefer lag(ignore nulls), but BigQuery does not support that.
So:
select d, m.metric,
coalesce(mm.metric_level,
last_value(mm.metric_level ignore nulls) over (partition by m.metric order by d)
) as metric_level
from (select distinct metric from metrics) m cross join
unnest(gnerate_date_array(date_sub(current_date(), interval 1 year), interval 1 day) d left join
metrics mm
on mm.metric = m.metric and mm.date = d;
after doing some research, I came up with somethig, due you are using left join and there would be more than one, or even, variable number of left joins, and also you can not use declare in BigQuery Web UI, you probably need better to use the API Rest BigQuery feature, you can find here the dependencies, you can use C#, GO, JAVA, NODE.JS, PHP, PYTHON or RUBY coding, this would allow you to assign in to a variable the number of metrics, so I recommend first to do a select distinct to know how many metrics are there and after that you can save them into a variable and after that do a loop to execute the left joins you want.
I hope this information helps you, and I'm here if you need more information.
I'm building a query that aims to show the number of occurrences of two date variables per month. I was able to assemble the two separate queries: I count the number of occurrences and group per month, but I have no idea how to join these two queries, since they are from the same table, and still show the count with only one column of month.
Thanks for your help, guys!
Format: YYYY-MM-DD
|---------------------|------------------|
| onboard_date | offboard_date |
|---------------------|------------------|
| 2019/01/15 | - |
|---------------------|------------------|
| 2019/01/25 | 2019/02/15 |
|---------------------|------------------|
| 2019/02/13 | 2019/02/20 |
|---------------------|------------------|
| 2019/02/18 | - |
|---------------------|------------------|
| 2019/03/09 | - |
|---------------------|------------------|
What I have tried and worked:
SELECT DATE_TRUNC('month', onboard_date) AS onboard_month,
COUNT(*) as onboards
FROM lukla.trn_users trn
WHERE trn.company_name = 'amaro'
GROUP BY DATE_TRUNC('month', onboard_date)
ORDER BY DATE_TRUNC('month', onboard_date)
and
SELECT DATE_TRUNC('month', offboard_date) AS onboard_month,
COUNT(*) as onboards
FROM lukla.trn_users trn
WHERE trn.company_name = 'amaro' AND offboard_date IS NOT NULL
GROUP BY DATE_TRUNC('month', offboard_date)
ORDER BY DATE_TRUNC('month', offboard_date)
The result that I want:
|--------------|------------|------------|
| month | onboards | offboards |
|--------------|------------|------------|
| 01 | 2 | 0 |
|--------------|------------|------------|
| 02 | 2 | 2 |
|--------------|------------|------------|
| 03 | 1 | 0 |
|--------------|------------|------------|
A lateral join makes this pretty simple:
select date_trunc('month', v.dte) as month, sum(v.is_onboard) as onboards, sum(v.is_offboard) as offboards
from trn_users t cross join lateral
(values (t.onboard_date, (t.onboard_date is not null)::int, 0),
(t.offboard_date, 0, (t.offboard_date is not null)::int)
) v(dte, is_onboard, is_offboard)
where v.dte is not null
group by month
order by month;
Here is a db<>fiddle.
You can try to full join two derived tables, one getting the count for on boards, the other the count for off boards.
SELECT coalesce(x.month, y.month) month,
coalesce(x.count, 0) onboards,
coalesce(y.count, 0) offboards
(SELECT date_trunc('month', trn.onboard_date) month
count(*) count
FROM lukla.trn_users trn
WHERE trn.company_name = 'amaro'
AND trn.onboard_date IS NOT NULL
GROUP BY date_trunc('month', trn.onboard_date)) x
FULL JOIN (SELECT date_trunc('month', trn.offboard_date) month
count(*) count
FROM lukla.trn_users trn
WHERE trn.company_name = 'amaro'
AND trn.offboard_date IS NOT NULL
GROUP BY date_trunc('month', trn.offboard_date)) y
ON y.month = x.month
ORDER BY coalesce(x.month, y.month);