Create a funnel in SQL with 30 days delay - sql

I have table like this with hundreds of records : month_signup, nb_signups, month_purchase and nb_purchases
month_signup
nb_signups
month_purchase
nb_purchases
01
100
01
10
02
200
02
20
03
150
03
10
Let's say I want to calculate the signup to purchase ratio month after month.
Normaly I can juste divide nb_purchases/nb_signups*100 but here no.
I want to calculate a signup to purchase ratio with 1 month (or 30days) delay.
To let the signups the time to purchase, I want to do the nb_purchase from month 2 divided by nb_signups from month_1. So 20/100 for exemple in my table.
I tried this but really not sure.
SELECT
month_signup
,SAFE_DIVIDE(CASE WHEN purchase_month BETWEEN signups_month AND DATE_ADD(signups_month, INTERVAL 30 DAY) THEN nb_purchases ELSE NULL END, nb_signups)*100 AS sign_up_to_purchase_ratio
FROM table
ORDER BY 1

You can use LEAD() function to get the next value of the current row, I'll provide a MySQL query syntax for this.
with cte as
(select month_signup, nb_signups, lead(nb_purchases) over (order by month_signup) as
nextPr from MyData
order by month_signup)
select cte.month_signup, (nextPr/cte.nb_signups)*100 as per from cte
where (nextPr/cte.nb_signups)*100 is not null;
You may replace (nextPr/cte.nb_signups) with the SAFE_DIVIDE function.
See the demo from db-fiddle.

Related

Finding a lagged cumulative total from a table containing both cumulative and delta entries

I have a SQL table with a schema where a value is either a cumulative value for a particular category, or a delta on top of the previous value. While I appreciate this is not a particularly great design, it comes from an external source and thus I can't change it in any way.
The table looks something like the following:
Date Category AmountSoldType AmountSold
-----------------------------------------------------
Jan 1 Apples Cumulative 100
Jan 1 Bananas Cumulative 50
Jan 2 Apples Delta 20
Jan 2 Bananas Delta 10
Jan 3 Apples Delta 25
Jan 3 Bananas Cumulative 75
For this example, I want to produce the total cumulative number of fruits sold by item at the beginning of each day:
Date Category AmountSold
--------------------------------
Jan 1 Apples 0
Jan 1 Bananas 0
Jan 2 Apples 100
Jan 2 Bananas 50
Jan 3 Apples 170
Jan 3 Bananas 60
Jan 4 Apples 195
Jan 4 Bananas 75
Intuitively, I want to take the most recent cumulative total, and add any deltas that have appeared since that entry.
I imagine something akin to
SELECT Date, Category
LEAD((subquery??), 1) OVER (PARTITION BY Category ORDER BY Date) AS Amt
FROM Fruits
GROUP BY Date, Category
ORDER BY Date ASC
is what I want, but I'm having trouble putting the right subquery together. Any suggestions?
You seem to want to add the deltas to the most recent cumulative -- all before the current date.
If so, I think this logic does what you want:
select f.*,
(max(case when date = date_cumulative then amountsold else 0 end) over (partition by category
) +
sum(case when date > date_cumulative then amountsold else 0 end) over (partition by category order by date rows between unbounded preceding and 1 preceding
)
) amt
from (select f.*,
max(case when AmountSoldType = 'cumulative' then date else 0 end) over
(partition by category order by date rows between unbounded preceding and current_row
) as date_cumulative
from fruits f
) f
I'm a bit confused by this data set (notwithstanding the mistake in adding up the apples). I assume the raw data states end-of-day figures, so for example 20 apples were sold on Jan 2 (because there is a delta of 20 reported for that day).
In your example results, it does not appear valid to say that zero apples were sold on Jan 1. It isn't actually possible to say how many were sold on that day, because it is not clear whether the 100 cumulative apples were accrued during Jan 1 (and thus should be excluded from the start-of-day figure you seek) or whether they were accrued on previous days (and should be included), or some mix of the two. That day's data should thus be null.
It is also not clear whether all data sets must begin with a cumulative, or whether data sets can begin with a delta (which might require working backwards from a subsequent cumulative), and whether you potentially have access to multiple data sets from your external source which form a continuous consistent sequence, or whether "cumulatives" relate purely to a single data set received. I'm going to assume at least that all data sets begin with a cumulative.
All that said, this problem is a simple case of firstly converting all rows into either all deltas, or all cumulatives. Assuming we go for all cumulatives, then recursing through each row in order, it is a case of either selecting the AmountSold as-is (if the row is a cumulative), or adding the AmountSold to the result of the previous step (if it is a delta).
Once pre-processed like this, then for a start-of-day cumulative, it is all just a question of looking at the previous day's cumulative (which was an end-of-day cumulative, if my initial assumption was correct that all raw data relates to end-of-day figures).
Using the LAG function in this final step to get the previous day's cumulative, will also neatly produce a null for the first row.

Retain values till there is a change in value in Teradata

There is a transaction history table in teradata where balance gets changed only when there is a transaction
Data as below:
Cust_id Balance Txn_dt
123 1000 27MAY2018
123 350 31MAY2018
For eg,For a customer(123) on May 27 we have a balance of 1000 and on May 31 there is a transaction made by the customer so balance becomes 350. There is no record maintained for May 28 to May 30 with same balance as on May 27 . I want these days data also to be there (With same balance retained and the date is incremented ) Its like same record has to be retained for rest of the days till there is a change in a balance done by the transaction . How to do this in teradata?
Expected output:
Cust_id Balance Txn_dt
123 1000 27MAY2018
123 1000 28MAY2018
123 1000 29MAY2018
123 1000 30MAY2018
123 350 31MAY2018
Thanks
Sandy
Hi Dnoeth. It seems to work, but can you let me know how to expand till a certain day for eg : till 30JUN2018 ?
There are several ways to get this result, the simplest in Teradata utilizes Time Series Expansion for Periods:
WITH cte AS
(
SELECT Cust_id, Balance, Txn_dt,
-- return the next row's date
Coalesce(Min(Txn_dt)
Over (PARTITION BY Cust_id
ORDER BY Txn_dt
ROWS BETWEEN 1 Following AND 1 Following)
,Txn_dt+1) AS next_Txn_dt
FROM tab
)
SELECT Cust_id, Balance
,Last(pd) -- last day of the period
FROM cte
-- make a period of the current and next row's date
-- and return one row per day
EXPAND ON PERIOD(Txn_dt, next_Txn_dt) AS pd
If you run TD16.10+ you can replace the MIN OVER with a simplified LEAD:
Lead(Txn_dt)
Over (PARTITION BY Cust_id
ORDER BY Txn_dt)

From SQL to Neo4j: trying to group query results

I have the following query in SQL (Oracle DB 11g XE)
Just for context: this query search the sensor with the biggest Power Factor, in a range between 0.90 and 0.99, for each month)
with abc as (select extract(month from peak_time) as Month,
max(total_power_factor) as Max_Power_Factor
from sensors group by extract(month from peak_time) order by Month DESC)
select abc.Month, Max_Power_Factor, meter_id as "Made by"
from abc join sensors
on sensors.total_power_factor = abc.Max_Power_Factor
where Max_Power_Factor between 0.90 and 0.99
order by Max_Power_Factor;
SQL Developer show me the correct result, only ONE line for each month, without duplicates; for example:
Month Max_Power_Factor Scored by
6 0.981046427565 b492b271760a
1 0.945921825336 db71ffead179
3 0.943302142482 a9c471b03587
8 0.9383185638 410bd58c8396
7 0.930911694091 fe5954a46888
5 0.912872055549 ee3c8ec29155
My problem is trying to replicate the same query on Neo4j (3.2.1 CE, on Windows 10): I don't know exactly how to group the data in order to have the same results. (As you can see I'm using APOC to manage dates)
match(a:Sensor) with a, a.peak_time as peak_time
where (a.total_power_factor > 0.90 and a.total_power_factor <0.99 )
RETURN distinct a.meterid, max(peak_time),apoc.date.format(peak_time,'s','MM') as month
order by month desc
These are my Cypher results and, as you can see, there are multiple row for each month.
Month Max_Power_Factor Scored by
06 0.981046427565 b492b271760a
01 0.945921825336 db71ffead179
03 0.943302142482 a9c471b03587
08 0.9383185638 410bd58c8396
08 0.93451098613 dfd6b67cc6d6
07 0.930911694091 fe5954a46888
02 0.916440282713 649956b34e87
05 0.912872055549 ee3c8ec29155
08 0.907059974935 a3e8df8a0ba8
So my question is: How can I group the data in order to have the same ouput as Oracle DB? (If it's possible, of course)
Thanks in advance for your help.
The fields in the output you show do not correspond to the query (for example, what exactly is "Scored By" ?) but the trick to aggregating in Neo4j is understanding that the aggregation keys are implicit.
So if you have
RETURN distinct a.meterid, max(peak_time),apoc.date.format(peak_time,'s','MM') as month
you are grouping on meterid and month.
If you want to group on month only it should be
RETURN max(peak_time),apoc.date.format(peak_time,'s','MM') as month
Hope this helps !
Regards,
Tom

oracle sql: efficient way to calculate business days in a month

I have a pretty huge table with columns dates, account, amount, etc. eg.
date account amount
4/1/2014 XXXXX1 80
4/1/2014 XXXXX1 20
4/2/2014 XXXXX1 840
4/3/2014 XXXXX1 120
4/1/2014 XXXXX2 130
4/3/2014 XXXXX2 300
...........
(I have 40 months' worth of daily data and multiple accounts.)
The final output I want is the average amount of each account each month. Since there may or may not be record for any account on a single day, and I have a seperate table of holidays from 2011~2014, I am summing up the amount of each account within a month and dividing it by the number of business days of that month. Notice that there is very likely to be record(s) on weekends/holidays, so I need to exclude them from calculation. Also, I want to have a record for each of the date available in the original table. eg.
date account amount
4/1/2014 XXXXX1 48 ((80+20+840+120)/22)
4/2/2014 XXXXX1 48
4/3/2014 XXXXX1 48
4/1/2014 XXXXX2 19 ((130+300)/22)
4/3/2014 XXXXX2 19
...........
(Suppose the above is the only data I have for Apr-2014.)
I am able to do this in a hacky and slow way, but as I need to join this process with other subqueries, I really need to optimize this query. My current code looks like:
<!-- language: lang-sql -->
select
date,
account,
sum(amount/days_mon) over (partition by last_day(date))
from(
select
date,
-- there are more calculation to get the account numbers,
-- so this subquery is necessary
account,
amount,
-- this is a list of month-end dates that the number of
-- business days in that month is 19. similar below.
case when last_day(date) in ('','',...,'') then 19
when last_day(date) in ('','',...,'') then 20
when last_day(date) in ('','',...,'') then 21
when last_day(date) in ('','',...,'') then 22
when last_day(date) in ('','',...,'') then 23
end as days_mon
from mytable tb
inner join lookup_businessday_list busi
on tb.date = busi.date)
So how can I perform the above purpose efficiently? Thank you!
This approach uses sub-query factoring - what other RDBMS flavours call common table expressions. The attraction here is that we can pass the output from one CTE as input to another. Find out more.
The first CTE generates a list of dates in a given month (you can extend this over any range you like).
The second CTE uses an anti-join on the first to filter out dates which are holidays and also dates which aren't weekdays. Note that Day Number varies depending according to the NLS_TERRITORY setting; in my realm the weekend is days 6 and 7 but SQL Fiddle is American so there it is 1 and 7.
with dates as ( select date '2014-04-01' + ( level - 1) as d
from dual
connect by level <= 30 )
, bdays as ( select d
, count(d) over () tot_d
from dates
left join holidays
on dates.d = holidays.hol_date
where holidays.hol_date is null
and to_number(to_char(dates.d, 'D')) between 2 and 6
)
select yt.account
, yt.txn_date
, sum(yt.amount) over (partition by yt.account, trunc(yt.txn_date,'MM'))
/tot_d as avg_amt
from your_table yt
join bdays
on bdays.d = yt.txn_date
order by yt.account
, yt.txn_date
/
I haven't rounded the average amount.
You have 40 month of data, this data should be very stable.
I will assume that you have a cold body (big and stable easily definable range of data) and hot tail (small and active part).
Next, I would like to define a minimal period. It is a data range that is a smallest interval interesting for Business.
It might be year, month, day, hour, etc. Do you expect to get questions like "what was averege for that account between 1900 and 12am yesterday?".
I will assume that the answer is DAY.
Then,
I will calculate sum(amount) and count() for every account for every DAY of cold body.
I will not create a dummy records, if particular account had no activity on some day.
and I will save day, account, total amount, count in a TABLE.
if there are modifications later to the cold body, you delete and reload affected day from that table.
For hot tail there might be multiple strategies:
Do the same as above (same process, clear to support)
always calculate on a fly
use materialized view as an averege between 1 and 2.
Cold body table totalc could also be implemented as materialized view, but if data never change - no need to rebuild it.
With this you go from (number of account) x (number of transactions per day) x (number of days) to (number of account)x(number of active days) number of records.
That should speed up all following calculations.

sql DB calculation moving summary‏‏‏‏‏

I would like to calculate moving summary‏‏‏‏‏:
Total amount:100
first receipt: 20
second receipt: 10
the first row in calculation column is a difference between total amount and the first receipt: 100-20=80
the second row in calculation column is a difference between the first calculated_row and the first receip: 80-10=70
The presentation is supposed to present receipt_amount, balance:
receipt_amount | balance
20 | 80
10 | 70
I'll be glad to use your help
Thanks :-)
You didn't really give us much information about your tables and how they are structured.
I'm assuming that there is an orders table that contains the total_amount and a receipt_table that contains each receipt (as a positive value):
As you also didn't specify your DBMS, this is ANSI SQL:
select sum(amount) over (order by receipt_nr) as running_sum
from (
select total_amount as amount
from orders
where order_no = 1
union all
select -1 * receipt_amount
from the_receipt_table
where order_no =
) t
First of all- thanks for your response.
I work with Cache DB which can be used both SQL and ORACLE syntax.
Basically, the data is locaed in two different tables, but I have them in one join query.
Couple of rows with different receipt amounts and each row (receipt) has the same total amount.
Foe example:
Receipt_no Receipt_amount Total_amount Balance
1 20 100 80
1 10 100 70
1 30 100 40
2 20 50 30
2 10 50 20
So, the calculation is supposed to be in a way that in the first receipt the difference calculation is made from the total_amount and all other receipts (in the same receipt_no) are being reduced from the balance
Thanks!