I have a data set that contain two columns [date, cust_id].
date cust_id
2019-12-08 123
2019-12-08 321
2019-12-09 123
2019-12-09 456
There is a high churn rate for my customers and I am trying to create two additional columns [new_cust, left_cust] by counting the numbers of cust_id that are new and have left by day respectively.
In the case I have two tables broken out by day, I have no issues by querying:
count of new customers
SELECT DISTINCT cust_id
FROM 2019-12-09
WHERE cust_id NOT IN (SELECT DISTINCT cust_id FROM 2019-12-08)
count of customers who churned
SELECT DISTINCT cust_id
FROM 2019-12-08
WHERE cust_id NOT IN (SELECT DISTINCT cust_id FROM 2019-12-09)
I'm not sure how I would query a single table and compare these values by date. What would be the best approach to getting the correct results? I am using AWS Athena.
Expected results:
date new_cust cust_left
2019-12-08 2 0
2019-12-09 1 1
Explanation: Assuming 2019-12-08 is the very first date, I have 2 new customers and 0 customers who have churned. 2019-12-09, I have gained 1 new customer "456", but have 1 customer "321" who has churned. I would have to apply this to a longer range of dates and cust_id.
Hmmm. I think you want:
select date,
sum(case when prev_date is null then 1 else 0 end) as new_cust,
sum(case when next_date = date + interval '1' day then 0 else 1 end) as left_cust
from (select t.*,
lag(date) over (partition by cust_id order by date) as prev_date,
lead(date) over (partition by cust_id order by date) as next_date
from t
) t
group by date;
Related
I am new to postgres and I want to be able to set value to Y if order (order table) is a first month order (first month order table)
first month order table is as per below. It will only show the order placed by user the first time in the month:
customer_id | order_date | order_id
--------------------------------------------------
a1 | December 6, 2015, 8:30 PM | orderA1
order table is as per below. It shows all the order records:
customer_id | order_date | order_id
-----------------------------------------------------
a1 | December 6, 2020, 8:30 PM | orderA1
a1 | December 7, 2020, 8:30 PM | orderA2
a2 | December 11, 2020, 8:30 PM | orderA3
To get the first month order column in the order table, I tried using case as below. But then it will give the error more than one row returned by a subquery.
SELECT DISTINCT ON (order_id) order_id, customer_id,
(CASE when (select distinct order_id from first_month_order_table) = order_id then 'Y' else 'N'
END)
FROM order_table
ORDER BY order_id;
I also tried using count but then i understand that this is quite inefficient and overworks the database i think.
SELECT DISTINCT ON (order_id) order_id, customer_id,
(CASE when (select count order_id from first_month_order_table) then 'Y' else 'N'
END)
FROM order_table
ORDER BY order_id;
How can I determine if the order is first month order and set the value as Y for every order in the order table efficiently?
Use the left join as follows:
SELECT o.order_id, o.customer_id,
CASE when f.order_id is not null then 'Y' else 'N' END as flag
FROM order_table o left join first_month_order_table f
on f.order_id = o.order_id
ORDER BY o.order_id;
If you have all orders in the orders table, you don't need the second table. Just use window functions. The following returns a boolean, which I find much more convenient than a character flag:
select o.*,
(row_number() over (partition by customer_id, date_trunc('month', order_date order by order_date) = 1) as flag
from orders o;
If you want a character flag, then you need case:
select o.*,
(case when row_number() over (partition by customer_id, date_trunc('month', order_date order by order_date) = 1
then 'Y' else 'N'
end) as flag
from orders o;
Account balance collection, that shows the account balance of a customer at a given day:
+---------------+---------+------------+
| customer_id | value | timestamp |
+---------------+---------+------------+
| 1 | -500 | 2019-10-12 |
| 1 | -300 | 2019-10-11 |
| 1 | -200 | 2019-10-10 |
| 1 | 0 | 2019-10-09 |
| 2 | 200 | 2019-09-10 |
| 1 | 600 | 2019-09-02 |
+---------------+---------+------------+
Notice, that customer #2 had no updates to his account balance in October.
I want to get the last account balance per customer per month. If there has been no account balance update for a customer in a given month, the last known account balance should be transferred to the current month. The result should look like that:
+---------------+---------+------------+
| customer_id | value | timestamp |
+---------------+---------+------------+
| 1 | -500 | 2019-10-12 |
| 2 | 200 | 2019-10-10 |
| 2 | 200 | 2019-09-10 |
| 1 | 600 | 2019-09-02 |
+---------------+---------+------------+
Since the account balance of customer #2 was not updated in October but in September, we create a copy of the row from September changing the date to October. Any ideas how to achieve this in BigQuery?
Below is for BigQuery Standard SQL
#standardSQL
WITH customers AS (
SELECT DISTINCT customer_id FROM `project.dataset.table`
), months AS (
SELECT month FROM (
SELECT DATE_TRUNC(MIN(timestamp), MONTH) min_month, DATE_TRUNC(MAX(timestamp), MONTH) max_month
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_month, max_month, INTERVAL 1 MONTH)) month
)
SELECT customer_id,
IFNULL(value, LEAD(value) OVER(win)) value,
IFNULL(timestamp, DATE_ADD(LEAD(timestamp) OVER(win), INTERVAL DATE_DIFF(month, LEAD(month) OVER(win), MONTH) MONTH)) timestamp
FROM months, customers
LEFT JOIN (
SELECT DATE_TRUNC(timestamp, MONTH) month, customer_id,
ARRAY_AGG(STRUCT(value, timestamp) ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table`
GROUP BY month, customer_id
) USING(month, customer_id)
WINDOW win AS (PARTITION BY customer_id ORDER BY month DESC)
if to apply to sample data from your question - as it is in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 customer_id, -500 value, DATE '2019-10-12' timestamp UNION ALL
SELECT 1, -300, '2019-10-11' UNION ALL
SELECT 1, -200, '2019-10-10' UNION ALL
SELECT 2, 200, '2019-09-10' UNION ALL
SELECT 2, 100, '2019-08-11' UNION ALL
SELECT 2, 50, '2019-07-12' UNION ALL
SELECT 1, 600, '2019-09-02'
), customers AS (
SELECT DISTINCT customer_id FROM `project.dataset.table`
), months AS (
SELECT month FROM (
SELECT DATE_TRUNC(MIN(timestamp), MONTH) min_month, DATE_TRUNC(MAX(timestamp), MONTH) max_month
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_month, max_month, INTERVAL 1 MONTH)) month
)
SELECT customer_id,
IFNULL(value, LEAD(value) OVER(win)) value,
IFNULL(timestamp, DATE_ADD(LEAD(timestamp) OVER(win), INTERVAL DATE_DIFF(month, LEAD(month) OVER(win), MONTH) MONTH)) timestamp
FROM months, customers
LEFT JOIN (
SELECT DATE_TRUNC(timestamp, MONTH) month, customer_id,
ARRAY_AGG(STRUCT(value, timestamp) ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table`
GROUP BY month, customer_id
) USING(month, customer_id)
WINDOW win AS (PARTITION BY customer_id ORDER BY month DESC)
-- ORDER BY month DESC, customer_id
result is
Row customer_id value timestamp
1 1 -500 2019-10-12
2 2 200 2019-10-10
3 1 600 2019-09-02
4 2 200 2019-09-10
5 1 null null
6 2 100 2019-08-11
7 1 null null
8 2 50 2019-07-12
The following query should mostly answer your question by creating a 'month-end' record for each customer for every month and getting the most recent balance:
with
-- Generate a set of months
month_begins as (
select dt from unnest(generate_date_array('2019-01-01','2019-12-01', interval 1 month)) dt
),
-- Get the month ends
month_ends as (
select date_sub(date_add(dt, interval 1 month), interval 1 day) as month_end_date from month_begins
),
-- Cross Join and group so we get 1 customer record for every month to account for
-- situations where customer doesn't change balance in a month
user_month_ends as (
select
customer_id,
month_end_date
from `project.dataset.table`
cross join month_ends
group by 1,2
),
-- Fan out so for each month end, you get all balances prior to month end for each customer
values_prior_to_month_end as (
select
customer_id,
value,
timestamp,
month_end_date
from `project.dataset.table`
inner join user_month_ends using(customer_id)
where timestamp <= month_end_date
),
-- Order by most recent balance before month end, even if it was more than 1+ months ago
ordered as (
select
*,
row_number() over (partition by customer_id, month_end_date order by timestamp desc) as my_row
from values_prior_to_month_end
),
-- Finally, select only the most recent record for each customer per month
final as (
select
* except(my_row)
from ordered
where my_row = 1
)
select * from final
order by customer_id, month_end_date desc
A few caveats:
I did not order results to match your desired result set, and I also kept a month-end date to illustrate the concept. You can easily change the ordering and exclude unneeded fields.
In the month_begins CTE, I set a range of months into the future, so your result set will contain the most recent balance of 'future months'. To make this a bit prettier, consider changing '2019-12-01' to 'current_date()' and your query will always return to the end of the current month.
Your timestamp field looks to be dates, so I used date logic, but you should be able to apply the same principles to use timestamp logic if your underlying fields are actual timestamps.
In your result set, I'm not sure why your 2nd row (customer 2) would have a timestamp of '2019-10-10', that seems arbitrary as customer 2 has no 2nd balance record.
I purposefully split the logic into several CTEs so I could comment on each step easier, you could definitely perform several steps in the same code block for a more condensed query.
I thought I got it, but actually not. Working with some trading data and need to do average stockprice for trading days only. Used the below query for 3 day average; but recently found out there can be dividends on a trading holiday; so for those days in the fact table there is data for the stockcode and closeprice is either zero or null.
Please help me to improve my query to ignore zero and nulls in the 3 preceding trading day's average calculation
select StockCode, datekey, ClosePrice,
AVG(ClosePrice) OVER (partition by StockCode order by datekey
ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING) Avg3Days
from Fact
You can partition by StockCode AND sign(NullIf([ClosePrice],0)) rather than having to know the trading days.
Example
Declare #YourTable Table ([datekey] date,[StockCode] varchar(50),[ClosePrice] money)
Insert Into #YourTable Values
('2019-06-15','xyx',5)
,('2019-06-16','xyx',10)
,('2019-06-17','xyx',NULL)
,('2019-06-18','xyx',0)
,('2019-06-19','xyx',15)
,('2019-06-20','xyx',20)
Select *
,AvgPrice = AVG(ClosePrice) over (partition by StockCode,sign(NullIf([ClosePrice],0)) order By datekey rows between 3 preceding and 1 preceding )
from #YourTable
Order By datekey
Returns
datekey StockCode ClosePrice AvgPrice
2019-06-15 xyx 5.00 NULL
2019-06-16 xyx 10.00 5.00
2019-06-17 xyx NULL NULL
2019-06-18 xyx 0.00 NULL
2019-06-19 xyx 15.00 7.50
2019-06-20 xyx 20.00 10.00
Update
A little uglier, but perhaps something like this
Select *
,AvgPrice = case when sum(1) over (partition by StockCode,sign(NullIf([ClosePrice],0)) order By datekey rows between 3 preceding and 1 preceding ) = 3
then avg(ClosePrice) over (partition by StockCode,sign(NullIf([ClosePrice],0)) order By datekey rows between 3 preceding and 1 preceding )
else null end
from #YourTable
Order By datekey
Returns
Assuming you have a flag that indicates trading days, you can do something like this:
SELECT StockCode, datekey, ClosePrice,
(CASE WHEN isTradingDay = 1
THEN AVG(ClosePrice) OVER (PARTITION BY StockCode, isTradingDay
ORDER BY datekey
ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING
)
END) as Avg3Days
FROM Fact;
This takes the average of the previous three trading days. The value is NULL on non-trading days.
If the StockCode is NULL, it will not be included in the average anyway. If the only indicator is the closePrice, then one method is:
SELECT f.StockCode, f.datekey, f.ClosePrice,
(CASE WHEN v.isTradingDay = 1
THEN AVG(f.ClosePrice) OVER (PARTITION BY f.StockCode, v.isTradingDay
ORDER BY f.datekey
ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING
)
END) as Avg3Days
FROM Fact f CROSS APPLY
(VALUES (CASE WHEN f.ClosePrice > 0 THEN 1 ELSE 0 END)
) v(isTradingDay);
Personally, I would prefer to have an explicit trading day indicator rather than relying on special values of the close price. For instance, trading on a single stock might be suspending for some company-specific reason.
You may want to also have WHERE f.StockCode <> '' to filter out invalid stock codes.
Once the customer is registered, between date_registered and current date - if the customer has made atleast one transaction every month, then flag it as active or else flag it has inactive
Note: Every customer has different date_registered
I tried this but doesn't work since few of the customers were onboarded in the middle of the year
Eg -
-------------------------------------
txn_id | txn_date | name | amount
-------------------------------------
101 2018-05-01 ABC 100
102 2018-05-02 ABC 200
-------------------------------------
(case when count(distinct case when txn_date >= '2018-05-01' and txn_date < '2019-06-01' then last_day(txn_date) end) = 13
then 'active' else 'inactive'
end) as flag
from t;
Final output
----------------
name | flag
----------------
ABC active
BCF inactive
You can use filtering on an aggregation query:
select customer,
count(distinct last_day(txn_date)) as num_months
from (select t.*, min(date_registered) over (partition by customer) as min_dr
from t
) t
group by customer, min_dr
having count(distinct last_day(txn_date)) = months_between(last_day(current_date), last_day(min_dr)) + 1;
Note: This may give unexpected results toward the beginning of a month, if customers do not all have transactions on the first day of the month.
EDIT:
If you want a flag, just move the HAVING logic to the SELECT:
select customer,
(case when count(distinct last_day(txn_date)) = months_between(last_day(current_date), last_day(min_dr)) + 1
then 'Active' else 'Inactive'
end) as active_flag
from (select t.*, min(date_registered) over (partition by customer) as min_dr
from t
) t
group by customer, min_dr;
I want to be able to "book" within range of dates, but you can't book across gaps of days. So booking across multiple rates is fine as long as they are contiguous.
I am happy to change data structure/index, if there are better ways of storing start/end ranges.
So far I have a "rates" table which contains Start/End Periods of time with a daily rate.
e.g. Rates Table.
ID Price From To
1 75.00 2015-04-12 2016-04-15
2 100.00 2016-04-16 2016-04-17
3 50.00 2016-04-18 2016-04-30
For the above data I would want to return:
From To
2015-04-12 2016-4-30
For simplicity sake it is safe to assume that dates are safely consecutive. For contiguous dates To is always 1 day before from.
For the case there is only 1 row, I would want it to return the From/To of that single row.
Also to clarify if I had the following data:
ID Price From To
1 75.00 2015-04-12 2016-04-15
2 100.00 2016-04-17 2016-04-18
3 50.00 2016-04-19 2016-04-30
4 50.00 2016-05-01 2016-05-21
Meaning where there is a gap >= 1 day it would count as a separate range.
In which case I would expect the following:
From To
2015-04-12 2016-04-15
2015-04-17 2016-05-21
Edit 1
After playing around I have come up with the following SQL which seems to work. Although I'm not sure if there are better ways/issues with it?
WITH grouped_rates AS
(SELECT
from_date,
to_date,
SUM(grp_start) OVER (ORDER BY from_date, to_date) group
FROM (SELECT
gite_id,
from_date,
to_date,
CASE WHEN (from_date - INTERVAL '1 DAY') = lag(to_date)
OVER (ORDER BY from_date, to_date)
THEN 0
ELSE 1
END grp_start
FROM rates
GROUP BY from_date, to_date) AS start_groups)
SELECT
min(from_date) from_date,
max(to_date) to_date
FROM grouped_rates
GROUP BY grp;
This is identifying contiguous overlapping groups in the data. One approach is to find where each group begins and then do a cumulative sum. The following query adds a flag indicating if a row starts a group:
select r.*,
(case when not exists (select 1
from rates r2
where r2.from < r.from and r2.to >= r.to or
(r2.from = r.from and r2.id < r.id)
)
then 1 else 0 end) as StartFlag
from rate r;
The or in the correlation condition is to handle the situation where intervals that define a group overlap on the start date for the interval.
You can then do a cumulative sum on this flag and aggregate by that sum:
with r as (
select r.*,
(case when not exists (select 1
from rates r2
where (r2.from < r.from and r2.to >= r.to) or
(r2.from = r.from and r2.id < r.id)
)
then 1 else 0 end) as StartFlag
from rate r
)
select min(from), max(to)
from (select r.*,
sum(r.StartFlag) over (order by r.from) as grp
from r
) r
group by grp;
CREATE TABLE prices( id INTEGER NOT NULL PRIMARY KEY
, price MONEY
, date_from DATE NOT NULL
, date_upto DATE NOT NULL
);
-- some data (upper limit is EXCLUSIVE)
INSERT INTO prices(id, price, date_from, date_upto) VALUES
( 1, 75.00, '2015-04-12', '2016-04-16' )
,( 2, 100.00, '2016-04-17', '2016-04-19' )
,( 3, 50.00, '2016-04-19', '2016-05-01' )
,( 4, 50.00, '2016-05-01', '2016-05-22' )
;
-- SELECT * FROM prices;
-- Recursive query to "connect the dots"
WITH RECURSIVE rrr AS (
SELECT date_from, date_upto
, 1 AS nperiod
FROM prices p0
WHERE NOT EXISTS (SELECT * FROM prices nx WHERE nx.date_upto = p0.date_from) -- no preceding segment
UNION ALL
SELECT r.date_from, p1.date_upto
, 1+r.nperiod AS nperiod
FROM prices p1
JOIN rrr r ON p1.date_from = r.date_upto
)
SELECT * FROM rrr r
WHERE NOT EXISTS (SELECT * FROM prices nx WHERE nx.date_from = r.date_upto) -- no following segment
;
Result:
date_from | date_upto | nperiod
------------+------------+---------
2015-04-12 | 2016-04-16 | 1
2016-04-17 | 2016-05-22 | 3
(2 rows)