Split monthly fix value to days and countries in Redshift - sql

DB-Fiddle
CREATE TABLE sales (
id SERIAL PRIMARY KEY,
country VARCHAR(255),
sales_date DATE,
sales_volume DECIMAL,
fix_costs DECIMAL
);
INSERT INTO sales
(country, sales_date, sales_volume, fix_costs
)
VALUES
('DE', '2020-01-03', '500', '2000'),
('NL', '2020-01-03', '320', '2000'),
('FR', '2020-01-03', '350', '2000'),
('None', '2020-01-31', '0', '2000'),
('DE', '2020-02-15', '0', '5000'),
('NL', '2020-02-15', '0', '5000'),
('FR', '2020-02-15', '0', '5000'),
('None', '2020-02-29', '0', '5000'),
('DE', '2020-03-27', '180', '4000'),
('NL', '2020-03-27', '670', '4000'),
('FR', '2020-03-27', '970', '4000'),
('None', '2020-03-31', '0', '4000');
Expected Result:
sales_date | country | sales_volume | used_fix_costs
-------------|--------------|------------------|------------------------------------------
2020-01-03 | DE | 500 | 37.95 (= 2000/31 = 64.5 x 0.59)
2020-01-03 | FR | 350 | 26.57 (= 2000/31 = 64.5 x 0.41)
2020-01-03 | NL | 320 | 0.00
-------------|--------------|------------------|------------------------------------------
2020-02-15 | DE | 0 | 86.21 (= 5000/28 = 172.4 x 0.50)
2020-02-15 | FR | 0 | 86.21 (= 5000/28 = 172.4 x 0.50)
2020-02-15 | NL | 0 | 0.00
-------------|--------------|------------------|------------------------------------------
2020-03-27 | DE | 180 | 20.20 (= 4000/31 = 129.0 x 0.16)
2020-03-27 | FR | 970 | 108.84 (= 4000/31 = 129.0 x 0.84)
2020-03-27 | NL | 670 | 0.00
-------------|--------------|------------------|-------------------------------------------
The column used_fix_costs in the expected result is calculated as the following:
Step 1) Exclude country NL from the next steps but it should still appear with value 0 in the results.
Step 2) Get the daily rate of the fix_costs per month.(2000/31 = 64.5; 5000/29 = 172.4; 4000/31 = 129.0)
Step 3) Split the daily value to the countries DE and FR based on their share in the sales_volume. (500/850 = 0.59; 350/850 = 0.41; 180/1150 = 0.16; 970/1150 = 0.84)
Step 4) In case the sales_volume is 0 the daily rate gets split 50/50 to DE and FR as you can see for 2020-02-15.
I am currently using this query to get the expected results:
SELECT
s.sales_date,
s.country,
s.sales_volume,
s.fix_costs,
(CASE WHEN country = 'NL' THEN 0
/* Exclude NL from fixed_costs calculation */
WHEN SUM(CASE WHEN country <> 'NL' THEN sales_volume ELSE 0 END) OVER (PARTITION BY sales_date) > 0
THEN ((s.fix_costs/ extract(day FROM (date_trunc('month', sales_date + INTERVAL '1 month') - INTERVAL '1 day'))) *
sales_volume /
NULLIF(SUM(s.sales_volume) FILTER (WHERE s.country != 'NL') OVER (PARTITION BY s.sales_date), 0)
)
/* Divide fixed_cots equaly among countries in case of no sale*/
ELSE (s.fix_costs / extract(day FROM (date_trunc('month', sales_date + INTERVAL '1 month') - INTERVAL '1 day')))
/ SUM(CASE WHEN country <> 'NL' THEN 1 ELSE 0 END) OVER (PARTITION by sales_date)
END) AS imputed_fix_costs
FROM sales s
WHERE country NOT IN ('None')
GROUP BY 1,2,3,4
ORDER BY 1;
This query works in the DB-Fiddle.
However, when I run it on Amazon Redshift I get this error message for the line
FILTER (WHERE pl.sales_Channel NOT IN ('Marketplace','B2B')).
Do you have any idea how I can replace/adjust this part of the query to also make it work in Amazon Redshift?

If I understand correctly, you want to define apportioned fixed costs per day for all countries other than NL:
select s.*,
(case when country = 'NL' then 0
when sum(sales_volume) over (partition by sales_date) = 0
then (fix_costs / datepart(day, last_day(sales_date))) * 1.0 / sum(case when country <> 'NL' then 1 else 0 end) over (partition by sales_date)
else (fix_costs / datepart(day, last_day(sales_date))) * (sales_volume / sum(case when country <> 'NL' then sales_volume end) over (partition by sales_date))
end) as apportioned_fix_costs
from sales s
where country <> 'None';
Note: You don't seem to want None in your results so that is just filtered out. Then the rest of the data all seems to be on one data in the month. If it can actually be on multiple data, use date_trunc() in the partition by clause.
For reference, Postgres doesn't support last_day(). You can use the expression:
select extract(day from date_trunc('month', sales_date) + interval '1 month' - interval '1 day')
DB-Fiddle

Related

Use SQL to get monthly churn count and churn rate

Currently using Postgres 9.5
I want to calculate monthly churn_count and churn_rate of the search function.
churn_count: number of users who used the search function last month but not this month
churn_rate: churn_count/total_users_last_month
My dummy data is:
CREATE TABLE yammer_events (
occurred_at TIMESTAMP,
user_id INT,
event_name VARCHAR(50)
);
INSERT INTO yammer_events (occurred_at, user_id, event_name) VALUES
('2014-06-01 00:00:01', 1, 'search_autocomplete'),
('2014-06-01 00:00:01', 2, 'search_autocomplete'),
('2014-07-01 00:00:01', 1, 'search_run'),
('2014-07-01 00:00:02', 1, 'search_run'),
('2014-07-01 00:00:01', 2, 'search_run'),
('2014-07-01 00:00:01', 3, 'search_run'),
('2014-08-01 00:00:01', 1, 'search_run'),
('2014-08-01 00:00:01', 4, 'search_run');
Ideal output should be:
|month |churn_count|churn_rate_percentage|
|--- |--- |--- |
|2014-07-01|0 |0
|2014-08-01|2 |66.6 |
In June: user 1, 2 (2 users)
In July: user 1, 2, 3 (3 users)
In August: user 1, 4 (2 users)
In July, we didn't lose any customer. In August, we lost customer 2 and 3, so the churn_count is 2, and the rate is 2/3*100 = 66.6
I tried the following query to calculate churn_count, but the result is really weird.
WITH monthly_activity AS (
SELECT distinct DATE_TRUNC('month', occurred_at) AS month,
user_id
FROM yammer_events
WHERE event_name LIKE 'search%'
)
SELECT last_month.month+INTERVAL '1 month', COUNT(DISTINCT last_month.user_id)
FROM monthly_activity last_month
LEFT JOIN monthly_activity this_month
ON last_month.user_id = this_month.user_id
AND this_month.month = last_month.month + INTERVAL '1 month'
AND this_month.user_id IS NULL
GROUP BY 1
db<>fiddle
Thank you in advance!
An easy way to do it would be to aggregate the users in an array, and from there extract and count the intersection between the current month and the previous one using the window function LAG(), e.g.
WITH j AS (
SELECT date_trunc('month',occurred_at::date) AS month,
array_agg(distinct user_id) AS users,
count(distinct user_id) AS total_users
FROM yammer_events
GROUP BY 1
ORDER BY 1
)
SELECT month::date,
cardinality(LAG(users) OVER w - users) AS churn_count,
(cardinality(LAG(users) OVER w - users)::numeric /
LAG(total_users) OVER w::numeric) * 100 AS churn_rate_percentage
FROM j
WINDOW w AS (ORDER BY month
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW);
month | churn_count | churn_rate_percentage
------------+-------------+-------------------------
2014-06-01 | |
2014-07-01 | 0 | 0.00000000000000000000
2014-08-01 | 2 | 66.66666666666666666700
(3 rows)
Note: this query relies on the extension intarray. In case you don't have it in your system, just hit:
CREATE EXTENSION intarray;
Demo: db<>fiddle
WITH monthly_activity AS (
SELECT distinct DATE_TRUNC('month', occurred_at) AS month,
user_id
FROM yammer_events
WHERE event_name LIKE 'search%'
)
SELECT
last_month.month+INTERVAL '1 month',
SUM(CASE WHEN this_month.month IS NULL THEN 1 ELSE 0 END) AS churn_count,
SUM(CASE WHEN this_month.month IS NULL THEN 1 ELSE 0 END)*1.00/COUNT(DISTINCT last_month.user_id)*100 AS churn_rate_percentage
FROM monthly_activity last_month
LEFT JOIN monthly_activity this_month
ON last_month.month + INTERVAL '1 month' = this_month.month
AND last_month.user_id = this_month.user_id
GROUP BY 1
ORDER BY 1
LIMIT 2
I think my way is more circuitous but easier for beginners to understand. Just for your reference.

Round timestamp to 30 minutes interval to group by

Problem
select currency,
MAX (CASE WHEN type = 'Bank A' THEN rate ELSE null END) as bank_a_rate,
MAX (CASE WHEN type = 'Bank B' THEN rate ELSE null END) as bank_b_rate
from rates
group by currency, created
I want to group my data by currency, timestamp and show all type value like table of comparison with interval of 30 minutes, for now my created time is 1 minute or less different so if i group with created time it will still showing 4 rows cause of different timestamp, is there a way to round the timestamp ?
Data Source
Type
Currency
Rate
Created
Bank A
USD
3.4
2020-01-01 12:29:15
Bank B
USD
3.34
2020-01-01 12:30:11
Bank A
EUR
4.92
2020-01-01 12:31:01
Bank B
EUR
5.03
2020-01-01 12:31:14
Expected Result
Timestamp
Currency
Bank A Rate
Bank B Rate
2020-01-01 12:30:00
USD
3.4
3.34
2020-01-01 12:30:00
EUR
4.92
5.03
Truncate/round created to 30 minutes (the ts expression) and group by it. Your query with this amendment:
select date_trunc('hour', created) +
interval '1 minute' * (extract(minute from created)::integer/30)*30 AS ts,
currency,
MAX (CASE WHEN "type" = 'Bank A' THEN rate ELSE null END) as bank_a_rate,
MAX (CASE WHEN "type" = 'Bank B' THEN rate ELSE null END) as bank_b_rate
from rates
group by currency, ts;
SQL Fiddle
'Inherit' previous rate
select ts, currency,
coalesce(bank_a_rate, lag(bank_a_rate) over w) bank_a_rate,
coalesce(bank_b_rate, lag(bank_b_rate) over w) bank_b_rate
from
(
select date_trunc('hour', created) +
interval '1 minute' * (extract(minute from created)::integer/30)*30 ts,
currency,
MAX (CASE WHEN "type" = 'Bank A' THEN rate ELSE null END) as bank_a_rate,
MAX (CASE WHEN "type" = 'Bank B' THEN rate ELSE null END) as bank_b_rate
from rates
group by currency, ts
) t
window w as (partition by currency order by ts);
SQL Fiddle
One method uses epoch for this purpose and then converting the seconds to 30 minutes intervals using arithmetic:
select '1970-01-01'::timestamp + floor(extract(epoch from created) / (24 * 60 * 2)) * (24 * 60 * 2) * interval '1 second' as timestamp,
currency,
MAX (CASE WHEN type = 'Bank A' THEN rate ELSE null END) as bank_a_rate,
MAX (CASE WHEN type = 'Bank B' THEN rate ELSE null END) as bank_b_rate
from rates
group by timestamp, currency;

Postgres Bank Account Transaction Balance

Here's an example "transactions" table where each row is a record of an amount and the date of the transaction.
+--------+------------+
| amount | date |
+--------+------------+
| 1000 | 2020-01-06 |
| -10 | 2020-01-14 |
| -75 | 2020-01-20 |
| -5 | 2020-01-25 |
| -4 | 2020-01-29 |
| 2000 | 2020-03-10 |
| -75 | 2020-03-12 |
| -20 | 2020-03-15 |
| 40 | 2020-03-15 |
| -50 | 2020-03-17 |
| 200 | 2020-10-10 |
| -200 | 2020-10-10 |
+--------+------------+
The goal is to return one column "balance" with the balance of all transactions. Only catch is that there is a monthly fee of $5 for each month that there are not at least THREE payment transactions (represented by a negative value in the amount column) that total at least $100. So in the example, the only month where you wouldn't have a $5 fee is March because there were 3 payments (negative amount transactions) that totaled $145. So the final balance would be $2,746. The sum of the amounts is $2,801 minus the $55 monthly fees (11 months X 5). I'm not a postgres expert by any means, so if anyone has any pointers on how to get started solving this problem or what parts of the postgres documentation which help me most with this problem that would be much appreciated.
The expected output would be:
+---------+
| balance |
+---------+
| 2746 |
+---------+
This is rather complicated. You can calculate the total span of months and then subtract out the one where the fee is cancelled:
select amount, (extract(year from age) * 12 + extract(month from age)), cnt,
amount - 5 *( extract(year from age) * 12 + extract(month from age) + 1 - cnt) as balance
from (select sum(amount) as amount,
age(max(date), min(date)) as age
from transactions t
) t cross join
(select count(*) as cnt
from (select date_trunc('month', date) as yyyymm, count(*) as cnt, sum(amount) as amount
from transactions t
where amount < 0
group by yyyymm
having count(*) >= 3 and sum(amount) < -100
) tt
) tt;
Here is a db<>fiddle.
This calculates 2756, which appears to follow your rules. If you want the full year, you can just use 12 instead of the calculating using the age().
I would first left join with a generate_series that represents the months you are interested in (in this case, all in the year 2020). That adds the missing months with a balance of 0.
Then I aggregate these values per month and add the negative balance per month and the number of negative balances.
Finally, I calculate the grand total and subtract the fee for each month that does not meet the criteria.
SELECT sum(amount_per_month) -
sum(5) FILTER (WHERE negative_per_month > -100 OR negative_count < 3)
FROM (SELECT sum(amount) AS amount_per_month,
sum(amount) FILTER (WHERE amount < 0) AS negative_per_month,
month_start,
count(*) FILTER (WHERE amount < 0) AS negative_count
FROM (SELECT coalesce(t.amount, 0) AS amount,
coalesce(date_trunc('month', CAST (t.day AS timestamp)), dates.d) AS month_start
FROM generate_series(
TIMESTAMP '2020-01-01',
TIMESTAMP '2020-12-01',
INTERVAL '1 month'
) AS dates (d)
LEFT JOIN transactions AS t
ON dates.d = date_trunc('month', CAST (t.day AS timestamp))
) AS gaps_filled
GROUP BY month_start
) AS sums_per_month;
This would be my solution by simply using cte.
DB fiddle here.
balance
2746
Code:
WITH monthly_credited_transactions
AS (SELECT Date_part('month', date) AS cred_month,
Sum(CASE
WHEN amount < 0 THEN Abs(amount)
ELSE 0
END) AS credited_amount,
Sum(CASE
WHEN amount < 0 THEN 1
ELSE 0
END) AS credited_cnt
FROM transactions
GROUP BY 1),
credit_fee
AS (SELECT ( 12 - Count(1) ) * 5 AS fee,
1 AS id
FROM monthly_credited_transactions
WHERE credited_amount >= 100
AND credited_cnt >= 3),
trans
AS (SELECT Sum(amount) AS amount,
1 AS id
FROM transactions)
SELECT amount - fee AS balance
FROM trans a
LEFT JOIN credit_fee b
ON a.id = b.id
For me the below query worked (have adopted my answer from #GordonLinoff):
select CAST(totalamount - 5 *(12 - extract(month from firstt) + 1 - nofeemonths) AS int) as balance
from (select sum(amount) as totalamount, min(date) as firstt
from transactions t
) t cross join
(select count(*) as nofeemonths
from (select date_trunc('month', date) as months, count(*) as nofeemonths, sum(amount) as totalamount
from transactions t
where amount < 0
group by months
having count(*) >= 3 and sum(amount) < -100
) tt
) tt;
The firstt is the date of first transaction in that year and 12 - extract(month from firstt) + 1 - nofeemonths are the number of months for which the credit card fees of 5 will be charged.

Assign total value of month to each day of month

DB-Fiddle
CREATE TABLE sales (
id SERIAL PRIMARY KEY,
country VARCHAR(255),
sales_date DATE,
sales_volume DECIMAL,
fix_costs DECIMAL
);
INSERT INTO sales
(country, sales_date, sales_volume, fix_costs
)
VALUES
('DE', '2020-01-03', '500', '0'),
('FR', '2020-01-03', '350', '0'),
('None', '2020-01-31', '0', '2000'),
('DE', '2020-02-15', '0', '0'),
('FR', '2020-02-15', '0', '0'),
('None', '2020-02-29', '0', '5000'),
('DE', '2020-03-27', '180', '0'),
('FR', '2020-03-27', '970', '0'),
('None', '2020-03-31', '0', '4000');
Expected Result:
sales_date | country | sales_volume | fix_costs
--------------|-------------|-------------------|-----------------
2020-01-03 | DE | 500 | 2000
2020-01-03 | FR | 350 | 2000
2020-02-15 | DE | 0 | 5000
2020-02-15 | FR | 0 | 5000
2020-03-27 | DE | 180 | 4000
2020-03-27 | FR | 970 | 4000
As you can see in my table I have a total of fix_costs assigned to the last day of each month.
In my results I want to assign this total of fix_costs to each day of the month.
Therefore, I tried to go with this query:
SELECT
s.sales_date,
s.country,
s.sales_volume,
f.fix_costs
FROM sales s
JOIN
(SELECT
((date_trunc('MONTH', sales_date) + INTERVAL '1 MONTH - 1 DAY')::date) AS month_ld,
SUM(fix_costs) AS fix_costs
FROM sales
WHERE country = 'None'
GROUP BY month_ld) f ON f.month_ld = LAST_DAY(s.sales_date)
WHERE country <> 'None'
GROUP BY 1,2,3;
For this query I get an error on the LAST_DAY(s.sales_date) since this expression does not exist in PostgresSQL.
However, I have no clue how I can replace it correctly in order to get the expected result.
Can you help me?
(MariaDB Fiddle as comparison)
demos:db<>fiddle
SELECT
s1.sales_date,
s1.country,
s1.sales_volume,
s2.fix_costs
FROM sales s1
JOIN sales s2 ON s1.country <> 'None' AND s2.country = 'None'
AND date_trunc('month', s1.sales_date) = date_trunc('month', s2.sales_date)
You need a natural self-join. Join conditions are:
First table without None records (s1.country <> 'None')
Second table only None records (s2.country = 'None')
Date: Only consider year and month part, ignore days. This can be achieved by normalizing the dates of both tables to the first of the month by using date_trunc(). So, e.g. '2020-02-15' results in '2020-02-01' and '2020-02-29' results in '2020-02-01' too, which works well as comparision and join condition.
Alternatively:
SELECT
*
FROM (
SELECT
sales_date,
country,
sales_volume,
SUM(fix_costs) OVER (PARTITION BY date_trunc('month', sales_date)) as fix_costs
FROM sales
) s
WHERE country <> 'None'
You can use the SUM() window function over the group of date_trunc() as described above. Then you need filter the None records afterwards
If I understand correctly, use window functions:
select s.*,
sum(fix_costs) over (partition by date_trunc(sales_date)) as month_fixed_costs
from sales;
Note that this assumes that fixed costs are NULL or 0 on other days -- which is true for the data in the question.

Sum results on constant timeframe range on each date in table

I'm using PostGres DB.
I have a table that contains test names, their results and reported time:
|test_name|result |report_time|
| A |error |29/11/2020 |
| A |failure|28/12/2020 |
| A |error |29/12/2020 |
| B |passed |30/12/2020 |
| C |failure|31/12/2020 |
| A |error |31/12/2020 |
I'd like to sum how many tests have failed or errored in the last 30 days, per date (and limit it to be 5 days back from the current date), so the final result will be:
| date | sum | (notes)
| 29/11/2020 | 1 | 1 failed/errored test in range (29/11 -> 29/10)
| 28/12/2020 | 2 | 2 failed/errored tests in range (28/12 -> 28/11)
| 29/12/2020 | 3 | 3 failed/errored tests in range (29/12 -> 29/11)
| 30/12/2020 | 2 | 2 failed/errored tests in range (30/12 -> 30/11)
| 31/12/2020 | 4 | 4 failed/errored tests in range (31/12 -> 31/11)
I know how to sum the results per date (i.e, how many failures/errors were on a specific date):
SELECT report_time::date AS "Report Time", count(case when result in ('failure', 'error') then 1 else
null end) from table
where report_time::date = now()::date
GROUP BY report_time::date, count(case when result in ('failure', 'error') then 1 else null end)
But I'm struggling to sum each date 30 days back.
You can generate the dates and then use window functions:
select gs.dte, num_failed_error, num_failed_error_30
from genereate_series(current_date - interval '5 day', current_date, interval '1 day') gs(dte) left join
(select t.report_time, count(*) as num_failed_error,
sum(count(*)) over (order by report_time range between interval '30 day' preceding and current row) as num_failed_error_30
from t
where t.result in ('failed', 'error') and
t.report_time >= current_date - interval '35 day'
group by t.report_time
) t
on t.report_time = gs.dte ;
Note: This assumes that report_time is only the date with no time component. If it has a time component, use report_time::date.
If you have data on each day, then this can be simplified to:
select t.report_time, count(*) as num_failed_error,
sum(count(*)) over (order by report_time range between interval '30 day' preceding and current row) as num_failed_error_30
from t
where t.result in ('failed', 'error') and
t.report_time >= current_date - interval '35 day'
group by t.report_time
order by report_time desc
limit 5;
Since I'm using PostGresSql 10.12 and update is currently not an option, I took a different approach, where I calculate the dates of the last 30 days and for each date I calculate the cumulative distinct sum for the past 30 days:
SELECT days_range::date, SUM(number_of_tests)
FROM generate_series (now() - interval '30 day', now()::timestamp , '1 day'::interval) days_range
CROSS JOIN LATERAL (
SELECT environment, COUNT(DISTINCT(test_name)) as number_of_tests from tests
WHERE report_time > days_range - interval '30 day'
GROUP BY report_time::date
HAVING COUNT(case when result in ('failure', 'error') then 1 else null end) > 0
ORDER BY report_time::date asc
) as lateral_query
GROUP BY days_range
ORDER BY days_range desc
It is definitely not the best optimized query, it takes ~1 minute for it to compute.