SQL: Calculating a new column in Postgres that refers to its own rows - sql

Let' say I have the following table where percent_leave is the percentage of people from the previous period that leave in the current period:
| Period | percent_leave |
----------------------
| 1 | 0.05 |
| 2 | 0.05 |
| 3 | 0.05 |
| 4 | 0.05 |
I want to calculate a new column that will contain the percentage of people left at the end of that period. For example, if we start with 100 people, 5 people leave in the first period, therefore we are left with 95. 5% of 95 would leave in the second period, which leaves us with 90.25 people, and so forth. Then the table would look like:
| Period | percent_leave | percent_remaining
-----------------------------------------
| 1 | 0.05 | 0.95
| 2 | 0.05 | 0.9025
| 3 | 0.05 | 0.857375
| 4 | 0.05 | 0.81450625
As you can see, the calculation of a new row in percent_remaining refers to the previous row in percent_remaining. Normally I would export this raw data into Excel and do this calculation there, but I would like to automate this task in SQL so I need to figure out how to do this in Postgres.
Any ideas?

You can do this with a cumulative sum . . . and some arithmetic:
select t.*,
exp(sum(ln(1-percent_leave)) over (order by period))
from t;
This is essentially implementing product() as a window function. This assumes that percent_leave is always less than 1 and greater than or equal to 0.

You can also simply use the pow function.
select period,percent_leave,pow((1-percent_leave),period) as percent_remaining
from t
If period values are not consecutive, use row_number as the second argument to pow.
select period,percent_leave
,pow((1-percent_leave),row_number() over(order by period)) as percent_remaining
from t

Related

BigQuery SQL: Identify records after n days recursively

When working on the implementation of an easier business logic, I asked myself what would I do if the business requirements looked like the following:
Business Requirement
Let's say there is a customer id and a visit timestamp in a table. Now business wants to send a special promotion to the customer after their very first visit and at each first visit after at least n days (let's set 7 for n for this example) since the customer qualified for their last promotion.
Example Data
+--------+-------------+---------------------------+
| row_no | customer_id | visit_ts |
+--------+-------------+---------------------------+
| 1 | A | 2020-01-01 07:00:00.00000 |
| 2 | A | 2020-01-01 09:00:00.00000 |
| 3 | A | 2020-01-02 17:00:00.00000 |
| 4 | A | 2020-01-08 20:00:00.00000 |
| 5 | A | 2020-01-11 11:30:00.00000 |
| 6 | A | 2020-01-16 08:00:00.00000 |
| 7 | B | 2020-01-11 10:00:00.00000 |
| 8 | B | 2020-01-16 10:00:00.00000 |
| 9 | B | 2020-01-18 11:00:00.00000 |
| 10 | B | 2020-01-20 11:00:00.00000 |
| 11 | B | 2020-01-27 09:00:00.00000 |
+--------+-------------+---------------------------+
Desired Result
Customer A: row_no 1,4,6
Customer B: row_no 7,9,11
What I tried so far
It's pretty straight forward to find previous visit_ts for each record using LAG function or window framing. In the next step, we can calculate the diff between visit_ts and prev_visit_ts with the help of TIMESTAMP_DIFF function, which gives us the number of seconds between these two timestamps.
Finally, we can calculate a running sum on this value by using window function once more:
SELECT *,
# Step 3: Calc running sum of seconds since last visit - how to reset after reaching a threshold?
SUM(sec_since_last_visit) OVER (PARTITION BY customer_id ORDER BY visit_ts ROWS UNBOUNDED PRECEDING) as running_sum
FROM (
SELECT
*,
# Step 2: Calc the amount of seconds between this and prev visit
TIMESTAMP_DIFF(visit_ts,prev_visit_ts,SECOND) AS sec_since_last_visit
FROM (
SELECT
*,
# Step 1: Calc visit_ts of previous visit
LAG(visit_ts) OVER(PARTITION BY customer_id ORDER BY visit_ts) AS prev_visit_ts
FROM
table ) )
What does not work?
What I cannot find a solution for: I need to somehow reset the running sum once it reached the threshold of 7 days. There is no RESET WHEN clause in BigQuery like there is in Teradata. Nor can one nest analytical functions or use recursion in BigQuery for that. The option of using a logical RANGE around the current row does only allow for static values which doesn't help either, I guess.
Is there actually a way of solving this in BigQuery (w/o using Stored Procedures)?
Of course, this problem can be solved pretty easily using an iterative or recursive approach in a programming language like python or java. However, I am specifically interested in whether there is a way of solving this problem in BigQuery Standard SQL and how this solution looks like.

Calculating percentage change of a metric for a group in PostgreSQL

I have a sample table as follows:
id | timestamp | agentid | input_interface | sourceipv4address | totalbytes_sum
-------+------------+-----------------+-----------------------------+-------------------+----------------
10733 | 1593648000 | 203.121.214.129 | 203.121.214.129 interface 1 | 10.10.10.10 | 3857
10734 | 1593648000 | 203.121.214.129 | 203.121.214.129 interface 1 | 10.10.10.101 | 45960
10731 | 1593648600 | 203.121.214.129 | 203.121.214.129 interface 1 | 10.10.10.10 | 20579
10736 | 1593648600 | 203.121.214.129 | 203.121.214.129 interface 1 | 10.10.10.101 | 21384
10737 | 1593648600 | 203.121.214.129 | 203.121.214.129 interface 1 | 10.10.10.107 | 2094
This table is populated by taking samples from a network every 10 minutes. Basically I am trying to build a view to calculate the percentage change on totalbytes_sum for each group (agentid,input_interface,sourceipv4address) and show it as:
timestamp | agentid | input_interface | sourceipv4address | totalbytes_sum | percent
The calculation needs to happen based on the current 10 minute and the previous 10 minute. i can guarantee that there will be only one copy of a particular agentid,input_interface,sourceipv4address combination within the same 10 minutes.
If a combination did not happen to have a record within the previous 10 minutes, the percentage will be +%100.
I was trying to apply the Partition/Order logic but had no luck. the offset function seems good too but I am pretty much stuck.
Can someone please assist me.
Thanks
Your timestamps are all the same. I assume they would be ~600 seconds apart in your actual data.
Please try something like this and let me know in comments if it does not work for you or if you need explanation for any of it:
select timestamp, agentid, input_interface,
sourceipv4address, totalbytes_sum,
timestamp - lag(timestamp) over w as elapsed_time, -- illustration column
lag(totalbytes_sum) over w as last_totalbytes_sum, -- illustration column
case
when coalesce(lag(timestamp) over w, 0) = 0 then 100.0
when timestamp - lag(timestamp) over w > 600 then 100.0
else 100.0 * (totalbytes_sum - (lag(totalbytes_sum) over w)) /
(lag(totalbytes_sum) over w)
end as percent
from sample_table
window w as (partition by agentid, input_interface, sourceipv4address
order by timestamp)
;

PostgreSQL: Melt table and calculate percentages for different groups

I am trying to create a funnel chart, but my data is in a wide format right now. It has a couple groups that I want to compare (e.g., A and B in the example below) and they are on different scales, so I want to use proportions as well as the raw values.
I have a starting table that looks like this:
| group | One | Two | Three |
|-------|-----|-----|-------|
| A | 100 | 75 | 50 |
| B | 10 | 7 | 6 |
|-------|-----|-----|-------|
I need to get the table to look like this:
| group | stage | count | proportion of stage One |
|-------|-------|-------|-------------------------|
| A | One | 100 | 1 |
| A | Two | 75 | 0.75 |
| A | Three | 50 | 0.5 |
| B | One | 10 | 1 |
| B | Two | 7 | 0.7 |
| B | Three | 6 | 0.6 |
|-------|-------|-------|-------------------------|
The proportion is calculated as each row's value divided by the maximum value for that group. Stage One is always gonna be 100%, then Stage 2 is the count for that row divided by the max of count for that group value.
The best I could do is connect to the database in python and use Pandas to melt the table, but I would really like to keep everything in a SQL script.
I've been fumbling around and making zero progress four too long. Any help is much appreciated.
You can do this with a UNION query, selecting first the values of One, then Two and Three with the appropriate division to get the proportion:
SELECT "group", 'One' AS stage, One, 1 AS proportion
FROM data
UNION ALL
SELECT "group", 'Two', Two, ROUND(1.0*Two/One, 2)
FROM data
UNION ALL
SELECT "group", 'Three', Three, ROUND(1.0*Three/One, 2)
FROM data
ORDER BY "group"
Output:
group stage one proportion
A One 100 1
A Two 75 0.75
A Three 50 0.50
B One 10 1
B Two 7 0.70
B Three 6 0.60
Demo on dbfiddle
I would recommend a lateral join:
SELECT t."group", v.stage, v.count, v.count * 1.0 / t.one
FROM t CROSS JOIN LATERAL
(VALUES ('One', one),
('Two', two),
('Three', three)
) v(stage, count);
A lateral join should be a little faster than a union all on a small amount of data. As the data gets bigger, only scanning the table once is a bigger win. However, the biggest win is when the "table" is really a more complex query. Then the lateral join can be significantly better in performance.

Find average for records with aggregated count

I am trying to find the average in a table that includes a count in each record.
I need to find the average as though there were individual records for each count listed in the record.
For example:
+-------+------------------+-------------------+
| Color | Value_to_Average | Number_of_Records |
+-------+------------------+-------------------+
| Red | 3 | 2 |
| Red | 2 | 3 |
| Green | 5 | 2 |
| Blue | 1 | 2 |
+-------+------------------+-------------------+
When I average the values individually, the result is 2.66667. How can I get this same result from the records with the counts?
SQL Fiddle
You want a weighted average:
select sum(Value_to_Average * Number_of_Records) / sum(Number_of_Records)
from Color_Avg t;
I think you're looking for something like this:
select (sum(value_to_average)*sum(number_of_records))/cast(sum(number_of_records) as double)
from table

Find percentage of first row within the same column

I have a table that looks like this:
Day | Count
----+------
1 | 59547
2 | 40448
3 | 36707
4 | 34492
And I want it to query a result set like this:
Day | Count | Percentage of 1st row
----+-------+----------------------
1 | 59547 | 1
2 | 40448 | .6793
3 | 36707 | .6164
4 | 34492 | .5792
I've tried using window functions, but can only seem to get a percentage of the total, which looks like this:
Day | Count | Percentage of 1st row
----+-------+----------------------
1 | 59547 | 0.347833452
2 | 40448 | 0.236269963
3 | 36707 | 0.214417561
4 | 34492 | 0.201479024
But I want a percentage of the first row. I know I can use a cross join, that queries just for "Day 1" but that seems to take a long time. I was wondering if there was a way to write a window function to do this.
Judging from your numbers, you may be looking for this:
SELECT *, round(ct::numeric/first_value(ct) OVER (ORDER BY day), 4) AS pct
FROM tbl;
"A percentage for each row, calculated as ct divided by ct of the first row as defined by the smallest day number."
The key is the window function first_value().
-> SQLfiddle demo.