Postgres alternating group query - sql

Given I have a table:
my_table
id INT
bool BOOLEAN
And some data:
1, true
2, true
3, false
4, true
5, true
6, false
7, false
8, false
9, true
...
How can I SELECT such that I find only the rows where there has been a change in the bool value between the current row's id and the previous row's id?
In this case, I would want the results to look like so:
1, true
3, false
4, true
6, false
9, true
...

You can use lag():
select t.*
from (select t.*, lag(bool) over (order by id) as prev_bool
from mytable t
) t
where t.bool <> prev_bool;

Related

Complex Clustering of Rows in Large Table - SQL in Google BigQuery

I try to find common clusters in a large table. I have my data in Google BigQuery.
My data consists of many transactions tx from different user groups. User groups can have multiple ids and I try to cluster all ids to the respective user group by analyzing their transactions.
I identified four rules that identify ids from the same user group. In the brackets I named this cluster rule so that you can map the rule to the SQL below:
all ids that belong to the same tx where is_type_0 = TRUE belong to the same user group (="cluster_is_type_0")
all ids that belong to the same tx where is_type_1 = FALSE belong to the same user group (="cluster_is_type_1")
all ids that belong to the same tx where is_type_1 = TRUE that did not exist in the row numbers before (row_nr), belong to the same user group (="cluster_id_occurence")
all ids with the same id belong to the same user group (="cluster_id")
Here is some example data:
row_nr
tx
id
is_type_1
is_type_0
expected_cluster
1
z0
a1
true
true
1
2
z1
b1
true
true
2
3
z1
b2
true
true
2
4
z2
c1
true
true
3
5
z2
c2
true
true
3
6
z3
d1
true
true
4
7
z
a1
false
false
1
8
z
b1
true
false
2
9
z
a2
true
false
1
10
y
b1
false
false
2
11
y
b2
false
false
2
12
y
a2
true
false
1
13
x
c1
false
false
3
14
x
c2
false
false
3
15
x
b1
true
false
2
16
x
c3
true
false
3
17
w
a2
false
false
1
18
w
c1
true
false
3
19
w
a3
true
false
1
20
v
b1
false
false
2
21
v
b2
false
false
2
22
v
a2
true
false
1
This is what I already tried:
WITH data AS (
SELECT *
FROM UNNEST([
STRUCT
(1 as row_nr, 'z0' as tx, 'a1' as id, TRUE as is_type_1, TRUE as is_type_0, 1 as expected_cluster),
(2, 'z1', 'b1', TRUE, TRUE, 2),
(3, 'z1', 'b2', TRUE, TRUE, 2),
(4, 'z2', 'c1', TRUE, TRUE, 3),
(5, 'z2', 'c2', TRUE, TRUE, 3),
(6, 'z3', 'd1', TRUE, TRUE, 4),
(7, 'z', 'a1', FALSE, FALSE, 1),
(8, 'z', 'b1', TRUE, FALSE, 2),
(9, 'z', 'a2', TRUE, FALSE, 1),
(10, 'y', 'b1', FALSE, FALSE, 2),
(11, 'y', 'b2', FALSE, FALSE, 2),
(12, 'y', 'a2', TRUE, FALSE, 1),
(13, 'x', 'c1', FALSE, FALSE, 3),
(14, 'x', 'c2', FALSE, FALSE, 3),
(15, 'x', 'b1', TRUE, FALSE, 2),
(16, 'x', 'c3', TRUE, FALSE, 3),
(17, 'w', 'a2', FALSE, FALSE, 1),
(18, 'w', 'c1', TRUE, FALSE, 3),
(19, 'w', 'a3', TRUE, FALSE, 1),
(20, 'v', 'b1', FALSE, FALSE, 2),
(21, 'v', 'b2', FALSE, FALSE, 2),
(22, 'v', 'a2', TRUE, FALSE, 1)
])
)
, first_cluster as (
SELECT *
, ROW_NUMBER() OVER (PARTITION BY id ORDER BY row_nr) as id_occurence
, CASE WHEN NOT is_type_1 THEN DENSE_RANK() OVER (ORDER BY tx) END AS cluster_is_type_1
, CASE WHEN is_type_0 THEN DENSE_RANK() OVER (ORDER BY tx) END AS cluster_is_type_0
, DENSE_RANK() OVER (ORDER BY id) AS cluster_id
FROM data
ORDER BY row_nr
)
, second_cluster AS (
SELECT *
, CASE WHEN id_occurence = 1 THEN MIN(cluster_is_type_1) OVER (PARTITION BY tx) END AS cluster_id_occurence
FROM first_cluster
ORDER BY row_nr
)
, third_cluster AS (
SELECT *
, COALESCE(cluster_is_type_1, cluster_id_occurence, cluster_is_type_0, cluster_id) AS combined_cluster
FROM second_cluster
ORDER BY row_nr
)
SELECT *
-- , ARRAY_AGG(combined_cluster) OVER (PARTITION BY id) AS combined_cluster_agg
, MIN(combined_cluster) OVER (PARTITION BY id) AS result_cluster
FROM third_cluster
ORDER BY id
But the result is not as expected. id a1, a2 and a3 are not considered to be the same cluster, and also COALESCE(cluster_is_type_1, cluster_id_occurence, cluster_is_type_0, cluster_id) AS combined_cluster can lead to some unwanted behavior as the defined clusters always start at 1 with the dense_rank and when you combine them like so it might be that ids end up in the same cluster that do not belong together.
I appreciate every help!

Using if statement in string_agg function - postreSQL

The query is as follows
WITH notes AS (
SELECT 891090 Order_ID, False customer_billing, false commander, true agent
UNION ALL
SELECT 891091, false, true, true
UNION ALL
SELECT 891091, true, false, false)
SELECT
n.order_id,
string_Agg(distinct CASE
WHEN n.customer_billing = TRUE THEN 'AR (Customer Billing)'
WHEN n.commander = TRUE THEN 'AP (Commander)'
WHEN n.agent = TRUE THEN 'AP (Agent)'
ELSE NULL
END,', ') AS finance
FROM notes n
WHERE
n.order_id = 891091 AND (n.customer_billing = TRUE or n.commander = TRUE or n.agent = TRUE)
GROUP BY ORDER_ID
As you can see there are two records with order_id as 891091.
First 891091 record has commander and agent set as true
Second 891091 record has customer_billing set as true
Since switch case is used, it considers only the first true value and returns commander and does not consider agent.
So the output becomes
order_id finance
891091 AP (Commander), AR (Customer Billing)
dbfiddle.uk Example
I need all the true values in the record to be considered so that the output becomes
order_id finance
891091 AP (Commander), AP (Agent), AR (Customer Billing)
My initial thought is that using if statement instead of case statement may fix this. I am not sure how to do this inside string_agg function
How to achieve this?
EDIT 1:
The answer specified below works almost fine. But the issue is that the comma separated values are not distinct
Here is the updated fiddle
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=9647d92870e3944516172eda83a8ac6e
You can consider splitting your case into separate ones and using array to collect them. Then you can use array_to_string to format:
WITH notes AS (
SELECT 891090 Order_ID, False customer_billing, false commander, true agent UNION ALL
SELECT 891091, false, true, true UNION ALL
SELECT 891091, true, true, false),
tmp as (
SELECT
n.order_id id,
array_agg(
ARRAY[
CASE WHEN n.customer_billing = TRUE THEN 'AR (Customer Billing)' END,
CASE WHEN n.commander = TRUE THEN 'AP (Commander)' END,
CASE WHEN n.agent = TRUE THEN 'AP (Agent)' END
]) AS finance_array
FROM notes n
WHERE
n.order_id = 891091 AND (n.customer_billing = TRUE or n.commander = TRUE or n.agent = TRUE)
GROUP BY ORDER_ID )
select id, array_to_string(array(select distinct e from unnest(finance_array) as a(e)), ', ')
from tmp;
Here is db_fiddle.

Select cases from long table by key variable grouping

I have a table like below:
caseid | ncode | code | test
1 1 ABC TRUE
1 2 DEF TRUE
2 1 ABC TRUE
3 1 DEF TRUE
3 2 HIJ FALSE
Where caseid represents an individual case. This table creates the relationship that each individual case can have multiple codes associated with it (ncode and code). test is just a variable that tracks a boolean value of interest.
I have specific requirements for my query:
I need all cases where code = ABC and ncode = 1 and test = TRUE. This criteria has the highest priority.
Of those cases from #1, I need to create an additional column called hasdef that is a boolean that indicates if that specific caseid has any other rows where code = DEF and test = TRUE. It should be TRUE if so, otherwise FALSE.
So from the above table, what should return is:
caseid | ncode | code | test | hasdef
1 1 ABC TRUE TRUE
2 1 ABC TRUE FALSE
caseid = 1 returns because code = ABC, ncode = 1, and test = TRUE. hasdef = TRUE because in the second row, caseid = 1, code = DEF and test = TRUE.
caseid = 2 returns because code = ABC, ncode = 1, and test = TRUE. hasdef = FALSE because there is no other row with caseid = 2 where code = DEF.
caseid = 3 does not return. Even though there is a row where code = DEF and test = TRUE, the first criteria (code = ABC and ncode = 1) is not first satisfied.
This is what I have so far, but I am not confident it is working as desired:
select tab1.*, tab2.code is not null as hasdef from
(select * from mytable
where code = 'ABC' and ncode = 1) as tab1
left join (
select caseid, any_value(code) code, any_value(test) test
from mytable
group by caseid
having code = 'DEF' and test is true
) as tab2
using(caseid)
order by caseid
Below is for BigQuery Standard SQL
#standardSQL
select * from (
select *,
0 < countif(code = 'DEF' and test is true) over(partition by caseid) as hasdef
from `project.dataset.table`
)
where code = 'ABC' and ncode = 1 and test is true
if to apply to sample data from your question - output is
Note: you can replace test is true with just test as in below
select * from (
select *,
0 < countif(code = 'DEF' and test) over(partition by caseid) as hasdef
from `project.dataset.table`
)
where code = 'ABC' and ncode = 1 and test

How to vectorize in Pandas when values depend on prior values

I'd like to use Pandas to implement a function that keeps a running balance, but I'm not sure it can be vectorized for speed.
In short, the problem I'm trying to solve is to keep track consumption, generation, and the "bank" of over-generation.
"consumption" means how much is used in a given time period.
"generation" is how much is generated.
When generation is greater than consumption then the homeowner can "bank" the extra generation, to be applied in subsequent time periods. they can apply it if their consumption exceeds their generation for a later month.
This will be for many entities, hence the "id" field. The time sequence is defined by "order"
Very basic example:
Month 1 generates 13 consumes 8 -> therefore banks 5
month 2 generates 8 consumes 10 -> therefore uses 2 from the the bank, and still has 3 left over
Month 3 generates 7 consumes 20 -> exhausts remaining 3 from bank, and has no bank left over.
Code
import numpy as np
import pandas as pd
id = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2]
order = [1,2,3,4,5,6,7,8,9,18,11,12,13,14,15,1,2,3,4,5,6,7,8,9,10,11]
consume = [10, 17, 20, 11, 17, 19, 20, 10, 10, 19, 14, 12, 10, 14, 13, 19, 12, 17, 12, 18, 15, 14, 15, 20, 16, 15]
generate = [20, 16, 17, 21, 9, 13, 10, 16, 12, 10, 9, 9, 15, 13, 100, 15, 18, 16, 10, 16, 12, 12, 13, 20, 10, 15]
df = pd.DataFrame(list(zip(id, order, consume, generate)),
columns =['id','Order','Consume', 'Generate'])
begin_bal = [0,10,9,6,16,8,2,0,6,8,0,0,0,5,4,0,0,6,5,3,1,0,0,0,0,0]
end_bal = [10,9,6,16,8,2,0,6,8,0,0,0,5,4,91,0,6,5,3,1,0,0,0,0,0,0]
withdraw = [0,1,3,0,8,6,2,0,0,8,0,0,0,1,4,0,0,1,2,2,1,0,0,0,0,0]
df_solution = pd.DataFrame(list(zip(id, order, consume, generate, begin_bal, end_bal, withdraw)),
columns =['id','Order','Consume', 'Generate', 'begin_bal', 'end_bal', 'Withdraw'])
def bank(df):
# deposit all excess when generation exceeds consumption
deposit = (df['Generate'] > df['Consume']) * (df['Generate'] - df['Consume'])
df['end_bal'] = 0
# beginning balance = prior period ending balance
df = df.sort_values(by=['id', 'Order'])
df['begin_bal'] = df['end_bal'].shift(periods=1)
df.loc[df['Order']==1, 'begin_bal'] = 0 # set first month beginning balance of each customer to 0
# calculate withdrawal
df['Withdraw'] = 0
ok_to_withdraw = df['Consume'] > df['Generate']
df.loc[ok_to_withdraw,'Withdraw'] = np.minimum(df.loc[ok_to_withdraw, 'begin_bal'],
df.loc[ok_to_withdraw, 'Consume'] -
df.loc[ok_to_withdraw, 'Generate'] -
deposit[ok_to_withdraw])
# ending balance = beginning balance + deposit - withdraw
df['end_bal'] = df['begin_bal'] + deposit - df['Withdraw']
return df
df = bank(df)
df.head()
id Order Consume Generate end_bal begin_bal Withdraw
0 1 1 10 20 10.0 0.0 0.0
1 1 2 17 16 0.0 0.0 0.0
2 1 3 20 17 0.0 0.0 0.0
3 1 4 11 21 10.0 0.0 0.0
4 1 5 17 9 0.0 0.0 0.0
df_solution.head()
id Order Consume Generate begin_bal end_bal Withdraw
0 1 1 10 20 0 10 0
1 1 2 17 16 10 9 1
2 1 3 20 17 9 6 3
3 1 4 11 21 6 16 0
4 1 5 17 9 16 8 9
I tried to implement with various iterations of cumsum and shift . . . but the fact remains that value of each row seems like it needs to be recalculated based on the prior row, and I'm not sure this is possible to vectorize.
Code to generate some test datasets:
def generate_testdata():
random.seed(42*42)
np.random.seed(42*42)
numids = 10
numorders = 12
id = []
order = []
for i in range(numids):
id = id + [i]*numorders
order = order + list(range(1,numorders+1))
consume = np.random.uniform(low = 10, high = 40, size = numids*numorders)
generate = np.random.uniform(low = 10, high = 40, size = numids*numorders)
df = pd.DataFrame(list(zip(id, order, consume, generate)),
columns =['id','Order','Consume', 'Generate'])
return df
Here is a numpy-ish approach, mostly because I'm not that familiar with pandas:
The idea is to first compute the free cumsum and then to subtract the cumulative minimum if it is negative.
import numpy as np
import pandas as pd
id = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2]
order = [1,2,3,4,5,6,7,8,9,18,11,12,13,14,15,1,2,3,4,5,6,7,8,9,10,11]
consume = [10, 17, 20, 11, 17, 19, 20, 10, 10, 19, 14, 12, 10, 14, 13, 19, 12, 17, 12, 18, 15, 14, 15, 20, 16, 15]
generate = [20, 16, 17, 21, 9, 13, 10, 16, 12, 10, 9, 9, 15, 13, 8, 15, 18, 16, 10, 16, 12, 12, 13, 20, 10, 15]
df = pd.DataFrame(list(zip(id, order, consume, generate)),
columns =['id','Order','Consume', 'Generate'])
begin_bal = [0,10,9,6,16,8,2,0,6,8,0,0,0,5,4,0,0,6,5,3,1,0,0,0,0,0]
end_bal = [10,9,6,16,8,2,0,6,8,0,0,0,5,4,0,0,6,5,3,1,0,0,0,0,0,0]
withdraw = [0,1,3,0,9,6,2,0,0,8,0,0,0,1,4,0,0,1,2,2,1,0,0,0,0,0]
df_solution = pd.DataFrame(list(zip(id, order, consume, generate, begin_bal, end_bal, withdraw)),
columns =['id','Order','Consume', 'Generate', 'begin_bal', 'end_bal', 'Withdraw'])
def f(df):
# find block bondaries
ids = df["id"].values
bnds, = np.where(np.diff(ids, prepend=ids[0]-1, append=ids[-1]+1))
# find raw balance change
delta = (df["Generate"] - df["Consume"]).values
# find offset, so cumulative min does not interfere across ids
safe_total = (np.minimum(delta.min(), 0)-1) * np.diff(bnds[:-1])
# must apply offset just before group switch, so it aligns the first
# begin_bal, not end_bal, of the next group
# also keep a copy of original values at switches
delta_orig = delta[bnds[1:-1]-1]
delta[bnds[1:-1]-1] += safe_total - np.add.reduceat(delta, bnds[:-2])
# form free cumsum
acc = delta.cumsum()
# correct
acc -= np.minimum(0, np.minimum.accumulate(acc))
# write solution back to df
shft = np.empty_like(acc)
shft[1:] = acc[:-1]
shft[0] = 0
# reinstate last end_bal of each group
acc[bnds[1:-1]-1] = np.maximum(0, shft[bnds[1:-1]-1] + delta_orig)
df["begin_bal"] = shft
df["end_bal"] = acc
df["Withdraw"] = np.maximum(0, df["begin_bal"] - df["end_bal"])
Test:
f(df)
df == df_solution
Prints:
id Order Consume Generate begin_bal end_bal Withdraw
0 True True True True True True True
1 True True True True True True True
2 True True True True True True True
3 True True True True True True True
4 True True True True True True False
5 True True True True True True True
6 True True True True True True True
7 True True True True True True True
8 True True True True True True True
9 True True True True True True True
10 True True True True True True True
11 True True True True True True True
12 True True True True True True True
13 True True True True True True True
14 True True True True True True True
15 True True True True True True True
16 True True True True True True True
17 True True True True True True True
18 True True True True True True True
19 True True True True True True True
20 True True True True True True True
21 True True True True True True True
22 True True True True True True True
23 True True True True True True True
24 True True True True True True True
25 True True True True True True True
There is one False but that appears to be a typo in the expected output provided.
Using #PaulPanzer's logic here is a pandas version.
def CalcEB(x):
delta = x['Generate'] - x['Consume']
return delta.cumsum() - delta.cumsum().cummin().clip(-np.inf,0)
df['end_bal'] = df.groupby('id', as_index=False).apply(CalcEB).values
df['begin_bal'] = df.groupby('id')['end_bal'].shift().fillna(0)
df['Withdraw'] = (df['begin_bal'] - df['end_bal']).clip(0,np.inf)
df_pandas = df.copy()
#Note the typo mentioned by Paul Panzer
df_pandas.reindex(df_solution.columns, axis=1) == df_solution
Output (check dataframes)
id Order Consume Generate begin_bal end_bal Withdraw
0 True True True True True True True
1 True True True True True True True
2 True True True True True True True
3 True True True True True True True
4 True True True True True True False
5 True True True True True True True
6 True True True True True True True
7 True True True True True True True
8 True True True True True True True
9 True True True True True True True
10 True True True True True True True
11 True True True True True True True
12 True True True True True True True
13 True True True True True True True
14 True True True True True True True
15 True True True True True True True
16 True True True True True True True
17 True True True True True True True
18 True True True True True True True
19 True True True True True True True
20 True True True True True True True
21 True True True True True True True
22 True True True True True True True
23 True True True True True True True
24 True True True True True True True
25 True True True True True True True
I am not sure I understood your question fully, but I am going to give a go at answering.
I will re-phrase what I understood...
1. Source data
There is source data, which is a DataFrame with four columns:
id - ID number of an entity
order - indicates the sequence of periods
consume - how much was consumed during the period
generate - how much was generated during the period
2. Calculations
For each id, we want to calculate:
diff which is the difference between generate and consume for each period
opening balance which is the closing balance from the previous order
closing balance which is the cumulative sum of the diff
3. Code
I will try to solve this with groupby, cumsum and shift.
# Make sure the df is sorted
df = df.sort_values(['id','order'])
df['diff'] = df['generate'] - df['consume']
df['closing_balance'] = df.groupby('id')['diff'].cumsum()
# Opening balance equals the closing balance from the previous period
df['opening_balance'] = df.groupby('id')['closing_balance'].shift(1)
I definitely misunderstood something, feel free to correct me and I will try to come up with a better answer.
In particular, I wasn't sure how to handle the closing_balance going into negative numbers. Should it show negative balance? Should it nullify the "debts"?

Pandas Dataframe of dates with missing data selection acting strangely

When there is missing data in a Pandas DataFrame the indexing is not working as I would expect it it.
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'a' : [datetime(2011, 1, 1), datetime(2013, 1, 1)],
'b' : [datetime(2010, 1, 1), datetime(2014, 1, 1)]})
df > datetime(2012, 1, 1)
works as expected:
a b
0 False False
1 True True
but if there is a missing value
none_df = pd.DataFrame({'a' : [datetime(2011, 1, 1), datetime(2013, 1, 1)],
'b' : [datetime(2010, 1, 1), None]})
none_df > datetime(2012, 1, 1)
the selection returns all True
a b
0 True True
1 True True
Am I doing something wrong? Is this desired behavior?
Python 3.5 64bit, Pandas 0.18.0, Windows 10
I agree that the behavior is unusual.
This is a work-around solution:
>>> df.apply(lambda col: col > datetime(2012, 1, 1))
a b
0 False False
1 True False