Complex Clustering of Rows in Large Table - SQL in Google BigQuery - sql

I try to find common clusters in a large table. I have my data in Google BigQuery.
My data consists of many transactions tx from different user groups. User groups can have multiple ids and I try to cluster all ids to the respective user group by analyzing their transactions.
I identified four rules that identify ids from the same user group. In the brackets I named this cluster rule so that you can map the rule to the SQL below:
all ids that belong to the same tx where is_type_0 = TRUE belong to the same user group (="cluster_is_type_0")
all ids that belong to the same tx where is_type_1 = FALSE belong to the same user group (="cluster_is_type_1")
all ids that belong to the same tx where is_type_1 = TRUE that did not exist in the row numbers before (row_nr), belong to the same user group (="cluster_id_occurence")
all ids with the same id belong to the same user group (="cluster_id")
Here is some example data:
row_nr
tx
id
is_type_1
is_type_0
expected_cluster
1
z0
a1
true
true
1
2
z1
b1
true
true
2
3
z1
b2
true
true
2
4
z2
c1
true
true
3
5
z2
c2
true
true
3
6
z3
d1
true
true
4
7
z
a1
false
false
1
8
z
b1
true
false
2
9
z
a2
true
false
1
10
y
b1
false
false
2
11
y
b2
false
false
2
12
y
a2
true
false
1
13
x
c1
false
false
3
14
x
c2
false
false
3
15
x
b1
true
false
2
16
x
c3
true
false
3
17
w
a2
false
false
1
18
w
c1
true
false
3
19
w
a3
true
false
1
20
v
b1
false
false
2
21
v
b2
false
false
2
22
v
a2
true
false
1
This is what I already tried:
WITH data AS (
SELECT *
FROM UNNEST([
STRUCT
(1 as row_nr, 'z0' as tx, 'a1' as id, TRUE as is_type_1, TRUE as is_type_0, 1 as expected_cluster),
(2, 'z1', 'b1', TRUE, TRUE, 2),
(3, 'z1', 'b2', TRUE, TRUE, 2),
(4, 'z2', 'c1', TRUE, TRUE, 3),
(5, 'z2', 'c2', TRUE, TRUE, 3),
(6, 'z3', 'd1', TRUE, TRUE, 4),
(7, 'z', 'a1', FALSE, FALSE, 1),
(8, 'z', 'b1', TRUE, FALSE, 2),
(9, 'z', 'a2', TRUE, FALSE, 1),
(10, 'y', 'b1', FALSE, FALSE, 2),
(11, 'y', 'b2', FALSE, FALSE, 2),
(12, 'y', 'a2', TRUE, FALSE, 1),
(13, 'x', 'c1', FALSE, FALSE, 3),
(14, 'x', 'c2', FALSE, FALSE, 3),
(15, 'x', 'b1', TRUE, FALSE, 2),
(16, 'x', 'c3', TRUE, FALSE, 3),
(17, 'w', 'a2', FALSE, FALSE, 1),
(18, 'w', 'c1', TRUE, FALSE, 3),
(19, 'w', 'a3', TRUE, FALSE, 1),
(20, 'v', 'b1', FALSE, FALSE, 2),
(21, 'v', 'b2', FALSE, FALSE, 2),
(22, 'v', 'a2', TRUE, FALSE, 1)
])
)
, first_cluster as (
SELECT *
, ROW_NUMBER() OVER (PARTITION BY id ORDER BY row_nr) as id_occurence
, CASE WHEN NOT is_type_1 THEN DENSE_RANK() OVER (ORDER BY tx) END AS cluster_is_type_1
, CASE WHEN is_type_0 THEN DENSE_RANK() OVER (ORDER BY tx) END AS cluster_is_type_0
, DENSE_RANK() OVER (ORDER BY id) AS cluster_id
FROM data
ORDER BY row_nr
)
, second_cluster AS (
SELECT *
, CASE WHEN id_occurence = 1 THEN MIN(cluster_is_type_1) OVER (PARTITION BY tx) END AS cluster_id_occurence
FROM first_cluster
ORDER BY row_nr
)
, third_cluster AS (
SELECT *
, COALESCE(cluster_is_type_1, cluster_id_occurence, cluster_is_type_0, cluster_id) AS combined_cluster
FROM second_cluster
ORDER BY row_nr
)
SELECT *
-- , ARRAY_AGG(combined_cluster) OVER (PARTITION BY id) AS combined_cluster_agg
, MIN(combined_cluster) OVER (PARTITION BY id) AS result_cluster
FROM third_cluster
ORDER BY id
But the result is not as expected. id a1, a2 and a3 are not considered to be the same cluster, and also COALESCE(cluster_is_type_1, cluster_id_occurence, cluster_is_type_0, cluster_id) AS combined_cluster can lead to some unwanted behavior as the defined clusters always start at 1 with the dense_rank and when you combine them like so it might be that ids end up in the same cluster that do not belong together.
I appreciate every help!

Related

Getting boolean columns based on value presence in other columns

I have this table:
df1 = pd.DataFrame(data={'col1': ['a', 'e', 'a', 'e'],
'col2': ['e', 'a', 'c', 'b'],
'col3': ['c', 'b', 'b', 'a']},
index=pd.Series([1, 2, 3, 4], name='index'))
index
col1
col2
col3
1
a
e
c
2
e
a
b
3
a
c
b
4
e
b
a
and this list:
all_vals = ['a', 'b', 'c', 'd', 'e' 'f']
How do I make boolean columns from df1 such that it includes all columns from the all_vals list, even if the value is not in df1?
index
a
b
c
d
e
f
1
TRUE
FALSE
TRUE
FALSE
TRUE
FALSE
2
TRUE
TRUE
FALSE
FALSE
TRUE
FALSE
3
TRUE
TRUE
TRUE
FALSE
FALSE
FALSE
4
TRUE
TRUE
FALSE
FALSE
TRUE
FALSE
You can iterate over all_vals to check if the value exists and create new column
for val in all_vals:
df1[val] = (df1 == val).any(axis=1)
Use get_dummies with aggregate max per columns and DataFrame.reindex:
df1 = (pd.get_dummies(df1, dtype=bool, prefix='', prefix_sep='')
.groupby(axis=1, level=0).max()
.reindex(all_vals, axis=1, fill_value=False))
print (df1)
a b c d e f
index
1 True False True False True False
2 True True False False True False
3 True True True False False False
4 True True False False True False

how do I select rows from pandas df without returning False values?

I have a df and I need to select rows based on some conditions in multiple columns.
Here is what I have
import pandas as pd
dat = [('p','q', 5), ('k','j', 2), ('p','-', 5), ('-','p', 4), ('q','pkjq', 3), ('pkjq','q', 2)
df = pd.DataFrame(dat, columns = ['a', 'b', 'c'])
df_dat = df[(df[['a','b']].isin(['k','p','q','j']) & df['c'] > 3)] | df[(~df[['a','b']].isin(['k','p','q','j']) & df['c'] > 2 )]
Expected result = [('p','q', 5), ('p','-', 5), ('-','p', 4), ('q','pkjq', 3)]
Result I am getting is an all false dataframe
When you have the complicate condition I recommend, make the condition outside the slice
cond1 = df[['a','b']].isin(['k','p','q','j']).any(1) & df['c'].gt(3)
cond2 = (~df[['a','b']].isin(['k','p','q','j'])).any(1) & df['c'].gt(2)
out = df.loc[cond1 | cond2]
Out[305]:
a b c
0 p q 5
2 p - 5
3 - p 4
4 q pkjq 3

Get (index,column) pair where the value is True

Say I have a dataframe df
a b c
0 False True True
1 False False False
2 True True False
3 False False False
I would like all (index,column) pairs e.g (0,"b"),(0,"c),(2,"a"),(2,"b") where the True value is.
Is there a way to do that, without looping over either the index or columns?
Assuming booleans in the input, you can use:
df.where(df).stack().index.to_list()
output:
[(0, 'b'), (0, 'c'), (2, 'a'), (2, 'b')]
Let us try np.where:
r, c = np.where(df)
[*zip(df.index[r], df.columns[c])]
[(0, 'b'), (0, 'c'), (2, 'a'), (2, 'b')]
df1.stack().to_frame('col1').query('col1').index.tolist()

How to I apply multiple filter criteria based on condition to copy values from other columns into new columns in pandas dataframe

I have suppose cols: A,B,C,D,E,F
i.e If in col A == '', make new col G = col C,new col H = col D,new col I = col E
if in col A !='' & col B == 'some-value' ,make col G=0,col H=0, col I=0..
tried using np.where but it supports two condtions only any idea.
def change(dfr):
if (dfr['A']==''):
dfr['G'] = dfr['A']
dfr['H'] = dfr['B']
dfr['I'] = dfr['C']
if ((dfr['A']!='') & (dfr['B']=='some-value')):
dfr['G'] = dfr['A']
dfr['H'] = dfr['B']
dfr['I'] = dfr['C']
if ((dfr['A']!='') & (dfr['B']=='value')):
dfr['G'] = 0
dfr['H'] = 0
dfr['I'] = 0
I'm not sure you need the if statements. You can use .loc to accomplish this. Here is a toy dataframe:
data = pd.DataFrame({"A" : ['a', '', 'f', '4', '', 'z'],
"B" : ['f', 'y', 't', 'u', 'o', '1'],
"C" : ['a', 'b', 'c', 'd', 'e', 'f'],
"G" : [1, 1, 1, 1, 1, 1],
'H' : [6, 6, 6, 6, 6, 6],
"I" : ['q', 'q', 'q', 'q', 'q', 'q']})
data
A B C G H I
0 a f a 1 6 q
1 y b 1 6 q
2 f t c 1 6 q
3 4 u d 1 6 q
4 o e 1 6 q
5 z 1 f 1 6 q
It probably makes sense to build in a couple of arguments for the values you'd like to check for in columns B:
def change(dfr, b_firstvalue, b_secondvalue):
new_df = dfr.copy()
new_df.loc[new_df['A']=='', 'G'] = new_df['A']
new_df.loc[new_df['A']=='', 'H'] = new_df['B']
new_df.loc[new_df['A']=='', 'I'] = new_df['C']
new_df.loc[((new_df['A']!='') & (new_df['B'] == b_firstvalue)), 'G'] = new_df['A']
new_df.loc[((new_df['A']!='') & (new_df['B'] == b_firstvalue)), 'H'] = new_df['B']
new_df.loc[((new_df['A']!='') & (new_df['B'] == b_firstvalue)), 'I'] = new_df['C']
new_df.loc[((new_df['A']!='') & (new_df['B'] == b_secondvalue)), 'G'] = 0
new_df.loc[((new_df['A']!='') & (new_df['B'] == b_secondvalue)), 'H'] = 0
new_df.loc[((new_df['A']!='') & (new_df['B'] == b_secondvalue)), 'I'] = 0
return new_df
data2 = change(data, '1', 'f')
data2
A B C G H I
0 a f a 0 0 0
1 y b y b
2 f t c 1 6 q
3 4 u d 1 6 q
4 o e o e
5 z 1 f z 1 f
Obviously, the function will depend on exactly how many columns you want to deal with. This was just a solution for the example problem. If you have many more columns you'd like to replace values with, there may be more efficient ways of handling that.

Postgres alternating group query

Given I have a table:
my_table
id INT
bool BOOLEAN
And some data:
1, true
2, true
3, false
4, true
5, true
6, false
7, false
8, false
9, true
...
How can I SELECT such that I find only the rows where there has been a change in the bool value between the current row's id and the previous row's id?
In this case, I would want the results to look like so:
1, true
3, false
4, true
6, false
9, true
...
You can use lag():
select t.*
from (select t.*, lag(bool) over (order by id) as prev_bool
from mytable t
) t
where t.bool <> prev_bool;