Get (index,column) pair where the value is True - pandas

Say I have a dataframe df
a b c
0 False True True
1 False False False
2 True True False
3 False False False
I would like all (index,column) pairs e.g (0,"b"),(0,"c),(2,"a"),(2,"b") where the True value is.
Is there a way to do that, without looping over either the index or columns?

Assuming booleans in the input, you can use:
df.where(df).stack().index.to_list()
output:
[(0, 'b'), (0, 'c'), (2, 'a'), (2, 'b')]

Let us try np.where:
r, c = np.where(df)
[*zip(df.index[r], df.columns[c])]
[(0, 'b'), (0, 'c'), (2, 'a'), (2, 'b')]

df1.stack().to_frame('col1').query('col1').index.tolist()

Related

Getting boolean columns based on value presence in other columns

I have this table:
df1 = pd.DataFrame(data={'col1': ['a', 'e', 'a', 'e'],
'col2': ['e', 'a', 'c', 'b'],
'col3': ['c', 'b', 'b', 'a']},
index=pd.Series([1, 2, 3, 4], name='index'))
index
col1
col2
col3
1
a
e
c
2
e
a
b
3
a
c
b
4
e
b
a
and this list:
all_vals = ['a', 'b', 'c', 'd', 'e' 'f']
How do I make boolean columns from df1 such that it includes all columns from the all_vals list, even if the value is not in df1?
index
a
b
c
d
e
f
1
TRUE
FALSE
TRUE
FALSE
TRUE
FALSE
2
TRUE
TRUE
FALSE
FALSE
TRUE
FALSE
3
TRUE
TRUE
TRUE
FALSE
FALSE
FALSE
4
TRUE
TRUE
FALSE
FALSE
TRUE
FALSE
You can iterate over all_vals to check if the value exists and create new column
for val in all_vals:
df1[val] = (df1 == val).any(axis=1)
Use get_dummies with aggregate max per columns and DataFrame.reindex:
df1 = (pd.get_dummies(df1, dtype=bool, prefix='', prefix_sep='')
.groupby(axis=1, level=0).max()
.reindex(all_vals, axis=1, fill_value=False))
print (df1)
a b c d e f
index
1 True False True False True False
2 True True False False True False
3 True True True False False False
4 True True False False True False

Complex Clustering of Rows in Large Table - SQL in Google BigQuery

I try to find common clusters in a large table. I have my data in Google BigQuery.
My data consists of many transactions tx from different user groups. User groups can have multiple ids and I try to cluster all ids to the respective user group by analyzing their transactions.
I identified four rules that identify ids from the same user group. In the brackets I named this cluster rule so that you can map the rule to the SQL below:
all ids that belong to the same tx where is_type_0 = TRUE belong to the same user group (="cluster_is_type_0")
all ids that belong to the same tx where is_type_1 = FALSE belong to the same user group (="cluster_is_type_1")
all ids that belong to the same tx where is_type_1 = TRUE that did not exist in the row numbers before (row_nr), belong to the same user group (="cluster_id_occurence")
all ids with the same id belong to the same user group (="cluster_id")
Here is some example data:
row_nr
tx
id
is_type_1
is_type_0
expected_cluster
1
z0
a1
true
true
1
2
z1
b1
true
true
2
3
z1
b2
true
true
2
4
z2
c1
true
true
3
5
z2
c2
true
true
3
6
z3
d1
true
true
4
7
z
a1
false
false
1
8
z
b1
true
false
2
9
z
a2
true
false
1
10
y
b1
false
false
2
11
y
b2
false
false
2
12
y
a2
true
false
1
13
x
c1
false
false
3
14
x
c2
false
false
3
15
x
b1
true
false
2
16
x
c3
true
false
3
17
w
a2
false
false
1
18
w
c1
true
false
3
19
w
a3
true
false
1
20
v
b1
false
false
2
21
v
b2
false
false
2
22
v
a2
true
false
1
This is what I already tried:
WITH data AS (
SELECT *
FROM UNNEST([
STRUCT
(1 as row_nr, 'z0' as tx, 'a1' as id, TRUE as is_type_1, TRUE as is_type_0, 1 as expected_cluster),
(2, 'z1', 'b1', TRUE, TRUE, 2),
(3, 'z1', 'b2', TRUE, TRUE, 2),
(4, 'z2', 'c1', TRUE, TRUE, 3),
(5, 'z2', 'c2', TRUE, TRUE, 3),
(6, 'z3', 'd1', TRUE, TRUE, 4),
(7, 'z', 'a1', FALSE, FALSE, 1),
(8, 'z', 'b1', TRUE, FALSE, 2),
(9, 'z', 'a2', TRUE, FALSE, 1),
(10, 'y', 'b1', FALSE, FALSE, 2),
(11, 'y', 'b2', FALSE, FALSE, 2),
(12, 'y', 'a2', TRUE, FALSE, 1),
(13, 'x', 'c1', FALSE, FALSE, 3),
(14, 'x', 'c2', FALSE, FALSE, 3),
(15, 'x', 'b1', TRUE, FALSE, 2),
(16, 'x', 'c3', TRUE, FALSE, 3),
(17, 'w', 'a2', FALSE, FALSE, 1),
(18, 'w', 'c1', TRUE, FALSE, 3),
(19, 'w', 'a3', TRUE, FALSE, 1),
(20, 'v', 'b1', FALSE, FALSE, 2),
(21, 'v', 'b2', FALSE, FALSE, 2),
(22, 'v', 'a2', TRUE, FALSE, 1)
])
)
, first_cluster as (
SELECT *
, ROW_NUMBER() OVER (PARTITION BY id ORDER BY row_nr) as id_occurence
, CASE WHEN NOT is_type_1 THEN DENSE_RANK() OVER (ORDER BY tx) END AS cluster_is_type_1
, CASE WHEN is_type_0 THEN DENSE_RANK() OVER (ORDER BY tx) END AS cluster_is_type_0
, DENSE_RANK() OVER (ORDER BY id) AS cluster_id
FROM data
ORDER BY row_nr
)
, second_cluster AS (
SELECT *
, CASE WHEN id_occurence = 1 THEN MIN(cluster_is_type_1) OVER (PARTITION BY tx) END AS cluster_id_occurence
FROM first_cluster
ORDER BY row_nr
)
, third_cluster AS (
SELECT *
, COALESCE(cluster_is_type_1, cluster_id_occurence, cluster_is_type_0, cluster_id) AS combined_cluster
FROM second_cluster
ORDER BY row_nr
)
SELECT *
-- , ARRAY_AGG(combined_cluster) OVER (PARTITION BY id) AS combined_cluster_agg
, MIN(combined_cluster) OVER (PARTITION BY id) AS result_cluster
FROM third_cluster
ORDER BY id
But the result is not as expected. id a1, a2 and a3 are not considered to be the same cluster, and also COALESCE(cluster_is_type_1, cluster_id_occurence, cluster_is_type_0, cluster_id) AS combined_cluster can lead to some unwanted behavior as the defined clusters always start at 1 with the dense_rank and when you combine them like so it might be that ids end up in the same cluster that do not belong together.
I appreciate every help!

how do I select rows from pandas df without returning False values?

I have a df and I need to select rows based on some conditions in multiple columns.
Here is what I have
import pandas as pd
dat = [('p','q', 5), ('k','j', 2), ('p','-', 5), ('-','p', 4), ('q','pkjq', 3), ('pkjq','q', 2)
df = pd.DataFrame(dat, columns = ['a', 'b', 'c'])
df_dat = df[(df[['a','b']].isin(['k','p','q','j']) & df['c'] > 3)] | df[(~df[['a','b']].isin(['k','p','q','j']) & df['c'] > 2 )]
Expected result = [('p','q', 5), ('p','-', 5), ('-','p', 4), ('q','pkjq', 3)]
Result I am getting is an all false dataframe
When you have the complicate condition I recommend, make the condition outside the slice
cond1 = df[['a','b']].isin(['k','p','q','j']).any(1) & df['c'].gt(3)
cond2 = (~df[['a','b']].isin(['k','p','q','j'])).any(1) & df['c'].gt(2)
out = df.loc[cond1 | cond2]
Out[305]:
a b c
0 p q 5
2 p - 5
3 - p 4
4 q pkjq 3

Performing a mod function on time data column pandas python

Hello I wanted to apply a mod function of column % 24 to the hour of time column.
I believe the time column is in a string format,
I was wondering how I should go about performing the operation.
sales_id,date,time,shopping_cart,price,parcel_size,Customer_lat,Customer_long,isLoyaltyProgram,nearest_storehouse_id,nearest_storehouse,dist_to_nearest_storehouse,delivery_cost
ORD0056604,24/03/2021,45:13:45,"[('bed', 3), ('Chair', 1), ('wardrobe', 4), ('side_table', 2), ('Dining_table', 2), ('mattress', 1)]",3152.77,medium,-38.246,145.61984,1,4,Sunshine,78.43,5.8725000000000005
ORD0096594,13/12/2018,54:22:20,"[('Study_table', 4), ('wardrobe', 4), ('side_table', 1), ('Dining_table', 2), ('sofa', 4), ('Chair', 3), ('mattress', 1)]",3781.38,large,-38.15718,145.05072,1,4,Sunshine,40.09,5.8725000000000005
ORD0046310,16/02/2018,17:23:36,"[('mattress', 2), ('wardrobe', 1), ('side_table', 2), ('sofa', 1), ('Chair', 3), ('Study_table', 4)]",2219.09,medium,144.69623,-38.00731,0,2,Footscray,34.2,16.9875
ORD0031675,25/06/2018,17:38:48,"[('bed', 4), ('side_table', 1), ('Chair', 1), ('mattress', 3), ('Dining_table', 2), ('sofa', 2), ('wardrobe', 2)]",4542.1,large,144.65506,-38.40669,1,2,Footscray,72.72,18.274500000000003
ORD0019799,05/01/2021,18:37:16,"[('wardrobe', 1), ('Study_table', 3), ('sofa', 4), ('side_table', 2), ('Chair', 4), ('Dining_table', 4), ('bed', 1)]",3132.71,L,-37.66022,144.94286,1,0,Clayton,17.77,14.931
ORD0041462,25/12/2018,07:29:33,"[('Chair', 3), ('bed', 1), ('mattress', 3), ('side_table', 3), ('wardrobe', 3), ('sofa', 4)]",4416.42,medium,-38.39154,145.87448,0,6,Sunshine,105.91,6.151500000000001
ORD0047848,30/07/2021,34:18:01,"[('Chair', 3), ('bed', 3), ('wardrobe', 4)]",2541.04,small,-37.4654,144.45832,1,2,Footscray,60.85,18.4635
Convert values to timedeltas by to_timedelta and then remove days by indexing - selecting last 8 values:
print (df)
sales_id date time
0 ORD0056604 24/03/2021 45:13:45
1 ORD0096594 13/12/2018 54:22:20
print (pd.to_timedelta(df['time']))
0 1 days 21:13:45
1 2 days 06:22:20
Name: time, dtype: timedelta64[ns]
df['time'] = pd.to_timedelta(df['time']).astype(str).str[-8:]
print (df)
sales_id date time
0 ORD0056604 24/03/2021 21:13:45
1 ORD0096594 13/12/2018 06:22:20
If need also add days to date column solution is add timedeltas to dates and last extract values by Series.dt.strftime:
dates = pd.to_datetime(df['date'], dayfirst=True) + pd.to_timedelta(df['time'])
df['time'] = dates.dt.strftime('%H:%M:%S')
df['date'] = dates.dt.strftime('%d/%m/%Y')
print (df)
sales_id date time
0 ORD0056604 25/03/2021 21:13:45
1 ORD0096594 15/12/2018 06:22:20

Pandas Dataframe of dates with missing data selection acting strangely

When there is missing data in a Pandas DataFrame the indexing is not working as I would expect it it.
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'a' : [datetime(2011, 1, 1), datetime(2013, 1, 1)],
'b' : [datetime(2010, 1, 1), datetime(2014, 1, 1)]})
df > datetime(2012, 1, 1)
works as expected:
a b
0 False False
1 True True
but if there is a missing value
none_df = pd.DataFrame({'a' : [datetime(2011, 1, 1), datetime(2013, 1, 1)],
'b' : [datetime(2010, 1, 1), None]})
none_df > datetime(2012, 1, 1)
the selection returns all True
a b
0 True True
1 True True
Am I doing something wrong? Is this desired behavior?
Python 3.5 64bit, Pandas 0.18.0, Windows 10
I agree that the behavior is unusual.
This is a work-around solution:
>>> df.apply(lambda col: col > datetime(2012, 1, 1))
a b
0 False False
1 True False