Optimized way for multiple conditions on pandas dataframe columns value - pandas

I am applying multiple filters on a dataframe at the same time.
data_df[(data_df['1']!=0) & (data_df['2']==0) & (data_df['3']==0) & (data_df['4']==0) & (data_df['5']==0)]
I needed to know is there any optimized way to do this? As I want to compare one column's value as !=0 and others value as =0 multiple times and columns could be more than 5. So, all operations will be:
data_df[(data_df['1']==0) & (data_df['2']!=0) & (data_df['3']==0) & (data_df['4']==0) & (data_df['5']==0)]
data_df[(data_df['1']==0) & (data_df['2']==0) & (data_df['3']!=0) & (data_df['4']==0) & (data_df['5']==0)]
data_df[(data_df['1']==0) & (data_df['2']==0) & (data_df['3']==0) & (data_df['4']!=0) & (data_df['5']==0)]
data_df[(data_df['1']==0) & (data_df['2']==0) & (data_df['3']==0) & (data_df['4']==0) & (data_df['5']!=0)]
Looking for a short and optimized method.

Based on the below statements:
Looking for a short and optimized method
and
I want to compare one column's value as !=0 and others value as =0
You can use df.ne and df.eq with df.drop on axis=1 to drop the column 1:
data_df[data_df[1].ne(0) & data_df.drop(1,axis=1).eq(0).all(axis=1)]

We can first calculate the dataframe, such that for the given columns we only calculate once if the value is equal to zero or not.
df_bool = df[['1', '2', '3', '4', '5']] == 0
Next we can use this as a mask:
df[~df_bool[1] & df[[2, 3, 4, 5]].all(axis=1)]

One idea is compare by numpy array filled by 1 and 0 and test if all values matching by numpy.all:
#test list - all 0, first 1
L = [1,0,0,0,0]
df = data_df[np.all(data_df == np.array(L), axis=1)]
Or use DataFrame.merge by one row DataFrame:
df = data_df.merge(pd.DataFrame([L], columns=data_df.columns))
Sample:
np.random.seed(2020)
data_df = pd.DataFrame(np.random.randint(2, size=(100, 5)), columns=list('12345'))
#print (data_df)
df = data_df[np.all(data_df == np.array(L), axis=1)]
print (df)
1 2 3 4 5
2 1 0 0 0 0
13 1 0 0 0 0
44 1 0 0 0 0
58 1 0 0 0 0
70 1 0 0 0 0
89 1 0 0 0 0
Or:
L = [1,0,0,0,0]
df = data_df.merge(pd.DataFrame([L], columns=data_df.columns))
print (df)
1 2 3 4 5
0 1 0 0 0 0
1 1 0 0 0 0
2 1 0 0 0 0
3 1 0 0 0 0
4 1 0 0 0 0
5 1 0 0 0 0
Solution with merge should be used with helper DataFrame with all combinations:
df1 = pd.DataFrame(0, index=data_df.columns, columns=data_df.columns)
np.fill_diagonal(df1.to_numpy(), 1)
print (df1)
1 2 3 4 5
1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 1 0
5 0 0 0 0 1
df = data_df.merge(df1.loc[['1']])
print (df)
1 2 3 4 5
0 1 0 0 0 0
1 1 0 0 0 0
2 1 0 0 0 0
3 1 0 0 0 0
4 1 0 0 0 0
5 1 0 0 0 0
df = data_df.merge(df1.loc[['2']])
print (df)
1 2 3 4 5
0 0 1 0 0 0
1 0 1 0 0 0

Related

Is there a way to loop through and based a single column value and mark a value into multiple new columns in Pandas?

The dataframe would look something similar to this:
start = [0,2,4,5,1]
end = [3,5,5,5,2]
df = pd.DataFrame({'start': start,'end': end})
The result I want look something like this:
Basically marking a value from start to finish across multiple columns. So if one that start on 0 and ends on 3 I want to mark new column 0 to 3 with a value(1) and the rest with 0.
start = [0,2,4,5,1]
end = [3,5,5,5,2]
diff = [3,3,1,0,1]
col_0 = [1,0,0,0,0]
col_1=[1,0,0,0,1]
col_2 = [1,1,0,0,1]
col_3=[1,1,0,0,0]
col_4=[0,1,1,0,0]
col_5=[0,1,1,1,0]
df = pd.DataFrame({'start': start,'end': end, 'col_0':col_0, 'col_1': col_1, 'col_2': col_2, 'col_3':col_3, 'col_4': col_4, 'col_5': col_5})
start end col_0 col_1 col_2 col_3 col_4 col_5
0 3 1 1 1 1 0 0
2 5 0 0 1 1 1 1
4 5 0 0 0 0 1 1
5 5 0 0 0 0 0 1
1 2 0 1 1 0 0 0
Use dict.fromkeys in list comprehension for each row in DataFrame and pass to DataFrame constructor if perfromance is important:
L = [dict.fromkeys(range(s, e + 1), 1) for s, e in zip(df['start'], df['end'])]
df = df.join(pd.DataFrame(L, index=df.index).add_prefix('col_').fillna(0).astype(int))
print (df)
start end col_0 col_1 col_2 col_3 col_4 col_5
0 0 3 1 1 1 1 0 0
1 2 5 0 0 1 1 1 1
2 4 5 0 0 0 0 1 1
3 5 5 0 0 0 0 0 1
4 1 2 0 1 1 0 0 0
If possible some range value is missing like in changed sample data add DataFrame.reindex:
#missing column 6
start = [0,2,4,7,1]
end = [3,5,5,8,2]
df = pd.DataFrame({'start': start,'end': end})
L = [dict.fromkeys(range(s, e + 1), 1) for s, e in zip(df['start'], df['end'])]
df1 = (pd.DataFrame(L, index=df.index)
.reindex(columns=range(df['start'].min(), df['end'].max() + 1), fill_value=0)
.add_prefix('col_')
.fillna(0)
.astype(int))
df = df.join(df1)
print (df)
start end col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8
0 0 3 1 1 1 1 0 0 0 0 0
1 2 5 0 0 1 1 1 1 0 0 0
2 4 5 0 0 0 0 1 1 0 0 0
3 7 8 0 0 0 0 0 0 0 1 1
4 1 2 0 1 1 0 0 0 0 0 0
EDIT: For counts hours use:
start = pd.to_datetime([0,2,4,5,1], format='%H')
end = pd.to_datetime([3,5,5,5,2], format='%H')
df = pd.DataFrame({'start': start,'end': end})
df.loc[[0,1], 'end'] += pd.Timedelta(1, 'day')
#list for hours datetimes
L = [dict.fromkeys(pd.date_range(s, e, freq='H'), 1) for s, e in zip(df['start'], df['end'])]
df1 = pd.DataFrame(L, index=df.index)
#aggregate sum by hours in columns
df1 = df1.groupby(df1.columns.hour, axis=1).sum().astype(int)
print (df1)
0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 \
0 2 2 2 2 1 1 1 1 1 1 ... 1 1 1 1 1 1 1
1 1 1 2 2 2 2 1 1 1 1 ... 1 1 1 1 1 1 1
2 0 0 0 0 1 1 0 0 0 0 ... 0 0 0 0 0 0 0
3 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0
4 0 1 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0
21 22 23
0 1 1 1
1 1 1 1
2 0 0 0
3 0 0 0
4 0 0 0
[5 rows x 24 columns]
Convert your range from start to stop to a list of indices then explode it. Finally, use indexing to set values to 1:
import numpy as np
range_to_ind = lambda x: range(x['start'], x['end']+1)
(i, j) = df.apply(range_to_ind, axis=1).explode().astype(int).reset_index().values.T
a = np.zeros((df.shape[0], max(df['end'])+1), dtype=int)
a[i, j] = 1
df = df.join(pd.DataFrame(a).add_prefix('col_'))
Output:
>>> df
start end col_0 col_1 col_2 col_3 col_4 col_5
0 0 3 1 1 1 1 0 0
1 2 5 0 0 1 1 1 1
2 4 5 0 0 0 0 1 1
3 5 5 0 0 0 0 0 1
4 1 2 0 1 1 0 0 0

Pandas iloc and conditional sum

This is my dataframe:
0 1 0 1 1
1 0 1 0 1
I generate the sum for each column as below:
data.iloc[:,1:] = data.iloc[:,1:].sum(axis=0)
The result is:
0 1 1 1 2
1 1 1 1 2
But I only want to update values that are not zero:
0 1 0 1 2
1 0 1 0 2
As it is a large dataframe and I don't know which columns will contain zero, I am having trouble in getting the condition to work togther with the iloc
Assuming the following input:
0 1 2 3 4
0 0 1 0 1 1
1 1 0 1 0 1
you can use the underlying numpy array and numpy.where:
import numpy as np
a = data.values[:, 1:]
data.iloc[:,1:] = np.where(a!=0, a.sum(0), a)
output:
0 1 2 3 4
0 0 1 0 1 2
1 1 0 1 0 2

How to remove duplicate columns generated after using pd.get_dummies using their variance as cutoff

I have a dataframe which is being generated using pd.get_dummies as below:
df_target = pd.get_dummies(df_column[column], dummy_na=True,prefix=column)
where column is a column name and df_column is the dataframe from which each column is being pulled to do some operations.
rev_grp_m2_> 225 rev_grp_m2_nan rev_grp_m2_nan
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
1 0 0
0 0 0
0 0 0
0 0 0
0 0 0
Now I do a check of variance for each column generated and skip those with zero variance.
for target_column in list(df_target.columns):
# If variance of the dummy created is zero : append it to a list and print to log file.
if ((np.var(df_target_attribute[[target_column]])[0] != 0)==True):
df_final[target_column] = df_target[target_column]
Here due to two columns being the same , I get a Key Error for the np.var line.
There are two values of variance for the nan column:
erev_grp_m2_nan 0.000819
rev_grp_m2_nan 0.000000
Ideally I would like to take the one with non-zero variance and drop/skip the one with 0 var.
Can someone please help me do this?
For DataFrame.var use:
print (df.var())
rev_grp_m2_> 225 0.083333
rev_grp_m2_nan 0.000000
rev_grp_m2_nan 0.000000
Last for filtering is used boolean indexing:
out = df.loc[:, df.var()!= 0]
print (out)
rev_grp_m2_> 225
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 1
8 0
9 0
10 0
11 0
EDIT: You can get indices of non 0 values and then seelct by iloc:
cols = [i for i in np.arange(len(df.columns)) if np.var(df.iloc[:, i]) != 0]
print (cols)
[0]
df = df.iloc[:, cols]
print (df)
rev_grp_m2_> 225
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 1
8 0
9 0
10 0
11 0
Another idea is filter out if all values are 0:
cols = [i for i in np.arange(len(df.columns)) if (df.iloc[:, i] != 0).any()]
out = df.iloc[:, cols]
Or:
out = df.loc[:, (df != 0).any()]
print (out)
rev_grp_m2_> 225
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 1
8 0
9 0
10 0
11 0

How to turn a list of event in to a matrix to display in Panda

I have a list of events and i want to display on a graph how many happens per hour each day of the week as shown below:
Example of the graph i want
(each line is a day, x axis is the time of the day, y axis is the number of events)
As i am new to Panda i am not sure what's the best way to do it but here is my way:
x = [(rts[k].getDay(), rts[k].getHour(), 1) for k in rts]
df = pd.DataFrame(x[:30]) # Subset of 30 events
dfGrouped = df.groupby([0, 1]).sum() # Group them by day and hour
#Format to display
pd.DataFrame(np.random.randn(24, 7), index=range(0,24), columns=['Mo', 'Tu', 'We', 'Th', 'Fr', 'Sa', 'Su'])
Question is, how can i go from my dataframe with data grouped to a matrix 24x7 as required to display ?
I tried as_matrix but that give me only a one dimensional array, while i want the index of my dataframe to be the index in my matrix.
print(df)
2
0 1
0 19 1
23 1
1 10 2
18 3
22 1
2 17 1
3 8 2
9 3
11 3
13 1
19 1
4 7 1
9 1
14 1
15 1
18 1
5 1 2
7 1
13 1
19 1
6 12 1
Thanks for your help :)
Antoine
I think you need unstack for reshape data, then rename columns names by dict and if necessary add missing hours to index by reindex_axis:
df1 = df.groupby([0, 1])[2].sum().unstack(0, fill_value=0)
#set columns names
df = pd.DataFrame(x[:30], columns = ['days','hours','val'])
d = {0: 'Mo', 1: 'Tu', 2: 'We', 3: 'Th', 4: 'Fr', 5: 'Sa', 6: 'Su'}
df1 = df.groupby(['days', 'hours'])['val'].sum().unstack(0, fill_value=0)
df1 = df1.rename(columns=d).reindex_axis(range(24), fill_value=0)
print (df1)
days Mo Tu We Th Fr Sa Su
hours
0 0 0 0 0 0 0 0
1 0 0 0 0 0 2 0
2 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0
7 0 0 0 0 1 1 0
8 0 0 0 2 0 0 0
9 0 0 0 3 1 0 0
10 0 2 0 0 0 0 0
11 0 0 0 3 0 0 0
12 0 0 0 0 0 0 1
13 0 0 0 1 0 1 0
14 0 0 0 0 1 0 0
15 0 0 0 0 1 0 0
16 0 0 0 0 0 0 0
17 0 0 1 0 0 0 0
18 0 3 0 0 1 0 0
19 1 0 0 1 0 1 0
20 0 0 0 0 0 0 0
21 0 0 0 0 0 0 0
22 0 1 0 0 0 0 0
23 1 0 0 0 0 0 0

How to create dummy variables on Ordinal columns in Python

I am new to Python. I have created dummy columns on categorical column using pandas get_dummies. How to create dummy columns on ordinal column (say column Rating has values 1,2,3...,10)
Consider the dataframe df
df = pd.DataFrame(dict(Cats=list('abcdcba'), Ords=[3, 2, 1, 0, 1, 2, 3]))
df
Cats Ords
0 a 3
1 b 2
2 c 1
3 d 0
4 c 1
5 b 2
6 a 3
pd.get_dummies
works the same on either column
with df.Cats
pd.get_dummies(df.Cats)
a b c d
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 0 0 1 0
5 0 1 0 0
6 1 0 0 0
with df.Ords
0 1 2 3
0 0 0 0 1
1 0 0 1 0
2 0 1 0 0
3 1 0 0 0
4 0 1 0 0
5 0 0 1 0
6 0 0 0 1
with both
pd.get_dummies(df)
Ords Cats_a Cats_b Cats_c Cats_d
0 3 1 0 0 0
1 2 0 1 0 0
2 1 0 0 1 0
3 0 0 0 0 1
4 1 0 0 1 0
5 2 0 1 0 0
6 3 1 0 0 0
Notice that it split out Cats but not Ords
Let's expand on this by adding another Cats2 column and calling pd.get_dummies
pd.get_dummies(df.assign(Cats2=df.Cats)))
Ords Cats_a Cats_b Cats_c Cats_d Cats2_a Cats2_b Cats2_c Cats2_d
0 3 1 0 0 0 1 0 0 0
1 2 0 1 0 0 0 1 0 0
2 1 0 0 1 0 0 0 1 0
3 0 0 0 0 1 0 0 0 1
4 1 0 0 1 0 0 0 1 0
5 2 0 1 0 0 0 1 0 0
6 3 1 0 0 0 1 0 0 0
Interesting, it splits both object columns but not the numeric one.