How do I take an elementwise OR of several matrices in Julia? - iterator

I have a several boolean matrices, and I want a resulting matrix that indicates if any of the elements in that position of those matrices are true. Is there a single function in the Julia language that would allow me to obtain an elementwise OR over an arbitrary number of matrices?
# My data
a = Bool[1 0; 1 1]
b = Bool[0 0; 1 1]
c = Bool[0 0; 0 0]
d = Bool[0 0; 1 1]
# Arrays of Bool Arrays
z1 = [a]
z2 = [a, b]
z3 = [b, c, d]
z4 = [a, b, c, d]
z100 = [rand(Bool, 2, 2) for i in 1:100]
# Expected
julia> some_function(z1)
2×2 BitMatrix:
1 0
1 1
julia> some_function(z2)
2×2 BitMatrix:
1 0
1 1
julia> some_function(z3)
2×2 BitMatrix:
0 0
1 1
julia> some_function(z4)
2×2 BitMatrix:
1 0
1 1
julia> some_function(z100)
2×2 BitMatrix:
1 1
1 1
This question was originally asked on Julia Slack.

The straightforward approach to this in Julia uses broadcasting. The OR operator in Julia is |. To broadcast an operator like this, we can prefix it with a dot as follows.
julia> .|(a)
2×2 BitMatrix:
1 0
1 1
julia> .|(a,b)
2×2 BitMatrix:
1 0
1 1
julia> .|(a,b,c)
2×2 BitMatrix:
1 0
1 1
julia> .|(a,b,c,d)
2×2 BitMatrix:
1 0
1 1
Indicating each matrix manually is tedious. To avoid this, we can use the splat operator which will take elements in an iterator and turn them each into a distinct argument for the function being called.
julia> .|(z1...)
2×2 BitMatrix:
1 0
1 1
julia> .|(z2...)
2×2 BitMatrix:
1 0
1 1
julia> .|(z3...)
2×2 BitMatrix:
0 0
1 1
julia> .|(z4...)
2×2 BitMatrix:
1 0
1 1
julia> .|(z100...)
2×2 BitMatrix:
1 1
1 1
Note that broadcasting allows for expansion of some of the arguments so all the arguments do not need to be the same shape.
julia> .|(z4..., [1 0])
2×2 Matrix{Int64}:
1 0
1 1
julia> .|(z4..., [0 1])
2×2 Matrix{Int64}:
1 1
1 1
julia> .|(z4..., [0, 1])
2×2 Matrix{Int64}:
1 0
1 1
julia> .|(z4..., [1, 0])
2×2 Matrix{Int64}:
1 1
1 1
julia> .|(z4..., 0)
2×2 Matrix{Int64}:
1 0
1 1
julia> .|(z4..., 1)
2×2 Matrix{Int64}:
1 1
1 1
Since the above solutions use broadcasting, they are quite general. If we constrain the problem such that all the boolean matrices must be the same size, then we can take advantage of short circuit evaluation. Once we find a 1 or true value in any position, we do not need to check the elements in the same position of subsequent matrices. To implement this we will use the any function along with an array comprehension.
julia> short_circuit_or(z...) = short_circuit_or(z)
short_circuit_or (generic function with 1 method)
julia> short_circuit_or(z::Tuple) = [
any(x->x[ind], z) for ind in CartesianIndices(first(z))
]
short_circuit_or (generic function with 2 methods)
julia> short_circuit_or(a,b,c)
2×2 Matrix{Bool}:
1 0
1 1
julia> short_circuit_or(z4...)
2×2 Matrix{Bool}:
1 0
1 1
julia> short_circuit_or(z1)
2×2 Matrix{Bool}:
1 0
1 1
julia> short_circuit_or(z2)
2×2 Matrix{Bool}:
1 0
1 1
julia> short_circuit_or(z3)
2×2 Matrix{Bool}:
0 0
1 1
julia> short_circuit_or(z4)
2×2 Matrix{Bool}:
1 0
1 1
julia> short_circuit_or(z100)
2×2 Matrix{Bool}:
1 1
1 1
As demonstrated by these benchmarks, short circuit evaluation can save time.
julia> using BenchmarkTools
julia> #btime .|($z100...)
3.032 ms (24099 allocations: 1.91 MiB)
2×2 BitMatrix:
1 1
1 1
julia> #btime short_circuit_or($z100)
76.413 ns (1 allocation: 96 bytes)
2×2 Matrix{Bool}:
1 1
1 1

Related

Split list column in a dataframe into separate 1/0 entry columns [duplicate]

I have a dataframe where one column is a list of groups each of my users belongs to. Something like:
index groups
0 ['a','b','c']
1 ['c']
2 ['b','c','e']
3 ['a','c']
4 ['b','e']
And what I would like to do is create a series of dummy columns to identify which groups each user belongs to in order to run some analyses
index a b c d e
0 1 1 1 0 0
1 0 0 1 0 0
2 0 1 1 0 1
3 1 0 1 0 0
4 0 1 0 0 0
pd.get_dummies(df['groups'])
won't work because that just returns a column for each different list in my column.
The solution needs to be efficient as the dataframe will contain 500,000+ rows.
Using s for your df['groups']:
In [21]: s = pd.Series({0: ['a', 'b', 'c'], 1:['c'], 2: ['b', 'c', 'e'], 3: ['a', 'c'], 4: ['b', 'e'] })
In [22]: s
Out[22]:
0 [a, b, c]
1 [c]
2 [b, c, e]
3 [a, c]
4 [b, e]
dtype: object
This is a possible solution:
In [23]: pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
Out[23]:
a b c e
0 1 1 1 0
1 0 0 1 0
2 0 1 1 1
3 1 0 1 0
4 0 1 0 1
The logic of this is:
.apply(Series) converts the series of lists to a dataframe
.stack() puts everything in one column again (creating a multi-level index)
pd.get_dummies( ) creating the dummies
.sum(level=0) for remerging the different rows that should be one row (by summing up the second level, only keeping the original level (level=0))
An slight equivalent is pd.get_dummies(s.apply(pd.Series), prefix='', prefix_sep='').sum(level=0, axis=1)
If this will be efficient enough, I don't know, but in any case, if performance is important, storing lists in a dataframe is not a very good idea.
Very fast solution in case you have a large dataframe
Using sklearn.preprocessing.MultiLabelBinarizer
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
df = pd.DataFrame(
{'groups':
[['a','b','c'],
['c'],
['b','c','e'],
['a','c'],
['b','e']]
}, columns=['groups'])
s = df['groups']
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_, index=df.index)
Result:
a b c e
0 1 1 1 0
1 0 0 1 0
2 0 1 1 1
3 1 0 1 0
4 0 1 0 1
Worked for me and also was suggested here and here
This is even faster:
pd.get_dummies(df['groups'].explode()).sum(level=0)
Using .explode() instead of .apply(pd.Series).stack()
Comparing with the other solutions:
import timeit
import pandas as pd
setup = '''
import time
import pandas as pd
s = pd.Series({0:['a','b','c'],1:['c'],2:['b','c','e'],3:['a','c'],4:['b','e']})
df = s.rename('groups').to_frame()
'''
m1 = "pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)"
m2 = "df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')"
m3 = "pd.get_dummies(df['groups'].explode()).sum(level=0)"
times = {f"m{i+1}":min(timeit.Timer(m, setup=setup).repeat(7, 1000)) for i, m in enumerate([m1, m2, m3])}
pd.DataFrame([times],index=['ms'])
# m1 m2 m3
# ms 5.586517 3.821662 2.547167
Even though this quest was answered, I have a faster solution:
df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')
And, in case you have empty groups or NaN, you could just:
df.loc[df.groups.str.len() > 0].apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')
How it works
Inside the lambda, x is your list, for example ['a', 'b', 'c']. So pd.Series will be as follows:
In [2]: pd.Series([1, 1, 1], index=['a', 'b', 'c'])
Out[2]:
a 1
b 1
c 1
dtype: int64
When all pd.Series comes together, they become pd.DataFrame and their index become columns; missing index became a column with NaN as you can see next:
In [4]: a = pd.Series([1, 1, 1], index=['a', 'b', 'c'])
In [5]: b = pd.Series([1, 1, 1], index=['a', 'b', 'd'])
In [6]: pd.DataFrame([a, b])
Out[6]:
a b c d
0 1.0 1.0 1.0 NaN
1 1.0 1.0 NaN 1.0
Now fillna fills those NaN with 0:
In [7]: pd.DataFrame([a, b]).fillna(0)
Out[7]:
a b c d
0 1.0 1.0 1.0 0.0
1 1.0 1.0 0.0 1.0
And downcast='infer' is to downcast from float to int:
In [11]: pd.DataFrame([a, b]).fillna(0, downcast='infer')
Out[11]:
a b c d
0 1 1 1 0
1 1 1 0 1
PS.: It's not required the use of .fillna(0, downcast='infer').
You can use explode and crosstab:
s = pd.Series([['a', 'b', 'c'], ['c'], ['b', 'c', 'e'], ['a', 'c'], ['b', 'e']])
s = s.explode()
pd.crosstab(s.index, s)
Output:
col_0 a b c e
row_0
0 1 1 1 0
1 0 0 1 0
2 0 1 1 1
3 1 0 1 0
4 0 1 0 1
You can use str.join to join all elements in list present in series into string and then use str.get_dummies:
out = df.join(df['groups'].str.join('|').str.get_dummies())
print(out)
groups a b c e
0 [a, b, c] 1 1 1 0
1 [c] 0 0 1 0
2 [b, c, e] 0 1 1 1
3 [a, c] 1 0 1 0
4 [b, e] 0 1 0 1

How to compare 2 dataframes columns and add a value to a new dataframe based on the result

I have 2 dataframes with the same length, and I'd like to compare specific columns between them. If the value of the first column in one of the dataframe is bigger - i'd like it to take the value in the second column and assign it to a new dataframe.
See example. The first dataframe:
0 class
0 1.9 0
1 9.8 0
2 4.5 0
3 8.1 0
4 1.9 0
The second dataframe:
0 class
0 1.4 1
1 7.8 1
2 8.5 1
3 9.1 1
4 3.9 1
The new dataframe should look like:
class
0 0
1 0
2 1
3 1
4 1
Use numpy.where with DataFrame constructor:
df = pd.DataFrame({'class': np.where(df1[0] > df2[0], df1['class'], df2['class'])})
Or DataFrame.where:
df = df1[['class']].where(df1[0] > df2[0], df2[['class']])
print (df)
class
0 0
1 0
2 1
3 1
4 1
EDIT:
If there is another condition use numpy.select and if necessary numpy.isclose
print (df2)
0 class
0 1.4 1
1 7.8 1
2 8.5 1
3 9.1 1
4 1.9 1
masks = [df1[0] == df2[0], df1[0] > df2[0]]
#if need compare floats in some accuracy
#masks = [np.isclose(df1[0], df2[0]), df1[0] > df2[0]]
vals = ['not_determined', df1['class']]
df = pd.DataFrame({'class': np.select(masks, vals, df2['class'])})
print (df)
class
0 0
1 0
2 1
3 1
4 not_determined
Or:
masks = [df1[0] == df2[0], df1[0] > df2[0]]
vals = ['not_determined', 1]
df = pd.DataFrame({'class': np.select(masks, vals, 1)})
print (df)
class
0 0
1 0
2 1
3 1
4 not_determined
Solution for out of box:
df = np.sign(df1[0].sub(df2[0])).map({1:0, -1:1, 0:'not_determined'}).to_frame('class')
print (df)
class
0 0
1 0
2 1
3 1
4 not_determined
Since class is 0 and 1, you could try,
df1[0].lt(df2[0]).astype(int)
For generic solutions, check jezrael's answer.
Try this one:
>>> import numpy as np
>>> import pandas as pd
>>> df_1
0 class
0 1.9 0
1 9.8 0
2 4.5 0
3 8.1 0
4 1.9 0
>>> df_2
0 class
0 1.4 1
1 7.8 1
2 8.5 1
3 9.1 1
4 3.9 1
>>> df_3=pd.DataFrame()
>>> df_3["class"]=np.where(df_1["0"]>df_2["0"], df_1["class"], df_2["class"])
>>> df_3
class
0 0
1 0
2 1
3 1
4 1

Strange behavior of pandas DataFrame.agg

The Lambdas in the following code return the same Series, but the aggregation results are different. Why?
import pandas as pd
df=pd.DataFrame([1, 2])
print(df)
print(df.agg({0: lambda x: x.cumsum()}))
print(df.agg({0: lambda x: pd.Series([1, 3], name=0)}))
Which gives:
0
0 1
1 2
0
0 1
1 3
0
0 1
0 1 3
1 1 3

How to groupby and count on different conditions?

is_correct, question_id
t 1
t 1
f 1
f 1
t 2
t 2
Desired results:
correct_count, incorrect_count, question_id
2 2 1
2 0 2
This is what I have, but I can only get a correct count
df[df["is_correct"]].groupby("question_id")["question_id"].count()
you can use pivot_table function for that:
In [28]: data = """\
....: is_correct question_id
....: t 1
....: t 1
....: f 1
....: f 1
....: t 2
....: t 2
....: """
In [29]: df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
In [30]: df['count'] = 0
In [31]:
In [31]: df
Out[31]:
is_correct question_id count
0 t 1 0
1 t 1 0
2 f 1 0
3 f 1 0
4 t 2 0
5 t 2 0
In [32]:
In [32]: df.pivot_table(index='question_id', columns='is_correct',
....: values='count', aggfunc='count', fill_value=0)\
....: .reset_index()
Out[32]:
is_correct question_id f t
0 1 2 2
1 2 0 2
You can use a groupby after creating another column that you will use for counting:
df = pd.DataFrame({'is_correct':['t','t','f','f','t','t'],'question_id':[1,1,1,1,2,2]})
df['to_sum_up']=1
is_correct question_id to_sum_up
t 1 1
t 1 1
f 1 1
f 1 1
t 2 1
t 2 1
df2 = df.groupby(['question_id','is_correct'],as_index = False).sum()
Once you've made your groupby, you just have to rearrange your data so that it fits the columns you want:
df2['correct_count'] = df2.ix[df2['is_correct']=='t','N']
df2['incorrect_count'] = df2.ix[df2['is_correct']=='f','N']
Then in order to have a nice dataframe as output:
df2.ix[df2['correct_count'].isnull(),'correct_count'] = 0
df2.ix[df2['incorrect_count'].isnull(),'incorrect_count'] = 0
df2 = df2.groupby('question_id',as_index = False).max()
df2 = df2.drop(['N','is_correct'],1)
question_id correct_count incorrect_count
0 1 2 2
1 2 2 0

populate a dense dataframe given a key-value dataframe

I have a key-value dataframe:
pd.DataFrame(columns=['X','Y','val'],data= [['a','z',5],['b','g',3],['b','y',6],['e','r',9]])
> X Y val
0 a z 5
1 b g 3
2 b y 6
3 e r 9
Which I'd like to convert into a denser dataframe:
X z g y r
0 a 5 0 0 0
1 b 0 3 6 0
2 e 0 0 0 9
Before I resort to a pure-python I was wondering if there was a simple way to do this with pandas.
You can use get_dummies:
In [11]: dummies = pd.get_dummies(df['Y'])
In [12]: dummies
Out[12]:
g r y z
0 0 0 0 1
1 1 0 0 0
2 0 0 1 0
3 0 1 0 0
and then multiply by the val column:
In [13]: res = dummies.mul(df['val'], axis=0)
In [14]: res
Out[14]:
g r y z
0 0 0 0 5
1 3 0 0 0
2 0 0 6 0
3 0 9 0 0
To fix the index, you could just add the X as this index, you could first apply set_index:
In [21]: df1 = df.set_index('X', append=True)
In [22]: df1
Out[22]:
Y val
X
0 a z 5
1 b g 3
2 b y 6
3 e r 9
In [23]: dummies = pd.get_dummies(df['Y'])
In [24]: dummies.mul(df['val'], axis=0)
Out[24]:
g r y z
X
0 a 0 0 0 5
1 b 3 0 0 0
2 b 0 0 6 0
3 e 0 9 0 0
If you wanted to do this pivot (you can also use pivot_table):
In [31]: df.pivot('X', 'Y').fillna(0)
Out[31]:
val
Y g r y z
X
a 0 0 0 5
b 3 0 6 0
e 0 9 0 0
Perhaps you want to reset_index, to make X a column (I'm not sure whether than makes sense):
In [32]: df.pivot('X', 'Y').fillna(0).reset_index()
Out[32]:
X val
Y g r y z
0 a 0 0 0 5
1 b 3 0 6 0
2 e 0 9 0 0
For completeness, the pivot_table:
In [33]: df.pivot_table('val', 'X', 'Y', fill_value=0)
Out[33]:
Y g r y z
X
a 0 0 0 5
b 3 0 6 0
e 0 9 0 0
In [34]: df.pivot_table('val', 'X', 'Y', fill_value=0).reset_index()
Out[34]:
Y X g r y z
0 a 0 0 0 5
1 b 3 0 6 0
2 e 0 9 0 0
Note: the column name are named Y, after reseting the index, not sure if this makes sense (and easy to rectify via res.columns.name = None).
If you want something that feels more direct. Something akin to DataFrame.lookup but for np.put might make sense.
def lookup_index(self, row_labels, col_labels):
values = self.values
ridx = self.index.get_indexer(row_labels)
cidx = self.columns.get_indexer(col_labels)
if (ridx == -1).any():
raise ValueError('One or more row labels was not found')
if (cidx == -1).any():
raise ValueError('One or more column labels was not found')
flat_index = ridx * len(self.columns) + cidx
return flat_index
flat_index = lookup_index(df, vals.X, vals.Y)
np.put(df.values, flat_index, vals.val.values)
This assumes that df has the appropriate columns and index to hold the X/Y values. Here's an ipython notebook http://nbviewer.ipython.org/6454120