Setting values in a matrix in bulk - indexing

The question is about bulk-changing values in a matrix based on data contained in a vector.
Suppose I have a matrix 5x4 matrix of zeroes.
octave> Z = zeros(5,4)
Z =
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
And a column vector of length equal to the number of rows in Z, that is, 5. The rows in the vector y correspond to rows in the matrix Z.
octave> y = [1; 3; 2; 1; 3]
y =
1
3
2
1
3
What I want is to set 1's in the matrix Z in the columns whose indices are contained as values in the corresponding row of the vector y. Namely, I'd like to have Z matrix like this:
Z = # y =
1 0 0 0 # <-- 1 st column
0 0 1 0 # <-- 3 rd column
0 1 0 0 # <-- 2 nd column
1 0 0 0 # <-- 1 st column
0 0 1 0 # <-- 3 rd column
Is there a concise way of doing it? I know I can implement it using a loop over y, but I have a feeling Octave could have a more laconic way. I am new to Octave.

Since Octave has automatic broadcasting (you'll need Octave 3.6.0 or later), the easies way I can think is to use this with a comparison. Here's how
octave> 1:5 == [1 3 2 1 3]'
ans =
1 0 0 0 0
0 0 1 0 0
0 1 0 0 0
1 0 0 0 0
0 0 1 0 0
Broadcasting is explained on the Octave manual but Scipy also has a good explanation for it with nice pictures.

Found another solution that does not use broadcasting. It does not need a matrix of zeroes either.
octave> y = [1; 3; 2; 1; 3]
octave> eye(5)(y,:)
ans =
1 0 0 0 0
0 0 1 0 0
0 1 0 0 0
1 0 0 0 0
0 0 1 0 0
Relevant reading here:
http://www.gnu.org/software/octave/doc/interpreter/Creating-Permutation-Matrices.html

Related

How to select row value from given columns based on comparison of other column values in Pandas data frame?

I have the following Pandas DataFrame:
true_y m1_labels m1_probs_0 m1_probs_1 m2_labels m2_probs_0 m2_probs_1
0 0 0.628205 0.371795 1 0.491648 0.508352
0 0 0.564113 0.435887 1 0.474973 0.525027
0 1 0.463897 0.536103 0 0.660307 0.339693
0 1 0.454559 0.545441 0 0.512349 0.487651
0 0 0.608345 0.391655 1 0.499531 0.500469
0 0 0.816127 0.183873 1 0.456669 0.543331
0 1 0.442693 0.557307 0 0.573354 0.426646
1 0 0.653497 0.346503 1 0.487212 0.512788
0 1 0.392380 0.607620 0 0.627419 0.372581
0 1 0.375816 0.624184 0 0.631532 0.368468
This is a collection of disagreeing ML model predictions with labels and label probabilities of two models (m1, m2) and the actual label (true_y).
I would like to have any of the hard label predictions (m1_labels or m2_labels) which have a higher probability to the respective predicted class of their respective models per row. So for row #1, I expect 0 (as the m1 model has a higher probability for its prediction 0 than the m2 model for its prediction 1). Basically, this is intended to be a manual voting ensemble of the two models.
How can I get this vector with a Pandas query?
You can use the apply function for this:
df.apply(lambda x: x["m1_labels"] if max(x["m1_probs_0"], x["m1_probs_1"]) > max(x["m2_probs_0"], x["m2_probs_1"]) else x["m2_labels"], axis=1)
This select the first model label if the probabilty of its predicted class is higher than the probability of the second model predicted class. Otherwise, it selects the label from the second model.
You can use:
# get max probability for m1
p1 = df.filter(like='m1_probs').max(axis=1)
# get max probability for m2
p2 = df.filter(like='m2_probs').max(axis=1)
# m1_label if it has a greater probability, else m2_label
df['best'] = df['m1_labels'].where(p1.gt(p2), df['m2_labels'])
output:
true_y m1_labels m1_probs_0 m1_probs_1 m2_labels m2_probs_0 m2_probs_1 best
0 0 0 0.628205 0.371795 1 0.491648 0.508352 0
1 0 0 0.564113 0.435887 1 0.474973 0.525027 0
2 0 1 0.463897 0.536103 0 0.660307 0.339693 0
3 0 1 0.454559 0.545441 0 0.512349 0.487651 1
4 0 0 0.608345 0.391655 1 0.499531 0.500469 0
5 0 0 0.816127 0.183873 1 0.456669 0.543331 0
6 0 1 0.442693 0.557307 0 0.573354 0.426646 0
7 1 0 0.653497 0.346503 1 0.487212 0.512788 0
8 0 1 0.392380 0.607620 0 0.627419 0.372581 0
9 0 1 0.375816 0.624184 0 0.631532 0.368468 0

A problem about the difference of SHA-1 logical functions between wikipedia and FIPS 180-4

wikipedia
standard manual
when calculating the SHA-1, we need a sequence of logical functions, f0, f1,…, f79,
I noticed that the function definitions in Wikipedia and the standard manual are different.
oddly, when I chose the ones in the standard manual, the SHA-1 result went wrong.
I used online sha-1 calculators and found that everyone uses the functions written in wikipedia.
Why?
Here are the truth tables for both versions of 'choose' (0..19) and 'majority' (40..59) (for 'parity' 20..39 and 60..79 both sources use xor). Please identify the rows for which the ior result is different from the xor result; those are the cases for which the two formulas produce different results.
x
y
z
x^y
¬x^z
ior
xor
0
0
0
0
0
0
0
0
0
1
0
1
1
1
0
1
0
0
0
0
0
0
1
1
0
1
1
1
1
0
0
0
0
0
0
1
0
1
0
0
0
0
1
1
0
1
0
1
1
1
1
1
1
0
1
1
x
y
z
x^y
x^z
y^z
ior
xor
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
1
0
0
1
1
1
1
0
0
0
0
0
0
0
1
0
1
0
1
0
1
1
1
1
0
1
0
0
1
1
1
1
1
1
1
1
1
1
Hint: there are no differences. The results are always the same, and it doesn't matter which formula you use, as long as you do it correctly you get the correct result.
In fact, on checking my copy of 180-4 this is even stated in section 4.1, immediately above the section you quoted:
... Each of the algorithms [for SHA-1, SHA-256 group, and SHA-512 group] include Ch(x, y, z)
and Maj(x, y, z) functions; the exclusive-OR operation (⊕ ) in these functions may be replaced
by a bitwise OR operation (∨) and produce identical results.
If something you did 'went wrong', it's because you did something wrong, but nobody here is psychic so we have absolutely no idea at all what you did wrong.

How to define incomplete sets in GAMS?

There is an incomplete graph (e.g. including 5 vertices). The adjacency matrix "a" is available. I want to define the set which includes all edges but exclude any other pair of vertices. That is, the pair of vertices belongs to the set of edges iff the element in matrix "a" is positive.
The last line of following code does not work!
sets i "Set of vertices" /1*5/ ;
alias(i,j);
set a(i,j) "Adjacency matrix" ;
Table a(i,j)
1 2 3 4 5
1 0 1 0 1 1
2 1 0 1 0 0
3 0 1 0 0 0
4 1 0 0 0 1
5 1 0 0 1 0;
Set edges(i,j);
edges(i,j) = a(i,j)$(a(i,j)>0);
If you want to have edge , you must define a set and parameter like this :
sets i "Set of vertices" /1*5/ ;
alias(i,j);
set a(i,j) "Adjacency matrix" ;
Table a(i,j)
1 2 3 4 5
1 0 1 0 1 1
2 1 0 1 0 0
3 0 1 0 0 0
4 1 0 0 0 1
5 1 0 0 1 0;
Set edges(i,j);
edges(i,j) $ a(i,j) =yes;
You can simplify your last line to
edges(i,j) = a(i,j);
This automatically acts as if you wrote something like $(a<>0). However, since you defined your symbol a as set already and not as parameter, I think you actually do not have to do anything. A just is what you are looking for. Just do
display a;
and look at the result in the lst file.

Undersampling for multilabel imbalanced datasets in pandas

I'm working on a roll-your-own undersampling function, since imblearn does not work neatly with multi-label classification (e.g. it only accepts one dimensional y).
I want to iterate through X and y, removing a row every 2 or 3 rows that are part of the majority class. The goal is a quick and dirty way to reduce the number of rows in the majority class.
def undersample(X, y):
counter = 0
for index, row in y.itertuples():
if row['rectangle_here'] == 0:
counter += 1
if counter > 3:
counter = 0
X.drop(index, inplace=True)
y.drop(index, inplace=True)
return X, y
But it crashes my kernel on even a small amount of rows (~30,000).
y is something like this, where anytime f2 or f3 is present, f1 is present
So, let's count the number of times 0 happens in f1 and then delete a 0 row every 3rd time:
f1 f2 f3
0 0 0 0
1 0 0 0
2 0 0 0
3 1 0 1
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0

Pandas iterate max value of a variable length slice in a series

Let's assume i have a Pandas DataFrame as follows:
import pandas as pd
idx = ['2003-01-02', '2003-01-03', '2003-01-06', '2003-01-07',
'2003-01-08', '2003-01-09', '2003-01-10', '2003-01-13',
'2003-01-14', '2003-01-15', '2003-01-16', '2003-01-17',
'2003-01-21', '2003-01-22', '2003-01-23', '2003-01-24',
'2003-01-27']
a = pd.DataFrame([1,2,0,0,1,2,3,0,0,0,1,2,3,4,5,0,1],
columns = ['original'], index = pd.to_datetime(idx))
I am trying to get the max for each slices of that DataFrame between two zeros.
In that example i would get:
a['result'] = [0,2,0,0,0,0,3,0,0,0,0,0,0,0,5,0,1]
that is:
original result
2003-01-02 1 0
2003-01-03 2 2
2003-01-06 0 0
2003-01-07 0 0
2003-01-08 1 0
2003-01-09 2 0
2003-01-10 3 3
2003-01-13 0 0
2003-01-14 0 0
2003-01-15 0 0
2003-01-16 1 0
2003-01-17 2 0
2003-01-21 3 0
2003-01-22 4 0
2003-01-23 5 5
2003-01-24 0 0
2003-01-27 1 1
find zeros
cumsum to make groups
mask the zeros into their own group -1
find the max location in each group idxmax
get rid of the one for group -1, that was for zeros anyway
get a.original for found max locations, reindex and fill with zeros
m = a.original.eq(0)
g = a.original.groupby(m.cumsum().mask(m, -1))
i = g.idxmax().drop(-1)
a.assign(result=a.loc[i, 'original'].reindex(a.index, fill_value=0))
original result
2003-01-02 1 0
2003-01-03 2 2
2003-01-06 0 0
2003-01-07 0 0
2003-01-08 1 0
2003-01-09 2 0
2003-01-10 3 3
2003-01-13 0 0
2003-01-14 0 0
2003-01-15 0 0
2003-01-16 1 0
2003-01-17 2 0
2003-01-21 3 0
2003-01-22 4 0
2003-01-23 5 5
2003-01-24 0 0
2003-01-27 1 1