NumPy, SciPy - how to calculate the z score for subsets of an array? - numpy

Using the array a below as an example, I am looking for a scalable way to calculate the z score of the last 2 columns a[:, 3:] separately for each value in the third column a[:,2]
In [52]: import numpy as np; from scipy import stats
In [53]: a = np.array([[0., 0., 0., 1., 2.], [ 0., 0., 1., 3., 4.], [ 1., 0.,
...: 0., 5., 6.], [1., 0., 1., 7., 8.], [ 2., 0., 0., 9., 6.], [2.,
...: 0., 1., 8., 9.], [ 3., np.NaN, np.NaN, np.NaN, np.NaN]])
In [54]: a
Out[54]:
array([[ 0., 0., 0., 1., 2.],
[ 0., 0., 1., 3., 4.],
[ 1., 0., 0., 5., 6.],
[ 1., 0., 1., 7., 8.],
[ 2., 0., 0., 9., 6.],
[ 2., 0., 1., 8., 9.],
[ 3., nan, nan, nan, nan]])
For the case where the third column is 0 a[:,2] == 0, I can calculate it with
In [48]: np.fromfunction(lambda i, j: stats.zscore(a[a[:,2] == 0][:,3:]), (1, 1))
Out[48]:
array([[-1.22474487, -1.41421356],
[ 0. , 0.70710678],
[ 1.22474487, 0.70710678]])
and for the case where the third column is 1 a[:,2] == 1, I can calculate it with
In [49]: np.fromfunction(lambda i, j: stats.zscore(a[a[:,2] == 1][:,3:]), (1, 1))
Out[49]:
array([[-1.38873015, -1.38873015],
[ 0.46291005, 0.46291005],
[ 0.9258201 , 0.9258201 ]])
How can I augment my original array with these results, regardless of the number of rows and values in the third column, to create something like the following -
Out[62]:
array([[ 0. , 0. , 0. , 1. , 2. ,
-1.22474487, -1.41421356],
[ 0. , 0. , 1. , 3. , 4. ,
-1.38873015, -1.38873015],
[ 1. , 0. , 0. , 5. , 6. ,
0. , 0.70710678],
[ 1. , 0. , 1. , 7. , 8. ,
0.46291005, 0.46291005],
[ 2. , 0. , 0. , 9. , 6. ,
1.22474487, 0.70710678],
[ 2. , 0. , 1. , 8. , 9. ,
0.9258201 , 0.9258201 ],
[ 3. , nan, nan, nan, nan,
nan, nan]])

you need to create an array with same number of columns as a and use np.column_stack to combine them
z1 = np.fromfunction(lambda i, j: stats.zscore(a[a[:,2] == 0][:,3:]), (1, 1))
z2 = np.fromfunction(lambda i, j: stats.zscore(a[a[:,2] == 1][:,3:]), (1, 1))
z=np.zeros((a.shape[0],z1.shape[1]))*np.nan
z[::2][:z1.shape[0]]=z1
z[1::2][:z2.shape[0]]=z2
arr1 = np.column_stack((a,z))
arr1
array([[ 0. , 0. , 0. , 1. , 2. ,
-1.22474487, -1.41421356],
[ 0. , 0. , 1. , 3. , 4. ,
-1.38873015, -1.38873015],
[ 1. , 0. , 0. , 5. , 6. ,
0. , 0.70710678],
[ 1. , 0. , 1. , 7. , 8. ,
0.46291005, 0.46291005],
[ 2. , 0. , 0. , 9. , 6. ,
1.22474487, 0.70710678],
[ 2. , 0. , 1. , 8. , 9. ,
0.9258201 , 0.9258201 ],
[ 3. , nan, nan, nan, nan,
nan, nan]])
for n unique values in a[:,2]
N = np.unique(a[:,2])[~np.isnan(np.unique(a[:,2]))]
zTemp = [np.fromfunction(lambda i, j: stats.zscore(a[a[:,2] == k][:,3:]), (1, 1)) for k in N]
z=np.zeros((a.shape[0], zTemp[0].shape[1]))*np.nan
for i in range(len(zTemp)):
z[i::2][:z1.shape[0]]=zTemp[i]
arr1 = np.column_stack((a,z))

Related

`sklearn.preprocessing.normalize` (L2 norm) equivalent in Tensorflow or TFX

How can I do the L2 norm in Tensorflow? I'm looking for the equivalent of sklearn.preprocessing.normalize in Tensorflow or in tfx.
You can use tensorflow.keras.utils.normalize for L2 norm as follows.
Using sklearn.preprocessing.normalize
X = [[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]]
X_normalized = sklearn.preprocessing.normalize(X, norm='l2')
X_normalized
Output:
array([[ 0.40824829, -0.40824829, 0.81649658],
[ 1. , 0. , 0. ],
[ 0. , 0.70710678, -0.70710678]])
Using tf.keras.utils.normalize gives the same output as above
X = [[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]]
tf.keras.utils.normalize(
X, order=2
)
Output:
array([[ 0.40824829, -0.40824829, 0.81649658],
[ 1. , 0. , 0. ],
[ 0. , 0.70710678, -0.70710678]])

What does tensorflow.keras.preprocessing.text.Tokenizer.texts_to_matrix do?

Please explain what tokenizer.texts_to_matrix does and what the result is?
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(oov_token="<OOV>")
sentences = [text]
print(sentences)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(sentences)
matrix = tokenizer.texts_to_matrix(sentences)
print(word_index)
print(sequences)
print(matrix)
---
['The fool doth think he is wise, but the wise man knows himself to be a fool.']
# word_index
{'<OOV>': 1, 'the': 2, 'fool': 3, 'wise': 4, 'doth': 5, 'think': 6, 'he': 7, 'is': 8, 'but': 9, 'man': 10, 'knows': 11, 'himself': 12, 'to': 13, 'be': 14, 'a': 15}
# sequences
[[2, 3, 5, 6, 7, 8, 4, 9, 2, 4, 10, 11, 12, 13, 14, 15, 3]]
# matrix
[[0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
In the binary mode (default mode), it indicates which words from learnt vocabulary are in the input texts. You have trained your tokenizer on
['The fool doth think he is wise, but the wise man knows himself to be a fool.']
So when you convert the same text to a matrix, it will have all words (indicated by 1) except OOV - because all words are known - hence position at 1 of result vector is 0 (see word_index, and 0 is always 0 since words are enumerated starting from 1
Some examples
tokenizer.texts_to_matrix(['foo'])
# only OOV in this one text
array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0.]])
tokenizer.texts_to_matrix(['he he'])
# known word, twice (does not matter how often)
array([[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
0.]])
tokenizer.texts_to_matrix(['the fool'])
array([[0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0.]])
Other mods
Other mods are more clear
count - How many times a word from vocabulary was in the text
tokenizer.texts_to_matrix(['He, he the fool'], mode="count")
array([[0., 0., 1., 1., 0., 0., 0., 2., 0., 0., 0., 0., 0., 0., 0., 0.,
0.]])
freq - count with sum normalized to 1.0
tokenizer.texts_to_matrix(['he he the fool'], mode="freq")
array([[0. , 0. , 0.25, 0.25, 0. , 0. , 0. , 0.5 , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. ]])
tfidf
tokenizer.texts_to_matrix(['he he the fool'], mode="tfidf")
array([[0. , 0. , 0.84729786, 0.84729786, 0. ,
0. , 0. , 1.43459998, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. ]])

numpy array concatenate with extra column to each array

I try to split a numPy array in roughly equal parts and merge them together with an extra value but end up being confused how I could do this. I have a list : [0., 2.25, 4., 4., 4., 4., 4., 4., 4., 2.25], which after an np.array_split and concatenate with an extra column should end up like: [0. , 2.25, 4., 8., 4., 4., 4., 8., 4., 4., 8., 4. , 2.25]
The steps I took:
>>> import numpy as np
>>> list = [0., 2.25, 4., 4., 4., 4., 4., 4., 4., 2.25]
>>> x = np.array(list)
>>> print(x)
[0. 2.25 4. 4. 4. 4. 4. 4. 4. 2.25]
>>> x = np.array_split(list, 4)
>>> print(x)
[array([0. , 2.25, 4. ]), array([4., 4., 4.]), array([4., 4.]),
array([4. , 2.25])]
>>> x = np.concatenate([x, 8])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: all the input arrays must have same number of dimensions
When I keep the array in the same shape, it will add it to the end of the list
>>> x = np.concatenate([[x, y]])
>>> print(x)
[list([array([0. , 2.25, 4. ]), array([4., 4., 4.]), array([4.,
4.]), array([4. , 2.25])])
list([8])]
I understand that it might be more easy to do so, if you know the shape of the individual arrays and so you could add an extra column with the single value of 8 but array_split doesn't have equal sizes as seen above.
Do I miss a step over here, is it even possible what I want to achieve?
In [71]: alist = [0., 2.25, 4., 4., 4., 4., 4., 4., 4., 2.25]
In [72]: x = np.array(alist)
In [73]: xs = np.array_split(x, 4)
In [75]: xs
Out[75]:
[array([0. , 2.25, 4. ]),
array([4., 4., 4.]),
array([4., 4.]),
array([4. , 2.25])]
xs is a list of arrays of differing size; concatenate can join them on their 1d dimension, recreating x.
In [76]: np.concatenate(xs)
Out[76]: array([0. , 2.25, 4. , 4. , 4. , 4. , 4. , 4. , 4. , 2.25])
Notice what happens when I try to create an array from xs:
In [77]: np.array(xs)
Out[77]:
array([array([0. , 2.25, 4. ]), array([4., 4., 4.]), array([4., 4.]),
array([4. , 2.25])], dtype=object)
The result is a 1d object dtype array, containing those arrays. But if the split had produced equal size arrays, the result would be 2d
In [79]: np.array_split(x,5)
Out[79]:
[array([0. , 2.25]),
array([4., 4.]),
array([4., 4.]),
array([4., 4.]),
array([4. , 2.25])]
In [80]: np.array(np.array_split(x,5))
Out[80]:
array([[0. , 2.25],
[4. , 4. ],
[4. , 4. ],
[4. , 4. ],
[4. , 2.25]])
np.concatenate([xs, 8]) is really np.concatenate([np.array(xs), np.array(8)]), joining a 1d object array with a 0d integer array. Hence the dimension error.
To produce the array you want, you need to add the 8 to the desired arrays, and then concatenate.
In [84]: for i,v in enumerate(xs[:-1]):
...: xs[i] = np.concatenate([v,[8]])
...:
In [85]: xs
Out[85]:
[array([0. , 2.25, 4. , 8. ]),
array([4., 4., 4., 8.]),
array([4., 4., 8.]),
array([4. , 2.25])]
In [86]: np.concatenate(xs)
Out[86]:
array([0. , 2.25, 4. , 8. , 4. , 4. , 4. , 8. , 4. , 4. , 8. ,
4. , 2.25])
Or add 8 to all, and drop the last 8 after concatenation.
Could you instead not just do
np.insert(list, [3,6,8],[8])
array([ 0. , 2.25, 4. , 8. , 4. , 4. , 4. , 8. , 4. ,
4. , 8. , 4. , 2.25, 8. ])
np.array_split produces a list of split array. So to get your desired result you would have to do
[np.concatenate((i,[8])) for i in x]
[array([ 0. , 2.25, 4. , 8. ]),
array([ 4., 4., 4., 8.]),
array([ 4., 4., 8.]),
array([ 4. , 2.25, 8. ])]

Index variable range in numpy

I have a numpy zero matrix A of the shape (2, 5).
A = [[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.]]
I have another array seq of size 2. This is same as the first axis of A.
seq = [2, 3]
I want to create another matrix B which looks like this:
B = [[ 1., 1., 0., 0., 0.],
[ 1., 1., 1., 0., 0.]]
B is constructed by changing the first seq[i] elements in the ith row of A with 1.
This is a toy example. A and seq can be large so efficiency is required. I would be extra thankful if someone knows how to do this in tensorflow.
You can do this in TensorFlow (and with some analogous code in NumPy) as follows:
seq = [2, 3]
b = tf.expand_dims(tf.range(5), 0) # A 1 x 5 matrix.
seq_matrix = tf.expand_dims(seq, 1) # A 2 x 1 matrix.
b_bool = tf.greater(seq_matrix, b) # A 2 x 5 bool matrix.
B = tf.to_int32(b_bool) # A 2 x 5 int matrix.
Example output:
In [7]: b = tf.expand_dims(tf.range(5), 0)
[[0 1 2 3 4]]
In [21]: b_bool = tf.greater(seq_matrix, b)
In [22]: op = sess.run(b_bool)
In [23]: print(op)
[[ True True False False False]
[ True True True False False]]
In [24]: bint = tf.to_int32(b_bool)
In [25]: op = sess.run(bint)
In [26]: print(op)
[[1 1 0 0 0]
[1 1 1 0 0]]
This #mrry's solution, expressed a little differently
In [667]: [[2],[3]]>np.arange(5)
Out[667]:
array([[ True, True, False, False, False],
[ True, True, True, False, False]], dtype=bool)
In [668]: ([[2],[3]]>np.arange(5)).astype(int)
Out[668]:
array([[1, 1, 0, 0, 0],
[1, 1, 1, 0, 0]])
The idea is to compare [2,3] with [0,1,2,3,4] in an 'outer' broadcasting sense. The result is boolean which can be easily changed to 0/1 integers.
Another approach would be to use cumsum (or another ufunc.accumulate function):
In [669]: A=np.zeros((2,5))
In [670]: A[range(2),[2,3]]=1
In [671]: A
Out[671]:
array([[ 0., 0., 1., 0., 0.],
[ 0., 0., 0., 1., 0.]])
In [672]: A.cumsum(axis=1)
Out[672]:
array([[ 0., 0., 1., 1., 1.],
[ 0., 0., 0., 1., 1.]])
In [673]: 1-A.cumsum(axis=1)
Out[673]:
array([[ 1., 1., 0., 0., 0.],
[ 1., 1., 1., 0., 0.]])
Or a variation starting with 1's:
In [681]: A=np.ones((2,5))
In [682]: A[range(2),[2,3]]=0
In [683]: A
Out[683]:
array([[ 1., 1., 0., 1., 1.],
[ 1., 1., 1., 0., 1.]])
In [684]: np.minimum.accumulate(A,axis=1)
Out[684]:
array([[ 1., 1., 0., 0., 0.],
[ 1., 1., 1., 0., 0.]])

turn around sparse matrix

I got some sparse matrix like this
>>>import numpy as np
>>>from scipy.sparse import *
>>>A = csr_matrix((np.identity(3)))
>>>print A
(0, 0) 1.0
(1, 1) 1.0
(2, 2) 1.0
For better understanding A is something like this:
>>>print A.todense()
[[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 0. 1.]]
And I would like to have an operator (let us call it op1(n) ) doing this:
>>>A.op1(1)
[[ 0. 1. 0.]
[ 0. 0. 1.]
[ 1. 0. 0.]]
=> makes the last n columns the first n ones,
so
>>>A == A.op1(3)
true
. Is there some build-in solution, (EDIT:) that returns a sparse matrix again?
The solution with roll:
X = np.roll(X.todense(),-tau, axis = 0)
print X.__class__
returns
<class 'numpy.matrixlib.defmatrix.matrix'>
scipy.sparse doesn't have roll, but you can simulate it with hstack:
from scipy.sparse import *
A = eye(3, 3, format='csr')
hstack((A[:, 1:], A[:, :1]), format='csr') # roll left
hstack((A[:, -1:], A[:, :-1]), format='csr') # roll right
>>> a = np.identity(3)
>>> a
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
>>> np.roll(a, -1, axis=0)
array([[ 0., 1., 0.],
[ 0., 0., 1.],
[ 1., 0., 0.]])
>>> a == np.roll(a, 3, axis=0)
array([[ True, True, True],
[ True, True, True],
[ True, True, True]], dtype=bool)