How can I mask a portion of an array using Numpy? - numpy

What I want to do is "mask" a subset of an array of j elements, from range 0 to k. Eg. For this array:
[0.2, 0.1, 0.3, 0.4, 0.5]
Masking the first 2 elements it becomes
[NaN, NaN, 0.3, 0.4, 0.5]
Does masked_array support this operation?

In [51]: arr=np.ma.array([0.2, 0.1, 0.3, 0.4, 0.5],mask=[True,True,False,False,False])
In [52]: print(arr)
[-- -- 0.3 0.4 0.5]
Or, if you already have a numpy array, you could use np.ma.masked_less_equal (see the link for a variety of other operations for masking particular elements):
In [53]: arr=np.array([0.2, 0.1, 0.3, 0.4, 0.5])
In [56]: np.ma.masked_less_equal(arr,0.2)
Out[57]:
masked_array(data = [-- -- 0.3 0.4 0.5],
mask = [ True True False False False],
fill_value = 1e+20)
Or, if you wish to mask the first two elements:
In [67]: arr=np.array([0.2, 0.1, 0.3, 0.4, 0.5])
In [68]: arr=np.ma.array(arr,mask=False)
In [69]: arr.mask[:2]=True
In [70]: arr
Out[70]:
masked_array(data = [-- -- 0.3 0.4 0.5],
mask = [ True True False False False],
fill_value = 1e+20)

I found this:
ma.array([1,2,3,4], mask=[1,1,0,0])
masked_array(data = [-- -- 3 4],
mask = [ True True False False],
fill_value = 999999)

Related

Tensorflow: Reshape a tensor according to a boolean mask

I have a 1D tensor of values:
a = tf.constant([0.1, 0.2, 0.3, 0.4])
and a nD boolean mask:
b = tf.constant([[1, 1, 0], [0, 1, 1]])
The total number of 1's in b matches the length of a.
How can I get [[0.1, 0.2, 0.0], [0.0, 0.3, 0.4]] from a and b?
import tensorflow as tf
a = tf.constant([0.1, 0.2, 0.3, 0.4])
b = tf.constant([[1, 1, 0], [0, 1, 1]])
# reshape b to a 1D vector
b_res = tf.reshape(b, [-1])
# Get the indices to gather using cumsum
b_cum = tf.cumsum(b_res) - 1
# Gather the elements, multiply by b_res to zero out the unwanted values and reshape back
c = tf.reshape(tf.gather(a, b_cum) * tf.cast(b_res, 'float32'), [-1, 3])
print(c)

How to remove repeated samples from a time series in Pandas?

Currently I'm working with time series data in Pandas. The series are the 3D positions of several markers, so my Dataframe looks as follows:
[A.x, A.y, A.z, B.x, B.y, B.z, C.x, C.y, C.z ... etc.]
Now sometimes the system lost track of one the markers, so the position stays the same over several frames. I want to set these values to NaN (to later interpolate them), but I can't figure out how to do this. So:
A.x A.y A.z A.x A.y A.z
[0.1, 0.2, 0.2] [0.1, 0.2, 0.2]
[0.1, 0.2, 0.2] [NaN, NaN, NaN]
[0.1, 0.2, 0.2] -> [NaN, NaN, NaN]
[0.3, 0.2, 0.2] [0.3, 0.2, 0.2] <- Kept because at least one position was different
[0.2, 0.2, 0.2] [0.2, 0.2, 0.2]
[0.3, 0.2, 0.2] [0.1, 0.2, 0.2] <- Kept as it was not the same as the immediately preceding frame
Dropping duplicated doesn't work, as it does not look for "repeated" values but duplicates in general. I think a solution looking at 3 columns (so 1 point) at the same time would be the best?
Simple version below.
Generic Version:
import numpy as np
import pandas as pd
df = pd.DataFrame(
[
[0.1, 0.2, 0.2, 0.3, 0.2, 0.2],
[0.1, 0.2, 0.2, 0.3, 0.2, 0.2],
[0.1, 0.2, 0.2, 0.2, 0.2, 0.2],
[0.3, 0.2, 0.2, 0.3, 0.2, 0.2],
[0.2, 0.2, 0.2, 0.1, 0.2, 0.2],
[0.3, 0.2, 0.2, 0.1, 0.2, 0.2],
[0.3, 0.2, 0.2, 0.1, 0.2, 0.2],
],
columns="A.x A.y A.z B.x B.y B.z".split(),
)
# A.x A.y A.z B.x B.y B.z
# 0 0.1 0.2 0.2 0.3 0.2 0.2
# 1 0.1 0.2 0.2 0.3 0.2 0.2
# 2 0.1 0.2 0.2 0.2 0.2 0.2
# 3 0.3 0.2 0.2 0.3 0.2 0.2
# 4 0.2 0.2 0.2 0.1 0.2 0.2
# 5 0.3 0.2 0.2 0.1 0.2 0.2
# 6 0.3 0.2 0.2 0.1 0.2 0.2
# identify repeating data
diff = (df.values[:-1] == df.values[1:])
# [[ True, True, True, True, True, True],
# [ True, True, True, False, True, True],
# [False, True, True, False, True, True],
# [False, True, True, False, True, True],
# [False, True, True, True, True, True],
# [ True, True, True, True, True, True]]
allfalse = np.full((1, diff.shape[1]), False)
# [[False, False, False, False, False, False]]
# add allfalse as first row
diff2 = np.concatenate((allfalse, diff), axis=0)
# grouped into 3s
grouped = diff2.reshape(diff2.shape[0], diff2.shape[1] // 3, 3)
# [[[False, False, False], [False, False, False]],
# [[ True, True, True], [ True, True, True]],
# [[ True, True, True], [False, True, True]],
# [[False, True, True], [False, True, True]],
# [[False, True, True], [False, True, True]],
# [[False, True, True], [ True, True, True]],
# [[ True, True, True], [ True, True, True]]]
# mask for triplets
mask = np.all(grouped, axis=2)
# [[False, False],
# [ True, True],
# [ True, False],
# [False, False],
# [False, False],
# [False, True],
# [ True, True]]
grouped[~mask] = False
# [[[False, False, False], [False, False, False]],
# [[ True, True, True], [ True, True, True]],
# [[ True, True, True], [False, False, False]],
# [[False, False, False], [False, False, False]],
# [[False, False, False], [False, False, False]],
# [[False, False, False], [ True, True, True]],
# [[ True, True, True], [ True, True, True]]]
# finally reshape back into original shape
repeated = grouped.reshape(diff2.shape[0], diff2.shape[1])
# [[False, False, False, False, False, False],
# [ True, True, True, True, True, True],
# [ True, True, True, False, False, False],
# [False, False, False, False, False, False],
# [False, False, False, False, False, False],
# [False, False, False, True, True, True],
# [ True, True, True, True, True, True]]
# set repeating values to NAN
df.values[repeated] = np.nan
# A.x A.y A.z B.x B.y B.z
# 0 0.1 0.2 0.2 0.3 0.2 0.2
# 1 NaN NaN NaN NaN NaN NaN
# 2 NaN NaN NaN 0.2 0.2 0.2
# 3 0.3 0.2 0.2 0.3 0.2 0.2
# 4 0.2 0.2 0.2 0.1 0.2 0.2
# 5 0.3 0.2 0.2 NaN NaN NaN
# 6 NaN NaN NaN NaN NaN NaN
Simple(r) Version:
import numpy as np
import pandas as pd
df = pd.DataFrame(
[
[0.1, 0.2, 0.2],
[0.1, 0.2, 0.2],
[0.1, 0.2, 0.2],
[0.3, 0.2, 0.2],
[0.2, 0.2, 0.2],
[0.3, 0.2, 0.2],
[0.3, 0.2, 0.2],
],
columns="A.x A.y A.z".split(),
)
# A.x A.y A.z
# 0 0.1 0.2 0.2
# 1 0.1 0.2 0.2
# 2 0.1 0.2 0.2
# 3 0.3 0.2 0.2
# 4 0.2 0.2 0.2
# 5 0.3 0.2 0.2
# 6 0.3 0.2 0.2
# difference between consecutive values
diff = (df.values[:-1] == df.values[1:])
# [[ True, True, True],
# [ True, True, True],
# [False, True, True],
# [False, True, True],
# [False, True, True],
# [ True, True, True]]
# collapse rows into single value np.all(..., axis=1)
# make array len == number of rows in original DF
repeated = np.insert(np.all(diff, axis=1), 0, False)
# [False, True, True, False, False, False, True]
# modify df in-place
df.values[repeated] = [np.nan, np.nan, np.nan]
# A.x A.y A.z
# 0 0.1 0.2 0.2
# 1 NaN NaN NaN
# 2 NaN NaN NaN
# 3 0.3 0.2 0.2
# 4 0.2 0.2 0.2
# 5 0.3 0.2 0.2
# 6 NaN NaN NaN
I'm certain this can be done prettier and more efficient, but this is step 2 :)
I'll think about B.x... C.x part... will post update.
Enjoy!

tensorflow how do one get the output the same size as input tensor after segment sum

I'm using the tf.unsorted_segment_sum method of TensorFlow and it works.
For example:
tf.unsorted_segment_sum(tf.constant([0.2, 0.1, 0.5, 0.7, 0.8]),
tf.constant([0, 0, 1, 2, 2]), 3)
Gives the right result:
array([ 0.3, 0.5 , 1.5 ], dtype=float32)
I want to get:
array([0.3, 0.3, 0.5, 1.5, 1.5], dtype=float32)
I've solved it.
data = tf.constant([0.2, 0.1, 0.5, 0.7, 0.8])
gr_idx = tf.constant([0, 0, 1, 2, 2])
y, idx, count = tf.unique_with_count(gr_idx)
group_sum = tf.segment_sum(data, gr_idx)
group_sup = tf.gather(group_sum, idx)
answer:
array([0.3, 0.3, 0.5, 1.5, 1.5], dtype=float32)

Add an extra column to ndarray in python

I have a ndarray as follows.
feature_matrix = [[0.1, 0.3], [0.7, 0.8], [0.8, 0.8]]
I have a position ndarray as follows.
position = [10, 20, 30]
Now I want to add the position value at the beginning of the feature_matrix as follows.
[[10, 0.1, 0.3], [20, 0.7, 0.8], [30, 0.8, 0.8]]
I tried the answers in this: How to add an extra column to an numpy array
E.g.,
feature_matrix = np.concatenate((feature_matrix, position), axis=1)
However, I get the error saying that;
ValueError: all the input arrays must have same number of dimensions
Please help me to resolve this prblem.
This solved my problem. I used np.column_stack.
feature_matrix = [[0.1, 0.3], [0.7, 0.8], [0.8, 0.8]]
position = [10, 20, 30]
feature_matrix = np.column_stack((position, feature_matrix))
It is the shape of the position array which is incorrect regarding the shape of the feature_matrix.
>>> feature_matrix
array([[ 0.1, 0.3],
[ 0.7, 0.8],
[ 0.8, 0.8]])
>>> position
array([10, 20, 30])
>>> position.reshape((3,1))
array([[10],
[20],
[30]])
The solution is (with np.concatenate):
>>> np.concatenate((position.reshape((3,1)), feature_matrix), axis=1)
array([[ 10. , 0.1, 0.3],
[ 20. , 0.7, 0.8],
[ 30. , 0.8, 0.8]])
But np.column_stack is clearly great in your case !

numpy - multiply each element in array by a scaling factor

I have a numpy array of values, and a list of scaling factors which I want to scale each value in the array by, down each column
values = [[ 0, 1, 2, 3 ],
[ 1, 1, 4, 3 ],
[ 2, 1, 6, 3 ],
[ 3, 1, 8, 3 ]]
ls_alloc = [ 0.1, 0.4, 0.3, 0.2]
# convert values into numpy array
import numpy as np
na_values = np.array(values, dtype=float)
Edit: To clarify:
na_values can is a 2-dimensional array of stock cumulative returns (ie: normalised to day 1), where each row represents a date, and each column a stock. The data is returned as an array for each date.
I want to now scale each stock's cumulative return by its allocation in the portfolio. So for each date (ie: each row of ndarray values, apply the respective element from ls_alloc to the array element-wise)
# scale each value by its allocation
na_components = [ ls_alloc[i] * na_values[:,i] for i in range(len(ls_alloc)) ]
This does what I want, but I can't help but feel there must be a way to have numpy do this for me automatically?
That is, I feel:
na_components = [ ls_alloc[i] * na_values[:,i] for i in range(len(ls_alloc)) ]
# display na_components
na_components
[array([ 0. , 0.1, 0.2, 0.3]), \
array([ 0.4, 0.4, 0.4, 0.4]), \
array([ 0.6, 1.2, 1.8, 2.4]), \
array([ 0.6, 0.6, 0.6, 0.6])]
should be able to be expressed as something like:
tmp = np.multiply(na_values, ls_alloc)
# display tmp
tmp
array([[ 0. , 0.4, 0.6, 0.6],
[ 0.1, 0.4, 1.2, 0.6],
[ 0.2, 0.4, 1.8, 0.6],
[ 0.3, 0.4, 2.4, 0.6]])
Is there a numpy function which will achieve what I want elegantly and succinctly?
Edit:
I see that my first solution has transposed my data, such that I am returned a list of ndarrays. na_components[0] now gives an ndarray of the stock values for the first stock, 1 element per date.
The next step that I perform with na_components is to calculate the total cumulative return for the portfolio by summing each individual component
na_pfo_cum_ret = np.sum(na_components, axis=0)
This works with the list of individual stock return ndarrays.
That order seems a little odd to me, but IIUC, all you need to do is to transpose the result of multiplying na_values by array(ls_alloc):
>>> v
array([[ 0., 1., 2., 3.],
[ 1., 1., 4., 3.],
[ 2., 1., 6., 3.],
[ 3., 1., 8., 3.]])
>>> a
array([ 0.1, 0.4, 0.3, 0.2])
>>> (v*a).T
array([[ 0. , 0.1, 0.2, 0.3],
[ 0.4, 0.4, 0.4, 0.4],
[ 0.6, 1.2, 1.8, 2.4],
[ 0.6, 0.6, 0.6, 0.6]])
It's not completely clear to me what you want to do, but the answer is probably in Broadcasting rules. I think you want:
values = np.array( [[ 0, 1, 2, 3 ],
[ 1, 1, 4, 3 ],
[ 2, 1, 6, 3 ],
[ 3, 1, 8, 3 ]] )
ls_alloc = np.array([ 0.1, 0.4, 0.3, 0.2])
and either:
na_components = values * ls_alloc
or:
na_components = values * ls_alloc[:,np.newaxis]