How to remove repeated samples from a time series in Pandas? - pandas

Currently I'm working with time series data in Pandas. The series are the 3D positions of several markers, so my Dataframe looks as follows:
[A.x, A.y, A.z, B.x, B.y, B.z, C.x, C.y, C.z ... etc.]
Now sometimes the system lost track of one the markers, so the position stays the same over several frames. I want to set these values to NaN (to later interpolate them), but I can't figure out how to do this. So:
A.x A.y A.z A.x A.y A.z
[0.1, 0.2, 0.2] [0.1, 0.2, 0.2]
[0.1, 0.2, 0.2] [NaN, NaN, NaN]
[0.1, 0.2, 0.2] -> [NaN, NaN, NaN]
[0.3, 0.2, 0.2] [0.3, 0.2, 0.2] <- Kept because at least one position was different
[0.2, 0.2, 0.2] [0.2, 0.2, 0.2]
[0.3, 0.2, 0.2] [0.1, 0.2, 0.2] <- Kept as it was not the same as the immediately preceding frame
Dropping duplicated doesn't work, as it does not look for "repeated" values but duplicates in general. I think a solution looking at 3 columns (so 1 point) at the same time would be the best?

Simple version below.
Generic Version:
import numpy as np
import pandas as pd
df = pd.DataFrame(
[
[0.1, 0.2, 0.2, 0.3, 0.2, 0.2],
[0.1, 0.2, 0.2, 0.3, 0.2, 0.2],
[0.1, 0.2, 0.2, 0.2, 0.2, 0.2],
[0.3, 0.2, 0.2, 0.3, 0.2, 0.2],
[0.2, 0.2, 0.2, 0.1, 0.2, 0.2],
[0.3, 0.2, 0.2, 0.1, 0.2, 0.2],
[0.3, 0.2, 0.2, 0.1, 0.2, 0.2],
],
columns="A.x A.y A.z B.x B.y B.z".split(),
)
# A.x A.y A.z B.x B.y B.z
# 0 0.1 0.2 0.2 0.3 0.2 0.2
# 1 0.1 0.2 0.2 0.3 0.2 0.2
# 2 0.1 0.2 0.2 0.2 0.2 0.2
# 3 0.3 0.2 0.2 0.3 0.2 0.2
# 4 0.2 0.2 0.2 0.1 0.2 0.2
# 5 0.3 0.2 0.2 0.1 0.2 0.2
# 6 0.3 0.2 0.2 0.1 0.2 0.2
# identify repeating data
diff = (df.values[:-1] == df.values[1:])
# [[ True, True, True, True, True, True],
# [ True, True, True, False, True, True],
# [False, True, True, False, True, True],
# [False, True, True, False, True, True],
# [False, True, True, True, True, True],
# [ True, True, True, True, True, True]]
allfalse = np.full((1, diff.shape[1]), False)
# [[False, False, False, False, False, False]]
# add allfalse as first row
diff2 = np.concatenate((allfalse, diff), axis=0)
# grouped into 3s
grouped = diff2.reshape(diff2.shape[0], diff2.shape[1] // 3, 3)
# [[[False, False, False], [False, False, False]],
# [[ True, True, True], [ True, True, True]],
# [[ True, True, True], [False, True, True]],
# [[False, True, True], [False, True, True]],
# [[False, True, True], [False, True, True]],
# [[False, True, True], [ True, True, True]],
# [[ True, True, True], [ True, True, True]]]
# mask for triplets
mask = np.all(grouped, axis=2)
# [[False, False],
# [ True, True],
# [ True, False],
# [False, False],
# [False, False],
# [False, True],
# [ True, True]]
grouped[~mask] = False
# [[[False, False, False], [False, False, False]],
# [[ True, True, True], [ True, True, True]],
# [[ True, True, True], [False, False, False]],
# [[False, False, False], [False, False, False]],
# [[False, False, False], [False, False, False]],
# [[False, False, False], [ True, True, True]],
# [[ True, True, True], [ True, True, True]]]
# finally reshape back into original shape
repeated = grouped.reshape(diff2.shape[0], diff2.shape[1])
# [[False, False, False, False, False, False],
# [ True, True, True, True, True, True],
# [ True, True, True, False, False, False],
# [False, False, False, False, False, False],
# [False, False, False, False, False, False],
# [False, False, False, True, True, True],
# [ True, True, True, True, True, True]]
# set repeating values to NAN
df.values[repeated] = np.nan
# A.x A.y A.z B.x B.y B.z
# 0 0.1 0.2 0.2 0.3 0.2 0.2
# 1 NaN NaN NaN NaN NaN NaN
# 2 NaN NaN NaN 0.2 0.2 0.2
# 3 0.3 0.2 0.2 0.3 0.2 0.2
# 4 0.2 0.2 0.2 0.1 0.2 0.2
# 5 0.3 0.2 0.2 NaN NaN NaN
# 6 NaN NaN NaN NaN NaN NaN
Simple(r) Version:
import numpy as np
import pandas as pd
df = pd.DataFrame(
[
[0.1, 0.2, 0.2],
[0.1, 0.2, 0.2],
[0.1, 0.2, 0.2],
[0.3, 0.2, 0.2],
[0.2, 0.2, 0.2],
[0.3, 0.2, 0.2],
[0.3, 0.2, 0.2],
],
columns="A.x A.y A.z".split(),
)
# A.x A.y A.z
# 0 0.1 0.2 0.2
# 1 0.1 0.2 0.2
# 2 0.1 0.2 0.2
# 3 0.3 0.2 0.2
# 4 0.2 0.2 0.2
# 5 0.3 0.2 0.2
# 6 0.3 0.2 0.2
# difference between consecutive values
diff = (df.values[:-1] == df.values[1:])
# [[ True, True, True],
# [ True, True, True],
# [False, True, True],
# [False, True, True],
# [False, True, True],
# [ True, True, True]]
# collapse rows into single value np.all(..., axis=1)
# make array len == number of rows in original DF
repeated = np.insert(np.all(diff, axis=1), 0, False)
# [False, True, True, False, False, False, True]
# modify df in-place
df.values[repeated] = [np.nan, np.nan, np.nan]
# A.x A.y A.z
# 0 0.1 0.2 0.2
# 1 NaN NaN NaN
# 2 NaN NaN NaN
# 3 0.3 0.2 0.2
# 4 0.2 0.2 0.2
# 5 0.3 0.2 0.2
# 6 NaN NaN NaN
I'm certain this can be done prettier and more efficient, but this is step 2 :)
I'll think about B.x... C.x part... will post update.
Enjoy!

Related

How to understand the self-attention mask implementation in google transformer tutorial

I am reading google's transformer tutorial, and the part why the attention_mask for multi-head attention can be built via mask1 & mask2 was unclear to me. Any help would be great!
def call(self, x, training, mask):
# A boolean mask.
if mask is not None:
mask1 = mask[:, :, None]
mask2 = mask[:, None, :]
attention_mask = mask1 & mask2 # <= here
else:
attention_mask = None
# Multi-head self-attention output (`tf.keras.layers.MultiHeadAttention `).
attn_output = self.mha(
query=x, # Query Q tensor.
value=x, # Value V tensor.
key=x, # Key K tensor.
attention_mask=attention_mask, # A boolean mask that prevents attention to certain positions.
training=training, # A boolean indicating whether the layer should behave in training mode.
)
toy example breakdown
input = tf.constant([
[[1, 0, 3, 0], [1, 2, 0, 0]]
])
mask = tf.keras.layers.Embedding(2,2, mask_zero=True).compute_mask(input)
print(mask)
mask1 = mask[:, :, None] # same as tf.expand_dims(mask, axis = 2)
print(mask1)
mask2 = mask[:, None, :]
print(mask2)
print(mask1 & mask2)
>
tf.Tensor(
[[[ True False True False]
[ True True False False]]], shape=(1, 2, 4), dtype=bool)
tf.Tensor(
[[[[ True False True False]]
[[ True True False False]]]], shape=(1, 2, 1, 4), dtype=bool)
tf.Tensor(
[[[[ True False True False]
[ True True False False]]]], shape=(1, 1, 2, 4), dtype=bool)
<tf.Tensor: shape=(1, 2, 2, 4), dtype=bool, numpy= # <= why built mask like this?
array([[[[ True, False, True, False],
[ True, False, False, False]],
[[ True, False, False, False],
[ True, True, False, False]]]])>
The following is my understanding. Correct me if I'm wrong.
I think the key to understand the computation of attention mask is the difference between the attention_mask for multi-head attention and the embedding mask generated by the embedding layer.
tf.keras.layers.Embedding is a mask-generating layer.
With input shape of (batch_size, input_length), tf.keras.layers.Embedding generates the embedding mask with the same shape (batch_size, input_length), (https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding#input-shape);
tf.keras.layers.MultiHeadAttention is mask-consuming layer.
When the output tensor of tf.keras.layers.Embedding is passed to tf.keras.layers.MultiHeadAttention, the embedding mask also need to be passed to the latter layer. But tf.keras.layers.MultiHeadAttention requires "attention_mask", which is different to the embedding mask. "attention_mask" is a boolean mask of shape (B, T, S) (https://www.tensorflow.org/api_docs/python/tf/keras/layers/MultiHeadAttention#call-arguments_1). B for batch_size, T for Target or Query, S for Source or Key.
To compute the attention mask for self attention, we basically need to do an outer product (https://en.wikipedia.org/wiki/Outer_product). This means, for a row token sequence $X$, we need to do $X^T X$. The outcome is the attention matrix where each element is the attention from one word to another. The attention mask would appear in the form of the matrix with the same shape.
The & operator in mask1 & mask2 is tf.math.logical_and.
A basic example to understand the attention mask in tf.keras.layers.MultiHeadAttention
sequence_a = "This is a very long sequence"
sequence_b = "This is short"
text = (sequence_a + ' ' + sequence_b).split(' ')
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(text)
print(le.classes_)
['This' 'a' 'is' 'long' 'sequence' 'short' 'very']
_tokens_a = le.transform(sequence_a.split(' ')) + 1 # 1-based
# print(_tokens_a)
_tokens_b = le.transform(sequence_b.split(' ')) + 1
# print(_tokens_b)
pad_b = tf.constant([[0,_tokens_a.size - _tokens_b.size]])
tokens_b = tf.pad(_tokens_b, pad_b)
tokens_a = tf.constant(_tokens_a)
print(tokens_a)
tf.Tensor([1 3 2 7 4 5], shape=(6,), dtype=int64)
print(tokens_b)
tf.Tensor([1 3 6 0 0 0], shape=(6,), dtype=int64)
padded_batch = tf.concat([tokens_a[None,:], tokens_b[None,:]], axis=0)
padded_batch # Shape `(batch_size, input_seq_len)`.
Tokenization result:
<tf.Tensor: shape=(2, 6), dtype=int64, numpy=
array([[1, 3, 2, 7, 4, 5],
[1, 3, 6, 0, 0, 0]])>
Embedding mask and attention mask:
embedding = tf.keras.layers.Embedding(10, 4, mask_zero=True)
embedding_batch = embedding(padded_batch)
embedding_batch
<tf.Tensor: shape=(2, 6, 4), dtype=float32, numpy=
array([[[-0.0395105 , 0.02781621, -0.02362361, 0.01861998],
[ 0.02881015, 0.03395045, -0.0079098 , -0.002824 ],
[ 0.02268535, -0.02632991, 0.03217204, -0.03376112],
[ 0.04794324, 0.01584867, 0.02413819, 0.01202248],
[-0.03509659, 0.04907972, -0.00174795, -0.01215838],
[-0.03295932, 0.02424154, -0.04788723, -0.03202241]],
[[-0.0395105 , 0.02781621, -0.02362361, 0.01861998],
[ 0.02881015, 0.03395045, -0.0079098 , -0.002824 ],
[-0.02425164, -0.04932282, 0.0186419 , -0.01743554],
[-0.00052293, 0.01411307, -0.01286217, 0.00627784],
[-0.00052293, 0.01411307, -0.01286217, 0.00627784],
[-0.00052293, 0.01411307, -0.01286217, 0.00627784]]],
dtype=float32)>
embedding_mask = embedding_batch._keras_mask # embedding.compute_mask(padded_batch)
embedding_mask
<tf.Tensor: shape=(2, 6), dtype=bool, numpy=
array([[ True, True, True, True, True, True],
[ True, True, True, False, False, False]])>
#This is self attention, thus Q and K are the same
my_mask1 = embedding_mask[:, :, None] # eq: td[:,:,tf.newaxis]
my_mask1
<tf.Tensor: shape=(2, 6, 1), dtype=bool, numpy=
array([[[ True],
[ True],
[ True],
[ True],
[ True],
[ True]],
[[ True],
[ True],
[ True],
[False],
[False],
[False]]])>
#This is self attention, thus Q and K are the same
my_mask2 = embedding_mask[:, None, :]
my_mask2
<tf.Tensor: shape=(2, 1, 6), dtype=bool, numpy=
array([[[ True, True, True, True, True, True]],
[[ True, True, True, False, False, False]]])>
#According to the `attention_mask` argument of `tf.keras.layers.MultiHeadAttention` (https://www.tensorflow.org/api_docs/python/tf/keras/layers/MultiHeadAttention#call-arguments_1), this is the attention_mask which is a boolean mask of shape (B, T, S)
my_attention_mask = my_mask1 & my_mask2
my_attention_mask #[batch_size, input_seq_len, input_seq_len]
<tf.Tensor: shape=(2, 6, 6), dtype=bool, numpy=
array([[[ True, True, True, True, True, True],
[ True, True, True, True, True, True],
[ True, True, True, True, True, True],
[ True, True, True, True, True, True],
[ True, True, True, True, True, True],
[ True, True, True, True, True, True]],
[[ True, True, True, False, False, False],
[ True, True, True, False, False, False],
[ True, True, True, False, False, False],
[False, False, False, False, False, False],
[False, False, False, False, False, False],
[False, False, False, False, False, False]]])>

XGBoostClassifier for multiclass and RandomizedSearchCV give nan scores and same probabilities for all classes

X is a df with some categorical and continuous variables, y is a multi-class variable with 5 classes. No variables are in 'object' data type. No nan in the df.
Code is as follows.
params = { 'max_depth': np.arange(3,20,1),
'learning_rate': np.arange(0.01, 0.5, 0.01),
'subsample': np.arange(0.5, 1.0, 0.1),
'colsample_bytree': np.arange(0.4, 1.0, 0.1),
'colsample_bylevel': np.arange(0.4, 1.0, 0.1),
'gamma': np.arange(0,10,0.05),
'reg_alpha': np.arange(0,80,1),
'reg_lambda': np.arange(0,1,0.1),
scoring = {'f1': make_scorer(f1_score, needs_proba=True, multi_class="ovr")}
#I plan to eventually add more score metrics
xgbc = xgb.XGBClassifier(seed = 20, eval_metric='mlogloss')
clf = RandomizedSearchCV(estimator=xgbc, param_distributions=params, scoring=scoring,
n_iter=10, verbose=False, n_jobs=1, refit = 'f1',return_train_score=False,
random_state = 30, cv=10)
clf.fit(X, y)
print(clf.cv_results_)
print(clf.predict_proba(X))
#Need estimated probabilities for each class for each row of X
print("Best parameters:", clf.best_params_) ```
The output is:
[0.2 0.2 0.2 0.2 0.2]
[0.2 0.2 0.2 0.2 0.2]
[0.2 0.2 0.2 0.2 0.2]
[0.2 0.2 0.2 0.2 0.2]]
Best parameters: {'subsample': 0.7, 'reg_lambda': 0.30000000000000004, 'reg_alpha': 58, 'n_estimators': 1000, 'max_depth': 13, 'learning_rate': 0.17, 'gamma': 7.1000000000000005, 'colsample_bytree': 0.4, 'colsample_bylevel': 0.6}
'split8_test_f1': array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]),
'split9_test_f1': array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]),
'mean_test_f1': array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]),

Create range from 0 to 1 with step 0.05 in Numpy

I want create a list from 0 to 1 with step 0.05, the result will like this: [0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1]
I try with following code, but the output seems not correct. Anyone could help? Thanks.
print(np.arange(0, 1, 0.05).tolist())
Output:
[0.0, 0.05, 0.1, 0.15000000000000002, 0.2, 0.25, 0.30000000000000004, 0.35000000000000003, 0.4, 0.45, 0.5, 0.55, 0.6000000000000001, 0.65, 0.7000000000000001, 0.75, 0.8, 0.8500000000000001, 0.9, 0.9500000000000001]
You want np.linspace()
np.linspace(0, 1, 21)
Out[]:
array([0. , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ])
Its not necessary use .tolist().
Try this:
a = np.arange(0, 1, 0.05)
print (a)
Output:
[0. 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65
0.7 0.75 0.8 0.85 0.9 0.95]
This works:
print(np.arange(0, 1, 0.05).round(2).tolist())
Output:
[0.0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]

numpy - multiply each element in array by a scaling factor

I have a numpy array of values, and a list of scaling factors which I want to scale each value in the array by, down each column
values = [[ 0, 1, 2, 3 ],
[ 1, 1, 4, 3 ],
[ 2, 1, 6, 3 ],
[ 3, 1, 8, 3 ]]
ls_alloc = [ 0.1, 0.4, 0.3, 0.2]
# convert values into numpy array
import numpy as np
na_values = np.array(values, dtype=float)
Edit: To clarify:
na_values can is a 2-dimensional array of stock cumulative returns (ie: normalised to day 1), where each row represents a date, and each column a stock. The data is returned as an array for each date.
I want to now scale each stock's cumulative return by its allocation in the portfolio. So for each date (ie: each row of ndarray values, apply the respective element from ls_alloc to the array element-wise)
# scale each value by its allocation
na_components = [ ls_alloc[i] * na_values[:,i] for i in range(len(ls_alloc)) ]
This does what I want, but I can't help but feel there must be a way to have numpy do this for me automatically?
That is, I feel:
na_components = [ ls_alloc[i] * na_values[:,i] for i in range(len(ls_alloc)) ]
# display na_components
na_components
[array([ 0. , 0.1, 0.2, 0.3]), \
array([ 0.4, 0.4, 0.4, 0.4]), \
array([ 0.6, 1.2, 1.8, 2.4]), \
array([ 0.6, 0.6, 0.6, 0.6])]
should be able to be expressed as something like:
tmp = np.multiply(na_values, ls_alloc)
# display tmp
tmp
array([[ 0. , 0.4, 0.6, 0.6],
[ 0.1, 0.4, 1.2, 0.6],
[ 0.2, 0.4, 1.8, 0.6],
[ 0.3, 0.4, 2.4, 0.6]])
Is there a numpy function which will achieve what I want elegantly and succinctly?
Edit:
I see that my first solution has transposed my data, such that I am returned a list of ndarrays. na_components[0] now gives an ndarray of the stock values for the first stock, 1 element per date.
The next step that I perform with na_components is to calculate the total cumulative return for the portfolio by summing each individual component
na_pfo_cum_ret = np.sum(na_components, axis=0)
This works with the list of individual stock return ndarrays.
That order seems a little odd to me, but IIUC, all you need to do is to transpose the result of multiplying na_values by array(ls_alloc):
>>> v
array([[ 0., 1., 2., 3.],
[ 1., 1., 4., 3.],
[ 2., 1., 6., 3.],
[ 3., 1., 8., 3.]])
>>> a
array([ 0.1, 0.4, 0.3, 0.2])
>>> (v*a).T
array([[ 0. , 0.1, 0.2, 0.3],
[ 0.4, 0.4, 0.4, 0.4],
[ 0.6, 1.2, 1.8, 2.4],
[ 0.6, 0.6, 0.6, 0.6]])
It's not completely clear to me what you want to do, but the answer is probably in Broadcasting rules. I think you want:
values = np.array( [[ 0, 1, 2, 3 ],
[ 1, 1, 4, 3 ],
[ 2, 1, 6, 3 ],
[ 3, 1, 8, 3 ]] )
ls_alloc = np.array([ 0.1, 0.4, 0.3, 0.2])
and either:
na_components = values * ls_alloc
or:
na_components = values * ls_alloc[:,np.newaxis]

How can I mask a portion of an array using Numpy?

What I want to do is "mask" a subset of an array of j elements, from range 0 to k. Eg. For this array:
[0.2, 0.1, 0.3, 0.4, 0.5]
Masking the first 2 elements it becomes
[NaN, NaN, 0.3, 0.4, 0.5]
Does masked_array support this operation?
In [51]: arr=np.ma.array([0.2, 0.1, 0.3, 0.4, 0.5],mask=[True,True,False,False,False])
In [52]: print(arr)
[-- -- 0.3 0.4 0.5]
Or, if you already have a numpy array, you could use np.ma.masked_less_equal (see the link for a variety of other operations for masking particular elements):
In [53]: arr=np.array([0.2, 0.1, 0.3, 0.4, 0.5])
In [56]: np.ma.masked_less_equal(arr,0.2)
Out[57]:
masked_array(data = [-- -- 0.3 0.4 0.5],
mask = [ True True False False False],
fill_value = 1e+20)
Or, if you wish to mask the first two elements:
In [67]: arr=np.array([0.2, 0.1, 0.3, 0.4, 0.5])
In [68]: arr=np.ma.array(arr,mask=False)
In [69]: arr.mask[:2]=True
In [70]: arr
Out[70]:
masked_array(data = [-- -- 0.3 0.4 0.5],
mask = [ True True False False False],
fill_value = 1e+20)
I found this:
ma.array([1,2,3,4], mask=[1,1,0,0])
masked_array(data = [-- -- 3 4],
mask = [ True True False False],
fill_value = 999999)