Currently I'm working with time series data in Pandas. The series are the 3D positions of several markers, so my Dataframe looks as follows:
[A.x, A.y, A.z, B.x, B.y, B.z, C.x, C.y, C.z ... etc.]
Now sometimes the system lost track of one the markers, so the position stays the same over several frames. I want to set these values to NaN (to later interpolate them), but I can't figure out how to do this. So:
A.x A.y A.z A.x A.y A.z
[0.1, 0.2, 0.2] [0.1, 0.2, 0.2]
[0.1, 0.2, 0.2] [NaN, NaN, NaN]
[0.1, 0.2, 0.2] -> [NaN, NaN, NaN]
[0.3, 0.2, 0.2] [0.3, 0.2, 0.2] <- Kept because at least one position was different
[0.2, 0.2, 0.2] [0.2, 0.2, 0.2]
[0.3, 0.2, 0.2] [0.1, 0.2, 0.2] <- Kept as it was not the same as the immediately preceding frame
Dropping duplicated doesn't work, as it does not look for "repeated" values but duplicates in general. I think a solution looking at 3 columns (so 1 point) at the same time would be the best?
Simple version below.
Generic Version:
import numpy as np
import pandas as pd
df = pd.DataFrame(
[
[0.1, 0.2, 0.2, 0.3, 0.2, 0.2],
[0.1, 0.2, 0.2, 0.3, 0.2, 0.2],
[0.1, 0.2, 0.2, 0.2, 0.2, 0.2],
[0.3, 0.2, 0.2, 0.3, 0.2, 0.2],
[0.2, 0.2, 0.2, 0.1, 0.2, 0.2],
[0.3, 0.2, 0.2, 0.1, 0.2, 0.2],
[0.3, 0.2, 0.2, 0.1, 0.2, 0.2],
],
columns="A.x A.y A.z B.x B.y B.z".split(),
)
# A.x A.y A.z B.x B.y B.z
# 0 0.1 0.2 0.2 0.3 0.2 0.2
# 1 0.1 0.2 0.2 0.3 0.2 0.2
# 2 0.1 0.2 0.2 0.2 0.2 0.2
# 3 0.3 0.2 0.2 0.3 0.2 0.2
# 4 0.2 0.2 0.2 0.1 0.2 0.2
# 5 0.3 0.2 0.2 0.1 0.2 0.2
# 6 0.3 0.2 0.2 0.1 0.2 0.2
# identify repeating data
diff = (df.values[:-1] == df.values[1:])
# [[ True, True, True, True, True, True],
# [ True, True, True, False, True, True],
# [False, True, True, False, True, True],
# [False, True, True, False, True, True],
# [False, True, True, True, True, True],
# [ True, True, True, True, True, True]]
allfalse = np.full((1, diff.shape[1]), False)
# [[False, False, False, False, False, False]]
# add allfalse as first row
diff2 = np.concatenate((allfalse, diff), axis=0)
# grouped into 3s
grouped = diff2.reshape(diff2.shape[0], diff2.shape[1] // 3, 3)
# [[[False, False, False], [False, False, False]],
# [[ True, True, True], [ True, True, True]],
# [[ True, True, True], [False, True, True]],
# [[False, True, True], [False, True, True]],
# [[False, True, True], [False, True, True]],
# [[False, True, True], [ True, True, True]],
# [[ True, True, True], [ True, True, True]]]
# mask for triplets
mask = np.all(grouped, axis=2)
# [[False, False],
# [ True, True],
# [ True, False],
# [False, False],
# [False, False],
# [False, True],
# [ True, True]]
grouped[~mask] = False
# [[[False, False, False], [False, False, False]],
# [[ True, True, True], [ True, True, True]],
# [[ True, True, True], [False, False, False]],
# [[False, False, False], [False, False, False]],
# [[False, False, False], [False, False, False]],
# [[False, False, False], [ True, True, True]],
# [[ True, True, True], [ True, True, True]]]
# finally reshape back into original shape
repeated = grouped.reshape(diff2.shape[0], diff2.shape[1])
# [[False, False, False, False, False, False],
# [ True, True, True, True, True, True],
# [ True, True, True, False, False, False],
# [False, False, False, False, False, False],
# [False, False, False, False, False, False],
# [False, False, False, True, True, True],
# [ True, True, True, True, True, True]]
# set repeating values to NAN
df.values[repeated] = np.nan
# A.x A.y A.z B.x B.y B.z
# 0 0.1 0.2 0.2 0.3 0.2 0.2
# 1 NaN NaN NaN NaN NaN NaN
# 2 NaN NaN NaN 0.2 0.2 0.2
# 3 0.3 0.2 0.2 0.3 0.2 0.2
# 4 0.2 0.2 0.2 0.1 0.2 0.2
# 5 0.3 0.2 0.2 NaN NaN NaN
# 6 NaN NaN NaN NaN NaN NaN
Simple(r) Version:
import numpy as np
import pandas as pd
df = pd.DataFrame(
[
[0.1, 0.2, 0.2],
[0.1, 0.2, 0.2],
[0.1, 0.2, 0.2],
[0.3, 0.2, 0.2],
[0.2, 0.2, 0.2],
[0.3, 0.2, 0.2],
[0.3, 0.2, 0.2],
],
columns="A.x A.y A.z".split(),
)
# A.x A.y A.z
# 0 0.1 0.2 0.2
# 1 0.1 0.2 0.2
# 2 0.1 0.2 0.2
# 3 0.3 0.2 0.2
# 4 0.2 0.2 0.2
# 5 0.3 0.2 0.2
# 6 0.3 0.2 0.2
# difference between consecutive values
diff = (df.values[:-1] == df.values[1:])
# [[ True, True, True],
# [ True, True, True],
# [False, True, True],
# [False, True, True],
# [False, True, True],
# [ True, True, True]]
# collapse rows into single value np.all(..., axis=1)
# make array len == number of rows in original DF
repeated = np.insert(np.all(diff, axis=1), 0, False)
# [False, True, True, False, False, False, True]
# modify df in-place
df.values[repeated] = [np.nan, np.nan, np.nan]
# A.x A.y A.z
# 0 0.1 0.2 0.2
# 1 NaN NaN NaN
# 2 NaN NaN NaN
# 3 0.3 0.2 0.2
# 4 0.2 0.2 0.2
# 5 0.3 0.2 0.2
# 6 NaN NaN NaN
I'm certain this can be done prettier and more efficient, but this is step 2 :)
I'll think about B.x... C.x part... will post update.
Enjoy!
I have a ndarray as follows.
feature_matrix = [[0.1, 0.3], [0.7, 0.8], [0.8, 0.8]]
I have a position ndarray as follows.
position = [10, 20, 30]
Now I want to add the position value at the beginning of the feature_matrix as follows.
[[10, 0.1, 0.3], [20, 0.7, 0.8], [30, 0.8, 0.8]]
I tried the answers in this: How to add an extra column to an numpy array
E.g.,
feature_matrix = np.concatenate((feature_matrix, position), axis=1)
However, I get the error saying that;
ValueError: all the input arrays must have same number of dimensions
Please help me to resolve this prblem.
This solved my problem. I used np.column_stack.
feature_matrix = [[0.1, 0.3], [0.7, 0.8], [0.8, 0.8]]
position = [10, 20, 30]
feature_matrix = np.column_stack((position, feature_matrix))
It is the shape of the position array which is incorrect regarding the shape of the feature_matrix.
>>> feature_matrix
array([[ 0.1, 0.3],
[ 0.7, 0.8],
[ 0.8, 0.8]])
>>> position
array([10, 20, 30])
>>> position.reshape((3,1))
array([[10],
[20],
[30]])
The solution is (with np.concatenate):
>>> np.concatenate((position.reshape((3,1)), feature_matrix), axis=1)
array([[ 10. , 0.1, 0.3],
[ 20. , 0.7, 0.8],
[ 30. , 0.8, 0.8]])
But np.column_stack is clearly great in your case !
I have a numpy array of values, and a list of scaling factors which I want to scale each value in the array by, down each column
values = [[ 0, 1, 2, 3 ],
[ 1, 1, 4, 3 ],
[ 2, 1, 6, 3 ],
[ 3, 1, 8, 3 ]]
ls_alloc = [ 0.1, 0.4, 0.3, 0.2]
# convert values into numpy array
import numpy as np
na_values = np.array(values, dtype=float)
Edit: To clarify:
na_values can is a 2-dimensional array of stock cumulative returns (ie: normalised to day 1), where each row represents a date, and each column a stock. The data is returned as an array for each date.
I want to now scale each stock's cumulative return by its allocation in the portfolio. So for each date (ie: each row of ndarray values, apply the respective element from ls_alloc to the array element-wise)
# scale each value by its allocation
na_components = [ ls_alloc[i] * na_values[:,i] for i in range(len(ls_alloc)) ]
This does what I want, but I can't help but feel there must be a way to have numpy do this for me automatically?
That is, I feel:
na_components = [ ls_alloc[i] * na_values[:,i] for i in range(len(ls_alloc)) ]
# display na_components
na_components
[array([ 0. , 0.1, 0.2, 0.3]), \
array([ 0.4, 0.4, 0.4, 0.4]), \
array([ 0.6, 1.2, 1.8, 2.4]), \
array([ 0.6, 0.6, 0.6, 0.6])]
should be able to be expressed as something like:
tmp = np.multiply(na_values, ls_alloc)
# display tmp
tmp
array([[ 0. , 0.4, 0.6, 0.6],
[ 0.1, 0.4, 1.2, 0.6],
[ 0.2, 0.4, 1.8, 0.6],
[ 0.3, 0.4, 2.4, 0.6]])
Is there a numpy function which will achieve what I want elegantly and succinctly?
Edit:
I see that my first solution has transposed my data, such that I am returned a list of ndarrays. na_components[0] now gives an ndarray of the stock values for the first stock, 1 element per date.
The next step that I perform with na_components is to calculate the total cumulative return for the portfolio by summing each individual component
na_pfo_cum_ret = np.sum(na_components, axis=0)
This works with the list of individual stock return ndarrays.
That order seems a little odd to me, but IIUC, all you need to do is to transpose the result of multiplying na_values by array(ls_alloc):
>>> v
array([[ 0., 1., 2., 3.],
[ 1., 1., 4., 3.],
[ 2., 1., 6., 3.],
[ 3., 1., 8., 3.]])
>>> a
array([ 0.1, 0.4, 0.3, 0.2])
>>> (v*a).T
array([[ 0. , 0.1, 0.2, 0.3],
[ 0.4, 0.4, 0.4, 0.4],
[ 0.6, 1.2, 1.8, 2.4],
[ 0.6, 0.6, 0.6, 0.6]])
It's not completely clear to me what you want to do, but the answer is probably in Broadcasting rules. I think you want:
values = np.array( [[ 0, 1, 2, 3 ],
[ 1, 1, 4, 3 ],
[ 2, 1, 6, 3 ],
[ 3, 1, 8, 3 ]] )
ls_alloc = np.array([ 0.1, 0.4, 0.3, 0.2])
and either:
na_components = values * ls_alloc
or:
na_components = values * ls_alloc[:,np.newaxis]