How to use gather on 1D vector? - cntk

Have been trying to use cntk.ops.gather on 1D vectors. Here is a snippet illustrating what doesn't work:
import cntk
import numpy as np
def main():
xx = cntk.input_variable(shape=(1))
yy = cntk.input_variable(shape=(1))
zz = cntk.sequence.gather(xx, yy)
xx_value = np.arange(15, dtype=np.float64)
yy_value = np.array([1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1], dtype=np.float64)
aa = zz.eval({xx: xx_value.reshape(-1, 1), yy: yy_value.reshape(-1, 1)})
print(aa)
if __name__ == "__main__":
main()

The reason for this is that cntk expects a batch of examples to be provided.
When it sees, a (15,1) array it converts it to a batch of 15 examples each of length 1.
Then when gather is applied cntk is unhappy because some examples in the minibatch produce empty sequences (those for which there is a 0 in yy_value).
You can solve your problem by specifying the fact that you only have one example in the minibatch in a couple different ways.
you can provide the values in lists like this
aa = zz.eval({xx: [xx_value.reshape(-1, 1)], yy: [yy_value.reshape(-1, 1)]})
you can provide the values in a tensor of shape (1,15,1) like this:
aa = zz.eval({xx: xx_value.reshape(1, -1, 1), yy: yy_value.reshape(1, -1, 1)})
The latter works only if all the sequences in a minibatch have the same length.

Related

numpy fill 3D mask array from 2D k-index boundary array

I want to use a 2D array which contains k-index values to quickly fill a 3D array with different mask values above/below each k-index. Only non-zero boundary indices will be used to fill.
Initialize 2D k-index array and extract valid i-j index arrays:
import numpy as np
boundary_indices = np.array([[0, 1, 2], [1, 2, 1], [0, 2, 0]])
ii, jj = np.where(boundary_indices > 0) # determine desired indices
kk = boundary_indices[ii, jj] # align boundary indices with valid indices
Yields:
boundary_indices = array([[0, 1, 2],
[1, 2, 1],
[0, 2, 0]])
ii = array([0, 0, 1, 1, 1, 2])
jj = array([1, 2, 0, 1, 2, 1])
kk = array([1, 2, 1, 2, 1, 2])
Loop through the indices and populate the output array:
output = np.zeros((3, 3, 3), dtype=np.int64)
for i, j, k in zip(ii, jj, kk):
output[i, j, :k] = 7 # fill region above
output[i, j, k:] = 8 # fill region below
While this does yield the correct results, it becomes quite slow once the size of the array increases significantly:
output[:, :, 0] = [[0, 7, 7],
[7, 7, 7],
[0, 7, 0]]
output[:, :, 1] = [[0, 8, 7],
[8, 7, 8],
[0, 7, 0]]
output[:, :, 2] = [[0, 8, 8],
[8, 8, 8],
[0, 8, 0]]
Is there a more efficient way to do this?
Tried output[ii, jj, kk] = 8 but that only imprints the boundary on the output array and not the regions above/below.
I was hoping that there would be some fancy-indexing magic and that something like this would work:
output[ii, jj, :kk] = 7
output[ii, jj, kk:] = 8
But it generates a TypeError: TypeError: only integer scalar arrays can be converted to a scalar index
For such kind of operation, Numba and Cython can be used to produce an efficient code. Here is an example with Numba:
import numba as nb
# `parallel=True` can be added here for large arrays
#nb.njit('int64[:,:,::1](int64[:], int64[:], int64[:])')
def compute(ii, jj, kk):
output = np.zeros((3, 3, 3), dtype=np.int64)
n = output.shape[2]
# `for idx in prange(ii.size)` can be used here for large array
for i, j, k in zip(ii, jj, kk):
# `i, j, k = ii[idx], jj[idx], kk[idx]` can be used here for large array
for l in range(k): # fill region above
output[i, j, l] = 7
for l in range(k, n): # fill region below
output[i, j, l] = 8
return output
# Either kk needs to be converted to an int64-based array with kk.astype(np.int64)
# or boundary_indices needs to be an int64-based array in the first place.
output = compute(ii, jj, kk)
Note that the Numba function can be faster if ii and jj are contiguous. However, they are surprisingly not contiguous when retrieved from np.where. Besides I assume that kk is a 64-bit array. You can change the signature (string in the Numba jit decorator) so to support 32-bit array. Also please note that Numba can lazily compile the function based on the provided type at runtime but this introduce a significant overhead during the first function call. This code is significantly faster, especially for large arrays thanks to the the just-in-time compilation of Numba. The Numba loop can be parallelized using prange and the parallel=True decorator flag although the current code should already be pretty good. Finally, note that you can do the operation np.where(boundary_indices > 0) directly in the Numba loop on the fly so to avoid creating possibly-expensive temporary arrays.

How do I input a Time Series in spmvg nfoursid

I want to use this algorithm for n4sid model estimation. However, in the Documentation, there is an input DataFrame generated from Random Samples, where I want to input a Time Series Dataframe. Calling the nfoursid method leads to an Type Error or Value Error.
Documentation:
https://github.com/spmvg/nfoursid/blob/master/examples/Overview.ipynb
Imported libs:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from nfoursid.kalman import Kalman
from nfoursid.nfoursid import NFourSID
from nfoursid.state_space import StateSpace
import time
import datetime
import math
import scipy as sp
My input Time Series as Data Frame (flawless):
import yfinance as yfin
yfin.pdr_override()
spy = pdr.get_data_yahoo('AAPL',start='2022-08-23',end='2022-10-24')
spy['Log Return'] = np.log(spy['Adj Close']/spy['Adj Close'].shift(1))
AAPL=pd.DataFrame((spy['Log Return']))
The input DataFrame as proposed in the documentation:
state_space = StateSpace(A, B, C, D)
for _ in range(NUM_TRAINING_DATAPOINTS):
input_state = np.random.standard_normal((INPUT_DIM, 1))
noise = np.random.standard_normal((OUTPUT_DIM, 1)) * NOISE_AMPLITUDE
state_space.step(input_state, noise)
The call using the input proposed in the documentation:
#---->libs already imported
pd.set_option('display.max_columns', None)
np.random.seed(0) # reproducible results
NUM_TRAINING_DATAPOINTS = 1000
# create a training-set by simulating a state-space model with this many datapoints
NUM_TEST_DATAPOINTS = 20 # same for the test-set
INPUT_DIM = 3 #---->this probably needs to adapted to the AAPL dimensions
OUTPUT_DIM = 2
INTERNAL_STATE_DIM = 4 # actual order of the state-space model in the training- and test-set
NOISE_AMPLITUDE = .1 # add noise to the training- and test-set
FIGSIZE = 8
# define system matrices for the state-space model of the training- and test-set
A = np.array([
[1, .01, 0, 0],
[0, 1, .01, 0],
[0, 0, 1, .02],
[0, -.01, 0, 1],
]) / 1.01
B = np.array([
[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[0, 1, 1],
]
) / 3
C = np.array([
[1, 0, 1, 1],
[0, 0, 1, -1],
])
D = np.array([
[1, 0, 1],
[0, 1, 0]
]) / 10
)
#---->maybe I have to input the DataFrame already here at the state-space model:
state_space = StateSpace(A, B, C, D)
for _ in range(NUM_TRAINING_DATAPOINTS):
input_state = np.random.standard_normal((INPUT_DIM, 1))
noise = np.random.standard_normal((OUTPUT_DIM, 1)) * NOISE_AMPLITUDE
state_space.step(input_state, noise)
#----
#---->This is the method with the input DF, in this case the random state-space model
nfoursid = NFourSID(
state_space.to_dataframe(), # the state-space model can summarize inputs and outputs as a dataframe
output_columns=state_space.y_column_names,
input_columns=state_space.u_column_names,
num_block_rows=10
)
nfoursid.subspace_identification()
Pasting my DF at the call of the method nfoursid which leads to an error:
df2 = pd.DataFrame()
nfoursid = NFourSID(
output_columns=df2,
input_columns=AAPL,
num_block_rows=10
)
TypeError: NFourSID.init() missing 1 required positional argument: 'dataframe'
Pasting DF in the state_space led to:
ValueError: Dimensions of u (43, 1) are inconsistent. Expected (3, 1).
and
TypeError: 'DataFrame' object is not callable

how to take numpy array as an input in logistic regression?

Currently i'm working on a video recommendation system which will predicts a video in a form of 0 (Negative) and 1 (positive). I successfully scrape data set from YouTube and also find sentiments of YouTube comments in the form of 0 (Negative) and 1 (positive).I encode text data of my csv using one hot encoder and get output in the form of numpy array. Now My question is how to give the numpy array as an input (X) in logistic regression ? Below are my code, output and csv(1874 X 2).
Target variable is Comments_Sentiments
#OneHotEncoding
import numpy as np
import pandas as pd
from sklearn import preprocessing
X = pd.read_csv("C:/Users/Shahnawaz Irfan/Desktop/USIrancrisis/demo.csv")
#X.head(5)
X = X.select_dtypes(include=[object])
#X.head(5)
#X.shape
#X.columns
le = preprocessing.LabelEncoder()
X_2 = X.apply(le.fit_transform)
X_2.head()
enc = preprocessing.OneHotEncoder()
enc.fit(X_2)
onehotlabels = enc.transform(X_2).toarray()
onehotlabels.shape
onehotlabels
Output is:
array([[1.],
[1.],
[1.],
...,
[1.],
[1.],
[1.]])
Can any one resolve this query by taking this numpy array as an input in logistic regression?
you can use the inverse functionenc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]]) enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])

How to convert a list of indices into a cell list (numpy array of lists) in numpy with vectorized implementation?

Cell list is a data structure that maintains lists of data points in an N-D meshgrid. For example, the following list of 2d indices:
ind = [(0, 1), (1, 0), (0, 1), (0, 0), (0, 0), (0, 0), (1, 1)]
is converted to the following 2x2 cell list:
cell = [[[3, 4, 5], [0, 2]],
[[1, ], [6, ]]
]
using an O(n) algorithm:
# create an empty 2x2 cell list
cell = [[[] for _ in range(2)] for _ in range(2)]
for i in range(len(ind)):
cell[ind[i][0], ind[i][1]].append(i)
Is there a vectorized way in numpy that can convert the list of indices (ind) into the cell structure described above?
I don't think there is a good pure numpy but you can either use pythran or---if you don't want to touch a compiler---scipy.sparse cf. this Q&A which is essentially a 1D version of your problem.
[stb_pthr.py] simplified from Most efficient way to sort an array into bins specified by an index array?
import numpy as np
#pythran export sort_to_bins(int[:], int)
def sort_to_bins(idx, mx=-1):
if mx==-1:
mx = idx.max() + 1
cnts = np.zeros(mx + 1, int)
for i in range(idx.size):
cnts[idx[i] + 1] += 1
for i in range(1, cnts.size):
cnts[i] += cnts[i-1]
res = np.empty_like(idx)
for i in range(idx.size):
res[cnts[idx[i]]] = i
cnts[idx[i]] += 1
return res, cnts[:-1]
Compile: pythran stb_pthr.py
Main script:
import numpy as np
try:
from stb_pthr import sort_to_bins
HAVE_PYTHRAN = True
except:
HAVE_PYTHRAN = False
from scipy import sparse
def fallback(flat, maxind):
res = sparse.csr_matrix((np.zeros_like(flat),flat,np.arange(len(flat)+1)),
(len(flat), maxind)).tocsc()
return res.indices, res.indptr[1:-1]
if not HAVE_PYTHRAN:
sort_to_bins = fallback
def to_cell(data, shape=None):
data = np.asanyarray(data)
if shape is None:
*shape, = (1 + c.max() for c in data.T)
flat = np.ravel_multi_index((*data.T,), shape)
reord, bnds = sort_to_bins(flat, np.prod(shape))
return np.frompyfunc(np.split(reord, bnds).__getitem__, 1, 1)(
np.arange(np.prod(shape)).reshape(shape))
ind = [(0, 1), (1, 0), (0, 1), (0, 0), (0, 0), (0, 0), (1, 1)]
print(to_cell(ind))
from timeit import timeit
ind = np.transpose(np.unravel_index(np.random.randint(0, 100, (1_000_000)), (10, 10)))
if HAVE_PYTHRAN:
print(timeit(lambda: to_cell(ind), number=10)*100)
sort_to_bins = fallback # !!! MUST REMOVE THIS LINE AFTER TESTING
print(timeit(lambda: to_cell(ind), number=10)*100)
Sample run, output is answer to OP's toy example and timings (in ms) for the pythran and scipy solutions on a 1,000,000 points example:
[[array([3, 4, 5]) array([0, 2])]
[array([1]) array([6])]]
11.411489499732852
29.700406698975712

TensorFlow: numpy.repeat() alternative

I want to compare the predicted values yp from my neural network in a pairwise fashion, and so I was using (back in my old numpy implementation):
idx = np.repeat(np.arange(len(yp)), len(yp))
jdx = np.tile(np.arange(len(yp)), len(yp))
s = yp[[idx]] - yp[[jdx]]
This basically create a indexing mesh which I then use. idx=[0,0,0,1,1,1,...] while jdx=[0,1,2,0,1,2...]. I do not know if there is a simpler manner of doing it...
Anyhow, TensorFlow has a tf.tile(), but it seems to be lacking a tf.repeat().
idx = np.repeat(np.arange(n), n)
v2 = v[idx]
And I get the error:
TypeError: Bad slice index [ 0 0 0 ..., 215 215 215] of type <type 'numpy.ndarray'>
It also does not work to use a TensorFlow constant for the indexing:
idx = tf.constant(np.repeat(np.arange(n), n))
v2 = v[idx]
-
TypeError: Bad slice index Tensor("Const:0", shape=TensorShape([Dimension(46656)]), dtype=int64) of type <class 'tensorflow.python.framework.ops.Tensor'>
The idea is to convert my RankNet implementation to TensorFlow.
You can achieve the effect of np.repeat() using a combination of tf.tile() and tf.reshape():
idx = tf.range(len(yp))
idx = tf.reshape(idx, [-1, 1]) # Convert to a len(yp) x 1 matrix.
idx = tf.tile(idx, [1, len(yp)]) # Create multiple columns.
idx = tf.reshape(idx, [-1]) # Convert back to a vector.
You can simply compute jdx using tf.tile():
jdx = tf.range(len(yp))
jdx = tf.tile(jdx, [len(yp)])
For the indexing, you could try using tf.gather() to extract non-contiguous slices from the yp tensor:
s = tf.gather(yp, idx) - tf.gather(yp, jdx)
According to tf api document, tf.keras.backend.repeat_elements() does the same work with np.repeat() . For example,
x = tf.constant([1, 3, 3, 1], dtype=tf.float32)
rep_x = tf.keras.backend.repeat_elements(x, 5, axis=0)
# result: [1. 1. 1. 1. 1. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 1. 1. 1. 1. 1.]
Just for 1-d tensors, I've made this function
def tf_repeat(y,repeat_num):
return tf.reshape(tf.tile(tf.expand_dims(y,axis=-1),[1,repeat_num]),[-1])
It looks like your question is so popular that people refer it on TF tracker. Sadly the same function is not still implemented in TF.
You can implement it by combining tf.tile, tf.reshape, tf.squeeze. Here is a way to convert examples from np.repeat:
import numpy as np
import tensorflow as tf
x = [[1,2],[3,4]]
print np.repeat(3, 4)
print np.repeat(x, 2)
print np.repeat(x, 3, axis=1)
x = tf.constant([[1,2],[3,4]])
with tf.Session() as sess:
print sess.run(tf.tile([3], [4]))
print sess.run(tf.squeeze(tf.reshape(tf.tile(tf.reshape(x, (-1, 1)), (1, 2)), (1, -1))))
print sess.run(tf.reshape(tf.tile(tf.reshape(x, (-1, 1)), (1, 3)), (2, -1)))
In the last case where repeats are different for each element you most probably will need loops.
Just in case anybody is interested for a 2D method to copy the matrices. I think this could work:
TF_obj = tf.zeros([128, 128])
tf.tile(tf.expand_dims(TF_obj, 2), [1, 1, 2])
import numpy as np
import tensorflow as tf
import itertools
x = np.arange(6).reshape(3,2)
x = tf.convert_to_tensor(x)
N = 3 # number of repetition
K = x.shape[0] # for here 3
order = list(range(0, N*K, K))
order = [[x+i for x in order] for i in range(K)]
order = list(itertools.chain.from_iterable(order))
x_rep = tf.gather(tf.tile(x, [N, 1]), order)
Results from:
[0, 1],
[2, 3],
[4, 5]]
To:
[[0, 1],
[0, 1],
[0, 1],
[2, 3],
[2, 3],
[2, 3],
[4, 5],
[4, 5],
[4, 5]]
If you want:
[[0, 1],
[2, 3],
[4, 5],
[0, 1],
[2, 3],
[4, 5],
[0, 1],
[2, 3],
[4, 5]]
Simply use tf.tile(x, [N, 1])
So I have found that tensorflow has one such method to repeat the elements of an array. The method tf.keras.backend.repeat_elements is what you are looking for. Anyone who comes at a later point of time can save lot of their efforts. This link offers an explanation to the method and specifically says
Repeats the elements of a tensor along an axis, like np.repeat
I have included a very short example which proves that the elements are copied in the exact way as np.repeat would do.
import numpy as np
import tensorflow as tf
x = np.random.rand(2,2)
# print(x) # uncomment this line to see the array's elements
y = tf.convert_to_tensor(x)
y = tf.keras.backend.repeat_elements(x, rep=3, axis=0)
# print(y) # uncomment this line to see the results
You can simulate missing tf.repeat by tf.stacking the value with itself:
value = np.arange(len(yp)) # what to repeat
repeat_count = len(yp) # how many times
repeated = tf.stack ([value for i in range(repeat_count)], axis=1)
I advice using this only on small repeat counts.
Though many clean and working solutions have been given, they seem to all be based on producing the set of indices from scratch each iteration.
While the cost to produce these node's isn't typically significant during training, it may be significant if using your model for inference.
Repeating tf.range (like your example) has come up a few times so I built the following function creator. Given the maximum number of times something will be repeated and the maximum number of things that will need repeating, it returns a function which produces the same values as np.repeat(np.arange(len(multiples)), multiples).
import tensorflow as tf
import numpy as np
def numpy_style_repeat_1d_creator(max_multiple=100, max_to_repeat=10000):
board_num_lookup_ary = np.repeat(
np.arange(max_to_repeat),
np.full([max_to_repeat], max_multiple))
board_num_lookup_ary = board_num_lookup_ary.reshape(max_to_repeat, max_multiple)
def fn_to_return(multiples):
board_num_lookup_tensor = tf.constant(board_num_lookup_ary, dtype=tf.int32)
casted_multiples = tf.cast(multiples, dtype=tf.int32)
padded_multiples = tf.pad(
casted_multiples,
[[0, max_to_repeat - tf.shape(multiples)[0]]])
return tf.boolean_mask(
board_num_lookup_tensor,
tf.sequence_mask(padded_multiples, maxlen=max_multiple))
return fn_to_return
#Here's an example of how it can be used
with tf.Session() as sess:
repeater = numpy_style_repeat_1d_creator(5,4)
multiples = tf.constant([4,1,3])
repeated_values = repeater(multiples)
print(sess.run(repeated_values))
The general idea is to store a repeated tensor and then mask it, but it may help to see it visually (this is for the example given above):
In the example above the following Tensor is produced:
[[0,0,0,0,0],
[1,1,1,1,1],
[2,2,2,2,2],
[3,3,3,3,3]]
For multiples [4,1,3] it will collect the non-X values:
[[0,0,0,0,X],
[1,X,X,X,X],
[2,2,2,X,X],
[X,X,X,X,X]]
resulting in:
[0,0,0,0,1,2,2,2]
tl;dr: To avoid producing the indices each time (can be costly), pre-repeat everything and then mask that tensor each time
A relatively fast implementation was recently added with RaggedTensor utilities from 1.13, but it's not a part of the officially exported API. You can still use it, but there's a chance it might disappear.
from tensorflow.python.ops.ragged.ragged_util import repeat
From the source code:
# This op is intended to exactly match the semantics of numpy.repeat, with
# one exception: numpy.repeat has special (and somewhat non-intuitive) behavior
# when axis is not specified. Rather than implement that special behavior, we
# simply make `axis` be a required argument.
Tensorflow 2.10 has implemented np.repeat feature.
tf.repeat([1, 2, 3], repeats=[3, 1, 2], axis=0)
<tf.Tensor: shape=(6,), dtype=int32, numpy=array([1, 1, 1, 2, 3, 3], dtype=int32)>