How to change dtypes of numpy array for tensorflow - pandas

I am creating a neural network in tensorflow and I have created the placeholders like this:
input_tensor = tf.placeholder(tf.float32, shape = (None,n_input), name = "input_tensor")
output_tensor = tf.placeholder(tf.float32, shape = (None,n_classes), name = "output_tensor")
During the training process, I was getting the following error:
Traceback (most recent call last):
File "try.py", line 150, in <module>
sess.run(optimizer, feed_dict={X: x_train[i: i + 1], Y: y_train[i: i + 1]})
TypeError: unhashable type: 'numpy.ndarray'
I identified that is because of the different datatypes of my x_train and y_train to the datatypes of the placeholders.
My x_train looks somewhat like this:
array([[array([[ 1., 0., 0.],
[ 0., 1., 0.]])],
[array([[ 0., 1., 0.],
[ 1., 0., 0.]])],
[array([[ 0., 0., 1.],
[ 0., 1., 0.]])]], dtype=object)
It was initially a dataframe like this:
0 [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]]
1 [[0.0, 1.0, 0.0], [1.0, 0.0, 0.0]]
2 [[0.0, 0.0, 1.0], [0.0, 1.0, 0.0]]
I did x_train = train_x.values to get the numpy array
And y_train looks this:
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
x_train has dtype object and y_train has dtype float64.
What I want to know is that how I can change the datatypes of my training data so that it can work well with the tensorflow placeholders. Or please suggest if I am missing something.

It is little hard to guess what shape you want your data to be, but I am guessing one of the two combinations which you might be looking for. I will also try to simulate your data in Pandas dataframe.
df = pd.DataFrame([[[[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]]],
[[[0.0, 1.0, 0.0], [1.0, 0.0, 0.0]]],
[[[0.0, 0.0, 1.0], [0.0, 1.0, 0.0]]]], columns = ['Mydata'])
print(df)
x = df.Mydata.values
print(x.shape)
print(x)
print(x.dtype)
Output:
Mydata
0 [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]]
1 [[0.0, 1.0, 0.0], [1.0, 0.0, 0.0]]
2 [[0.0, 0.0, 1.0], [0.0, 1.0, 0.0]]
(3,)
[list([[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]])
list([[0.0, 1.0, 0.0], [1.0, 0.0, 0.0]])
list([[0.0, 0.0, 1.0], [0.0, 1.0, 0.0]])]
object
Combination 1
y = [item for sub_list in x for item in sub_list]
y = np.array(y, dtype = np.float32)
print(y.dtype, y.shape)
print(y)
Output:
float32 (6, 3)
[[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 1. 0.]
[ 1. 0. 0.]
[ 0. 0. 1.]
[ 0. 1. 0.]]
Combination 2
y = [sub_list for sub_list in x]
y = np.array(y, dtype = np.float32)
print(y.dtype, y.shape)
print(y)
Output:
float32 (3, 2, 3)
[[[ 1. 0. 0.]
[ 0. 1. 0.]]
[[ 0. 1. 0.]
[ 1. 0. 0.]]
[[ 0. 0. 1.]
[ 0. 1. 0.]]]

Your x_train is a nested object containing arrays, so you have to unpack it and reshape it. Here's a general purpose hack:
def unpack(a, aggregate=[]):
for x in a:
if type(x) is float:
aggregate.append(x)
else:
unpack(x, aggregate=aggregate)
return np.array(aggregate)
x_train = unpack(x_train.values).reshape(x_train.shape[0],-1)
Once you've got a dense array (y_train is already dense), you can use a function like the following:
def cast(placeholder, array):
dtype = placeholder.dtype.as_numpy_dtype
return array.astype(dtype)
x_train, y_train = cast(X,x_train), cast(Y,y_train)

Related

What's the right way to one-hot-encode a categorical tuple in Tensorflow/Keras?

I want to create a neural network that takes a categorical tuple as input and passes its one-hot-encoded value to its layers.
For example, assuming that the tuple value limits were (2, 2, 3), I need a preprocessing layer that transforms the following three-dimensional list of values:
[
(1, 0, 0),
(0, 0, 1),
(1, 1, 2),
]
Into the following one-dimensional tensor:
[
0.0, 1.0, 0.0, 0.0, 0.0, 0.0,
1.0, 0.0, 0.0, 0.0, 0.0, 1.0,
]
Does such a function exist?
I assume that this custom layer operates on a batch having varied number of tuples per sample. For example, an input batch may be
[[(1, 0, 0), (0, 0, 1), (1, 1, 2)],
[(1, 0, 0), (1, 1, 2)]]
and the desired output tensors would be
[[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0]]
Since the samples can be of uneven sizes, the batch needs to be converted to tf.RaggedTensor (instead of normal Tensor) before being fed to the layer. However, the following solution works with both tf.Tensor and tf.RaggedTensor as input.
class FillOneLayer(tf.keras.layers.Layer):
def __init__(self, shape, *args, **kwargs):
super().__init__(*args, **kwargs)
self.shape = shape
def call(self, inputs, training=None):
num_samples = inputs.nrows() if isinstance(inputs, tf.RaggedTensor) else tf.shape(inputs)[0]
num_samples = tf.cast(num_samples, tf.int32)
ret = tf.TensorArray(tf.float32, size=num_samples, dynamic_size=False)
for i in range(num_samples):
sample = inputs[i]
sample = sample.to_tensor() if isinstance(sample, tf.RaggedTensor) else sample
updates_shape = tf.shape(sample)[:-1]
tmp = tf.zeros(self.shape)
tmp = tf.tensor_scatter_nd_update(tmp, sample, tf.ones(updates_shape))
ret = ret.write(i, tf.reshape(tmp, (-1,)))
return ret.stack()
Output for normal input tensor
>>> a = tf.constant([[(1, 0, 0), (0, 0, 1), (1, 1, 2)],
[(1, 0, 0), (0, 0, 1), (1, 0, 2)]])
>>> FillOneLayer((2,2,3))(a)
<tf.Tensor: shape=(2, 12), dtype=float32, numpy=
array([[0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.],
[0., 1., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0.]], dtype=float32)>
Output for ragged tensor
>>> a = tf.ragged.constant([[(1, 0, 0), (0, 0, 1), (1, 1, 2)],
[(1, 0, 0), (0, 0, 1)]])
>>> FillOneLayer((2,2,3))(a)
<tf.Tensor: shape=(2, 12), dtype=float32, numpy=
array([[0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.],
[0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]], dtype=float32)>
The solution also works when you decorate call() with tf.function, which is usually what happens when you call fit on a model whom this layer is a member of. In that case, to avoid graph retracing, you should ensure that all batches are of the same type, i.e., either all RaggedTensor or all Tensor.

what does myarray[0][:,0] mean

This is an excerpt from a documentation.
lambda ind, r: 1.0 + any(np.array(points_2d)[ind][:,0] == 0.0)
But I don't understand np.array(points_2d)[ind][:,0].
It seems equivalent to myarray[0][:,0], which doesn't make sense to me.
Can anyone help to explain?
With points_2d from earlier in the doc:
In [38]: points_2d = [(0., 0.), (0., 1.), (1., 1.), (1., 0.),
...: (0.5, 0.25), (0.5, 0.75), (0.25, 0.5), (0.75, 0.5)]
In [39]: np.array(points_2d)
Out[39]:
array([[0. , 0. ],
[0. , 1. ],
[1. , 1. ],
[1. , 0. ],
[0.5 , 0.25],
[0.5 , 0.75],
[0.25, 0.5 ],
[0.75, 0.5 ]])
Indexing with a scalar gives a 1d array, which can't be further indexed with [:,0].
In [40]: np.array(points_2d)[0]
Out[40]: array([0., 0.])
But with a list or slice:
In [41]: np.array(points_2d)[[0,1,2]]
Out[41]:
array([[0., 0.],
[0., 1.],
[1., 1.]])
In [42]: np.array(points_2d)[[0,1,2]][:,0]
Out[42]: array([0., 0., 1.])
So this selects the first column of a subset of rows.
In [43]: np.array(points_2d)[[0,1,2]][:,0]==0.0
Out[43]: array([ True, True, False])
In [44]: any(np.array(points_2d)[[0,1,2]][:,0]==0.0)
Out[44]: True
I think they could have used:
In [45]: np.array(points_2d)[[0,1,2],0]
Out[45]: array([0., 0., 1.])

How to write a custom kernel in GPflow for the covariance matrix RBF plus noise only on the main block diagonal?

The required covariance matrix is
where t is 1D time and k={0, 1}
A sample from the kernel should look like:
with the orange sequence corresponding to k=0, and the blue one to k=1.
It sounds like you're looking for a kernel for different discrete outputs. You can achieve this in GPflow for example with the Coregion kernel, for which there is a tutorial notebook.
To construct a coregionalization kernel that is block-diagonal (all off-diagonal entries are zero), you can set rank=0. Note that you need to explicitly specify which kernel should act on which dimensions:
import gpflow
k_time = gpflow.kernels.SquaredExponential(active_dims=[0])
k_coreg = gpflow.kernels.Coregion(output_dim=2, rank=0, active_dims=[1])
You can combine them with * as in the notebook, or with + as specified in the question:
k = k_time + k_coreg
You can see that the k_coreg term is block-diagonal as you specified: Evaluating
test_inputs = np.array([
[0.1, 0.0],
[0.5, 0.0],
[0.7, 1.0],
[0.1, 1.0],
])
k_coreg(test_inputs)
returns
<tf.Tensor: shape=(4, 4), dtype=float64, numpy=
array([[1., 1., 0., 0.],
[1., 1., 0., 0.],
[0., 0., 1., 1.],
[0., 0., 1., 1.]])>
And you can get samples as in the graph in the question by running
import numpy as np
num_inputs = 51
num_outputs = 2
X = np.linspace(0, 5, num_inputs)
Q = np.arange(num_outputs)
XX, QQ = np.meshgrid(X, Q, indexing='ij')
pts = np.c_[XX.flatten(), QQ.flatten()]
K = k(pts)
L = np.linalg.cholesky(K + 1e-8 * np.eye(len(K)))
num_samples = 3
v = np.random.randn(len(L), num_samples)
f = L # v
import matplotlib.pyplot as plt
for i in range(num_samples):
plt.plot(X, f[:, i].reshape(num_inputs, num_outputs))
In GPflow, you can construct this kernel using a sum kernel consisting of a Squared Exponential (RBF) and a White kernel.
import gpflow
kernel = gpflow.kernels.SquaredExponential() + gpflow.kernels.White()

Collect Spark dataframe into Numpy matrix

I've used spark to compute the PCA on a large dataset, now I have a spark dataframe with the following structure:
Row('pcaFeatures'=DenseVector(elem1,emlem2..))
where elem1,..., elemN are double numbers. I would like to transform it in a numpy matrix. Right now I'm using the following code:
numpymatrix = datapca.toPandas().as_Matrix()
but I get a numpy Series with elements of type Object instead of a numeric matrix. Is there a way to get the matrix I need?
Your request makes sense only if the resulting data can fit into your main memory (i.e. you can safely use collect()); on the other hand, if this is the case, admittedly you have absolutely no reason to use Spark at all.
Anyway, given this assumption, here is a general way to convert a single-column features Spark dataframe (Rows of DenseVector) to a NumPy array using toy data:
spark.version
# u'2.2.0'
from pyspark.ml.linalg import Vectors
import numpy as np
# toy data:
df = spark.createDataFrame([(Vectors.dense([0,45,63,0,0,0,0]),),
(Vectors.dense([0,0,0,85,0,69,0]),),
(Vectors.dense([0,89,56,0,0,0,0]) ,),
], ['features'])
dd = df.collect()
dd
# [Row(features=DenseVector([0.0, 45.0, 63.0, 0.0, 0.0, 0.0, 0.0])),
# Row(features=DenseVector([0.0, 0.0, 0.0, 85.0, 0.0, 69.0, 0.0])),
# Row(features=DenseVector([0.0, 89.0, 56.0, 0.0, 0.0, 0.0, 0.0]))]
np.asarray([x[0] for x in dd])
# array([[ 0., 45., 63., 0., 0., 0., 0.],
# [ 0., 0., 0., 85., 0., 69., 0.],
# [ 0., 89., 56., 0., 0., 0., 0.]])

Argmax on a tensor and ceiling in Tensorflow

Suppose I have a tensor in Tensorflow that its values are like:
A = [[0.7, 0.2, 0.1],[0.1, 0.4, 0.5]]
How can I change this tensor into the following:
B = [[1, 0, 0],[0, 0, 1]]
In other words I want to just keep the maximum and replace it with 1.
Any help would be appreciated.
I think that you can solve it with a one-liner:
import tensorflow as tf
import numpy as np
x_data = [[0.7, 0.2, 0.1],[0.1, 0.4, 0.5]]
# I am using hard-coded dimensions for simplicity
x = tf.placeholder(dtype=tf.float32, name="x", shape=(2,3))
session = tf.InteractiveSession()
session.run(tf.one_hot(tf.argmax(x, 1), 3), {x: x_data})
The result is the one that you expect:
Out[6]:
array([[ 1., 0., 0.],
[ 0., 0., 1.]], dtype=float32)