TensorFlow: Can data sets contain string category values?

TensorFlow: Can data sets contain string category values? - tensorflow

With TensorFlow, it is easy to determine from examples that data contains numeric values. For example:
x_train = [1, 2, 3, 4]
y_train = [0, -1, -2, -3]
However, does it also work with string category values? For example:
x_train = ["sunny", "rainy", "sunny", "cloudy"]
y_train = ["go outside", "stay inside", "go outside", "go outside"]
If it does not, I must assume that TensorFlow has a methodology for working with categorical values. Perhaps by some clever trick such as converting them to numeric values in some systematic way.

Yes, TensorFlow does support datasets with categorical features. Perhaps the easiest way to work with them is to use the Feature Column API, which provides methods such as tf.feature_column.categorical_column_with_vocabulary_list() (for dealing with small, known sets of categories) and tf.feature_column.categorical_column_with_hash_bucket() (for dealing with large and potentially unbounded sets of categories).

Related

Higher (4+) dimension/axis in numpy...are they ever actually used in computation?

what I mean by the title is that sometimes I come across code that requires numpy operations (for example sum or average) along a specified axis. For example:
np.sum([[0, 1], [0, 5]], axis=1)
I can grasp this concept, but do we actually ever do these operations also along higher dimensions? Or is that not a thing? And if yes, how do you gain intuition for high-dimensional datasets and how do you make sure you are working along the right dimension/axis?

numpy concatenate over dimension

I find myself doing the following quite frequently and am wondering if there's a "canonical" way of doing it.
I have an ndarray say shape = (100, 4, 6) and I want to reduce to (100, 24) by concatenating the 4 vectors of length 6 into one vector
I can use reshape to do this but I've been manually computing the new shape
i.e.
np.reshape(x,shape=(a.shape[0],a.shape[1]*a.shape[2]))
ideally I'd simply supply the dimension I want to reduce on
np.concatenate(x,dim=-1)
but np.concatenate operates on an enumerable of ndarray. I've wondered if it's possible to supply an iterator over an ndarray axis but haven't looked further. What is the usual pattern here?

You can avoid calculating one dimension by using -1 like:
x.reshape(a.shape[0], -1)

Faster solution for sampling an index by value of ndarray

I have some pretty large arrays to deal with. By describing them big, I mean like in the scale of (514, 514, 374). I want to randomly get an index base on its pixel value. For example, I need the 3-d index of a pixl with value equal to 1. So, I list all the possibilities by
indices = np.asarray(np.where(img_arr == 1)).T
This works perfect, except that it runs very slow, to an intolerable extent, since the array is so big. So my question is is there a better way to do that? It would be nicer if I can input a list of pixel values, and I get back a list of corresponding indices. For example, I want to sample the indices of these pixel values [0, 1, 2], and I get back list of indices [[1,2,3], [53, 215, 11], [223, 42, 113]]
Since I am working with medical images, solutions with SimpleITK is also welcomed. So feel free to leave your opinions, thanks.

import numpy as np
value = 1
# value_list = [1, 3, 5] you can also use a list of values -> *
n_samples = 3
n_subset = 500
# Create a example array
img_arr = np.random.randint(low=0, high=5, size=(10, 30, 20))
# Choose randomly indices for the array
idx_subset = np.array([np.random.randint(high=s, size=n_subset) for s in x.shape]).T
# Get the values at the sampled positions
values_subset = img_arr[[idx_subset[:, i] for i in range(img_arr.ndim)]]
# Check which values match
idx_subset_matching_temp = np.where(values_subset == value)[0]
# idx_subset_matching_temp = np.argwhere(np.isin(values_subset, value_list)).ravel() -> *
# Get all the indices of the subset with the correct value(s)
idx_subset_matching = idx_subset[idx_subset_matching_temp, :]
# Shuffle the array of indices
np.random.shuffle(idx_subset_matching)
# Only keep as much as you need
idx_subset_matching = idx_subset_matching[:n_samples, :]
This gives you the desired samples. The distribution of those samples should be the same as if you are using your method of looking at all matches in the array. In both cases you get a uniform distribution along all the positions with matching values.
You have to be careful when choosing the size of the subset and the number of samples you want. The subset must be large enough that there are enough matches for the values, otherwise it won't work.
A similar problem occurs if the values you want to sample are very sparse, then the size of the subset needs to be very large (in the edge case the whole array) and you gain nothing.
If you are sampling often from the same array maybe it is also a good idea to store the indices for each value
indices_i = np.asarray(np.where(img_arr == i)).T
and use those for the your further computations.

dimension of a tensor created by tf.zeros(n)

I'm confused by the dimension of a tensor created with tf.zeros(n). For instance, if I write: tf.zeros(6).eval.shape, this will return me (6, ). What dimension is this? is this a matrix of 6 rows and arbitrary # of columns? Or is this a matrix of 6 columns with arbitrary # of rows?
weights = tf.random_uniform([3, 6], minval=-1, maxval=1, seed=1)- this is 3X6 matrix
b=tf.zeros(6).eval- I'm not sure what dimension this is.
Why I am able to add the two like weights+b? If I understand correctly, in order for the two to be added, b needs to be 3X1 dimension.

why i am able to add the two like weights+b?
Operator + is the same as using tf.add() (<obj>.__add__() calls the tf.add() or tf.math.add()) and if you read the documentation it says:
NOTE: math.add supports broadcasting. AddN does not. More about broadcasting here
Now I'm quoting from numpy broadcasting rules (which are the same for tensorflow):
When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when
they are equal, or
one of them is 1
So you're able to add two tensors with different shapes because they have the same trailing dimensions. If you change the dimension of your weights tensor to, let's say [3, 5], you will get InvalidArgumentError exception because trailing dimensions differ.

(6,) is python syntax for a tuple with 6 as a single element. Hence the shape here is a uni-dimensional vector of length 6.

TensorFlow: Contracting a dimension of two tensors via dot product

I have two tensors, a of rank 4 and b of rank 1. I'd like to produce aprime, of rank 3, by "contracting" the last axis of a away, by replacing it with its dot product against b. In numpy, this is as easy as np.tensordot(a, b, 1). However, I can't figure out a way to do this in Tensorflow.
How can I replace the last axis of a tensor with a value equal to that axis's dot product against another tensor (of course, of the same shape)?
UPDATE:
I see in Wikipedia that this is called the "Tensor Inner Product" https://en.wikipedia.org/wiki/Dot_product#Tensors aka tensor contraction. It seems like this is a common operation, I'm surprised that there's no explicit support for it in Tensorflow.
I believe that this may be possible via tf.einsum; however, I have not been able to find a generalized way to do this that works for tensors of any rank (this is probably because I do not understand einsum and have been reduced to trial and error)

Aren't you just using tensor in the sense of a multidimensional array? Or in some disciplines a tensor is 3d (vector 1d, matrix 2d, etc). I haven't used tensorflow but I don't think it has much to do with tensors in that linear algebra sensor. They talk about data flow graphs. I'm not sure where the tensor part of the name comes from.
I assume you are talking about an expression like:
In [293]: A=np.tensordot(np.ones((5,4,3,2)),np.arange(2),1)
resulting in a (5,4,3) shape array. The einsum equivalent is
In [294]: B=np.einsum('ijkl,l->ijk',np.ones((5,4,3,2)),np.arange(2))
np.einsum implements Einstine Notation, as discussed here: https://en.wikipedia.org/wiki/Einstein_notation. I got this link from https://en.wikipedia.org/wiki/Tensor_contraction
You seem to be talking about straight forward numpy operations, not something special in tensorflow.

I would first add 3 dimensions of size 1 to b so that it can be broadcast along the 4'th dimension of a.
b = tf.reshape(b, (1, 1, 1, -1))
Then you can multiply b and a and it will broadcast b along all of the other dimensions.
a_prime = a * b
Finally, reduce the sum along the 4'th dimension to get rid of that dimension and replace it with the dot product.
a_prime = tf.reduce_sum(a_prime, [3])

This seems like it would work (for the first tensor being of any rank):
tf.einsum('...i,i->...', x, y)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

TensorFlow: Can data sets contain string category values? - tensorflow

Related

Higher (4+) dimension/axis in numpy...are they ever actually used in computation?

numpy concatenate over dimension

Faster solution for sampling an index by value of ndarray

dimension of a tensor created by tf.zeros(n)

TensorFlow: Contracting a dimension of two tensors via dot product

Categories

Resources