Best way to convert TensorProto to TensorFlow tensor - tensorflow

As far as I can tell, there are at least two different ways to recover a Tensor from a TensorProto in Tensorflow 2.3. Say, for the sake of example, that we have
tensor = tf.range(10)
tproto = tf.make_tensor_proto(tensor)
Then:
You can use tf.make_ndarray like so
tf.constant(tf.make_ndarray(tproto))
Or you can use tf.io.parse_tensor like so
tf.io.parse_tensor(tproto.SerializeToString(), out_type=tf.int32)
I feel both of these are a bit artificial, since in the former you end up with an intermediate numpy array, and in the latter you have to serialize the TensorProto to a string and parse it back. Additionally, parse_tensor won't automatically recover the correct data type from the TensorProto. So:
Is there a function to do the conversion in a single step? I'd like to see something like tf.from_tensor_proto doing the conversion all at once optimizing for speed and memory allocation (or, if tf.constant(tf.make_ndarray(tproto)) is the best you can do, just wrapping this up).
Otherwise, which of the two options above should be preferred (in terms of efficiency, memory usage, etc.)?

Related

label encoding in dask_cudf dataframe

I am trying to use dask_cudf to preprocess a very large dataset (150,000,000+ records) for multi-class xgboost training and am having trouble encoding the class column (dtype is string). I tried using the 'replace' function, but the error message said the two dtypes must match. I tried using dask_ml.LabelEncoder, but it said string arrays aren't supported in cudf. I tried using compute() in various ways, but i kept running into out-of-memory errors (i'm assuming because operations on cudf dataframe require a smaller dataset). I also tried pulling the class column out, encoding, and then merging it back with the dataframe, but the partitions do not line up. I tried manually lining them up, but dask_cudf seemingly does not support repartioning using 'divisions' parameter (got error saying something like 'old and new partitions do not match'). Any help on how to do this would be much appreciated.
Strings aren't supported on xgboost. Not having seen your data, here are a few ways quick and dirty ways I've modified string columns to train, as generally strings may not matter:
If the strings were actually numeric (like dates), converting to int (int8 int16, int32)
I did this by hashmapping the strings and then running xgboost (basically creating a reversible conversion between string and integer as long as you don't change the integer) and train on your current, now hashed as an integer, column.
if the strings are classes, manually naming class numbers (0,1,2,...,n) in a new column and train on that one.
There are definitely other, better ways. As for the second part of your question, left a comment.
Now, your XGBoost model and your dask-cudf dataframe per-GPU allocation must fit on a single GPU, or you will get memory errors. If your model will be considering a large amount of data, please train on the largest GPU memory sized cluster you can. A100s can have 40GB and 80GB. Some older compute GPUs, V100 and GV100 have 32GB. A6000 and RTX8000 have 48GB. then it goes to 24, 16, and lower from there. Please size your GPUs accordingly

numpy-thonic way to store data on already-allocated data

I have a code where I have to iterate creation of several GB of data with np.tile and np.repeat.
After few iterations the code goes out of memory. Since each tile and repeat is used only within inside an iteration, I am thinking on how to save memory.
Ideally, in order to reuse memory, I would like to do something like this:
large_matrix = np.zeros(N*M)
for data in generator:
np.repeat(data, M, out = large_matrix)
[...] #here I will use large matrix
Unfortunately there is no such keyword out on np.repeat and I had create my own njit(parallel=True) numba functions to replicate numpy repeat function.
However, before I start rewriting many other numpy functions in numba, my question is: what is the numpy-thonic way to store numpy results on already existing arrays so to keep memory usage under control?
Numpy's in-place assignment is large_matrix[:]=np.repeat(data,M)
better - encapsulate the inside of your for-loop as a function (e.g. def process(data):). This way, all matrices except for the returned outputs are freed when the iteration is done. If the outputs are big, write them down to disk instead of accumulating them on RAM.
It's very rare that tile or repeat can't be replaced with smart broadcasting.

Dealing with both categorical and numerical variables in a Multiple Linear Regression Python

So I have already performed a multiple linear regression in Python using LinearRegression from sklearn.
My independant variables were all numerical (and so was my dependant one)
But now I'd like to perform a multiple linear regression combining numerical and non numerical independant variables.
Therefore I have several questions:
If I use dummy variables or One-Hot for the non-numerical ones, will I then be able to perform the LinearRegression from sklearn?
If yes, do I have to change some parameters?
If not, how should I perform the Linear Regression?
One thing that bother me is that dummy/one-hot methods don't deal with ordinal variables, right? (Because it shouldn't be encoded the same way in my opinion)
Problem is: Even if I want to encode diffently nominal and ordinal variables,
it seems impossible for Python to tell the difference between both of them?
This stuff might be easy for you but right now as you could tell I'm a little confused so I could really use your help !
Thanks in advance,
Alex
If I use dummy variables or One-Hot for the non-numerical ones, will I then be able to perform the LinearRegression from sklearn?
In fact the model has to be fed exclusively with numerical data, thus you must use OneHot vectors for the categorical data in your input features. For that you can take a look at Scikit-Learn's LabelEncoder and OneHotEncoder.
One thing that bother me is that dummy/one-hot methods don't deal with ordinal variables, right? (Because it shouldn't be encoded the same way in my opinion)
Yes. As you mention one-hot methods don't deal with ordinal variables. One way to work with ordinal features is to create a scale map, and map those features to that scale. Ordinal is a very useful tool for these cases. You can feed it a mapping dictionary according to a predifined scale mapping as mentioned. Otherwise, obviously it randomly assigns integers to the different categories as it has no knowledge to infer any order. From the documentation:
Ordinal encoding uses a single column of integers to represent the classes. An optional mapping dict can be passed in, in this case we use the knowledge that there is some true order to the classes themselves. Otherwise, the classes are assumed to have no true order and integers are selected at random.
Hope this helps.

Update submatrix in Tensorflow

Quite simply, what I want to do is the following
A = np.ones((3,3)) #arbitrary matrix
B = np.ones((2,2)) #arbitrary matrix
A[1:,1:] = A[1:,1:] + B
except in Tensorflow (where the matrices can be arbitrarily complicated tensor expressions). Neither A nor B is a Tensorflow Variable, but just a run-of-the-mill tensor.
What I have gathered so far: tensors are immutable, so I cannot assign to a submatrix. tf.scatter_nd is the current option for sub-assignment, but does not appear to support sub-matrices, only slices.
Methods that should work, but are perhaps not ideal:
I could pad B with zeros, but I'm sure this leads to instantiation of
an unnecessarily large B - can it be made sparse, maybe?
I could use the padding idea, but write it as a low-rank decomposition, e.g. in Numpy: A+U.dot(B).U.T where U is a stacked zero and identity matrix. I'm not sure this is actually advantageous.
I could split A into submatrices, and stack them back together. Might be the most efficient, but sounds like the code would be convoluted.
Ideally, I want to do this operation N times for progressively smaller matrices, resulting in one large final result, but this is tangential.
I'll use one of the hacks for now, but I'm hoping someone can tell me what the idiomatic version is!

Understanding Numpy internals for profiling purposes

Profiling a piece of numpy code shows that I'm spending most of the time within these two functions
numpy/matrixlib/defmatrix.py.__getitem__:301
numpy/matrixlib/defmatrix.py.__array_finalize__:279
Here's the Numpy source:
https://github.com/numpy/numpy/blob/master/numpy/matrixlib/defmatrix.py#L301
https://github.com/numpy/numpy/blob/master/numpy/matrixlib/defmatrix.py#L279
Question #1:
__getitem__ seems to be called every time I'm using something like my_array[arg] and it's getting more expensive if arg is not an integer but a slice. Is there any way to speed up calls to array slices?
E.g. in
for i in range(idx): res[i] = my_array[i:i+10].mean()
Question #2:
When exactly does __array_finalize__ get called and how can I speed up by reducing the number of calls to this function?
Thanks!
You could not use matrices as much and just use 2d numpy arrays. I typically only use matrices for a short-time to take advantage of the syntax for multiplication (but with the addition of the .dot method on arrays, I find I do that less and less as well).
But, to your questions:
1) There really is no short-cut to __getitem__ unless defmatrix over-rides __getslice__ which it could do but doesn't yet. There are the .item and .itemset methods which are optimized for integer getting and setting (and return Python objects rather than NumPy's array-scalars)
2) __array_finalize__ is called whenever an array object (or a subclass) is created. It is called from the C-function that every array-creation gets funneled through. https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L1003
In the case of sub-classes defined purely in Python, it is calling back into the Python interpreter from C which has overhead. If the matrix class were a builtin type (a Cython-based cdef class, for example), then the call could avoid the Python interpreter overhead.
Question 1:
Since array slices can sometimes require a copy of the underlying data structure (holding the pointers to the data in memory) they can be quite expensive. If you're really bottlenecked by this in your above example, you can perform mean operations by actually iterating over the i to i+10 elements and manually creating the mean. For some operations this won't give any performance improvement, but avoiding creating new data structures will generally speed up the process.
Another note, if you're not using native types inside numpy you will get a Very large performance penalty to manipulating a numpy array. Say you're array has dtype=float64 and your native machine float size is float32 -- this will cost a lot of extra computation power for numpy and performance overall will drop. Sometimes this is fine and you can just take the hit for maintaining a data type. Other times it's arbitrary what type the float or int is stored as internally. In these cases try dtype=float instead of dtype=float64. Numpy should default to your native type. I've had 3x+ speedups on numpy intensive algorithms by making this change.
Question 2:
__array_finalize__ "is called whenever the system internally allocates a new array from obj, where obj is a subclass (subtype) of the (big)ndarray" according to SciPy. Thus this is a result described in the first question. When you slice and make a new array, you have to finalize that array by either making structural copies or wrapping the original structure. This operation takes time. Avoiding slices will save on this operation, though for multidimensional data it may be impossible to completely avoid calls to __array_finalize__.