Running cupy.histogram on given axis - cupy

I want to run cupy.histogram() parallelly on a 3D tensor of size (1000,10) where the histogram is performed features ( ,10) for each instance. I want to avoid doing the for-loop.
Is there any way to do it, help will be appreciated?


Predict a nonlinear array based on 2 features with scalar values using XGBoost or equivalent

So I have been looking at XGBoost as a place to start with this, however I am not sure the best way to accomplish what I want.
My data is set up something like this
Where every value, whether it be input or output is numerical. The issue I'm facing is that I only have 3 input data points per several output data points.
I have seen that XGBoost has a multi-output regression method, however I am only really seeing it used to predict around 2 outputs per 1 input, whereas my data may have upwards of 50 output points needing to be predicted with only a handful of scalar input features.
I'd appreciate any ideas you may have.
For reference, I've been looking at mainly these two demos (they are the same idea just one is scikit and the other xgboost)

Implement CVAE for a single image

I have a multi-dimensional, hyper-spectral image (channels, width, height = 15, 2500, 2500). I want to compress its 15 channel dimensions into 5 channels.So, the output would be (channels, width, height = 5, 2500, 2500). One simple way to do is to apply PCA. However, performance is not so good. Thus, I want to use Variational AutoEncoder(VAE).
When I saw the available solution in Tensorflow or keras library, it shows an example of clustering the whole images using Convolutional Variational AutoEncoder(CVAE).
However, I have a single image. What is the best practice to implement CVAE? Is it by generating sample images by moving window approach?
One way of doing it would be to have a CVAE that takes as input (and output) values of all the spectral features for each of the spatial coordinates (the stacks circled in red in the picture). So, in the case of your image, you would have 2500*2500 = 6250000 input data samples, which are all vectors of length 15. And then the dimension of the middle layer would be a vector of length 5. And, instead of 2D convolutions that are normally used along the spatial domain of images, in this case it would make sense to use 1D convolution over the spectral domain (since the values of neighbouring wavelengths are also correlated). But I think using only fully-connected layers would also make sense.
As a disclaimer, I haven’t seen CVAEs used in this way before, but like this, you would also get many data samples, which is needed in order for the learning generalise well.
Another option would be indeed what you suggested -- to just generate the samples (patches) using a moving window (maybe with a stride that is the half size of the patch). Even though you wouldn't necessarily get enough data samples for the CVAE to generalise really well on all HSI images, I guess it doesn't matter (if it overfits), since you want to use it on that same image.

Tensorboard: Filter out certain time steps

Is there a way to filter out the first X timesteps when visualizing the histograms and scalars, only showing steps > X?
Much like the zooming feature for scalars, but something which updates as time progresses, and something which works for histograms too?

Fitting Large Matrix Calculations into Memory when using Tensorflow

I am attempting to build a model which has two phases.
The first takes an input image and passes it through a conv-deconv network. The resulting Tensor has entries corresponding to pixels in a desired output image (same size as the input image).
To calculate the final output image I want to take the value generated at each pixel location from the first phase and use it as an additional input to a reduction function that is applied over the entire input image. This second step has no trainable variables, but it does have computation/memory costs that grow exponentially with the size of the input (each output pixel is a function of all input pixels).
I'm currently using the tf.map_fn to calculate the output image. I'm mapping the output pixel calculation function onto the results from the first phase. My desire is that tensorflow would allocate the memory to store the intermediate tensors needed for each pixel calculation and then free that memory before moving on to the next pixel calculation. But instead it seems to never free the intermediate calculations causing OOM errors.
Is there someway to tell tensorflow (either explicitly or implicitly) that it should free the memory allocated to hold the data of a Tensor that is no longer needed in the calculation?
TensorFlow deallocates memory for the tensor as soon as the tensor is no longer needed for any future calculations. You can verify this by looking at memory deallocation messages as shown in this notebook.
It's possible you are running out of memory because TensorFlow executes nodes in a memory inefficient order.
As an example, consider following computation:
k = 2000
a = tf.random_uniform(shape=(k,k))
for i in range(n):
a = tf.matmul(a, tf.random_uniform(shape=(k,k)))
The order in which it is evaluated can be shown below
All the circles (tf.random_uniform) nodes are evaluated first, followed by squares (tf.matmul). This has O(n) memory requirement compared to O(1) for the optimal order.
You can use control dependencies to force a specific execution order, ie, using helper function as below:
import tensorflow.contrib.graph_editor as ge
def run_after(a_tensor, b_tensor):
"""Force a to run after b"""
ge.reroute.add_control_inputs(a_tensor.op, [b_tensor.op])

LabView cos fitting

I am working on a program that needs to fit numerous cosine waves in order to determine one of the parameters for the function. The equation that I am using is y = y_0 + Acos((4*pi*L)/x + pi) where L is the value that I am trying to obtain from the best fit line.
I know that it is possible to do this correctly by hand for each set of data, but what is the best way to automate this process? I am currently reading in the data from text files, and running a loop with the initial paramiters changing until I have an array of paramater values that have an amplitude similar to the data, then I check the percent difference between points on the center peak and two end peaks to try to pick the best one. It in consistently picking lower values than what I get when fitting by hand (almost exactly one phase off). So is there a way to improve this method, or another method that works better?
Edit: My LabVIEW version has a cos fitting VI which is what I am using, the problem is when I try to automate the fitting by changing the initial parameters using a loop, I cant figure out how to get the program to pick the same best fit line as a human would pick.
Why not just use a Fast Fourier Transform? This should be way faster than fitting a cosine. In the result vector of complex numbers look for the largest peak of in the totals. You're given frequency (position in the FFT result vector), amplitude and phase.
You can evaluate the goodness of the fit by computing the difference between fitting curve and your data. A VI does this in the "Advanced curve fitting" palette. Then all you have to do is pick up the best fit.