Partially off/loading slices of a tensor to/from disk with jax - lazy-loading

I'm working with really big tensors, but I only use small rectangular regions at a time. I don't want to split the tensor apart since my window is constantly sliding foreward. I'd like to offload the parts of the tensor that are not in my window to the disk, since there's no way it's going to fit in memory. Is there a simple way to do this?
Bigger Picture
My code looks like this:
def act(self, obs):
t = obs.t
self.vision[FRAME_RATE*t: FRAME_RATE*(t+INTERVAL)] = obs.video
self.hearing[AUDIO_SAMPLE_RATE*t: AUDIO_SAMPLE_RATE*(t+INTERVAL)] = obs.sound
self.action[t] = self.model(self.vision[-VIDEO_WINDOW:], self.hearing[-AUDIO_WINDOW:])
return self.action[t]
which requires working with really long tensors, and it seems simpler to just write or find a library that offloads to the disk behind the scenes instead of chunking it myself.
I found the lazytensors library, but I'm looking for a jax solution
I considered manually slicing my tensor, but that's just so wasteful to allocate, copy, release for every window move
:thanks: Thanks for any help you're able to provide!

Related

Implement data generator in federated training

(I have posted the question on https://github.com/tensorflow/federated/issues/793 and maybe also here!)
I have customized my own data and model to federated interfaces and the training converged. But I am confused about an issue that in an images classification task, the whole dataset is extreme large and it can't be stored in a single federated_train_data nor be imported to memory for one time. So I need to load the dataset from the hard disk in batches to memory real-timely and use Keras model.fit_generator instead of model.fit during training, the approach people use to deal with large data.
I suppose in iterative_process shown in image classification tutorial, the model is fitted on a fixed set of data. Is there any way to adjust the code to let it fit to a data generator?I have looked into the source codes but still quite confused. Would be incredibly grateful for any hints.
Generally, TFF considers the feeding of data to be part of the "Python driver loop", which is a helpful distinction to make when writing TFF code.
In fact, when writing TFF, there are generally three levels at which one may be writing:
TensorFlow defining local processing (IE, processing that will happen on the clients, or on the server, or in the aggregators, or at any other placement one may want, but only a single placement.
Native TFF defining the way data is communicated across placements. For example, writing tff.federated_sum inside of a tff.federated_computation decorator; writing this line declares "this data is moved from clients to server, and aggregated via the sum operator".
Python "driving" the TFF loop, e.g. running a single round. It is the job of this final level to do what a "real" federated learning runtime would do; one example here would be selecting the clients for a given round.
If this breakdown is kept in mind, using a generator or some other lazy-evaluation-style construct to feed data in to a federated computation becomes relatively simple; it is just done at the Python level.
One way this could be done is via the create_tf_dataset_for_client method on the ClientData object; as you loop over rounds, your Python code can select from the list of client_ids, then you can instantiate a new list of tf.data.Datasetsand pass them in as your new set of client data. An example of this relatively simple usage would be here, and a more advanced usage (involving defining a custom client_datasets_fn which takes client_id as a parameter, and passing it to a separately-defined training loop would be here, in the code associated to this paper.
One final note: instantiating a tf.data.Dataset does not actually load the dataset into memory; the dataset is only loaded in when it is iterated over. One helpful tip I have received from the lead author of tf.data.Dataset is to think of tf.data.Dataset more as a "dataset recipe" than a literal instantiation of the dataset itself. It has been suggested that perhaps a better name would have been DataSource for this construct; hopefully that may help the mental model on what is actually happening. Similarly, using the tff.simulation.ClientData object generally shouldn't really load anything into memory until it is iterated over in training on the clients; this should make some nuances around managing dataset memory simpler.

Best way to load 24-bit BGR image from memory to ID2D1Bitmap1

I have a BGR 24-bit image in memory as continuous buffer (represented by cv::Mat, in case it may be of any help). I would like to load it to ID2D1Bitmap1 bitmap for 2D rendering. I have the following working code (showing a pseudo-code here):
IWICImagingFactory::CreateBitmapFromMemory(GUID_WICPixelFormat24bppBGR);
IWICFormatConverter::Initialize(GUID_WICPixelFormat32bppRGB);
ID2D1DeviceContext::CreateBitmapFromWicBitmap;
This works fine, the main issue being the time it takes: 20-40 milliseconds, which is too long for my application. I am looking for ways to optimize the process.
I, probably, can save the creation time of the ID2D1Bitmap1 by doing this once, and then loading the converted image from memory using CopyFromMemory, but still the conversion itself takes a large amount of time. One way could be loading the raw BGR buffer to GPU memory, and converting it to native RGBA format on the GPU itself, but I have no idea how start with that.
Your second idea is exactly the direction you should go. Create the ID2D1Bitmap(s) once, convert the buffer (more on that below), and use CopyFromMemory. I do something very similar in an app which provides a preview of a connected webcam (which may have one of various formats). Some cameras will deliver the images in MJPEG, YUY2, etc.
That app uses MediaFoundation, and an IMFTransform to convert the buffer. The IMFTransform in this case is an instance of CLSID_CColorConvertDMO (which uses SIMD registers/instructions when possible). However, prior to completing that, I tested with my own conversion code (which was CPU bound, and performed so-so), and another solution with HLSL and DirectCompute (performed well, but handled only one format). In the end I chose CLSID_CColorConvertDMO to handle the various types of input, but if you only have one known type, you may choose to use the HLSL solution (although it will cause you to have to write the conversion code, and setup the 'views' of the data).
However, if you choose the MediaFoundation route, you can use an IMFTransform without all of the rest of the graph (source, sink, etc). After creating the CLSID_CColorConvertDMO instance, and setting the input and output types (format, frame size, etc), create an IMFSample (using MFCreateSample), and an IMFMediaBuffer (using MFCreateMemoryBuffer) to the sample (using IMFSample::AddBuffer), then all that is necessary is to call ProcessInput and ProcessOutput to convert the buffer (create all items upfront). This may sound like a lot, but if done correctly, your cpu utilization will not have a noticeable impact, and you will achieve the performance you are looking for, even for capture cards which often deliver large frames at 60+ FPS (having used DataPath and Blackmagic capture cards in the past).
Good luck. I am certain you will crush it.

Suggestions for optimizing a fractal visualization method

I've written up a variation on Melinda Green's Buddhabrot method for visualizing the Mandelbrot set. Here it is:
http://pastebin.com/RH6dD77F
To create an animation I rendered hundreds of the individual images with slight variations. The variation is a transformation of the coefficients of the generating function as if they were an abstract vector in a space of coefficients. All of that produced incredible structures in the video...
http://www.youtube.com/watch?v=S2uMAvL_5Fo
The problem? As you can tell, the quality on each image is rather low because it takes forever using the method I came up with (the copies I have on my computer are a little better quality, but still look like old reel-to-reel movies). I'm hoping to find a few methods for increasing quality or lowering output time.
Thanks for any suggestions. I would really like to produce more detailed versions of these. Obviously there is much more structure in the graininess of these images.
You can try something like boxcounting, http://imagej.nih.gov/ij/plugins/fraclac/FLHelp/BoxCounting.htm. If buddhabrot is some sort mandelbrot you can skip some empty boxes. You can use a kd-tree like in packing lightmaps to subdivide the surface.

How to create a 'vector movie' out of data?

What file formats and software could I use to represent vector images over time as an animation, without compromising the advantages of the vector format?
Say I generate data that is best represented as a single point in the plane, moving over time. I would like to make an animation showing the motion of this point. One way to do this is to make a sequence of 2D bitmap images and string these together into an AVI file. But this produces either huge files (orders of magnitude larger than the underlying dataset) or very low quality animations. A stack of raster images is a very inefficient representation of the data.
A much better representation would be a sequence of 2D vector images. Vector images combine very high fidelity with small file size. But is it possible to string such images into an animation? What kind of software could be used to do so, starting from the underlying dataset?
I imagine a tool such as Adobe Flash could be used here, but this seems akin to making scatterplots from scratch in Illustrator: sure, it can be done and will look nice, but this is not how you make scatterplots. You use R, Excel or MATLAB, and then perhaps retouch the plot in a graphics program. I'm looking for a similarly efficient solution, but for making dynamic visualizations rather than plots.

CUDA: optimize latency due to iterative process

I have an iterative computation that involves a Fourier transform in each iteration.
in high level it looks like this:
// executed in host , calling functions that run on the device
B = image
L = 100
while(L--) {
A = FFT_2D(B)
A = SOME_PER_PIXEL_CALCULATION(A)
B = INVERSE_FFT_2D(A)
B = SOME_PER_PIXEL_CALCULATION(B)
}
I am using "cufft" library to do the transforms.
now the problem is that I am always working with global memory,
basically if there was a way of doing some of the work with shared memory it would be great,
but it seems like using FFT won't allow me to bypass this, given "cufft" library functions can only be called from the host, and stores input and output in global memory.
how should I tackle this?
thanks.
EDIT:
since there IS a data dependency. it would seem like I can't do much but optimize the 'per pixel' calculations...
the bottleneck is still due to the fact that the kernels pass the data via global memory .which seems unavoidable in this case.
so basically the fact that I have to do the transform an it's inverse is what keeps me from sharing intermidiate computation data.
currently I am exploring ways of doing most of the calculation in the frequency space.
( more of a math problem )
so does anyone has a good idea on how to approximate F{max(0,f(x,y))} given F{f(x,y)} ?
EDIT:
note that f(x,y) is in the time domain, and therefore is real valued,
f(x,y) is also processed before calculating pointwise max(0,f(x,y)), so it is indeed possible for negetiv values to appear.
Concerning the FFT/IFFT, I think you are wrongly assuming that the CUFFT routine does not internally use shared memory. Typical algorithms for FFT calculations split the entire FFT into smaller ones fitting one thread block and so probably they already internally exploit shared memory, see for example the paper.
Concerning the PER_PIXEL_CALCULATIONS, shared memory is typically used to make threads within a thread block cooperate each other. My question is: are the PER_PIXEL_CALCULATIONS independent each other? If so, perhaps thread cooperation is not needed and you would not need shared memory either and arrange the calculations by using only registers.
Anyway, to be more specific on the latter point, you should provide more information on what you actually need (by editing your original post). Is your code related to an implementation of the Gerchberg-Saxton algorithm?