how to make embedding column through features directly? - tensorflow

I'm learning wide&deep model for ctr. My data has a feature user_id which has more than 2**26 values. How I can get embedding column through this feature? I used
user_id = tf.feature_column.categorical_column_with_hash_bucket('user_id', hash_bucket_size=2**26),
user_id_emb = tf.feature_column.embedding_column(user_id, dimension=95),
but it shows out of memeory.

So, 2**26 is about 64M. You want 95 embedding dimensions. Each will be a float32 by default. That is 4 bytes. 4 * 95 ~= 400 bytes per user_id. So you need 64M * 400 ~= 25.6 Gbytes of memory to store the embedding.
Make sure you can allocate that much on your system. It should be all in ram (swap will make everything much slower). If you placed this on a GPU it won't work since most GPUs don't have so much memory available. An embedding of only 20 dimensions should use about 5Gbytes which is more likely to fit in memory.
The easiest thing is to lower the number of embedding dimensions.
If you have multiple systems available you can shard the embedding (see partitioner parameter for variable related functions).
Another thing you can do is cluster some user_ids together (lower the hash_bucket_size). Or replace user_ids by a combination of other features that would describe the user sufficiently for your model.

Related

Does deeper LSTM need more units?

I'm applying LSTM on time series forecasting with 20 lags. Suppose that we have two cases. The first one just using five lags and the second one (like my case) is using 20 lags. Is it correct that for the second case we need more units compared to the former one? If yes, how can we support this idea? I have 2000 samples for training the model, so this is the main limitation for increasing number of units here.
It is very difficult to give an exact answer as the relationship between timesteps and number of hidden units is not an exact science. For example, following factors can affect the number of units required.
Short term memory problem vs long-term memory problem
If your problem can be solved with relatively less memory (i.e. requires to remember only a few time steps) you wouldn't get much benefit from adding more neurons while increasing the number of steps.
The amount of data
If you don't have enough data for the model to learn from (which I feel like you will run into with 2000 data points - but I could be wrong), then increasing the number of timesteps won't help you much.
The type of model you use
Depending on the type of model you use (e.g. LSTM / GRU ) you might get different results (this is not always true but can happen for certain problems)
I'm sure there are other factors out there, but these are few that came to my mind.
Proving more units give better results while having more time steps (if true)
That should be relatively easy as you can try few different options,
5 lags with 10 / 20 / 50 hidden units
20 lags with 10 / 20 / 50 hidden units
And if you get better performance (e.g. lower MSE) with 20 lags problem than 5 lags problem (when you use 50 units), then you have gotten your point across. And you can reinforce your claims by showing results with different types of models (e.g. LSTMs vs GRUs).

Explanation of parallel arguments of tf.while_loop in TensorFlow

I want to implement an algorithm which allows a parallel implementation in TensorFlow. My question is what the arguments parallel_iterations, swap_memory and maximum_iterations actually do and which are their appropriate values according the situation. Specifically, in the documentation on TensorFlow's site https://www.tensorflow.org/api_docs/python/tf/while_loop says that parallel_iterations are the number of iterations allowed to run in parallel. Is this number the number of threads? When someone should allow CPU-GPU swap memory and for what reason? What are the advantages and disadvantages from this choice? What is the purpose of maximum_iterations? Can it be combined with parallel_iterations?
swap_memory is used when you want to have extra memory on the GPU device. Usually when you are training a model some activations are saved in the GPU mem. for later use. With swap_memory, you can store those activations in the CPU memory and use the GPU mem. to fit e.g. larger batch sizes. And this is an advantage. You would choose this if you need big batch_sizes or have long sequences and want to avoid OOM exceptions. Disadvantage is computation time since you need to transfer the data from CPU mem. to GPU mem.
The maximum iterations is smth. like this:
while num_iter < 100 and <some condition>:
do something
num_iter += 1
So it is useful when you check a condition, but also want to have an upper bound (one example is to check if your model converges. If it doesn't you still want to end after k iterations.)
As for parallel_iterations I am not sure, but it sounds like multiple threads, yes. You can try and see the effect in a sample script.

Avoiding exhausting GPU resources in convNN Tensorflow

I'm trying to run a hyperparameter optimization script, for a convNN using Tensorflow.
As you may know, TF handling of the GPU-Memory isn't that fancy(don't think it will ever be, thanks to the TPU). So my question is how do I know to choose the filter dimensions and the batchsize, so that the GPU-memory don't get exhausted.
Here's the equation that I'm thinking of:
image_shape =128x128x3(3 color channel)
batchSitze = 20 ( is the smallest possible batchsize, since I got 20 klasses)
filter_shape= fw_fh_fd[filter_width=4, filter_height=4, filter_depth=32]
As far as understood, using tf.conv2d function will need the following amount of memory:
image_width * image_height *numerofchannel*batchSize*filter_height*filter_width*filter_depth*32bit
since we're tf.float32 type for each pixel.
in the given example, the needed memory, will be :
128x128x3x20x4x4x32x32 =16106127360 (bits), which is all most 16GB of memory.
I'm not the formula is correct, so I hope to get a validation or the a correction of what I'm missing.
Actually, this will take only about 44MB of memory, mostly taken by the output.
Your input is 20x128x128x3
The convolution kernel is 4x4x3x32
The output is 20x128x128x32
When you sum up the total, you get
(20*128*128*3 + 4*4*3*32 + 20*128*128*32) * 4 / 1024**2 ≈ 44MB
(In the above, 4 is for the size in bytes of float32 and 1024**2 is to get the result in MB).
Your batch size can be smaller than your number of classes. Think about ImageNet and its 1000 classes: people are training with batch sizes 10 times smaller.
EDIT
Here is a tensorboard screenshot of the net — it reports 40MB rather than 44MB, probably because it excludes the input — and you also have all the tensor sizes I mentioned earlier.

headache for clEnqueueNDRangeKernel local work size

For opencl optimization, my idea is try to make match for
1 workgroup(kernel coding) as compute unit(GPU Hardware)
1 workitem(kernel coding) as process element(GPU Hardware)
( Maybe my idea is not correct, please teach me )
for example:
1. I have a global work size of 4000 by 3000.
2. My GPU opnecl device has a maximum work-group-size of 8192.
3. I call clEnqueueNDRangeKernel with the desired local-work-size (along with all other necessary parameters)
4. by fucntion call:
a. clGetKernelWorkGroupInfo(kernel, device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), (void*)&workGroupSizeUsed, NULL);
b. clGetKernelWorkGroupInfo(kernel, device, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, sizeof(size_t), (void*)&workGroupSizeUsed, NULL);
above a and b are return 8192.
maximum work-group-size, CL_KERNEL_WORK_GROUP_SIZE, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE all are 8192.
I have no idea what I should follow to define my local work size...
(Q1)Any good idea for setting the local work size? (10x10? 40x30, X by Y )
clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_work_item_size, local_work_item_size, 0, NULL, NULL);
Very headache to define this "local_work_item_size" of clEnqueueNDRangeKernel function.
(Q2)
Could some one explain the difference if I set local work size = 1,1 between
local work size = 4000,3000 ?
Thank you in advance!
(Q1)Any good idea for setting the local work size? (10x10? 40x30, X by Y )
As pmdj pointed out, this highly depends on your application. Since it is unclear how you selected your global_work_size and it is also linked to the local_work_size I would like to explain that one first. Usually what you would want to do is to map the size of the data you want to process to the global_work_size. E.g. if you have an array with 1024 values you would also pick a global_work_size of 1024 because then you can easily use the global id as an index in your OpenCL program:
int index = get_global_id(0);
input_array[index]++; // your data processing
However, the global_work_size is limited to a maximum 2^32 - 1. If you have more data to process than that you can pass your global_work_size and data size as parameters and use a loop like the following one:
int index = get_global_id(0);
for (int i = index; i < data_size; i += global_work_size) {
input_array[i]++; // your data processing
}
The last fact which is important for the global_work_size is that it needs to be dividable by the local_work_size. This can result into a your global_work_size being bigger than your data size, e.g. you could have 1000 values while your local_work_size is 32. Then you would make your global_work_size 1024 and ensure through a condition like the one above (i < data_size) that the redundant work items are not doing anything weird like accessing not allocated memory areas.
The local_work_size depends on your platform. First of all you should always have a local_work_size which is a multiple of 32 for NVIDIA or a multiple of 64 for AMD GPUs. This represents the amount of operations which are scheduled together anyway. If you use a different number the GPU will have idle threads which won't do anything but decrease your performance.
Not only the manufacturer but also the specific type of your GPU has to be considered to find the optimal local_work_size. The global_work_size divided by the local_work_size is the number of work groups. Each work group is executed by one thread inside your CPU/GPU. If you use OpenCL to run your application on powerful hardware you want to make sure that it runs as parallel as possible. E.g. if you use an Intel i7 with 8 threads you would want to make sure that you have at least 8 work groups (global_work_size / local_work_size >= 8). If you use a NVIDIA GeForce GTX 1060 with 1280 CUDA Cores you would want to have at least 1280 work groups. But never at the cost of having a local_work_size of less than 32 which is more important!
If you are having more work groups than your hardware has threads that does not matter, they will be processed sequentially. Hence for most applications you can always set your local_work_size to 32/64. The only exception is if you require synchronization among more than work items. E.g. barriers only work inside work groups but not among different work groups. An example: If you need to to sum up chunks of 1024 values before being able to proceed with your algorithm you would need to set your local_work_size to 1024 for the barrier to work as desired.
(Q2) Could some one explain the difference if I set local work size = 1,1 between local work size = 4000,3000 ?
Both, the global_work_size and the local_work_size can have more than one dimension. If this is used or not solely depends on the preference of the programmer. All algorithms can be implemented in one dimension as well and the number of work groups is calculated by multiplying the dimensions, e.g. if your global_work_size is 20*20 and your local_work_size is 10*10 you would run the program with (20*20) / (10*10) = 400 work groups.
I personally like to use the dimensions if I am processing data which has multiple dimensions. Imagine your input is a two-dimensional image, you could simply use its width and height as global_work_size (e.g. 1024 * 1024) and the local_work_size accordingly (e.g. 32 * 32). In your code you could then use the following indices:
int x = get_global_id(0);
int y = get_global_id(1);
input_array[x][y]++; // your data processing

How tensorflow deals with large Variables which can not be stored in one box

I want to train a DNN model by training data with more than one billion feature dimensions. So the shape of the first layer weight matrix will be (1,000,000,000, 512). this weight matrix is too large to be stored in one box.
By now, is there any solution to deal with such large variables, for example partition the large weight matrix to multiple boxes.
Update:
Thanks Olivier and Keveman. let me add more detail about my problem.
The example is very sparse and all features are binary value: 0 or 1. The parameter weight looks like tf.Variable(tf.truncated_normal([1 000 000 000, 512],stddev=0.1))
The solutions kaveman gave seem reasonable, and I will update results after trying.
The answer to this question depends greatly on what operations you want to perform on the weight matrix.
The typical way to handle such a large number of features is to treat the 512 vector per feature as an embedding. If each of your example in the data set has only one of the 1 billion features, then you can use the tf.nn.embedding_lookup function to lookup the embeddings for the features present in a mini-batch of examples. If each example has more than one feature, but presumably only a handful of them, then you can use the tf.nn.embedding_lookup_sparse to lookup the embeddings.
In both these cases, your weight matrix can be distributed across many machines. That is, the params argument to both of these functions is a list of tensors. You would shard your large weight matrix and locate the shards in different machines. Please look at tf.device and the primer on distributed execution to understand how data and computation can be distributed across many machines.
If you really want to do some dense operation on the weight matrix, say, multiply the matrix with another matrix, that is still conceivable, although there are no ready-made recipes in TensorFlow to handle that. You would still shard your weight matrix across machines. But then, you have to manually construct a sequence of matrix multiplies on the distributed blocks of your weight matrix, and combine the results.