How to extract typical rows in MXNet? - mxnet

These are data(Batch size 2) and batch index
import mxnet as mx
data=mx.nd.array(range(24)).reshape(2,3,4)
index=mx.nd.array([[0,1],[1,2]])
How to get the selected data? I tried the Pick and take functions, but don't know how to do it.

It seems gather_nd works
mx.nd.gather_nd(data,mx.nd.array([[0,0,1,1],[0,1,1,2]])).reshape(2,2,4)

Related

Gensim word2vec saves numpy arrays?

I am running the Word2Vec implementation from gensim twice, and I have a problem with the save function:
model_ = gensim.models.Word2Vec(all_doc, size=int(config['MODEL']['embed_size']),
window=int(config['MODEL']['window']),
workers=multiprocessing.cpu_count(),
sg=1, iter=int(config['MODEL']['iteration']),
negative=int(config['MODEL']['negative']),
min_count=int(config['MODEL']['min_count']), seed=int(config['MODEL']['seed']))
model_.save(config['BASIC']['embedding_dir'])
I obtain different outputs for each time I run it. The first time it gives an "output_embedding", an "output_embedding.trainables.syn1neg.npy" and an "output_embedding.wv.vectors.npy". But the second time it does not give the two npy files, it just generates "output_embedding".
The only thing I change from the first to the second time is the sentences I use as input (all_doc).
Why it does not generate the 3 files ?
Gensim only creates the separate files when the size of the internal numpy arrays is over a certain threshold – so I suspect your all_doc corpus has a very small vocabulary in one case, and a more typically large vocabulary in the other.
When it does generate multiple files, be sure to keep them all together for later loads to work.
(If for some urgent reason you needed to change that behavior, the inherited .save() method takes an optional sep_limit argument to change the threshold - but I'd recommend against mucking with this.)
Separately: that your file names have .trainables. in them suggests you're using a pre-4.0.0 version of Gensim. There've been some improvements to Word2Vec & related algorithms in the latest Gensim, and some older code will need small changes to keep working, so you may want to upgrade to the latest version before building any more functionality on an older base.

Tensorflow Shuffle Batch Non Deterministic

I am trying to get deterministic behaviour from tf.train.shuffle_batch(). I could, instead, use tf.train.batch() which works fine (always the same order of elements), but I need to get examples from multiple tf-records and so I am stuck with shuffle_batch().
I am using:
random.seed(0)
np.random.seed(0)
tf.set_random_seed(0)
data_entries = tf.train.shuffle_batch(
[data], batch_size=batch_size, num_threads=1, capacity=512,
seed=57, min_after_dequeue=32)
But every time I restart my script I get slightly different results (not completely different, but about 20% of the elements are in the wrong order).
Is there anything I am missing?
Edit: Solved it! See my answer below!
Maybe I misunderstood something, but you can collect multiple tf-records in a queue with tf.train.string_input_producer(), then read the examples into tensors and finally use tf.train.batch().
Take a look at CIFAR-10 input.
Answering my own question:
First the reason shuffle_batch is non deterministic:
The time until I request a batch is inherently random.
In that time, a random number of tensors are available.
Tensorflow calls a shuffle operation that is seeded but depending on the number of items, it will return a different order.
So no matter the seeding, the order is always different unless the number of elements is constant. So the solution is to keep the number of elements constant, but how we do it?
By setting capacity=min_after_dequeue+batch_size. This will force Tensorflow to fill up the queue until it reaches full capacity before dequeuing an item. Therefore, at the time of the shuffle operation, we have capacity many items which is a constant number.
So why are we doing this? Because one tf.record contains many examples but we want examples from multiple tf.records. With a normal batch we would first get all the examples of one record and then of the next one. This also means we should set min_after_dequeue to something larger than the number of items in one tf.record. In my example, I have 50 examples in one file so I set min_after_dequeue=2048.
Alternatively, we can also shuffle the examples before creating the tf.records, but this was not possible for me because I read tf.records from multiple directories (each with their own dataset).
Last Note: You should also use a batch size of 1 to be super save.

Why embedding_lookup_sparse and string_to_hash_bucket in tensorflow slow with large number of rows of embeddings

In tensorflow embedding_lookup_sparse lookup the row of embeddings according the sp_ids. I think it's similar to random access. However when the shape of embeddings is large, i.e 10M rows, the inference spent more time than when the embeddings only has about 1M rows. As I think, the lookup phase and is similar to random access and the hash function spent constant time which is all fast and less sensitive with the size. Is there any wrong with my thought? Is there any way to optimize so that the inference can be faster? Thank you!
Are you sure it is caused by the embedding_lookup? In my case I also have millions of rows to lookup. It is very fast if I use GradientDecend optimizer. It is very slow if I use Adam or the others. Probably it is not the embedding_lookup opr slows down your training but other oprs that depend on the total number of params.
It is true that "embedding_lookup" works slowly when there are many rows in table.
And you may figure out why by reading its source code. Here is the source code in "embedding_lookup":
image of the source code: variable "np" is the length of table
image of the source code: loop with np
As you see there is a loop with a time complexity of O(table length) appearing here. In fact "embedding_lookup" use dynamic partition to separate input data into several partition of ids, and then use this loop to embed words vectors to each id's partition. In my opinion, this trick can fix the time complexity to O(table length) no matter how big the input data is.
So I think the best way for you to increase training speed is to input more samples in each batch.

a tricky graph solve in tensorflow

As the following, I built a graph with two big variables and two input placeholder.
Every time, I want to use the current value of variables (partial values) and input placeholders to calculate delta values. Then the delta values are update to the variables using scatter_add.
problem: the two computing paths are not the same, one needs more computing. the tensorflow solving engine seems to prefer one of the path randomly-it solves one of path, then the other. For example, tf may update variable 0 first, then use this new variable 0 to solve another path (update variable 1). This is not my need.
so, any idea?
tensorflow graph:
I find the solution. Using the tf.control_dependencies() could solve this problem.
https://www.tensorflow.org/api_docs/python/tf/control_dependencies

I want to map a function to each element of a vector in Theano, can I do it without using scan?

Say a function that counts the appearances of ones at each index of an array:
import theano
import theano.tensor as T
A = T.vector("A")
idx_range = T.arange(A.shape[0])
result, updates = theano.scan(fn=lambda idx: T.sum(A[:idx+1]), sequences=idx_range)
count_ones = theano.function(inputs=[A], outputs=result)
print count_ones([0,0,1,0,0,1,1,1])
# gives [ 0. 0. 1. 1. 1. 2. 3. 4.]
As said here, using scan may not be efficient. Plus, theano.scan always produces "RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility from scan_perform.scan_perform import *" on my machine.
So I was wondering is there a better way of mapping functions in Theano?
Thanks in advance.
Edit:
I just realized it is a terrible example, apparently there's a more efficient way of just looping over the vector once like:
result, updates = theano.scan(fn=lambda prior_result, a: prior_result + a,
outputs_info=T.alloc(np.int32(0), 1),
sequences=A,
n_steps=A.shape[0])
However according to #Daniel Renshaw's answer, since
the computation in one step is dependent on the same computation at
some earlier step
so actually I can't avoid using scan in this regard, right?
Edit:
I thought of a way of vercotrizing it as:
A = T.vector()
in_size = 8
# a matrix with ones at and below the given diagonal and zeros elsewhere
mask = theano.shared(numpy.tri(in_size))
result = T.dot(mask, A)
count_ones = theano.function(inputs=[A], outputs=result)
print count_ones(numpy.asarray([0,0,1,0,0,1,1,1]))
But in this case I have to know the size of the input in advance (unless I can craft numpy.tri like matrices on the fly?).
Any suggestions would be welcome. :)
Edit:
I benchmarked the three methods using a 512D input array and 10000 iterations, and got the following results:
map a sum function to each element: CPU 16s GPU 140s
loop over the array using scan: CPU 13s GPU 32s
vectorization: CPU 0.8s GPU 0.8s (actually I don't think theano has engaged GPU to do this
In the most general case, if no assumptions are made about the function, then scan would have to be used. However, many (maybe most?) useful functions can be vectorized such that scan is not needed. As is pointed out in the question edit, the example function can certainly be applied to the input without using scan.
To decided whether scan is needed will depend on the function that needs to be applied. Case that certainly will require scan are those when the computation in one step is dependent on the same computation at some earlier step.
P.S. The warning about binary incompatibility can be safely ignored.