I am using Faiss to index some sentences, and the sentences will add by user erverday, so i need to update index file everyday, i just load the trained index using faiss.read_index(file) and use indexer.add to add the embeddings incrementally, finnaly use write_index to save index file. but it seems not work, Can someone give me some advice?
Related
I am quite new to TensorFlow, and have never worked with TFRecords before.
I have downloaded a dataset of images from online and the download format was TFRecord.
This is the file structure in the downloaded dataset:
1.
2.
E.g. inside "test"
What I want to do is load in the training, validation and testing data into TensorFlow in a similar way to what happens when you load a built-in dataset, e.g. you might load in the MNIST dataset like this, and get arrays containing pixel data and arrays containing the corresponding image labels.
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data()
However, I have no idea how to do so.
I know that I can use dataset = tf.data.TFRecordDataset(filename) somehow to open the dataset, but would this act on the entire dataset folder, one of the subfolders, or the actual files? If it is the actual files, would it be on the .TFRecord file? How do I use/what do I do with the .PBTXT file which contains a label map?
And even after opening the dataset, how can I extract the data and create the necessary arrays which I can then feed into a TensorFlow model?
It's mostly archaeology, and plus a few tricks.
First, I'd read the README.dataset and README.roboflow files. Can you show us what's in them?
Second, pbtxt are text formatted so we may be able to understand what that file is if you just open it with a text editor. Can you show us what's in that.
The think to remember about a TFRecord file is that it's nothing but a sequence of binary records. tf.data.TFRecordDataset('balls.tfrecord') will give you a dataset that yields those records in order.
Number 3. is the hard part, because here you'll have binary blobs of data, but we don't have any clues yet about how they're encoded.
It's common for TFRecord filed to contian serialized tf.train.Example.
So it would be worth a shot to try and decode it as a tf.train.Example to see if that tells us what's inside.
ref
for record in tf.data.TFRecordDataset('balls.tfrecord'):
break
example = tf.train.Example()
example.ParseFromString(record.numpy())
print(example)
The Example object is just a representation of a dict. If you get something other than en error there look for the dict keys and see if you can make sense out of them.
Then to make a dataset that decodes them you'll want something like:
def decode(record):
return tf.train.parse_example(record, {key:tf.io.RaggedFeature(dtype) for key, dtype in key_dtypes.items()})
ds = ds.map(decode)
I am running the Word2Vec implementation from gensim twice, and I have a problem with the save function:
model_ = gensim.models.Word2Vec(all_doc, size=int(config['MODEL']['embed_size']),
window=int(config['MODEL']['window']),
workers=multiprocessing.cpu_count(),
sg=1, iter=int(config['MODEL']['iteration']),
negative=int(config['MODEL']['negative']),
min_count=int(config['MODEL']['min_count']), seed=int(config['MODEL']['seed']))
model_.save(config['BASIC']['embedding_dir'])
I obtain different outputs for each time I run it. The first time it gives an "output_embedding", an "output_embedding.trainables.syn1neg.npy" and an "output_embedding.wv.vectors.npy". But the second time it does not give the two npy files, it just generates "output_embedding".
The only thing I change from the first to the second time is the sentences I use as input (all_doc).
Why it does not generate the 3 files ?
Gensim only creates the separate files when the size of the internal numpy arrays is over a certain threshold – so I suspect your all_doc corpus has a very small vocabulary in one case, and a more typically large vocabulary in the other.
When it does generate multiple files, be sure to keep them all together for later loads to work.
(If for some urgent reason you needed to change that behavior, the inherited .save() method takes an optional sep_limit argument to change the threshold - but I'd recommend against mucking with this.)
Separately: that your file names have .trainables. in them suggests you're using a pre-4.0.0 version of Gensim. There've been some improvements to Word2Vec & related algorithms in the latest Gensim, and some older code will need small changes to keep working, so you may want to upgrade to the latest version before building any more functionality on an older base.
These are data(Batch size 2) and batch index
import mxnet as mx
data=mx.nd.array(range(24)).reshape(2,3,4)
index=mx.nd.array([[0,1],[1,2]])
How to get the selected data? I tried the Pick and take functions, but don't know how to do it.
It seems gather_nd works
mx.nd.gather_nd(data,mx.nd.array([[0,0,1,1],[0,1,1,2]])).reshape(2,2,4)
I am using this code to train a word2vec model. I am trying to train it incrementally, with using saver.restore(). I am using new data after restoring the model. Since vocabulary size for the old data and new data are not the same, I got an exception like this:
InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [28908,200] rhs shape= [71291,200]
Here 71291 is vocabulary size for the old data and 28908 is for new data.
It gets the vocabulary words from the train_data file here, and constructs the network model using size of the vocabulary. I thought that if I could set vocabulary size the same for my old data and new data, I can solve this problem.
So, my question is: Can I do that in this code? As far as I understand, I cannot reach skipgram_word2vec() function.
Or, is there any other way of solving this issue in this code beside what I thought? If it is not possible using this code, I will try other ways for my purpose.
Any help is appreciated.
Having taken a look at the source of word2vec_optimized.py I'd say you will need to change the code there. It operates by opening a text file right up front as "training data". For your purposes, you have to change the build_graph method and allow it to get an option to set all that data ( words, counts, words_per_epoch, current_epoch, total_words_processed, examples, labels, opts.vocab_words, opts.vocab_counts, opts.words_per_epoch ) when initializing, and not from a text file.
Then you need to merge the two text files, and load them once, to produce the vocabulary. Then save all the data above, and use that to restore the network at each subsequent run.
If you use more than 2 texts, you need to include all the text you plan to use in the first data to produce the vocabulary, however.
I am trying to do a Deep Learning project by using Tensorflow.
Each of my data sets contains 2 files( PNGimage file + TXTvectors file ), where are put in different folders as follow:
./data/image/ #Folders contains different size of images
./data/vector/ #Folders contains vectors of corresponding image
#For example: apple.png + apple.txt
The example content of vector shows as follow:
10.0,2.5,5,13
And since image size are different, the resize and some transformation apply on vectors are required. It is important to make sure that I can do these processing during Tensorflow is running. Is there any good way to manage this kind of datasets?
I referred to a lot of basic tutorial however most of them are not so many details about arrange customized data input and output. Please give me some advice!
I recommend you to take a look at TFRecords and queues. Basically the idea is the following: you resize all your images to the same format and store them together with your txt vectors in one TFRecord file. This is done separately before you run your model.
When you create your model you create a queue which reads data from the TFRecord file and feeds it to your model.