efficiently convert words to vectors using bert

efficiently convert words to vectors using bert - google-colaboratory

I am trying to convert reviews Text to vectors using bert. I am using pretrained bert from tensorflowhub. While encoding it on TPU on google colab it crashes the memory.I have seen some articles most of the people are defining bert layers for creating embedings.
My code is file is:
https://colab.research.google.com/drive/1hfV1V9aMKIfWruH59H6xCCqNxwX6amub?usp=sharing
I am currently passing 400 reviews and creating there embedings then storing them to a seperate file but this is a very long work
Any sort of help would be appreciated

Related

how to load large datasets of numpy arrays in order to train a CNN model in tensorflow2.1.0

I'm training a convolutional neural network (CNN) model for a binary classification task in tensorflow2.1.0.
The feature of each instance is a 4-dimensional numpy array with shape of (50, 50, 50, 2), in which the type of each element is float32.
The label of each instance is 1 or 0
My largest training dataset can contain up to ~100 millions of instances.
To efficiently train the model, is it best to serialize my training data and store it in a set of files with TFrecord format, and then load them with tf.data.TFRecordDataset() and parse them with tf.data.map()?
If so, could you show me an example of how to serialize the pairs of feature-label and store them into TFrecord files, then how to load and parse them?
I did not find appropriate example in the website of Tensorflow.
Or is there any better way to store and load the huge datasets? Thanks very much.

There are many ways to efficiently build data pipeline without TFRecord click thislink it was very useful
To extract images from directory efficiently then click this link.
Hope this helped you.

How to generate .tf/.tflite files from python

I am trying to generate the custom tensor flow model (tf/tflite file) which i wanted to use for my mobile application.
I have gone through few machine learning and tensor flow blogs, from there I started to generate a simple ML model.
https://www.datacamp.com/community/tutorials/tensorflow-tutorial
https://www.edureka.co/blog/tensorflow-object-detection-tutorial/
https://blog.metaflow.fr/tensorflow-how-to-freeze-a-model-and-serve-it-with-a-python-api-d4f3596b3adc
https://www.youtube.com/watch?v=ICY4Lvhyobk
All these are really nice and they guided me to do the below steps,
i)Install all necessary tools (TensorFlow,Python,Jupyter,etc).
ii)Load the Training and testing Data.
iii)Run the tensor flow session for train and evaluate the results.
iv)Steps to increase the accuracy
But i am not able to generate the .tf/.tflite files.
I tried the following code, but that generates an empty file.
converter = tf.contrib.lite.TFLiteConverter.from_session(sess,[],[])
model = converter.convert()
file = open( 'model.tflite' , 'wb' )
file.write( model )
I have checked few answers in stackoverflow and according to my understanding in-order to generate the .tf files we need to create the pb files, freezing the pb file and then generating the .tf files.
But how can we achieve this?

Tensorflow provides Tflite converter to convert saved model to Tflite model.For more details find here.
tf.lite.TFLiteConverter.from_saved_model() (recommended): Converts a SavedModel.
tf.lite.TFLiteConverter.from_keras_model(): Converts a Keras model.
tf.lite.TFLiteConverter.from_concrete_functions(): Converts concrete functions.

Data augmentation in Tensorflow using Estimator API and TFRecords dataset

I'm using Tensorflow's 1.3 Estimator API to perform some image classification. Since I have a considerable amount of data, I gave the TFRecords a go. Saved the file and can read the examples to a Dataset using a parser function inside the input_fn of the estimator model. So far so good.
The issue is when I want to do some image augmentation (rotating and shearing in this case).
1) I tried using the tf.contrib.keras.preprocessing.image.random_shearand the likes. Turns out Keras doesn't like the format of TF's shape ('Dimension') and I can't cast it to a list because its arguments are the axis indexes not the actual value.
2) Then I tried using the tf.contrib.image.rotate and tf.contrib.image.transform with random values in my chosen range. This time I get an error of NotFoundError: Op type not registered 'ImageProjectiveTransform' in binary running on MYPC. Make sure the Op and Kernel are registered in the binary running in this process. which is an open issue (https://github.com/tensorflow/tensorflow/issues/9672). At the moment I can't move from Windows, so I would very interested in possible alternatives.
3) Searched for a way to read TFRecords and transform it to numpy array and do the augmentation with other tools, but can't find a way from within the input_fn from where I can't access the session.
Thanks!

Have you tried using function from the answer to the question below?tensorflow: how to rotate an image for data augmentation?

Tensorflow embeddings

I know what embeddings are and how they are trained. Precisely, while referring to the tensorflow's documentation, I came across two different articles. I wish to know what exactly is the difference between them.
link 1: Tensorflow | Vector Representations of words
In the first tutorial, they have explicitly trained embeddings on a specific dataset. There is a distinct session run to train those embeddings. I can then later on save the learnt embeddings as a numpy object and use the
tf.nn.embedding_lookup() function while training an LSTM network.
link 2: Tensorflow | Embeddings
In this second article however, I couldn't understand what is happening.
word_embeddings = tf.get_variable(“word_embeddings”,
[vocabulary_size, embedding_size])
embedded_word_ids = tf.gather(word_embeddings, word_ids)
This is given under the training embeddings sections. My doubt is: does the gather function train the embeddings automatically? I am not sure since this op ran very fast on my pc.
Generally: What is the right way to convert words into vectors (link1 or link2) in tensorflow for training a seq2seq model? Also, how to train the embeddings for a seq2seq dataset, since the data is in the form of separate sequences for my task unlike (a continuous sequence of words refer: link 1 dataset)

Alright! anyway, I have found the answer to this question and I am posting it so that others might benefit from it.
The first link is more of a tutorial that steps you through the process of exactly how the embeddings are learnt.
In practical cases, such as training seq2seq models or Any other encoder-decoder models, we use the second approach where the embedding matrix gets tuned appropriately while the model gets trained.

Word2vec classification and clustering tensorflow

I am trying to cluster some sentences using similarity (maybe cosine) and then maybe use a classifier to put text in predefined classes.
My idea is to use tensorflow to generate the word embedding then average them for each sentence. Next use a clustering/classification algorithm.
Does tensorflow provide ready to use word2vec generation algorithm?
Would a bag of words model generate a good output?

No, tensorflow does not provide a ready-to-use word2vec but it does have a tutorial on word2vec.
Yes, a bag of words can generate surprisingly good output (but not state-of-the-art), and has the benefit of being amazingly faster. I have a small amount of data (tens of thousands of sentences) and have achieved F1 scores of >0.90 for classification.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

efficiently convert words to vectors using bert - google-colaboratory

Related

how to load large datasets of numpy arrays in order to train a CNN model in tensorflow2.1.0

How to generate .tf/.tflite files from python

Data augmentation in Tensorflow using Estimator API and TFRecords dataset

Tensorflow embeddings

Word2vec classification and clustering tensorflow

Categories

Resources