Could I use BERT to Cluster phrases with pre-trained model - tensorflow

I found it was a failure that I had used Gensim with GoogleNews pre-trained model to cluster phrases like:
knitting
knit loom
loom knitting
weaving loom
rainbow loom
home decoration accessories
loom knit/knitting loom
...
I am advised that GoogleNews model does't have the phrases in it. The phrases I have are a little specific to GoogleNews model while I don't have corpus to train a new model. I have only the phrases. And now I am considering to turn to BERT. But could BERT do that as I expected as above? Thank you.

You can feed a phrase into the pretrained BERT model and get an embedding, i.e. a fixed-dimension vector. So BERT can embed your phrases in a space. Then you can use a clustering algorithm (such as k-means) to cluster the phrases. The phrases do not need to occur in the training corpus of BERT, as long as the words they consist of are in the vocabulary. You will have to try to see if the embeddings give you relevant results.

Related

how to fine tune spacys word vectors

I am predicting similarities of documents using the pre trained spacy word embeddings. Because I have a lot of domain specific words, I want to fine tune my vectors on a rather small data set containing my domain specific vocabulary.
My idea was to just train the spacy model again with my data. But since the word vectors in spacy are built-in, I am not sure how to do that. Is there a way to train the spacy model again with my data?
After some research, I found out, that I can train my own vectors using Gensim. There I would have to download a pre trained model for example the Google News dataset model and afterwards I could train it again with my data set. Is this the only way? Or is there a way to proceed with my spacy model?
Any help is greatly appreciated.
update: the right term here was "incremental training" and thats not possible with the pre-trained spacy models.
It is however possible, to perform incremental training on a gensim model. I did that with the help of another pretrained vector set (i went with the fasttext model) and then I trained this gensim model trained with the fasttext vectors again with my own corpus. This worked pretty well
If you pre-trained word embeddings with fasttext in your domain and would like to use them with spaCy you can extend/replace the tokens from an existing spaCy model with your new fasttext vocabulary&vectors using something similar to this:
https://github.com/explosion/spaCy/issues/2538#issuecomment-404888091
or from scratch:
https://spacy.io/usage/vectors-similarity#converting
The advantage of this approach is that (1) you can keep using spacy and (2) if some tokens were present in the pre-trained spaCy but not in your corpus you will still be able to use them

General usefulness of Dense layers for different identification tasks

I'd like to ask, is it practical to use embeddings and similarity metrics to any form of identification task? If I had a neural network trained to find different objects in a photo, would extracting the fully-connected layers/Dense layers and clustering them be useful?
I've recently found that there is an embeddings projector tool from tensorflow that is very cool and useful. I know that there has been some work in word embeddings and how similar words cluster together. This is the case for faces as well.
Having said that, I want to follow the same methods into analyzing geological sites; can I train a model to create embeddings of the features of a site and use clustering methods to classify?
Yes, we can do that. We can use embeddings for images and visualize the embeddings in the tensorboard.
You can replicate using the fashion mnist embedding example found here for your use case.

How to use google inception model to classify DNA or protein sequences data sets?

I tried to classify protein using its sequences into their families. Can I use deep convolutional models on this purpose even though they use RGB 3 input metrics of an image? Is there any specific way to convert dataset other than the image in order to classify using these models. I'm new to Artificial neural networks, your suggestions are highly appreciated.
First you need to understand that the models you have in mind are tasked with a very difficult problem: Object Recognition in colored images therefore the models used are very big.
Then you need to know the purpose of using CNNs, is to extract as many features as we can from colored images in order to perform detection.
With the knowledge above considered I think classifying protein using its sequences seems achievable with a much more smaller convolutional model. You may need at max 10 layers of convolution. To conclude you should not need a CNN as complex as google inception model.
About your data: There is no rule about CNNs which say you can only use RGB pictures. These pictures are only arrays. If you have any kind of numeric data which can be used in algorithmic operations ofcourse, you can definitely use CNNs for feature extraction. I recommend you to take a look at this example.
I also recommend you to take a look at the following libraries. SK-LEARN, KERAS and PYTORCH. These libraries are very begginer friendly and they have amazing documentaries.
Best of luck.

Pre Trained LeNet Model for License plate Recognition

I have implemented a form of the LeNet model via tensorflow and python for a Car number plate recognition system. My model was trained solely on my train data and tested on the test data. My dataset contains segmented images wherein every image has only one character in them. This is what my data looks like. My created model does not perform very well, so I'm now looking for models which I can use via Transfer Learning. Since most models, are already trained on a humongous dataset, I looked over a few like AlexNet, ResNet, GoogLeNet and Inception v2. Most of these models have not been trained on the type of data that I want which would be, Letters and digits.
Question: Should I still go forward with one of these models and train them on my dataset or are there any better models which would help ? For such models would keras be a better option since it is more high level than Tensorflow?
Question: I'd prefer to work with the LeNet model itself since training the other models would definitely take a long time due to the insufficient specs of my laptop. So is there any implementation of the model which uses machine printed character images to train the model which I could use to then train the final layers of the model on my data?
to get good results you should use a model explicitly designed for text recognition.
First, (roughly) crop the input image to the region around the text.
Then, feed the image of the text into a neural network (NN) to detect the text.
A typical NN for text recognition extracts relevant features (with convolutional NN), propagates those features through the image (with recurrent NN) and finally predicts a character score for each position in the image.
Usually, those networks are trained with the CTC loss.
As a starting point I would suggest looking at the CRNN implementation (they also provide a pre-trained model) [1] and the corresponding paper [2]. There is, as far as I remember, also a TensorFlow implementation on github.
You can use any framework (e.g TensorFlow or CNTK or ...) you like as long as it features convolutional and recurrent NN and the CTC loss.
I once attended a presentation about CNTK where they claimed that they have a very fast implementation of recurrent NN - so maybe CNTK would be a good choice for your slow computer?
[1] CRNN implementation: https://github.com/bgshih/crnn
[2] Shi - An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

Spacy Document Vectors with Custom Number of Dimensions

This question may be a duplicate, but I could not find the answer on StackOverflow.
Is there a way to generate document vectors with another number of dimensions such as 25 instead of 300? I also checked the spacy documentation but could not find the answer.
Thanks!
The document and word vectors in spaCy are not generated by spaCy, they're actually the pre-trained embeddings built off of a large corpus. For more details on the embeddings you can check Word Vectors and Semantic Similarity in the docs.
If you wanted to use your own embeddings that were 25 dimensional, you could follow the instructions here. SpaCy won't train new embeddings for you, for that I'd recommend gensim.