Want to extract embeddings of programming language tokens from TransCoder (Facebook) - tokenize

I am trying to extract a embeddings of the various tokenized (.tok) files. I have preprocessed the various dataset using preprocessing pipeline suggested in the TransCoder (https://github.com/facebookresearch/TransCoder). I have also trained the TransCoder model and can also used pretrained (TransCoder) to extract embedding matrix and embedding vectors of various tokens of various tokenized file.
Authors of this work have plotted t-SNE visualization of a cross-lingual token embeddings. They obtained by encoding programming language tokens into TransCoder's lookup table.
Can authors explain how you did that? I also want to extract embedding of these tokens.

Related

how to fine tune spacys word vectors

I am predicting similarities of documents using the pre trained spacy word embeddings. Because I have a lot of domain specific words, I want to fine tune my vectors on a rather small data set containing my domain specific vocabulary.
My idea was to just train the spacy model again with my data. But since the word vectors in spacy are built-in, I am not sure how to do that. Is there a way to train the spacy model again with my data?
After some research, I found out, that I can train my own vectors using Gensim. There I would have to download a pre trained model for example the Google News dataset model and afterwards I could train it again with my data set. Is this the only way? Or is there a way to proceed with my spacy model?
Any help is greatly appreciated.
update: the right term here was "incremental training" and thats not possible with the pre-trained spacy models.
It is however possible, to perform incremental training on a gensim model. I did that with the help of another pretrained vector set (i went with the fasttext model) and then I trained this gensim model trained with the fasttext vectors again with my own corpus. This worked pretty well
If you pre-trained word embeddings with fasttext in your domain and would like to use them with spaCy you can extend/replace the tokens from an existing spaCy model with your new fasttext vocabulary&vectors using something similar to this:
https://github.com/explosion/spaCy/issues/2538#issuecomment-404888091
or from scratch:
https://spacy.io/usage/vectors-similarity#converting
The advantage of this approach is that (1) you can keep using spacy and (2) if some tokens were present in the pre-trained spaCy but not in your corpus you will still be able to use them

What is vggish_model.ckpt and vggish_pca_params.npz

I am trying to understand some aspects of audio classification and came by "vggish_model.ckpt" and "vggish_pca_params.npz". I am trying to have a good understanding of these two. Are they part of tensorflow or google audio set? Why do I need to use them when building audio features? I couldn't see any documentation about them!
The precalculated features released with AudioSet are "embeddings" from a deep net that was trained to predict video-level tags from soundtracks (see https://arxiv.org/abs/1609.09430). The embedding layer is further processed via PCA to reduce dimensionality; this processing is included to make the features compatible with the ones release in https://research.google.com/youtube8m/ . So, vggish_model.ckpt gives the weights of the VGG-like deep CNN used to calculate the embedding from mel-spectrogram patches, and vggish_pca_params.npz gives the bases for the PCA transformation.
The only content released as part of AudioSet are these precalculated embedding features. If you train a model based on these features, then want to use it to classify new inputs, you must convert the new input to the same domain, and thus you have to use vggish_model and vggish_pca_params.
If AudioSet had included waveforms, none of this would be needed. But YouTube terms of service do not allow download and redistribution of its users' content.

General usefulness of Dense layers for different identification tasks

I'd like to ask, is it practical to use embeddings and similarity metrics to any form of identification task? If I had a neural network trained to find different objects in a photo, would extracting the fully-connected layers/Dense layers and clustering them be useful?
I've recently found that there is an embeddings projector tool from tensorflow that is very cool and useful. I know that there has been some work in word embeddings and how similar words cluster together. This is the case for faces as well.
Having said that, I want to follow the same methods into analyzing geological sites; can I train a model to create embeddings of the features of a site and use clustering methods to classify?
Yes, we can do that. We can use embeddings for images and visualize the embeddings in the tensorboard.
You can replicate using the fashion mnist embedding example found here for your use case.

Could I use BERT to Cluster phrases with pre-trained model

I found it was a failure that I had used Gensim with GoogleNews pre-trained model to cluster phrases like:
knitting
knit loom
loom knitting
weaving loom
rainbow loom
home decoration accessories
loom knit/knitting loom
...
I am advised that GoogleNews model does't have the phrases in it. The phrases I have are a little specific to GoogleNews model while I don't have corpus to train a new model. I have only the phrases. And now I am considering to turn to BERT. But could BERT do that as I expected as above? Thank you.
You can feed a phrase into the pretrained BERT model and get an embedding, i.e. a fixed-dimension vector. So BERT can embed your phrases in a space. Then you can use a clustering algorithm (such as k-means) to cluster the phrases. The phrases do not need to occur in the training corpus of BERT, as long as the words they consist of are in the vocabulary. You will have to try to see if the embeddings give you relevant results.

Spacy Document Vectors with Custom Number of Dimensions

This question may be a duplicate, but I could not find the answer on StackOverflow.
Is there a way to generate document vectors with another number of dimensions such as 25 instead of 300? I also checked the spacy documentation but could not find the answer.
Thanks!
The document and word vectors in spaCy are not generated by spaCy, they're actually the pre-trained embeddings built off of a large corpus. For more details on the embeddings you can check Word Vectors and Semantic Similarity in the docs.
If you wanted to use your own embeddings that were 25 dimensional, you could follow the instructions here. SpaCy won't train new embeddings for you, for that I'd recommend gensim.