How to do decimal encoding of DNA sequences (dataset)? - k-means

I need to perform K-means clustering and Hierarchical clustering of DNA sequences(nucleotide) sequences which i have downloaded in FASTA format. So before performing clustering I need to do DECIMAL ENCODING OF bases(a,t,c,g).. so how to do that.. so that i can take this input in the matrix form in MATLAB?.

Use the nt2int function. Documentation on it below:
http://www.mathworks.com/help/bioinfo/ref/nt2int.html

Related

Pre_tokenization/tokenization of DNA data using HuggingFace

I am struggling with transformers in DNA data for a supervised binary classification problem. I have very long DNA sequences (the mean is 6E7 characters) and, to be able to pass longer sequences as input to the Neural Network, I am trying to tokenize using different algorithms to work with longer sequences tokens rather than only (C, G, A, T) ones.
At the moment I am trying with HuggingFace to implement BPE, WordPiece, and Unigram algorithms. However, before training those models I do have to apply a pretokenizer to my data. All of them are based into "classic" language structures like Whitespace() but in my case I only have a list of DNA sequences like (small chunk):
['CCAGCAGCTCGGTGCGCTTGCCGCTCCAGTCGCCCAGCAGCTCGGTGCGCTTGCCGCCCCAGTCGC']
My intention is to group those characters to work with bigger tokens than only 1 single character. However, when I use for example Whitespace(), my model does not learn...
Could you recommend me some pre_tokenizer for passing as input to BPE, WPiece and UNIGRAM only characters?
Also, would you recommend padding sequence before or after tokenization process?
Thank you very much

how to find closeness between two keras pad_sequences?

I am writing a small proof of concept where I turn a catalog into a json that has a url, and a label that explains the web page. I read this json in python, tokenize it and create a pad_sequences.
I need to then compare some free flow texts to find which index of the pad_sequences has the most words from the free flow text.
I am generating a pad_sequences() from the text too but not sure if I can somehow compare the two sequences for closeness?
Please help.
You can use cosine similarity or euclidean distance to compare two vectors.
https://www.tensorflow.org/api_docs/python/tf/keras/metrics/CosineSimilarity
https://www.tutorialexample.com/calculate-euclidean-distance-in-tensorflow-a-step-guide-tensorflow-tutorial/
For sequences you can make embedding to same lenght vector at first.

one_hot Vs Tokenizer for Word representation

I have seen in many blogs , people using one_hot (from tf.keras.preprocessing.text.one_hot ) to convert the string of words into array of numbers which represent indices. This does not ensure unicity. Whereas Tokenizer class ensures unicity (tf.keras.preprocessing.text.Tokenizer ).
Then why is one_hot prefered over tokenizer?
Update: I got to know that hashing is used in One_hot to convert words into numbers but didn't get its importance as we can use the tokenizer class to do the same thing with more accuracy.
Not sure what you mean by uncity. I expect it has to do with the sequential relationship between the words. That of course is lost with ine hot encoding. However one-hot encoding is used when the number of words is limited. If say you have 10 words in the vocubulary you will create 10 new features which is fine for most neural networks to process. If you have other features in your data set beside the word sequences say numeric ordinal parameters you can still create a single input model. However if you have 10,000 words in the vocabulary you would create 10,000 new features which at best will take a lot to process. So in the case of a large vocabularly it is best to use "dense" encoding" versus the sparse encoding generated by one hot encoding. You can use the results of the tokenizer encoding to serve as input to a keras embedding layer which will encode the words into an n dimensional space where N is a value you specify. If you have additional ordinal features then to process the data your model will need multiple inputs. Perhaps that is why some people prefer to one hot encode the words.

How is hashing implemented in SGNN (Self-Governing Neural Networks)?

So I've read the paper named Self-Governing Neural Networks for On-Device Short Text Classification which presents an embedding-free approach to projecting words into a neural representation. To quote them:
The key advantage of SGNNs over existing work is that they surmount the need for pre-trained word embeddings and complex networks with huge parameters. [...] our method is a truly embedding-free approach unlike majority of the widely-used state-of-the-art deep learning techniques in NLP
Basically, from what I understand, they proceed as follow:
You'd first need to compute n-grams (side-question: is that skip-gram like old skip-gram, or new skip-gram like word2vec? I assume it's the first one for what remains) on words' characters to obtain a featurized representation of words in a text, so as an example, with 4-grams you could yield a 1M-dimensional sparse feature vector per word. Hopefully, it's sparse so memory needn't to be fully used for that because it's almost one-hot (or count-vectorized, or tf-idf vectorized ngrams with lots of zeros).
Then you'd need to hash those n-grams sparse vectors using Locality-sensitive hashing (LSH). They seem to use Random Projection from what I've understood. Also, instead of ngram-vectors, they instead use tuples of n-gram feature index and its value for non-zero n-gram feature (which is also by definition a "sparse matrix" computed on-the-fly such as from a Default Dictionary of non-zero features instead of a full vector).
I found an implementation of Random Projection in scikit-learn. From my tests, it doesn't seem to yield a binary output, although the whole thing is using sparse on-the-fly computations within scikit-learn's sparse matrices as expected for a memory-efficient (non-zero dictionnary-like features) implementation I guess.
What doesn't work in all of this, and where my question lies, is in how they could end up with binary features from the sparse projection (the hashing). They seem to be saying that the hashing is done at the same time of computing the features, which is confusing, I would have expected the hashing to come in the order I wrote above as in 1-2-3 steps, but their steps 1 and 2 seems to be somehow merged.
My confusion arises mostly from the paragraphs starting with the phrase "On-the-fly Computation." at page 888 (PDF's page 2) of the paper in the right column. Here is an image depicting the passage that confuses me:
I'd like to convey my school project to a success (trying to mix BERT with SGNNs instead of using word embeddings). So, how would you demystify that? More precisely, how could a similar random hashing projection be achieved with scikit-learn, or TensorFlow, or with PyTorch? Trying to connect the dots here, I've significantly researched but their paper doesn't give implementation details, which is what I'd like to reproduce. I at least know that the SGNN uses 80 fourten-dimensionnal LSHes on character-level n-grams of words (is my understanding right in the first place?).
Thanks!
EDIT: after starting to code, I realized that the output of scikit-learn's SparseRandomProjection() looks like this:
[0.7278244729081154,
-0.7278244729081154,
0.0,
0.0,
0.7278244729081154,
0.0,
...
]
For now, this looks fine, it's closer to binary but it would still be castable to an integer instead of a float by using the good ratio in the first place. I still wonder about the skip-gram thing, I assume n-gram of characters of words for now but it's probably wrong. Will post code soon to GitHub.
EDIT #2: I coded something here, but with n-grams instead of skip-grams: https://github.com/guillaume-chevalier/SGNN-Self-Governing-Neural-Networks-Projection-Layer
More discussion threads on this here: https://github.com/guillaume-chevalier/SGNN-Self-Governing-Neural-Networks-Projection-Layer/issues?q=is%3Aissue
First of all, thanks for your implementation of the projection layer, it helped me get started with my own.
I read your discussion with #thinline72, and I agree with him that the features are calculated in the whole line of text, char by char, not word by word. I am not sure this difference in features is too relevant, though.
Answering your question: I interpret that they do steps 1 and 2 separately, as you suggested and did. Right, in the article excerpt that you include, they talk about hashing both in feature construction and projection, but I think those are 2 different hashes. And I interpret that the first hashing (feature construction) is automatically done by the CountVectorizer method.
Feel free to take a look at my implementation of the paper, where I built the end-to-end network and trained on the SwDA dataset, as split in the SGNN paper. I obtain a max of 71% accuracy, which is somewhat lower than the paper claims. I also used the binary hasher that #thinline72 recommended, and nltk's implementation of skipgrams (I am quite certain the SGNN paper is talking about "old" skipgrams, not "word2vec" skipgrams).

How are quantized DCT coeffiecients serialised in JPEG?

I've read in dozens of articles, scientific papers, and toy implementations that the steps in JPEG compression are roughly as follows
Take 8x8 DCT
Divide by quantization matrix
Round to integers
Run-length & Hufmann
And then the inverse is pretty much the same. What is left out in everything on the topic I've found so far is the magnitude of the data and the corresponding serialization.
It appears implicitly assumed that all the coefficients are stored as unsigned bytes. However, as I understand it, the DC coefficient is in the range 0-255, while the AC coefficients can be negative. Are the AC coefficients in the range ±255, or ±127, or something else?
What is the common way to store these coefficients in a compact way?
The first-hand source to read is of course the ITU-T T.81 standard document.
Looks like the first Google link leads to a paywall.. it's on the w3 site, though: https://www.w3.org/Graphics/JPEG/itu-t81.pdf
Take 8-bit input samples (0..255)
Subtract 128 (-128..127)
Do N*N fDCT, where N=8
Output can have log2(N)+8 bits = 11 bits (-1024..1023)
DC coefficients are stored as a difference, so they can have 12 bits.
The encoding process depends upon whether you have a sequential scan or a progressive scan. The details of the encoding process are too complicated to fit within an answer here.
I highly recommend this book:
https://www.amazon.com/Compressed-Image-File-Formats-JPEG/dp/0201604434/ref=sr_1_2?ie=UTF8&qid=1531091178&sr=8-2&keywords=JPEG&dpID=5168QFRTslL&preST=_SX258_BO1,204,203,200_QL70_&dpSrc=srch
It is the only source I know of that explains JPEG end-to-end in plain English.