Hisat2 Graph based indexing - indexing

thanks for your time.
Regarding Hisat2 Genome based indexing:
I was trying to understand the indexing part in different aligners, i understood bowtie2 and bwa versions, built with (BWT, SA, FM index).
I tried a lot to understand the graph based indexing but couldn't reach to the level to understand the GFM index, i need some suggestions and help to understand how did they build the graph and what does it mean by "Prefix sort graph - doubling and pruning", how is this graph constructed. i understood how the LM mapping part after prefix sorted graph.
I'll be very thankful for your suggestions.
How to understand prefix sorted graph which is saying doubling and pruning.
best regards,
Kiran.

Related

tensorflow datasets: more efficient to vectorize with unbatching (batch -> map -> unbatch) or just map?

TensorFlow recommends batching of datasets before transformations with map in order to vectorize the transformation and reduce overhead: https://www.tensorflow.org/guide/data_performance#vectorizing_mapping
However, there are cases where you want to perform transformations on the dataset and then do something (e.g., shuffle) on the UNBATCHED dataset.
I haven't been able to find anything to indicate which is more efficient:
1) dataset.map(my_transformations)
2) dataset.batch(batch_size).map(my_transformations).unbatch()
(2) has reduced map overhead from having vectorized with batch, but has additional overhead from having to unbatch the dataset after.
I could also see it being that there is not a universal rule. Short of testing every time I try a new dataset or transformation (or hardware!), is there a good rule of thumb here? I have seen several examples online use (2) without explanation, but I have no intuition on this subject, so...
Thanks in advance!
EDIT: I have since found that in at least some cases, (2) is MUCH less efficient than (1). For example, on our image dataset, applying random flips and rotations (with .map and the built-in TF functions tf.image.random_flip_left_right, tf.image.random_flip_up_down, and tf.image.rot90) per epoch for data augmentation takes 50% longer with (2). I still have no idea when to expect this to be the case, or not, but the tutorials' suggested approach is at least sometimes wrong.
The answer is (1). https://github.com/tensorflow/tensorflow/issues/40386
TF is modifying the documentation to reflect that the overhead from unbatch will usually (always?) be higher than the savings from vectorized transformations.

How is hashing implemented in SGNN (Self-Governing Neural Networks)?

So I've read the paper named Self-Governing Neural Networks for On-Device Short Text Classification which presents an embedding-free approach to projecting words into a neural representation. To quote them:
The key advantage of SGNNs over existing work is that they surmount the need for pre-trained word embeddings and complex networks with huge parameters. [...] our method is a truly embedding-free approach unlike majority of the widely-used state-of-the-art deep learning techniques in NLP
Basically, from what I understand, they proceed as follow:
You'd first need to compute n-grams (side-question: is that skip-gram like old skip-gram, or new skip-gram like word2vec? I assume it's the first one for what remains) on words' characters to obtain a featurized representation of words in a text, so as an example, with 4-grams you could yield a 1M-dimensional sparse feature vector per word. Hopefully, it's sparse so memory needn't to be fully used for that because it's almost one-hot (or count-vectorized, or tf-idf vectorized ngrams with lots of zeros).
Then you'd need to hash those n-grams sparse vectors using Locality-sensitive hashing (LSH). They seem to use Random Projection from what I've understood. Also, instead of ngram-vectors, they instead use tuples of n-gram feature index and its value for non-zero n-gram feature (which is also by definition a "sparse matrix" computed on-the-fly such as from a Default Dictionary of non-zero features instead of a full vector).
I found an implementation of Random Projection in scikit-learn. From my tests, it doesn't seem to yield a binary output, although the whole thing is using sparse on-the-fly computations within scikit-learn's sparse matrices as expected for a memory-efficient (non-zero dictionnary-like features) implementation I guess.
What doesn't work in all of this, and where my question lies, is in how they could end up with binary features from the sparse projection (the hashing). They seem to be saying that the hashing is done at the same time of computing the features, which is confusing, I would have expected the hashing to come in the order I wrote above as in 1-2-3 steps, but their steps 1 and 2 seems to be somehow merged.
My confusion arises mostly from the paragraphs starting with the phrase "On-the-fly Computation." at page 888 (PDF's page 2) of the paper in the right column. Here is an image depicting the passage that confuses me:
I'd like to convey my school project to a success (trying to mix BERT with SGNNs instead of using word embeddings). So, how would you demystify that? More precisely, how could a similar random hashing projection be achieved with scikit-learn, or TensorFlow, or with PyTorch? Trying to connect the dots here, I've significantly researched but their paper doesn't give implementation details, which is what I'd like to reproduce. I at least know that the SGNN uses 80 fourten-dimensionnal LSHes on character-level n-grams of words (is my understanding right in the first place?).
Thanks!
EDIT: after starting to code, I realized that the output of scikit-learn's SparseRandomProjection() looks like this:
[0.7278244729081154,
-0.7278244729081154,
0.0,
0.0,
0.7278244729081154,
0.0,
...
]
For now, this looks fine, it's closer to binary but it would still be castable to an integer instead of a float by using the good ratio in the first place. I still wonder about the skip-gram thing, I assume n-gram of characters of words for now but it's probably wrong. Will post code soon to GitHub.
EDIT #2: I coded something here, but with n-grams instead of skip-grams: https://github.com/guillaume-chevalier/SGNN-Self-Governing-Neural-Networks-Projection-Layer
More discussion threads on this here: https://github.com/guillaume-chevalier/SGNN-Self-Governing-Neural-Networks-Projection-Layer/issues?q=is%3Aissue
First of all, thanks for your implementation of the projection layer, it helped me get started with my own.
I read your discussion with #thinline72, and I agree with him that the features are calculated in the whole line of text, char by char, not word by word. I am not sure this difference in features is too relevant, though.
Answering your question: I interpret that they do steps 1 and 2 separately, as you suggested and did. Right, in the article excerpt that you include, they talk about hashing both in feature construction and projection, but I think those are 2 different hashes. And I interpret that the first hashing (feature construction) is automatically done by the CountVectorizer method.
Feel free to take a look at my implementation of the paper, where I built the end-to-end network and trained on the SwDA dataset, as split in the SGNN paper. I obtain a max of 71% accuracy, which is somewhat lower than the paper claims. I also used the binary hasher that #thinline72 recommended, and nltk's implementation of skipgrams (I am quite certain the SGNN paper is talking about "old" skipgrams, not "word2vec" skipgrams).

pandas : finding relationship between data in the large dataset

I am new to the data science and i want to explore the relationship between data .. I have a very large dataset containing 556784 X 60 rows and columns . There are some unwanted variable to ignore to feed to the neural network . Using Linearregression && Multipleregression can help us to find the relationship between Xlabel and Ylabel . But running regression technique in such huge dataset really helps ? or there any other ways to find which data is really important to the problem and which data not ?
I know this a theory question but it really helps me to further proceed .
Thanks!
I'm also a noob in DS, but I think I can give you some ideas:
The way you treat your data depends on what kind of data you are working with(is in numbers, text, or some kind of time-series)
It is a good idea to explore it by yourself with making some plots.
You can use a reasonably small part of your data to reduce computation time.
Is there really need in NN? It gives quite unclear results which are hard to interpret and takes time to train, maybe you should try to start with "classic" models first and make some good feature engineering.
Finally, you can check sklearn manual (which I find really good) for data preprocessing chapter, I think it would give you some ideas to try with:
http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
I hope some of this will be helpful.

TSP and Lin-Kernighan algorithm from primm graph

I'm trying to code a TSP problem. I already have the minimal weight graph thanks to Primm algo, I also read that Lin-Kernighan algorithm could be constructed from this graph but can't see how to make it.
Is anyone could explain to me how to perform that ?
Thanks
You need to construct an eulerian circuit from your minimum spanning tree and then you can remove overlapping paths (x-cross connection between 2 edges) with Lin Kernigan.

Lucene. How to build a term-doc matrix

I need to build that matrix but I can't find a way to compute normalized tf-idf for each cell.
The normalization I would perform is cosine-normalization that is divide tf-idf (computed using DefaultSimilarity ) per 1/sqrt(sumOfSquaredtf-idf in the column).
Does anyone know a way to perform that?
Thanks in advance
Antonio
One way, not using Lucene, is described in Sujit Pal's blog. Alternatively, you can build a Lucene index that has term vectors per field, iterate over terms to get idf, then iterate over term's documents to get tf.