Scikit Naive Bayes Classification for text - numpy

i am trying to use scikit for the Naive Basyes classification. i have couple of question (Also i am new to scikit)
1) Scikit Algorithms want input as a numpy array and label as arrays. In case of text classification should i map each of my word with a number (id) , by maintaining a hash of words in vocab and a unique id associated with it? is this is standard practice in scikit?
2) In case of assigning same text to more than one class how should i proceed. One obvious way is to replicate each training example one for each associated label. Any better representation exist?
3) Similarly for the test data how will i get more than one class associated with a test?
I am using http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
as my base.

1) yes. Use DictVectorizer or HashVectorizer from the feature_extraction module.
2) This is a multilabel problem. Maybe use the OneVsRestClassifier from the multi_class module. It will train a separate classifier for each class.
3) Using a multilabel classifier / one classifier per calss will do that.
Take a look at http://scikit-learn.org/dev/auto_examples/grid_search_text_feature_extraction.html
and http://scikit-learn.org/dev/auto_examples/plot_multilabel.html

Related

Is there any difference between keras.utils.to_categorical and pd.get_dummies?

I think the same purpose among sklearn.OneHotEncoder, pandas.get_dummies, and keras.to_categorical. But I don't know the difference. 
Apart from the difference of the output/input type there is no difference, they all achieve the same result.
There's some technical difference:
Keras is very simple, you give him the target vector and he one -hot encodes it, use keras if you need to encode the labels vector.
Pandas is the most complex, it creates a new column for every class of the data, the good part is that works on dataframes where you want to one-hot only one of the columns (so you could say this is more of a multi purpose method, but not the preferable option if you need to train a NN)
Sklearn lets you one-hot encode multiple features in the same variable, is a bit more flexible that the use keras offers, if the method from keras is too simple try with sklearn, if keras is enough stick with it.

Loss function for ordinal multi classification in pytorch

I am a beginner with DNN and pytorch.
I am dealing with a multi-classification problem where my label are encoded into a one-hotted vector, say of dimension D.
To this end, I am using the CrossEntropyLoss. However now I want to modify or change such criterion in order to penalize value distant from the actual one, say classify 4 instead of 5 is better than 2 instead of 5.
Is there a function already built-in in Pytorch that implement this behavior? Otherwise how can I modify the CrossEntropyLoss to achieve it?
This could help you. It is a PyTorch implementation ordinal regression:
https://www.ethanrosenthal.com/2018/12/06/spacecutter-ordinal-regression/

Tensorflow: pattern training and generation

Imagine I have hundreds of rectangular patterns that look like the following:
_yx_0zzyxx
_0__yz_0y_
x0_0x000yx
_y__x000zx
zyyzx_z_0y
Say the only variables for the different patterns are dimension (width by height in characters) and values at a given cell within the rectangle with possible characters _ y x z 0. So another pattern might look like this:
yx0x_x
xz_x0_
_yy0x_
zyy0__
and another like this:
xx0z00yy_z0x000
zzx_0000_xzzyxx
_yxy0y__yx0yy_z
_xz0z__0_y_xz0z
y__x0_0_y__x000
xz_x0_z0z__0_x0
These simplified examples were randomly generated, but imagine there is a deeper structure and relation between dimensions and layout of characters.
I want to train on this dataset in an unsupervised fashion (no labels) in order to generate similar output. Assuming I have created my dataset appropriately with tf.data.Dataset and categorical identity columns:
what is a good general purpose model for unsupervised training (no labels)?
is there a Tensorflow premade estimator that would represent such a model well enough?
once I've trained the model, what is a general approach to using it for generation of patterns based on what it has learned? I have in mind Google Magenta, which can be used to train on a dataset of musical melodies in order to generate similar ones from a kind of seed/primer melody
I'm not looking for a full implementation (that's the fun part!), just some suggested tutorials and next steps to follow. Thanks!

How to train a reverse embedding, like vec2word?

how do you train a neural network to map from a vector representation, to one hot vectors? The example I'm interested in is where the vector representation is the output of a word2vec embedding, and I'd like to map onto the the individual words which were in the language used to train the embedding, so I guess this is vec2word?
In a bit more detail; if I understand correctly, a cluster of points in embedded space represents similar words. Thus if you sample from points in that cluster, and use it as the input to vec2word, the output should be a mapping to similar individual words?
I guess I could do something similar to an encoder-decoder, but does it have to be that complicated/use so many parameters?
There's this TensorFlow tutorial, how to train word2vec, but I can't find any help to do the reverse? I'm happy to do it using any deeplearning library, and it's OK to do it using sampling/probabilistic.
Thanks a lot for your help, Ajay.
One easiest thing that you can do is to use the nearest neighbor word. Given a query feature of an unknown word fq, and a reference feature set of known words R={fr}, then you can find out what is the nearest fr* for fq, and use the corresponding fr* word as fq's word.

how to find similar words for a certain word in tensorflow_word2vec like using model.most_similar in gensim?

I've using tensorflow to build word2vec model,reference here:https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/word2vec/word2vec_basic.py#L118
my question is that, how can i find top n similar words for a certain word.I know in gensim, I can save and load word2vec model,and then use model.most_similar to find what I want.but how in tensorflow and even more is there any way to save model in tensorflow since i find what i get is only an embedding vector,is that right?
I think as long as you have computed the weight vector for each token, then you can manipulate all the tokens in the vector space. You can simply calculate the cosine similarity between each vector and then sort by score. For your reference, you can look at the source code of most_similar method implemented in gensim word2vec model. Hope this helps.