Given a word to topic distribution, how do I calculate perplexity? - text-mining

I have a blackbox LDA model which outputs word to topic distribution like follows:
word_1 topic_b:freq_of_word_1 topic_a:freq_of_word_1 ....
word_2 topic_a:freq_of_word_2 topic_d:freq_of_word_2 ....
.
.
How do I proceed to calculate perplexity? Looking at Gensim LDA code I found they use variational lower bound methods, but they have access to internal topic parameters to calculate them. In my blackbox model, all I have access is to these dumps, and the initial parameters alpha and beta.

Related

Learning parameters of each simulated device

Does tensorflow-federated support assigning different hyper-parameters(like batch-size or learning rate) for different simulated devices?
Currently, you may find this a bit unnatural, but yes, such a thing is possible.
One approach to doing this that is supported today is to have each client take its local learning rate as a top-level parameter, and use this in the training. A dummy example here would be (sliding the model parameter in the computations below) something along the lines of
#tff.tf_computation(tff.SequenceTyoe(...), tf.float32)
def train_with_learning_rate(ds, lr):
# run training with `tf.data.Dataset` ds and learning rate lr
...
#tff.federated_computation(tff.FederatedType([tff.SequenceType(...), tf.float32])
def run_one_round(datasets_and_lrs):
return tff.federated_mean(
tff.federated_map(train_with_learning_rate, datasets_and_lrs))
Invoking the federated computation here with a list of tuples with the first element of the tuple representing the clients data and the second element representing the particular client's learning rate, would give what you want.
Such a thing requires writing custom federated computations, and in particular likely defining your own IterativeProcess. A similar iterative process definition was recently open sourced here, link goes to the relevant local client function definition to allow for learning rate scheduling on the clients by taking an extra integer parameter representing the round number, it is likely a good place to look.

Binomial And Multinomial Classification in ML

I got a project in which my task is to build network intrusion detection system to detect anomolies and attacks in the network.
There are two problems.
1. Binomial Classification: Activity is normal or attack
2. Multinomial classification: Activity is normal or DOS or PROBE or R2L or U2R
But before this I get some confusion in these terms Binomial/Multinomial Classification.
Help me to understand/ if possible please share a sort code... which gives me more help.
I tried to search these term on google/youtube but can't find proper definition with some code
I do only these thing with my code:-
clean/transform/outlier detect/missing value treatment
model_selection/accuracy test
so my next step is to make classification of Binomial/Multinomial Classification
Thanks for help...
First, do not hesitate to post on https://datascience.stackexchange.com/ for these kind of question that is more Data Science than coding issue.
Second, the answer is as simple as :
Binary (and not Binomial) Classification means only 2 targets to find.
=> In your case Normal vs Attack
Multilabel / Multiclass / Multinomial Classification means more than 2 targets to find.
=> Your case : Normal, DOS, PROBE, REL & E2R.
You can find example on https://scikit-learn.org/stable/supervised_learning.html#supervised-learning

Machine Learning Algorithm for multiple output features

I am looking for machine learning algorithm where I have multiple variables as output . It is something like like a vector[A,....X] each of which can have 0 or 1 value. I have data to train the model with required input features.
Which algorithm should I use for such case. With my limited knowledge I know that multi label classification can solve the problem where one output variable can take multiple values like color. But this case is multiple output variables taking 0 or 1 . Please let me know.
It is difficult to give an answer on which algorithm is the best without more information.
A perceptron, a neural network with an output layer with multiple binary (threshold function) neurons could be a good candidate.

Tensorflow: pattern training and generation

Imagine I have hundreds of rectangular patterns that look like the following:
_yx_0zzyxx
_0__yz_0y_
x0_0x000yx
_y__x000zx
zyyzx_z_0y
Say the only variables for the different patterns are dimension (width by height in characters) and values at a given cell within the rectangle with possible characters _ y x z 0. So another pattern might look like this:
yx0x_x
xz_x0_
_yy0x_
zyy0__
and another like this:
xx0z00yy_z0x000
zzx_0000_xzzyxx
_yxy0y__yx0yy_z
_xz0z__0_y_xz0z
y__x0_0_y__x000
xz_x0_z0z__0_x0
These simplified examples were randomly generated, but imagine there is a deeper structure and relation between dimensions and layout of characters.
I want to train on this dataset in an unsupervised fashion (no labels) in order to generate similar output. Assuming I have created my dataset appropriately with tf.data.Dataset and categorical identity columns:
what is a good general purpose model for unsupervised training (no labels)?
is there a Tensorflow premade estimator that would represent such a model well enough?
once I've trained the model, what is a general approach to using it for generation of patterns based on what it has learned? I have in mind Google Magenta, which can be used to train on a dataset of musical melodies in order to generate similar ones from a kind of seed/primer melody
I'm not looking for a full implementation (that's the fun part!), just some suggested tutorials and next steps to follow. Thanks!

how to find similar words for a certain word in tensorflow_word2vec like using model.most_similar in gensim?

I've using tensorflow to build word2vec model,reference here:https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/word2vec/word2vec_basic.py#L118
my question is that, how can i find top n similar words for a certain word.I know in gensim, I can save and load word2vec model,and then use model.most_similar to find what I want.but how in tensorflow and even more is there any way to save model in tensorflow since i find what i get is only an embedding vector,is that right?
I think as long as you have computed the weight vector for each token, then you can manipulate all the tokens in the vector space. You can simply calculate the cosine similarity between each vector and then sort by score. For your reference, you can look at the source code of most_similar method implemented in gensim word2vec model. Hope this helps.