How to find good observations for reinforcement learning? - sequence

I am starting with my study of RL and was wondering how would one approach the observation features, which are not able to represent the state(hidden)?
Is there some systematic approach or some guidelines on how one would prefer the feature vector to look like? Discrete, dimension, Markov properties, embedding quality...?
I would like to process machine operation data streams and actually have a lot of direct measurements and many high-dim feature-vector (also stream).
Thank you very much for you input.

Related

Conditional GANs to Causal GANS?

Can we use conditional GANs to show causality in our data?
I tried a Conditional GAN and I want to know how can I convert it into a Causal one.
Finding causal relationships is very difficult and depends on both model and data
Generally speaking, there is no quick fix that can just make any complex ML model into a causal one (this applies to GANs as much as to anything else). It all depends on what data you have and what causal relationships you hope to find or estimate.
For example, if you have data with a lot of interventions (e.g. data collected through many controlled experiments), you may be able to leverage the difference in outcomes between the experiments to estimate causal effects. If you have only an observational dataset, as is the standard for many vanilla machine learning tasks, finding causal relationships is extremely difficult.

Is topic coherence (gensim CoherenceModel) calculated based exclusively on my corpus or external data as well?

I'm topic modeling a corpus of English 20th century correspondence using LDA and I've been using topic coherence (as well as silhouette scores) to evaluate my topics. I use gensim's CoherenceModel with c_v coherence and the highest I've ever gotten was a 0.35 score in all the models I've tested, even in the topics that make the most sense to me in qualitative evaluation, even after extensive pre-processing and hyperparameter comparison.
So I basically accepted that that's the best I'd get, but in order to write about it now I've been reading up on topic coherence and I've understood it's a pipeline and it models human judgement. One thing I can't seen to find clear info on, though: Is it based exclusively on calculations made on my corpus, or is it based on some external data as well? Like trained on external corpora that might have nothing to do with my domain? Should I use u_mass instead?
Yes, except u_mass, they all use external reference datasets. However, it may not be a bad thing, as those reference datasets provide richer information.

Encoding invariance for deep neural network

I have a set of data, 2D matrix (like Grey pictures).
And use CNN for classifier.
Would like to know if there is any study/experience on the accuracy impact
if we change the encoding from traditionnal encoding.
I suppose yes, question is rather which transformation of the encoding make the accuracy invariant, which one deteriorates....
To clarify, this concerns mainly the quantization process of the raw data into input data.
EDIT:
Quantize the raw data into input data is already a pre-processing of the data, adding or removing some features (even minor). It seems not very clear the impact in term of accuracy on this quantization process on real dnn computation.
Maybe, some research available.
I'm not aware of any research specifically dealing with quantization of input data, but you may want to check out some related work on quantization of CNN parameters: http://arxiv.org/pdf/1512.06473v2.pdf. Depending on what your end goal is, the "Q-CNN" approach may be useful for you.
My own experience with using various quantizations of the input data for CNNs has been that there's a heavy dependency between the degree of quantization and the model itself. For example, I've played around with using various interpolation methods to reduce image sizes and reducing the color palette size, and in the end, I discovered that each variant required a different tuning of hyper-parameters to achieve optimal results. Generally, I found that minor quantization of data had a negligible impact, but there was a knee in the curve where throwing away additional information dramatically impacted the achievable accuracy. Unfortunately, I'm not aware of any way to determine what degree of quantization will be optimal without experimentation, and even deciding what's optimal involves a trade-off between efficiency and accuracy which doesn't necessarily have a one-size-fits-all answer.
On a theoretical note, keep in mind that CNNs need to be able to find useful, spatially-local features, so it's probably reasonable to assume that any encoding that disrupts the basic "structure" of the input would have a significantly detrimental effect on the accuracy achievable.
In usual practice -- a discrete classification task in classic implementation -- it will have no effect. However, the critical point is in the initial computations for back-propagation. The classic definition depends only on strict equality of the predicted and "base truth" classes: a simple right/wrong evaluation. Changing the class coding has no effect on whether or not a prediction is equal to the training class.
However, this function can be altered. If you change the code to have something other than a right/wrong scoring, something that depends on the encoding choice, then encoding changes can most definitely have an effect. For instance, if you're rating movies on a 1-5 scale, you likely want 1 vs 5 to contribute a higher loss than 4 vs 5.
Does this reasonably deal with your concerns?
I see now. My answer above is useful ... but not for what you're asking. I had my eye on the classification encoding; you're wondering about the input.
Please note that asking for off-site resources is a classic off-topic question category. I am unaware of any such research -- for what little that is worth.
Obviously, there should be some effect, as you're altering the input data. The effect would be dependent on the particular quantization transformation, as well as the individual application.
I do have some limited-scope observations from general big-data analytics.
In our typical environment, where the data were scattered with some inherent organization within their natural space (F dimensions, where F is the number of features), we often use two simple quantization steps: (1) Scale all feature values to a convenient integer range, such as 0-100; (2) Identify natural micro-clusters, and represent all clustered values (typically no more than 1% of the input) by the cluster's centroid.
This speeds up analytic processing somewhat. Given the fine-grained clustering, it has little effect on the classification output. In fact, it sometimes improves the accuracy minutely, as the clustering provides wider gaps among the data points.
Take with a grain of salt, as this is not the main thrust of our efforts.

Suitability of Naive Bayes classifier in Mahout to classifying websites

I'm currently working on a project that requires a database categorising websites (e.g. cnn.com = news). We only require broad classifications - we don't need every single URL classified individually. We're talking to the usual vendors of such databases, but most quotes we've had back are quite expensive and often they impose annoying requirements - like having to use their SDKs to query the database.
In the meantime, I've also been exploring the possibility of building such a database myself. I realise that this is not a 5 minute job, so I'm doing plenty of research.
From reading various papers on the subject, it seems a Naive Bayes classifier is generally the standard approach for doing this. However, many of the papers suggest enhancements to improve its accuracy in web classification - typically by making use of other contextual information, such as hyperlinks, header tags, multi-word phrases, the URL, word frequency and so on.
I've been experimenting with Mahout's Naive Bayes classifier against the 20 Newsgroup test dataset, and I can see its applicability to website classification, but I'm concerned about its accuracy for my use case.
Is anyone aware of the feasibility of extending the Bayes classifier in Mahout to take into account additional attributes? Any pointers as to where to start would be much appreciated.
Alternatively, if I'm barking up entirely the wrong tree please let me know!
You can control the input about as much as you'd like. In the end the input is just a feature vector. The feature vector's features can be words, or bigrams -- but they can also be whatever you want. So, yes, you can inject new features by modifying the input as you like.
How best to weave in those features is another topic entirely -- there's not one best way to convert them to numbers. Mahout in Action covers this reasonably well FWIW.

hidden markov models with multiple time independent streams

I'm trying to figure out if there is a good way to merge two HMMs into one, when the underlying states are the same, but the observations aren't temporally linked.
I have two independent observation streams describing the same hidden state space. The underlying order of each observation stream remains the same, but they are not emitted at the same time.
For instance, say I have audio recordings of two separate speakers reading aloud the same passage of text, where the hidden state space becomes the letters in the text, while the stream of phonemes from each audio comprise the observation space. Each speaker records the audio separately, and use a different cadence when reading.
I can clearly make a prediction of the text using each speaker independently, and try and reconcile the results after the fact... but I sense that combining the observation streams into a single HMM may produce a better result.
Does anyone know a good way to reconcile this?
Merging the states would require aligning these streams first... ie some kind of log-likelihood optimization.
But its possible to use statistics from multiple streams to predict the "observations" - modern data compressors basically do just that.
Eg. see http://www.mattmahoney.net/dc/dce.html#Section_432
I am not sure if there are methods to merge two HMM's after they have each been fitted to different observation sequences.
But there exists an algotihm to train one Markov Model on multiple independent observation sequences.
It is coverered for example in the paper
"A tutorial to Hidden Markov models and selected applications in speech recognition"
by Rabiner
Unfortunately, I haven't yet found an implementiation of this algorithm.
Here is my corresponding question on stackexchange: https://stats.stackexchange.com/questions/53256/two-sequences-one-hmm