Graphlab - How to set the number of trees parameter in random forest classifier - graphlab

I am training a random forest classifier :
model = gl.random_forest_classifier.create(train, target = 'label',row_subsample = 0.5, column_subsample = 0.75, validation_set=validation, metric="auc", max_iterations=10, max_depth = 15)
How can I set the number of trees parameter? It's a binary classification problem and the documentation says:
max_iterations : The maximum number of iterations to perform. For multi-class classification with K classes, each iteration will create K-1 trees.

"The num_trees keyword argument is deprecated. Please use the max_iterations argument instead. Any value provided for num_trees will be used in place of max_iterations."
So, adjust the max_iterations also means adjust the the number of trees parameter.

Related

What is an effective way to pad a variable length dataset for batching in Tensorflow that does not have have exact

I am trying to integrate the Dataset API into my input pipeline. Before this integration, the program used tf.train.batch_join(), which had dynamic padding enabled. Hence, this would batch elements and pad them according to the largest one in the mini-batch.
image, width, label, length, text, filename = tf.train.batch_join(
data_tuples,
batch_size=batch_size,
capacity=queue_capacity,
allow_smaller_final_batch=final_batch,
dynamic_pad=True)
For dataset, however, I was unable to find the exact alternative to this. I cannot use padded batch, since the dimensions of the images does not have a set threshold. The image width could be anything. My partner and I were able to come up with a work around for this using tf.contrib.data.bucket_by_sequence(). Here is an excerpt:
dataset = dataset.apply(tf.contrib.data.bucket_by_sequence_length
(element_length_func=_element_length_fn,
bucket_batch_sizes=np.full(len([0]) + 1, batch_size),
bucket_boundaries=[0]))
What this does is basically dumps all the elements into the overflow bucket since the boundary is set to 0. Then, it batches it from that bucket since bucketing pads the elements according to the largest one.
Is there a better way to achieve this functionality?
I meet exactly the same problem. Now I know how to solve this. If your input_data only has one dimension that is of variable length, try to use tf.contrib.data.bucket_by_sequence_length to dataset.apply() function, make bucket_batch_sizes = [batch_size] * (len(buckets) + 1). And there is another way to do so just as #mrry has said in comments.
iterator = dataset.make_one_shot_iterator()
item = iterator.get_next()
padded_shapes = []
for i in item:
padded_shapes.append(i.get_shape())
padded_shapes = tf.contrib.framework.nest.pack_sequence_as(item, padded_shapes)
dataset = dataset.padded_batch(batch_size, padded_shapes)
If one dimension in the shapes of a tensor is None or -1, then padded_batch will pad the tensor on that dimension to max length of the batch.
My training data has two features of varibale length, And this method works fine.

pymc python change point detection for small probabilities. ZeroProbability Error

I am trying to use pymc to find a change point in a time-series. The value I am looking at over time is probability to "convert" which is very small, 0.009 on average with a range of 0.001-0.016.
I give the two probabilities a uniform distribution as a prior between zero and the max observation.
alpha = df.cnvrs.max() # Set upper uniform
center_1_c = pm.Uniform("center_1_c", 0, alpha)
center_2_c = pm.Uniform("center_2_c", 0, alpha)
day_c = pm.DiscreteUniform("day_c", lower=1, upper=n_days)
#pm.deterministic
def lambda_(day_c=day_c, center_1_c=center_1_c, center_2_c=center_2_c):
out = np.zeros(n_days)
out[:day_c] = center_1_c
out[day_c:] = center_2_c
return out
observation = pm.Uniform("obs", lambda_, value=df.cnvrs.values, observed=True)
When I run this code I get:
ZeroProbability: Stochastic obs's value is outside its support,
or it forbids its parents' current values.
I'm pretty new to pymc so not sure if I'm missing something obvious. My guess is I might not have appropriate distributions for modelling small probabilities.
It's impossible to tell where you've introduced this bug—and programming is off-topic here, in any case—without more of your output. But there is a statistical issue here: You've somehow constructed a model that cannot produce either the observed variables or the current sample of latent ones.
To give a simple example, say you have a dataset with negative values, and you've assumed it to be gamma distributed; this will produce an error, because the data has zero probability under a gamma. Similarly, an error will be thrown if an impossible value is sampled during an MCMC chain.

Tensorflow: What exact formula is applied in `tf.nn.sparse_softmax_cross_entropy_with_logits`?

I tried to manually recompute the outputs of this function so I created a minimal example:
logits = tf.pack(np.array([[[[0,1,2]]]],dtype=np.float32)) # img of shape (1, 1, 1, 3)
labels = tf.pack(np.array([[[1]]],dtype=np.int32)) # gt of shape (1, 1, 1)
softmaxCrossEntropie = tf.nn.sparse_softmax_cross_entropy_with_logits(logits,labels)
softmaxCrossEntropie.eval() # --> output is [1.41]
Now according to my own calculation I only get [1.23]
When manually calculating, I'm simply applying softmax
and cross-entropy:
where q(x) = sigma(x_j) or (1-sigma(x_j)) depending whether j is the correct ground truth class or not and p(x) = labels which are then one-hot-encoded
I'm not sure where the difference might originate from. I cannot really imagine that some epsilon causes such a big difference. Does someone know where I can lookup, which exact formula is used by tensorflow?
Is the source code of that exact part available?
I could only find nn_ops.py, but it only uses another function called gen_nn_ops._sparse_softmax_cross_entropy_with_logits which I couldn't find on github...
Well, usually p(x) in cross-entropy equation is true distribution, while q(x) is the distribution obtained from softmax. So, if p(x) is one-hot (and this is so, otherwise sparse cross-entropy could not be applied), cross entropy is just negative log for probability of true category.
In your example, softmax(logits) is a vector with values [0.09003057, 0.24472847, 0.66524096], so the loss is -log(0.24472847) = 1.4076059 which is exactly what you got as output.

word2vec - get nearest words

Reading the tensorflow word2vec model output how can I output the words related to a specific word ?
Reading the src : https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/examples/tutorials/word2vec/word2vec_basic.py can view how the image is plotted.
But is there a data structure (e.g dictionary) created as part of training the model that allows to access nearest n words closest to given word ?
For example if word2vec generated image :
image src: https://www.tensorflow.org/versions/r0.11/tutorials/word2vec/index.html
In this image the words 'to , he , it' are contained in same cluster, is there a function which takes as input 'to' and outputs 'he , it' (in this case n=2) ?
This approach apply to word2vec in general. If you can save the word2vec in text/binary file like google/GloVe word vector. Then what you need is just the gensim.
To install:
Via github
Python code:
from gensim.models import Word2Vec
gmodel=Word2Vec.load_word2vec_format(fname)
ms=gmodel.most_similar('good',10)
for x in ms:
print x[0],x[1]
However this will search all the words to give the results, there are approximate nearest neighbor (ANN) which will give you the result faster but with a trade off in accuracy.
In the latest gensim, annoy is used to perform the ANN, see this notebooks for more information.
Flann is another library for Approximate Nearest Neighbors.
I will assume that you don't want to use gensim, and would prefer to stick with tensorflow. In that case, I'll offer two options
Option 1 - Tensorboard:
If you are just trying to do this from an exploratory standpoint, I would suggest using Tensorboard's embedding visualizer to search for the closest embeddings. It provides a cool interface and you can use both cosine and euclidian distances with a set number of neighbors.
Link to Tensorflow documentation
Option 2 - Direct Calculation
Within the word2vec_basic.py file, there is an example of how they are calculating closest words, and you could go ahead and use that if you mess with the function a little bit. The following is found in the graph itself:
# Compute the cosine similarity between minibatch examples and all embeddings.
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(
normalized_embeddings, valid_dataset)
similarity = tf.matmul(
valid_embeddings, normalized_embeddings, transpose_b=True)
Then, during training (every 10000 steps) they run this next bit of code (while the session is active). When they call similarity.eval() it is getting the literal numpy array evaluation of the similarity tensor in the graph.
# Note that this is expensive (~20% slowdown if computed every 500 steps)
if step % 10000 == 0:
sim = similarity.eval()
for i in xrange(valid_size):
valid_word = reverse_dictionary[valid_examples[i]]
top_k = 8 # number of nearest neighbors
nearest = (-sim[i, :]).argsort()[1:top_k+1]
log_str = "Nearest to %s:" % valid_word
for k in xrange(top_k):
close_word = reverse_dictionary[nearest[k]]
log_str = "%s %s," % (log_str, close_word)
print(log_str)
If you want to adapt this for yourself, you will have to do some finessing with changing reverse_dictionary[valid_examples[i]] to be the word/words idxs that you want to get the k-closest words for.
Get gensim and use similar_by_word method on gensim.models.Word2Vec model.
similar_by_word takes 3 parameters,
The input word
n - for top n similar words (optional, default=10)
restrict_vocab (optional, default=None)
Example
import gensim, nltk
class FileToSent(object):
"""A class to load a text file efficiently """
def __init__(self, filename):
self.filename = filename
# To remove stop words (optional)
self.stop = set(nltk.corpus.stopwords.words('english'))
def __iter__(self):
for line in open(self.filename, 'r'):
ll = [i for i in unicode(line, 'utf-8').lower().split() if i not in self.stop]
yield ll
Then depending on your input sentences (sentence_file.txt),
sentences = FileToSent('sentence_file.txt')
model = gensim.models.Word2Vec(sentences=sentences, min_count=2, hs=1)
print model.similar_by_word('hack', 2) # Get two most similar words to 'hack'
# [(u'debug', 0.967338502407074), (u'patch', 0.952264130115509)] (Output specific to my dataset)

Tensorflow Linear Regression: Getting values for Adjusted R Square, Coefficients, P-value

There are few key parameters associated with Linear Regression e.g. Adjusted R Square, Coefficients, P-value, R square, Multiple R etc. While using google Tensorflow API to implement Linear Regression how are these parameter mapped? Is there any way we can get the value of these parameters after/during model execution
From my experience, if you want to have these values while your model runs then you have to hand code them using tensorflow functions. If you want them after the model has run you can use scipy or other implementations. Below are some examples of how you might go about coding R^2, MAPE, RMSE...
total_error = tf.reduce_sum(tf.square(tf.sub(y, tf.reduce_mean(y))))
unexplained_error = tf.reduce_sum(tf.square(tf.sub(y, prediction)))
R_squared = tf.sub(tf.div(total_error, unexplained_error),1.0)
R = tf.mul(tf.sign(R_squared),tf.sqrt(tf.abs(unexplained_error)))
MAPE = tf.reduce_mean(tf.abs(tf.div(tf.sub(y, prediction), y)))
RMSE = tf.sqrt(tf.reduce_mean(tf.square(tf.sub(y, prediction))))
I believe the formula for R2 should be the following. Note that it would go negative when the network is so bad that it does a worse job than the mere average as a predictor:
total_error = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
unexplained_error = tf.reduce_sum(tf.square(tf.subtract(y, pred)))
R_squared = tf.subtract(1.0, tf.divide(unexplained_error, total_error))
Adjusted_R_squared = 1 - [ (1-R_squared)*(n-1)/(n-k-1) ]
whereas n is the number of observations and k is the number of features.
You should not use a formula for R Squared. This exists in Tensorflow Addons. You will only need to extend it to Adjusted R Squared.
I would strongly recommend against using a recipe to calculate r-squared itself! The examples I've found do not produce consistent results, especially with just one target variable. This gave me enormous headaches!
The correct thing to do is to use tensorflow_addons.metrics.RQsquare(). Tensorflow Add Ons is on PyPi here and the documentation is a part of Tensorflow here. All you have to do is set y_shape to the shape of your output, often it is (1,) for a single output variable.
Then you can use what RSquare() returns in your own metric that handled the adjustments.