What is cross validation rate in grid.py indicate? - libsvm

I know what cross validation and what grid.py is does.
I know that parameter g and g are supposed to be used while training but I have no idea what is this third parameter rate?
I get cross-validation rate as 95.32 % . What does this signify ??
Is it good or bad ??

That cross-validation rate is the percentage of samples that has been correctly classified during the cross-validation step (with the best c and g parameters found), so having a 95% success is a great result. Parameters of grid.py are the following:
-log2c: c regularization parameter
-log2g: set gamma in kernel function exp(-gamma*|u-v|^2)
-v n: n-fold cross validation
-svmtrain pathname: set svm executable path and name
-gnuplot pathname: set gnuplot executable path and name
-out pathname: set output file path and name
-png pathname: set graphic output file path and name (default dataset.png)

Related

Rjags jags model compiling error when using for loop

I am using a Rjags package to run MCMC. I have binomial dataset and I tried to run a "for loop" function in order to generate parameters for multiple datasets from different authors in a combined data.
I specified jags model and uninformative priors for each parameter that I want to get posteriors, but I kept getting an error message like this;
jcode <- "model{
for (i in 1:3){
n.pos[i] ~ dbinom(seropos_est[i],N[i]) #fit to binomial data
seropos_est[i] = 1-exp(-lambdaS1*age[i]) #catalytic model
}
for (i in 4:7) {
n.pos[i] ~ dbinom(seropos_est[i],N[i]) #fit to binomial data
seropos_est[i] = 1-exp(-lambdaS2*age[i]) #catalytic model
}
for (i in 8:11) {
n.pos[i] ~ dbinom(seropos_est[i],N[i]) #fit to binomial data
seropos_est[i] = 1-exp(-lambdaS3*age[i]) #catalytic model
}
#priors
lambdaS1 ~ dnorm(0,1) #uninformative prior
lambdaS2 ~ dnorm(0,1) #uninformative prior
lambdaS3 ~ dnorm(0,1) #uninformative prior
}"
Parameter vector
paramVector <- c("lambdaS1", "lambdaS2", "lambdaS3")
`
mcmc.length=50000
jdat = list(n.pos= df_chik$N.pos,
N=df_chik$N,
age=df_chik$agemid)
jmod = jags.model(textConnection(jcode), data=jdat, n.chains=4, n.adapt=15000)
jpos = coda.samples(jmod, paramVector, n.iter=mcmc.length)
`Error message
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 11
Unobserved stochastic nodes: 3
Total graph size: 74
Initializing model
Deleting model
This is an error message that I am kept getting. I would appreciate if anyone can help me out with this!
The text you show under “Error message” i.e. this text:
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 11
Unobserved stochastic nodes: 3
Total graph size: 74
Initializing model
Deleting model
... is not an error, but the expected output of rjags. But I suspect that you have not copied the real error message, which is probably something along the lines of "invalid parent values for node n.pos[1]". The reason for that is that for seropos_est[] you have relationships of the form:
seropos_est[i] = 1-exp(-lambdaS1*age[i])
Where lambdaS1 is an unconstrained variable. Therefore, the result of exp(-lambdaS1*age[i]) can be above 1, which means that seropos_est[i] can be negative, which is invalid for a probability parameter. In fact, given the normal prior for lambdaS1 the model will initialise that variable with a value of zero, meaning that seropos_est[i] is initialised to zero, which is also invalid if any of your n.pos are greater than zero. You therefore need to re-specify your model to constrain seropos_est to valid parameter space, possibly by changing the prior for lambdaS1 etc (presuming that age is positive).
Also, you have in your code:
lambdaS1 ~ dnorm(0,1) #uninformative prior
But this is certainly not uninformative. In any case, there is really no such thing as an 'uninformative prior' - all priors contain some information, by definition. The best you can do is a 'minimally informative prior' or 'non-informative prior', which is why this terminology is generally recommended rather than the misleading word 'uninformative'.
For future, it would help us to help you if your question contained a minimal reproducible example, so that we can run the model and see exactly what you see. In this case all that is really missing is access to the data.
Hope that helps,
Matt

TF object detection: return subset of inference payload

Problem
I'm working on training and deploying an instance segmentation model using TF's object detection API. I'm able to successfully train the model, package it into a TF Serving Docker image (latest tag as of Oct 2020), and process inference requests via the REST interface. However, the amount of data returned from an inference request is very large (hundreds of Mb). This is a big problem when the inference request and processing don't happen on the same machine because all that returned data has to go over the network.
Is there a way to trim down the number of outputs (either during model export or within the TF Serving image) so allow faster round trip times during inference?
Details
I'm using TF OD API (with TF2) to train a Mask RCNN model, which is a modified version of this config. I believe the full list of outputs is described in code here. The list of items I get during inference is also pasted below. For a model with 100 object proposals, that information is ~270 Mb if I just write the returned inference as json to disk.
inference_payload['outputs'].keys()
dict_keys(['detection_masks', 'rpn_features_to_crop', 'detection_anchor_indices', 'refined_box_encodings', 'final_anchors', 'mask_predictions', 'detection_classes', 'num_detections', 'rpn_box_predictor_features', 'class_predictions_with_background', 'proposal_boxes', 'raw_detection_boxes', 'rpn_box_encodings', 'box_classifier_features', 'raw_detection_scores', 'proposal_boxes_normalized', 'detection_multiclass_scores', 'anchors', 'num_proposals', 'detection_boxes', 'image_shape', 'rpn_objectness_predictions_with_background', 'detection_scores'])
I already encode the images within my inference requests as base64, so the request payload is not too large when going over the network. It's just that the inference response is gigantic in comparison. I only need 4 or 5 of the items out of this response, so it'd be great to exclude the rest and avoid passing such a large package of bits over the network.
Things I've tried
I've tried setting the score_threshold to a higher value during the export (code example here) to reduce the number of outputs. However, this seems to just threshold the detection_scores. All the extraneous inference information is still returned.
I also tried just manually excluding some of these inference outputs by adding the names of keys to remove here. That also didn't seem to have any effect, and I'm worried this is a bad idea because some of those keys might be needed during scoring/evaluation.
I also searched here and on tensorflow/models repo, but I wasn't able to find anything.
I was able to find a hacky workaround. In the export process (here), some of the components of the prediction dict are deleted. I added additional items to the non_tensor_predictions list, which contains all keys that will get removed during the postprocess step. Augmenting this list cut down my inference outputs from ~200MB to ~12MB.
Full code for the if self._number_of_stages == 3 block:
if self._number_of_stages == 3:
non_tensor_predictions = [
k for k, v in prediction_dict.items() if not isinstance(v, tf.Tensor)]
# Add additional keys to delete during postprocessing
non_tensor_predictions = non_tensor_predictions + ['raw_detection_scores', 'detection_multiclass_scores', 'anchors', 'rpn_objectness_predictions_with_background', 'detection_anchor_indices', 'refined_box_encodings', 'class_predictions_with_background', 'raw_detection_boxes', 'final_anchors', 'rpn_box_encodings', 'box_classifier_features']
for k in non_tensor_predictions:
tf.logging.info('Removing {0} from prediction_dict'.format(k))
prediction_dict.pop(k)
return prediction_dict
I think there's a more "proper" way to deal with this using signature definitions during the creation of the TF Serving image, but this worked for a quick and dirty fix.
I've ran into the same problem. In the exporter_main_v2 code there is stated that the outputs should be:
and the following output nodes returned by the model.postprocess(..):
* `num_detections`: Outputs float32 tensors of the form [batch]
that specifies the number of valid boxes per image in the batch.
* `detection_boxes`: Outputs float32 tensors of the form
[batch, num_boxes, 4] containing detected boxes.
* `detection_scores`: Outputs float32 tensors of the form
[batch, num_boxes] containing class scores for the detections.
* `detection_classes`: Outputs float32 tensors of the form
[batch, num_boxes] containing classes for the detections.
I've submitted an issue on the tensorflow object detection github repo, I hope we will get feedback from the tensorflow dev team.
The github issue can be found here
If you are using exporter_main_v2.py file to export your model, you can try this hack way to solve this problem.
Just add following codes in the function _run_inference_on_images of exporter_lib_v2.py file:
detections[classes_field] = (
tf.cast(detections[classes_field], tf.float32) + label_id_offset)
############# START ##########
ignored_model_output_names = ["raw_detection_boxes", "raw_detection_scores"]
for key in ignored_model_output_names:
if key in detections.keys(): del detections[key]
############# END ##########
for key, val in detections.items():
detections[key] = tf.cast(val, tf.float32)
Therefore, the generated model will not output the values of ignored_model_output_names.
Please let me know if this can solve your problem.
Another approach would be to alter the signatures of the saved model:
model = tf.saved_model.load(path.join("models", "efficientdet_d7_coco17_tpu-32", "saved_model"))
infer = model.signatures["serving_default"]
outputs = infer.structured_outputs
for o in ["raw_detection_boxes", "raw_detection_scores"]:
outputs.pop(o)
tf.saved_model.save(
model,
export_dir="export",
signatures={"serving_default" : infer},
options=None
)

Increasing number of predictions in Inception for Tensorflow

I am going through the training tutorial on retraining Inception's final layer after having installed Tensorflow for Ubuntu with regular CPU support. I successfully made the flower examples work however after switching to a new set of categories with ten sub-folders I cannot make Inception produce ten scores for each input image rather than the default five. My current command line to run a test image looks like this, working with headers labelled 0-9.
bazel build tensorflow/examples/label_image:label_image && \
bazel-bin/tensorflow/examples/label_image/label_image \
--graph=/tmp/output_graph.pb --labels=/tmp/output_labels.txt \
--output_layer=final_result \ --input_layer=Mul
--image=$HOME/Input/Example.jpg
Which produces as a result
5 (4): 0.642959
3 (2): 0.243444
9 (8): 0.0513504
4 (5): 0.0231318
6 (7): 0.0180509
However I cannot find anything in the programs that Inception runs to reconfigure how many output scores are produced so that all ten of my categories have scores rather than just five. How do I change this?
I tried with 8 categories and was able to get result for all of them.
If your code has below line
top_k = predictions[0].argsort()[-5:][::-1]
change it to
top_k = predictions[0].argsort()[-len(predictions[0]):][::-1]
If code contains predictions = np.squeeze(predictions) then use predictions instead of predictions[0]
I have run this using following command instead of bazel and I found it easier.
python /path_to_file/label_image.py /path_to_image/image.jpeg
First make sure that graph is created after you run retrain.py and it is at the correct location. (default is inside /tmp/).

What is the model file in svm-train command-line syntax?

I have used grid.py in LIBSVM and found the best parameter for my dataset
C -8.0 g -0.0625 CV- 63.82
Then I tried svm-train but I don't understand the syntax of the svm-train command
svm-train [options] training_set_file [model_file]
A model_File is need but grid.py only gave me a .out file. When I used this, it showed an error.
My question is:
Could you explain what the model file is, preferably using an example?
I am using LIBSVM on Debian (using the command-line).
You want command-lines like:
svm-train -C 8.0 -g 0.0625 training.data svm.model
svm-predict testing.data svm.model predict.out
The model file (svm.model) is just a place to store the model parameters learned by svm-train so that they can be later used for prediction. The model is created by svm-train, it is not produced by grid.py, and it is input to svm-predict. Therefore you can make any name you like to give to svm--train, so long as you give the same name to svm-predict. I often call the file something like model-C8.0-g0.0625 so I can later tell what it is.
A model file will look like this:
svm_type c_svc
kernel_type rbf
gamma 0.5
nr_class 2
total_sv 6164
rho -2.4768
label 1 -1
nr_sv 3098 3066
SV
2 1:-0.452773 2:-0.455573 3:-0.485312 4:-0.436805 ...
If you need to know more about the model file, see the LIBSVM FAQ

mahout lucene document clustering howto?

I'm reading that i can create mahout vectors from a lucene index that can be used to apply the mahout clustering algorithms.
http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text
I would like to apply K-means clustering algorithm in the documents in my Lucene index, but it is not clear how can i apply this algorithm (or hierarchical clustering) to extract meaningful clusters with these documents.
In this page http://cwiki.apache.org/confluence/display/MAHOUT/k-Means
says that the algorithm accepts two input directories: one for the data points and one for the initial clusters. My data points are the documents? How can i "declare" that these are my documents (or their vectors) , simply take them and do the clustering?
sorry in advance for my poor grammar
Thank you
If you have vectors, you can run KMeansDriver. Here is the help for the same.
Usage:
[--input <input> --clusters <clusters> --output <output> --distance <distance>
--convergence <convergence> --max <max> --numReduce <numReduce> --k <k>
--vectorClass <vectorClass> --overwrite --help]
Options
--input (-i) input The Path for input Vectors. Must be a
SequenceFile of Writable, Vector
--clusters (-c) clusters The input centroids, as Vectors. Must be a
SequenceFile of Writable, Cluster/Canopy.
If k is also specified, then a random set
of vectors will be selected and written out
to this path first
--output (-o) output The Path to put the output in
--distance (-m) distance The Distance Measure to use. Default is
SquaredEuclidean
--convergence (-d) convergence The threshold below which the clusters are
considered to be converged. Default is 0.5
--max (-x) max The maximum number of iterations to
perform. Default is 20
--numReduce (-r) numReduce The number of reduce tasks
--k (-k) k The k in k-Means. If specified, then a
random selection of k Vectors will be
chosen as the Centroid and written to the
clusters output path.
--vectorClass (-v) vectorClass The Vector implementation class name.
Default is SparseVector.class
--overwrite (-w) If set, overwrite the output directory
--help (-h) Print out help
Update: Get the result directory from HDFS to local fs. Then use ClusterDumper utility to get the cluster and list of documents in that cluster.
A pretty good howto is here:
integrating apache mahout with apache lucene
# maiky
You can read more about reading the output and using clusterdump utility in this page -> https://cwiki.apache.org/confluence/display/MAHOUT/Cluster+Dumper