prediction using libsvm in java - libsvm

I'm using libsvm(3.11) tool for implementation of SVM classification in my project(Text Classification using Multi Agent). But every time when I'm predicting the result it is giving the same label to all the test Documents i.e., either +1 or -1, though I'm using different kinds of data.
I'm using the following procedure for executing libsvm classification for a plain text documents:
-> There will be a set of training text documents
-> I'm converting these text documents into libsvm supported format using TF-IDF weights(I'm taking two folders, that represents two classes .. for 1st folder I assigned label -1 and for 2nd folder it is +1 follows TF-IDF values for that text document)
-> After that I took those bag of words into one plain text document .. and then by using those words I'm generating test document vector with some label(I'm taking only one test document, so IDF will be 1 always and there ll be only one vector ... I hope label doesn't matter) ...
-> After that I'm applying the libsvm functions svm_train and svm_predict with default options
Am I doing in correct procedure?? .. If there is any wrong procedure plz feel free to inform me .. It ll really helps me ..
and Y this libsvm is always giving the result as only one label?? .. Is it any fault with my procedure?? .. or problem with tool??
Thanks in Advance..

Why are you using a new criteria to make test documents? The testing and training document sets should all be derived from your original set of "training text documents". I put these in quotes because you could take a subset of these and use them for testing. Ultimately, make sure your training and testing text document sets are distinct and from the original set.

Related

Linear regression output is given as separated and NAN values

I'm trying to create the best linear regression model and my code is:
Daugialype2 <- lm(TNFBL~IL-4_konc_BL+ MCP-1_konc_BL+IL-8_konc_BL+TGF-β1_konc_BL)
summary(Daugialype2) #this code is working, I get a normal output
BUT
Then I want to introduce more variables to the model, e.g.
Daugialype2 <- lm(TNFBL~IL-4_konc_BL+ MCP-1_konc_BL+IL-8_konc_BL+TGF-β1_konc_BL+MiR_181_BL)
For unknown reasons, my output looks like this (even though without the MiR_181_BL variable, the output was good:
enter image description here
I don't know where is the problem - I don't get any error message. Could it be in the variable itself?
My variable looks like this (while others have less numbers after comma)
enter image description here
It's my very first model. Thank you for your answers!

hyperopt with manual data source

I would like to optimize my parameters with hyperopt.
As I understand, I can specify a range for each parameter. hyperopt selects the test values within this range and is learning from the return value. For example, minimizing the result.
The Problem is, I can not let hyperopt start the test program itself. I need to manual start the program with test parameters and collect the result value in each iteration. Maybe in a csv file as external data source for hyperopt, one line could contain testvalue1,testvalue2,result.
Is this manual saving and data import at hyperopt possible?
Can someone provide an example, this would be very helpful.

How to assign lexical features to new unanalyzable tokens in spaCy?

I'm working with spaCy, version 2.3. I have a not-quite-regular-expression scanner which identifies spans of text which I don't want analyzed any further. I've added a pipe at the beginning of the pipeline, right after the tokenizer, which uses the document retokenizer to make these spans into single tokens. I'd like to remainder of the pipeline to treat these tokens as proper nouns. What's the right way to do this? I've set the POS and TAG attrs in my calls to retokenizer.merge(), and those settings persist in the resulting sentence parse, but the dependency information on these tokens makes me doubt that my settings have had the desired impact. Is there a way to update the vocabulary so that the POS tagger knows that the only POS option for these tokens is PROPN?
Thanks in advance.
The tagger and parser are independent (the parser doesn't use the tags as features), so modifying the tags isn't going to affect the dependency parse.
The tagger doesn't overwrite any existing tags, so if a tag is already set, it doesn't modify it. (The existing tags don't influence its predictions at all, though, so the surrounding words are tagged the same way they would be otherwise.)
Setting TAG and POS in the retokenizer is a good way to set those attributes. If you're not always retokenizing and you want to set the TAG and/or POS based on a regular expression for the token text, then the best way to do this is a custom pipeline component that you add before the tagger that sets tags for certain words.
The transition-based parsing algorithm can't easily deal with partial dependencies in the input, so there isn't a straightforward solution here. I can think of a few things that might help:
The parser does respect pre-set sentence boundaries. If your skipped tokens are between sentences, you can set token.is_sent_start = True for that token and the following token so that the skipped token always ends up in its own sentence. If the skipped tokens are in the middle of a sentence or you want them to be analyzed as nouns in the sentence, then this won't help.
The parser does use the token.norm feature, so if you set the NORM feature in the retokenizer to something extremely PROPN-like, you might have a better chance of getting the intended analysis. For example, if you're using a provided English model like en_core_web_sm, use a word you think would be a frequent similar proper noun in American newspaper text from 20 years ago, so if the skipped token should be like a last name, use "Bush" or "Clinton". It won't guarantee a better parse, but it could help.
If you using a model with vectors like en_core_web_lg, you can also set the vectors for the skipped token to be the same as a similar word (check that the similar word has a vector first). This is how to tell the model to refer to the same row in the vector table for UNKNOWN_SKIPPED as Bush.
The simpler option (that duplicates the vectors in the vector table internally):
nlp.vocab.set_vector("UNKNOWN_SKIPPED", nlp.vocab["Bush"].vector)
The less elegant version that doesn't duplicate vectors underneath:
nlp.vocab.vectors.add("UNKNOWN_SKIPPED", row=nlp.vocab["Bush"].rank)
nlp.vocab["UNKNOWN_SKIPPED"].rank = nlp.vocab["Bush"].rank
(The second line is only necessary to get this to work for a model that's currently loaded. If you save it as a custom model after the first line with nlp.to_disk() and reload it, then only the first line is necessary.)
If you just have a small set of skipped tokens, you could update the parser with some examples containing these tokens, but this can be tricky to do well without affecting the accuracy of the parser for other cases.
The NORM and vector modifications will also influence the tagger, so it's possible if you choose those well, you might get pretty close to the results you want.

Extracting Data from an Area file

I am trying to extract information at a specific location (lat,lon) from different satellite images. These images are were given to me in the AREA format and I cooked up a simple jython script to extract temperature values like so.
While the script works, here is small snippet from it that prints out the data value at a point.
from edu.wisc.ssec.mcidas import AreaFile as af
url="adde://localhost/imagedata?&PORT=8113&COMPRESS=gzip&USER=idv&PROJ=0& VERSION=1&DEBUG=false&TRACE=0&GROUP=FL&DESCRIPTOR=8712C574&BAND=2&LATLON=29.7276 -85.0274 E&PLACE=ULEFT&SIZE=1 1&UNIT=TEMP&MAG=1 1&SPAC=4&NAV=X&AUX=YES&DOC=X&DAY=2012002 2012002&TIME=&POS=0&TRACK=0"
a=af(url);
value=a.getData();
print value
array([[I, [array([I, [array('i', [2826, 2833, 2841, 2853])])])
So what does this mean?
Please excuse me if the question seems trivial, while I am comfortable with python I am really new to dealing with scientific data.
Note
Here is a link to the entire script.
After asking around, I found out that the Area objects returns data in multiples of four. So the very first value is what I am looking for.
Grabbing the value is as simple as :
ar[0][0][0]

usage of language model file while creating a dictionary

I created a speech to text recognition app.For that purpose i developed a dictionary using CMULanguage tool.For creating the dictionary for my project,i added two files to my Language folder present in Groups and Files.The files are having extensions .lm(language model) and .dic.
These files are being supplied to me by the CMULanguage tool when i uploaded my Corpus.I want to know that what is the usage of this .lm file?anyone if know please let me know about this topic.
Thanks in advance,
Christy
The dictionary and the language model are two separate items -- you can not convert one into the other, and you can't just delete / not provide one of them -- both are needed!
The dictionary is used to tell the search algorithm what the valid words are and how they relate to phonemes / the phonetic transcription.
The language model is used during the recognition of an utterance, by using the probability of a uni-gram, bi-gram, n-gram .. when the search algorithm is considering a word-transition.