How to make training and testing data files for LIBSVM and/or TinySVM - libsvm

when I open the sample files for training data of LIBSVM, I can't understand the file structure. Can someone please show me how to make it ?
Below is my training data to predict song writers of a song(as an example):
Feature 1: Number of "love" word in the lyric
Feature 2: Number of "friend" word in the lyric
Feature 3: Number of "zone" word in the lyrics
Training data:
Song A (3, 0, 0), song writer is David
Song B (0, 3, 1), song writer is Peter
Song C (1, 3, 1), song writer is Tom
Testing data:
Song D (3, 0, 1)
Thank you very much.

Libsvm ReadMe file can help you
The training data must be something like this
label feature1:value1 feature2:value2 ..... -1:? (? can be any number)
but in the Libsvm there is something called svm_node that do the same thing:
sample code in java:
for (int k = 0; k < dataCount; k++) {
prob.x[k] = new svm_node[features.length];
for (int j = 0; j < features.length; j++) {
svm_node node = new svm_node();
node.index = featuresIndex[j];
node.value = features[j];
prob.x[k][j] = node;
}
prob.y[k] = lable;
}

In this problem of classification we have three classes for our whole dataset David, Peter, Tom and we assign them values 0, 1 and 2 receptively.
The format of data set will be.
[label] [feature number] : [the no of times that feature occurs] .... ....
Our training data file will look like this.
0 1:3 2:0 3:0
1 1:0 3:1 3:1
2 1:1 2:3 3:1
This file can be used to train our model. In this file there are 3 rows and four columns, the first column represents the actual result and the other columns represent the feature number : the number of times that feature occurs.
The testing data will be treated as.
1:3 2:0 3:1
this will be passed to svm model and then prediction can be drawn.

Related

Understanding Pandas Series Data Structure

I am trying to get my head around the Pandas module and started learning about the Series data structure.
I have created the following Series in Spyder :-
songs = pd.Series(data = [145,142,38,13], name = "Count")
I can obtain information about the Series index using the code:-
songs.index
The output of the above code is as follows:-
My question is where it states Start = 0 and Stop = 4, what are these referring to?
I have interpreted start = 0 as the first element in the Series is in row 0.
But i am not sure what Stop value refers to as there are no elements in row 4 of the Series?
Can some one explain?
Thank you.
This concept as already explained adequately in the comments (indexing is at minus one the count of items) is prevalent in many places.
For instance, take the list data structure-
z = songs.to_list()
[145, 142, 38, 13]
len(z)
4 # length is four
# however indexing stops at i-1 position 'i' being the length/count of items in the list.
z[4] # this will raise an IndexError
# you will have to start at index 0 going till only index 3 (i.e. 4 items)
z[0], z[1], z[2], z[-1] # notice how -1 can be used to directly access the last element

Convert string features to numeric features in sklearn and pandas

I'm currently working with sklearn (I'm beginner) and I want to train and test a very naif classifier.
The structure of my training and testing data is the following:
----|----|----|----|----|----|------|----|----|----|-------
f1 | f2 | f3 | c1 | c2 | c3 | word | c4 | c5 | c6 | label
----|----|----|----|----|----|------|----|----|----|-------
Where:
f1: feature 1, binary numerical type like 0
f2: feature 2, binary numerical type like 1
f3: feature 3, binary numerical type like 0
c1: context 1, string type like "from"
c2: context 2, string type like "this"
c3: context 3, string type like "website"
word: central word (string) of the context like "http://.."
c4: context 4, string type
c5: context 5, string type
c6: context 6, string type
label: this is the label (string) that the classifier has to train and predict like: "URL" (I have only three types of label: REF,IRR,DATA)
What I want to do is to convert my context string features in numerical features. Every string field is composed of a maximum of one word.
The main goal is to assign a numeric value for every context and word string in such a way to make the system works.
What I thought is that it's possible to define a vocabulary like:
{ from, website, to, ... }
and provide this vocabulary to the DictVectorizer, but I don't know how to do this now.
What I really want to do is to generate a huge number of binary features: the word “from” immediately preceding the word in question is one feature; the word “available” two positions after the word is another one. But I really don't know how.
This is what I tried to do:
#I tried to read the train csv:
train = pd.read_csv('train.csv')
#Drop the label field:
train_X = train.drop(['label'],axis=1)
#Take the other parameters:
train_y = train.label.values
#Then I convert the panda's data type into a dictionary:
train_X = train_X.to_dict('r')
#And I tried to vectorize everything:
vec = DictVectorizer()
train_X = vec.fit_transform(train_X).toarray()
Obviously did't work. This because the context and word fields can be a very big word like an url.
Any suggestions? I accept all kinds of solutions.
Thank you very much.
If unique words are finite, you can do something like this using pandas.
mapping_dict = {'word1':0,
'word2':1,
'word3':3 }
df[col] = df[col].str.map(mapping_dict)

How to make a new variable based on 30 other variables

I have 30 variables on family history of cancer i.e. breast cancer father, breast cancer mother, breast cancer sister etc. I would like to make a new variable and give it a value of "1" if in one of my columns there is a 1.
Thus:
I have 30 variables with answers 1 to 3; 1 is yes, 2 is no and, 3 is unknown if one of the 30 variables is given a 1 I would like my new variable to take on the value 1.
Does someone know how I can do this?
You can create a list instead of separate 30 variables and then filter it out to create a new variable. This will make it more dynamic.
// This will be the cancer history for a single family
var cancerHistory = [];
// Add dummy data
cancerHistory.push('yes');
cancerHistory.push('no')
cancerHistory.push('unknown');
cancerHistory.push('no');
// Check if at least one of them is "yes"
var hasHistoryOfCancer = cancerHistory.indexOf('yes') > -1;
alert(hasHistoryOfCancer); // true
You can use a for loop. You did not mention the language so I am writing the code in Python which is easy to understand. If you want it in other language you can use the similar approach and apply it
import pandas as pd
new_var = []
df = pd.read_csv("DataFile.csv") # Convert data file to csv and put name it.
for i in range(len(df)):
x = [df['column1'][i], df['column2'][i] ...., df['column30'][i]]
if (1 in x): new_var.append(1)
else: new_var.append(0)
df['new_var'] = new_var
df.to_csv('NewDataFile.csv', sep=',', encoding='utf-8')

How to organize an SQL database to store book text analysis data

I have analysed 3 books using the Stanford NLP library. I run my analysis on a page basis, and for every book this is the output I get:
// An array of length P, where P is the total number of pages in the book
// so that pageSentiment[0] represents the sentiment of the page 1.
float[] pageSentiment
// An array of length P, where P is the total number of pages in the book
// so that pageWords[0] represents the number of words in the page 1.
int[] pageWords
// An array of length W, where W is the number of unique words in the book
// where, for example, bookWords[0] has the following values
// word = "then"
// data[0] = {1, 1, 2} => the word "then" occurs 2 times in page 1 (associated to chapter 1)
// data[1] = {1, 2, 1} => the word "then" occurs 1 times in page 2 (associated to chapter 1)
// data[2] = {1, 3, 0} => the word "then" occurs 0 times in page 3 (associated to chapter 1)
// data[3] = {1, 4, 0} => the word "then" occurs 0 times in page 4 (associated to chapter 1)
// data[4] = {2, 5, 3} => the word "then" occurs 3 times in page 5 (associated to chapter 2)
// data[5] = ...
struct WordData { string word; int[,,] data; }
WordData[] bookWords
Now... I have to store all those results into an SQL database so that I can access it to plot graphs and statistical tables within a web page. What I'm trying to figure out, now, is the proper way to store all those values in a flexible way so that I can easily send different queries to the database in order to obtain different outputs that follow my current needs. For example... I need to be able to:
plot an histogram concerning the words count (pageWords) in which
each column can be either a page or a chapter (in this case I need to
aggregate page values);
see the frequency of a word by page or by chapter;
print global book values for every book;
ect...
Any suggestion about the structure of my SQL tables, please?
Just 3 tables
book
---
book_id
title
...
word
---
word_id
text
...
and many-to-many table with results
word_2_book
---
word_id
book_id
page_no
chapter_no
word_count
Then just
select *
from word_2_book wb
where wb.book_id=? and wb.word_id=?
and you can apply any aggregate functions

Convert cell text in progressive number

I have written this SQL in PostgreSQL environment:
SELECT
ST_X("Position4326") AS lon,
ST_Y("Position4326") AS lat,
"Values"[4] AS ppe,
"Values"[5] AS speed,
"Date" AS "timestamp",
"SourceId" AS smartphone,
"Track" as session
FROM
"SingleData"
WHERE
"OsmLineId" = 44792088
AND
array_length("Values", 1) > 4
AND
"Values"[5] > 0
ORDER BY smartphone, session;
Now I have imported the result in Matlab and I have six vectors and one cell (because the text from the UUIDs was converted in cell) all of 5710x1 size.
Now I would like convert the text in the cell, in a progressive number, like 1, 2, 3... for each different session code.
In Excel it is easy with FIND.VERT(obj, matrix, col), but I do not know how do it in Matlab.
Now I have a big cell with a lot of codes like:
ff95465f-0593-43cb-b400-7d32942023e1
I would like convert this cell in an array of numbers where at the first occurrence of
ff95465f-0593-43cb-b400-7d32942023e1 -> 1
and so on. And you put 2 when a different code appear, and so on.
OK, I have solve.
I put the single session code in a second cell C.
At this point, with a for loop, I obtain:
%% Converting of the UUIDs into integer
C = unique(session);
N = length(session);
session2 = zeros(N, 1);
for i = 1:N
session2(i) = find(strcmp(C, session(i)));
end
Thanks to all!