I want to use learned model - data-science

I have decision tree is learned already but now I want to use this decision tree to predicted a new data is don't
known class
I have:
Tid : 1 2 3 4 5 6 7 8 9 10
Refund : Yes No No Yes No No Yes No No No
MaritalStatus : Single Maried Single Maried Divorced
Maried Divorced Single Maried Single
TexableIncome : 125K 100K 70K 120K 95K 60K 220K 85K
75K 90K
Cheat (this attribute is class) : No No No No Yes No No Yes
No Yes
After I use this data in above for train decision tree I get a tree but after this I have data I don't know class I want to use tree I got to predict a class
Tid : 11 12 13 14 15
Attrib1 : No Yes Yes No No
Attrib2 : Small Medium Large Small Large
Attrib3 : 55k 80k 110k 95k 67k
Class : ? ? ? ? ? (I want to predict this by my first tree)

A decision tree model is trained on all available attributes of the training set. All decisions leading to a label (class) prediction are based on these specific attributes. Thus your pre-trained model can only be applied to example sets, that at least contain all attributes of the training set (be aware that attributes with a special role like ´id´ are ignored by the learning algorithm).
In your example, the model relies on Refund, MaritalStatus and TexableIncome (spelling?), but the second data set, for which you want to predict a label, has the attributes Attrib1, Attrib2 and Attrib3. Not even a simple rename would work, since the type of possible values differs between MaritalStatus: Divorced, Single, Maried (again, spelling?) and Attrib2: Small, Medium, Large.

Related

Data selecting in data science project about predicting house prices

This is my first data science project and I need to select some data. Of course, I know that I can not just select all the data available because this will result in overfitting. I am currently investigating house prices in the capital of Denmark for the past 10 years and I wanted to know which type of houses I should select in my data:
Owner-occupied flats and houses (This gives a dataset of 50000 elements)
Or just owner-occupied flats (this gives a dataset of 43000 elements)
So as you can see there are a lot more owner-occupied flats sold in the capital of Denmark. My opinion is that I should select just the owner-occupied flats because then I have the "same" kind of elements in my data and still have 43000 elements.
Also, there are a lot higher taxes involved if you own a house rather than owning an owner-occupied flat. This might affect the price of the house and skew the data a little bit.
I have seen a few projects where both owner-occupied flats and houses are selected for the data and the conclusion was overfitting, so that is what I am looking to avoid.
This is an classic example of over-fitting due to lack of data or insufficient data.
Let me example the selection process to resolve this kind of problem. I will example using the example of credit card fraud then relate that with your question or any future problem of prediction with classified data.
In ideal world credit card fraud are not that common. So, if you look at the real data you will find only 2% data which resulted in fraud. So, if you train a model with this datasets it would be biased as you don't have normal distribution of the class (i.e fraud and none fraud transaction in your case its Owner-occupied flats and houses). There are 4 a way to tackle this issue.
Let's Suppose Datasets has 90 none fraud data points and 10 fraud data points.
1. Under sampling majority class
In this we just select 10 data points from 90 and train model with 10:10 so distribution is normalised (In your case using only 7000 of 43000 flats). This is not ideal way as we would be throughout a huge amount of data.
2. Over sampling minority class by duplication
In this we duplicate the 10 data points to make it 90 data point distribution is normalised (In your case duplicating 7000 house data to make it 43000 i.e equal to that of flat). While this work there is a better way.
3. Over sampling minority class by SMOTE (recommended)
Synthetic Minority Over-sampling Technique is a technique we use k nearest neigbors algo to generate the minority class in your case the housing data. There is module named imbalanced-learn (here) which can be use to implement this.
4. Ensemble Method
In this method you divide your data into multiple datasets to make it balance for example dividing 90 into 9 sets so that each set can have 10 fraud class data and 10 none fraud class data. In your case diving 43000 in batch of 7000. After that training each one separately and using majority vote mechanism to predict.
So now I have created the following diagram. The green line shows the price per square meter of owner occupied flats and the red line shows price per square meter of houses (all prices in DKK). I was wondering if there is imbalanced classification? The maximum deviation of the prices is atmost 10% (see for example 2018). Is 10% enough to say that the data is biased and hence therefore is imbalanced classified?

Combine multiple source sets to make a decision

I'm working on a project in which I am using ocr-engine and tensorflow to identify the vehicle number plate and vehicle model respectively. I also have a database which contains Vehicle Information (for eg, owner, number plate, vehicle brand, color, etc).
Simple flow:
Image input
Number plate recognition using OCR
Vehicle model (eg Hyundai,Toyota, Honda, etc) using Tensorflow
Query (2. and 3.) in database to find the owner
Now, the fact is ocr-engine is not 100% accurate, let's consider INDXXXX0007 as the best result of the engine.
When I query this result in database I get
Set 1,
Owner1 - INDXXXX0004 (95% match)
Owner2 - INDXXXX0009 (95% match)
In such cases, I use tensorflow data to make a decision
Set 2, where vehicle model shows:
Hyundai (95.00%)
Honda (90.00%)
Here comes my main problem, tensorflow sometimes gives me false-positive values. For eg, the actual vehicle is Honda but the model shows more confidence for Hyundai (ref, Set2).
What should be a possible way to avoid such problems or How can I combine both sets to make a decision?

tensorflow crossed column feature with vocabulary list for cross terms

How would a make a crossed_column with a vocabulary list for the crossed terms? That is suppose that I have two categorical columns
animal [dog, cat, puma, other]
food [pizza, salad, quinoa, other]
and now I want to make the crossed column, animal x food - but I've done some frequency counts of the training data (in spark before exporting tfrecords for training tensorflow models), and puma x quinoa only showed up once, and cat x quinoa never showed up. So I don't want to generate features for them, I don't think I have enough training examples to learn what their weights should be. What I'd like is for both of them to get absorbed in the "other x other" feature -- the thought that I'll learn some kind of average weight for a feature that covers all the infrequent terms.
It doesn't look like I can do that with tf.feature_column.crossed_column -- any idea how I would do this kind of thing in tensorflow?
Or, should I not worry about it? If I crosses all the features I'd get 20, but there are only 18 that I think are important - so maybe set the hash map size to 18 or, less, causing collisions? Then include the first order columns, animal and food, so the model can figure out what it is looking at? This is the approach I'm getting from reading the docs. I like it because it is simpler, but am concerned about model accuracy.
I think what I really want is some kind of sparse table lookup, rather than hashing the cross -- imagine I have
column A - integer Ids, 1 to 10,000
column B - integer Ids, 1 to 10,000
column C - integer Ids, 1 to 10,000
and there are only 1 million of the 1 trillion possible crosses between A,B,C that I want to make features for -- all the rest will go into the 1 million + 1 other x other x other feature, how would I do that in tensorflow?

LSTM (long short memory) on Penn Tree Bank data

The PennTree Bank data seems difficult to understand. Below are two links
https://github.com/townie/PTB-dataset-from-Tomas-Mikolov-s-webpage/tree/master/data
https://www.tensorflow.org/tutorials/recurrent
My concern is as follows. The reader gives me around a million occurrences of 10000 words. I have written code to convert this data set into one-hot encoding. Thus, I have a million vectors of 10000 dimensions, each vector having a 1 at a single location. Now, I want to train a LSTM (long short term memory) model on this for prediction.
For simplicity let us assume that there are 30 (and not a million occurences) occurences, and the sequence length is equal to 10 for the LSTM (the number of timesteps that it unrolls). Let us denote these occurences by
X1,X2,....,X10,X11,...,X20,X21,...X30
Now, my concern is that should I use 3 data samples for training
X1,..X10 and X11,..,X20, and X21,..X30
or should I use 20 data samples for training
X1,..X10 and X2,...,X11, and X3,..,X12, so on until X21,..,X30
In case I go with the latter, then am I not breaking the i.i.d. assumption of training data sequence generations?

Simple voice recognition when whispering

I'm trying to do simple voice to text mapping using pocketsphinx (. The grammar is very simple such as:
public <grammar> = (Matt, Anna, Tom, Christine)+ (One | Two | Three | Four | Five | Six | Seven | Eight | Nine | Zero)+ ;
e.g:
Tom Anna Three Three
yields
Tom Anna 33
I adapted the acoustic model (to take into account my foreign accent) and after that I received decent performance (~94% accuracy). I used training dataset of ~3minutes.
Right now I'm trying to do the same but by whispering to the microphone. The accuracy dropped significantly to ~50% w/o training. With training for accent
I got ~60%. I tried other thinks including denoising and boosting volume. I read the whole docs but was wondering if anyone could answer some questions so I can
better know in which direction should I got to improve performance.
1) in tutorial you are adapting hub4wsj_sc_8k acustic model. I guess "8k" is a sampling parameter. When using sphinx_fe you use "-samprate 16000". Was it used deliberately to train 8k model using data with 16k sampling rate? Why data with 8k sampling haven't been used? Does it have influence on performance?
2) in sphinx 4.1 (in comparison to pocketsphinx) there are differenct acoustic models e.g. WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar. Can those models be used with pocketsphinx? Will acustic model with 16k sampling have typically better performance with data having 16k sampling rate?
3) when using data for training should I use those with normal speaking mode (to adapt only for my accent) or with whispering mode (to adapt to whisper and my accent)? I think I tried both scenarios and didn't notice any difference to draw any conclussion but I don't know pocketsphinx internals so I might be doing something wrong.
4) I used the following script to record adapting training and testing data from the tutorial:
for i in `seq 1 20`; do
fn=`printf arctic_%04d $i`;
read sent; echo $sent;
rec -r 16000 -e signed-integer -b 16 -c 1 $fn.wav 2>/dev/null;
done < arctic20.txt
I noticed that each time I hit Control-C this keypress is distinct in the recorded audio that leaded to errors. Trimming audio somtimes helped to correct to or lead to
other error instead. Is there any requirement that each recording has some few seconds of quite before and after speaking?
5) When accumulating observation counts is there any settings I can tinker with to improve performance?
6) What's the difference between semi-continuous and continuous model? Can pocketsphinx use continuous model?
7) I noticed that 'mixture_weights' file from sphinx4 is much smaller comparing to the one you got in pocketsphinx-extra. Does it make any difference?
8) I tried different combination of removing white noise (using 'sox' toolkit e.g. sox noisy.wav filtered.wav noisered profile.nfo 0.1). Depending on the last parameter
sometimes it improved a little bit (~3%) and sometimes it makes worse. Is it good to remove noise or it's something pocketsphinx doing as well? My environment is quite
is there is only white noise that I guess can have more inpack when audio recorded whispering.
9) I noticed that boosting volume (gain) alone most of the time only maked the performance a little bit worse even though for humans it was easier to distinguish words. Should I avoid it?
10) Overall I tried different combination and the best results I got is ~65% when only removing noise, so only slight (5%) improvement. Below are some stats:
//ORIGNAL UNPROCESSED TESTING FILES
TOTAL Words: 111 Correct: 72 Errors: 43
TOTAL Percent correct = 64.86% Error = 38.74% Accuracy = 61.26%
TOTAL Insertions: 4 Deletions: 13 Substitutions: 26
//DENOISED + VOLUME UP
TOTAL Words: 111 Correct: 76 Errors: 42
TOTAL Percent correct = 68.47% Error = 37.84% Accuracy = 62.16%
TOTAL Insertions: 7 Deletions: 4 Substitutions: 31
//VOLUME UP
TOTAL Words: 111 Correct: 69 Errors: 47
TOTAL Percent correct = 62.16% Error = 42.34% Accuracy = 57.66%
TOTAL Insertions: 5 Deletions: 12 Substitutions: 30
//DENOISE, threshold 0.1
TOTAL Words: 111 Correct: 77 Errors: 41
TOTAL Percent correct = 69.37% Error = 36.94% Accuracy = 63.06%
TOTAL Insertions: 7 Deletions: 3 Substitutions: 31
//DENOISE, threshold 0.21
TOTAL Words: 111 Correct: 80 Errors: 38
TOTAL Percent correct = 72.07% Error = 34.23% Accuracy = 65.77%
TOTAL Insertions: 7 Deletions: 3 Substitutions: 28
Those processing I was doing only for testing data. Should the training data be processed in the same way? I think I tried that but there was barely any difference.
11) In all those testing I used ARPA language model. When using JGSF results where usually much worse (I have the latest pocketsphinx branch). Why is that?
12) Because is each sentence the maximum number would be '999' and no more than 3 names, I modified the JSGF and replaced repetition sign '+' by repeating content in the parentheses manually. This time the result where much closer to ARPA. Is there any way in grammar to tell maximum number of repetition like in regular expression?
13) When using ARPA model I generated it by using all possible combinations (since dictionary is fixed and really small: ~15 words) but then testing I was still receiving somtimes illegal results e.g. Tom Anna (without any required number). Is there any way to enforce some structure using ARPA model?
14) Should the dictionary be limited only to those ~15 words or just full dictionary will only affect speed but not performance?
15) Is modifying dictionary (phonemes) the way to go to improve recognition when whispering? (I'm not an expert but when we whisper I guess some words might sounds different?)
16) Any other tips how to improve accuracy would be really helpful!
Regarding whispering: when you do so, the sound waves don't have meaningful aperiodic parts (vibrations that result from your vocal cords resonating normally, but not when whispering). You can try this by putting your finger to your throat while loudly speaking 'aaaaaa', and then just whispering it.
AFAIR acoustic modeling relies a lot on taking the frequency spectrum of the sound to detect peaks (formants) and relate them to phones (like vowels).
Educated guess: when whispering, the spectrum is mostly white-noise, slightly shaped by the oral position (tongue, openness of mouth, etc), which is enough for humans, but far not enough to make the peeks distinguishable by a computer.