Combine multiple source sets to make a decision - tensorflow

I'm working on a project in which I am using ocr-engine and tensorflow to identify the vehicle number plate and vehicle model respectively. I also have a database which contains Vehicle Information (for eg, owner, number plate, vehicle brand, color, etc).
Simple flow:
Image input
Number plate recognition using OCR
Vehicle model (eg Hyundai,Toyota, Honda, etc) using Tensorflow
Query (2. and 3.) in database to find the owner
Now, the fact is ocr-engine is not 100% accurate, let's consider INDXXXX0007 as the best result of the engine.
When I query this result in database I get
Set 1,
Owner1 - INDXXXX0004 (95% match)
Owner2 - INDXXXX0009 (95% match)
In such cases, I use tensorflow data to make a decision
Set 2, where vehicle model shows:
Hyundai (95.00%)
Honda (90.00%)
Here comes my main problem, tensorflow sometimes gives me false-positive values. For eg, the actual vehicle is Honda but the model shows more confidence for Hyundai (ref, Set2).
What should be a possible way to avoid such problems or How can I combine both sets to make a decision?

Related

Data selecting in data science project about predicting house prices

This is my first data science project and I need to select some data. Of course, I know that I can not just select all the data available because this will result in overfitting. I am currently investigating house prices in the capital of Denmark for the past 10 years and I wanted to know which type of houses I should select in my data:
Owner-occupied flats and houses (This gives a dataset of 50000 elements)
Or just owner-occupied flats (this gives a dataset of 43000 elements)
So as you can see there are a lot more owner-occupied flats sold in the capital of Denmark. My opinion is that I should select just the owner-occupied flats because then I have the "same" kind of elements in my data and still have 43000 elements.
Also, there are a lot higher taxes involved if you own a house rather than owning an owner-occupied flat. This might affect the price of the house and skew the data a little bit.
I have seen a few projects where both owner-occupied flats and houses are selected for the data and the conclusion was overfitting, so that is what I am looking to avoid.
This is an classic example of over-fitting due to lack of data or insufficient data.
Let me example the selection process to resolve this kind of problem. I will example using the example of credit card fraud then relate that with your question or any future problem of prediction with classified data.
In ideal world credit card fraud are not that common. So, if you look at the real data you will find only 2% data which resulted in fraud. So, if you train a model with this datasets it would be biased as you don't have normal distribution of the class (i.e fraud and none fraud transaction in your case its Owner-occupied flats and houses). There are 4 a way to tackle this issue.
Let's Suppose Datasets has 90 none fraud data points and 10 fraud data points.
1. Under sampling majority class
In this we just select 10 data points from 90 and train model with 10:10 so distribution is normalised (In your case using only 7000 of 43000 flats). This is not ideal way as we would be throughout a huge amount of data.
2. Over sampling minority class by duplication
In this we duplicate the 10 data points to make it 90 data point distribution is normalised (In your case duplicating 7000 house data to make it 43000 i.e equal to that of flat). While this work there is a better way.
3. Over sampling minority class by SMOTE (recommended)
Synthetic Minority Over-sampling Technique is a technique we use k nearest neigbors algo to generate the minority class in your case the housing data. There is module named imbalanced-learn (here) which can be use to implement this.
4. Ensemble Method
In this method you divide your data into multiple datasets to make it balance for example dividing 90 into 9 sets so that each set can have 10 fraud class data and 10 none fraud class data. In your case diving 43000 in batch of 7000. After that training each one separately and using majority vote mechanism to predict.
So now I have created the following diagram. The green line shows the price per square meter of owner occupied flats and the red line shows price per square meter of houses (all prices in DKK). I was wondering if there is imbalanced classification? The maximum deviation of the prices is atmost 10% (see for example 2018). Is 10% enough to say that the data is biased and hence therefore is imbalanced classified?

Can you forecast with multiple trajectories?

I am new to time-series machine learning and have a, perhaps, trivial question.
I would like like to forecast the temperature for a particular region. I could train a model using the hourly data points from the first 6 days of the week and then evaluate its performance on the final day. Therefore the training set would have 144 data points (6*24) and the test set would have 24 data points (24*1). Likewise, I can train a new model for regions B-Z and evaluate each of their individual performances. My question is, can you train a SINGLE model for the predictions across multiple different regions? So the region label should be an input of course since that will effect the temperature evolution.
Can you train a single model that forecasts for multiple trajectories rather than just one? Also, what might be a good metric for evaluating its performance? I was going to use mean absolute error but maybe a correlation is better?
Yes you can train with multiple series of data from different region the question that you ask is an ultimate goal of deep learning by create a 1 model to do every things, predict every region correctly and so on. However, if you want to generalize your model that much you normally need a really huge model, I'm talking about 100M++ parameter and to train that data you also need tons of Data maybe couple TB or PB, so you also need a super powerful computer to train that thing something like GOOGLE data center. Coming to your next question, the metric, you may use just simple RMS error or mean absolute error will work fine.
Here is what you need to focus Training Data, there is no super model that take garbage and turn it in to gold, same thing here garbage in garbage out. You need a pretty good datasets that can represent whole environment of what u are trying to solve. For example, you want to create model to predict that if you hammer a glass will it break, so you have maybe 10 data for each type of glass and all of them break when u hammer it. so, you train the model and it just predict break every single time, then you try to predict with a bulletproof glass and it does not break, so your model is wrong. Therefore, you need a whole data of different type of glass then your model maybe predict it correctly. Then compare this to your 144 data points, I'm pretty sure it won't work for your case.
Therefore, I would say yes you can build that 1 model fits all but there is a huge price to pay.

Multiple trained models vs Multple features and one model

I'm trying to build a regression based M/L model using tensorflow.
I am trying to estimate an object's ETA based on the following:
distance from target
distance from target (X component)
distance from target (Y component)
speed
The object travels on specific journeys. This could be represented as from A->B or from A->C or from D->F (POINT 1 -> POINT 2). There are 500 specific journeys (between a set of points).
These journeys aren't completely straight lines, and every journey is different (ie. the shape of the route taken).
I have two ways of getting around this problem:
I can have 500 different models with 4 features and one label(the training ETA data).
I can have 1 model with 5 features and one label.
My dilemma is that if I use option 1, that's added complexity, but will be more accurate as every model will be specific to each journey.
If I use option 2, the model will be pretty simple, but I don't know if it would work properly. The new feature that I would add are originCode+ destinationCode. Unfortunately these are not quantifiable in order to make any numerical sense or pattern - they're just text that define the journey (journey A->B, and the feature would be 'AB').
Is there some way that I can use one model, and categorize the features so that one feature is just a 'grouping' feature (in order separate the training data with respect to the journey.
In ML, I believe that option 2 is generally the better option. We prefer general models rather than tailoring many models to specific tasks, as that gets dangerously close to hardcoding, which is what we're trying to get away from by using ML!
I think that, depending on the training data you have available, and the model size, a one-hot vector could be used to describe the starting/end points for the model. Eg, say we have 5 points (ABCDE), and we are going from position B to position C, this could be represented by the vector:
0100000100
as in, the first five values correspond to the origin spot whereas the second five are the destination. It is also possible to combine these if you want to reduce your input feature space to:
01100
There are other things to consider, as Scott has said in the comments:
How much data do you have? Maybe the feature space will be too big this way, I can't be sure. If you have enough data, then the model will intuitively learn the general distances (not actually, but intrinsically in the data) between datapoints.
If you have enough data, you might even be able to accurately predict between two points you don't have data for!
If it does come down to not having enough data, then finding representative features of the journey will come into use, ie. length of journey, shape of the journey, elevation travelled etc. Also a metric for distance travelled from the origin could be useful.
Best of luck!
I would be inclined to lean toward individual models. This is because, for a given position along a given route and a constant speed, the ETA is a deterministic function of time. If one moves monotonically closer to the target along the route, it is also a deterministic function of distance to target. Thus, there is no information to transfer from one route to the next, i.e. "lumping" their parameters offers no a priori benefit. This is assuming, of course, that you have several "trips" worth of data along each route (i.e. (distance, speed) collected once per minute, or some such). If you have only, say, one datum per route then lumping the parameters is a must. However, in such a low-data scenario, I believe that including a dummy variable for "which route" would ultimately be fruitless, since that would introduce a number of parameters that rivals the size of your dataset.
As a side note, NEITHER of the models you describe could handle new routes. I would be inclined to build an individual model per route, data quantity permitting, and a single model neglecting the route identity entirely just for handling new routes, until sufficient data is available to build a model for that route.

Is multiple regression the best approach for optimization?

I am being asked to take a look at a scenario where a company has many projects that they wish to complete, but with any company budget comes into play. There is a Y value of a predefined score, with multiple X inputs. There are also 3 main constraints of Capital Costs, Expense Cost and Time for Completion in Months.
The ask is could an algorithmic approach be used to optimize which projects should be done for the year given the 3 constraints. The approach also should give different results if the constraint values change. The suggested method is multiple regression. Though I have looked into different approaches in detail. I would like to ask the wider community, if anyone has dealt with a similar problem, and what approaches have you used.
Fisrt thing we should understood, a conclution of something is not base on one argument.
this is from communication theory, that every human make a frame of knowledge (understanding conclution), where the frame construct from many piece of knowledge / information).
the concequence is we cannot use single linear regression in math to create a ML / DL system.
at least we should use two different variabel to make a sub conclution. if we push to use single variable with use linear regression (y=mx+c). it's similar to push computer predict something with low accuration. what ever optimization method that you pick...it's still low accuracy..., why...because linear regresion if you use in real life, it similar with predict 'habbit' base on data, not calculating the real condition.
that's means...., we should use multiple linear regression (y=m1x1+m2x2+ ... + c) to calculate anything in order to make computer understood / have conclution / create model of regression. but, not so simple like it. because of computer try to make a conclution from data that have multiple character / varians ... you must classified the data and the conclution.
for an example, try to make computer understood phitagoras.
we know that phitagoras formula is c=((a^2)+(b^2))^(1/2), and we want our computer can make prediction the phitagoras side (c) from two input values (a and b). so to do that, we should make a model or a mutiple linear regresion formula of phitagoras.
step 1 of course we should make a multi character data of phitagoras.
this is an example
a b c
3 4 5
8 6 10
3 14 etc..., try put 10 until 20 data
try to make a conclution of regression formula with multiple regression to predic the c base on a and b values.
you will found that some data have high accuration (higher than 98%) for some value and some value is not to accurate (under 90%). example a=3 and b=14 or b=15, will give low accuration result (under 90%).
so you must make and optimization....but how to do it...
I know many method to optimize, but i found in manual way, if I exclude the data that giving low accuracy result and put them in different group then, recalculate again to the data group that excluded, i will get more significant result. do again...until you reach the accuracy target that you want.
each group data, that have a new regression, is a new class.
means i will have several multiple regression base on data that i input (the regression come from each group of data / class) and the accuracy is really high, 99% - 99.99%.
and with the several class, the regresion have a fuction as a 'label' of the class, this is what happens in the backgroud of the automation computation. but with many module, the user of the module, feel put 'string' object as label, but the truth is, the string object binding to a regresion that constructed as label.
with some conditional parameter you can get the good ML with minimum number of data train.
try it on excel / libreoffice before step more further...
try to follow the tutorial from this video
and implement it in simple data that easy to construct in excel, like pythagoras.
so the answer is yes...the multiple regression is the best approach for optimization.

How to generate data that fits the normal distribution within each class?

Using numpy, I need to produce training and test data for a machine learning problem. The model is able to predict three different classes (X,Y,Z). The classes represent the types of patients in multiple clinical trials, and the model should be able to predict the type of patient based on data gathered about the patient (such as blood analysis and blood pressure, previous history etc.)
From a previous study we know that, in total, the classes are represented with the following distribution, in terms of a percentage of the total patient count per trial:
X - u=7.2, s=5.3
Y - u=83.7, s=15.2
Z - u=9.1, s=2.3
The u/s describe the distribution in N(u, s) for each class (so, for all trials studied, class X had mean 7.2 and variance 5.3). Unfortunately the data set for the study is not available.
How can I recreate a dataset that follows the same distribution over all classes, and within each class, subject to the constraint of X+Y+Z=100 for each record.
It is easy to generate a dataset that follows the overall distribution (the u values), but how do I get a dataset that has the same distribution per each class?
The problem you have stated is to sample from a mixture distribution. A mixture distribution is just a number of component distributions, each with a weight, such that the weights are nonnegative and sum to 1. Your mixture has 3 components. Each is a Gaussian distribution with the mean and sd you gave. It is reasonable to assume the mixing weights are the proportion of each class in the population. To sample from a mixture, first select a component using the weights as probabilities for a discrete distribution. Then sample from the component. I assume you know how to sample from a Gaussian distribution.