Include belonging into model - pandas

If you had data like (prices and market-cap are not real)
Date Stock Close Market-cap GDP
15.4.2010 Apple 7.74 1.03 ...
15.4.2010 VW 50.03 0.8 ...
15.5.2010 Apple 7.80 1.04 ...
15.5.2010 VW 52.04 0.82 ...
where Close is the y you want to predict and Market-cap and GDP are your x-variables, would you also include Stock in your model as another independent variable as it could for example be that price building for Apple works than differently than for VW.
If yes, how would you do it? My idea is to assign 0 to Apple and 1 to VW in the column Stock.

You first need to identify what exactly are you trying to predict. As it stands, you have longitudinal data such that you have multiple measurements from the same company over a period of time.
Are you trying to predict the close price based on market cap + GDP?
Or are you trying to predict the future close price based on previous close price measurements?
You could stratify based on company name, but it really depends on what you are trying to achieve. What is the question you are trying to answer ?
You may also want to take the following considerations into account:
close prices measured at different times on the same company are correlated with each other.
correlations between two measurements soon after each other will be better than correlations between two measurements far apart in time.
There are four assumptions associated with a linear regression model:
Linearity: The relationship between X and the mean of Y is linear.
Homoscedasticity: The variance of residual is the same for any value of X.
Independence: Observations are independent of each other.
Normality: For any fixed value of X, Y is normally distributed.

Related

Data selecting in data science project about predicting house prices

This is my first data science project and I need to select some data. Of course, I know that I can not just select all the data available because this will result in overfitting. I am currently investigating house prices in the capital of Denmark for the past 10 years and I wanted to know which type of houses I should select in my data:
Owner-occupied flats and houses (This gives a dataset of 50000 elements)
Or just owner-occupied flats (this gives a dataset of 43000 elements)
So as you can see there are a lot more owner-occupied flats sold in the capital of Denmark. My opinion is that I should select just the owner-occupied flats because then I have the "same" kind of elements in my data and still have 43000 elements.
Also, there are a lot higher taxes involved if you own a house rather than owning an owner-occupied flat. This might affect the price of the house and skew the data a little bit.
I have seen a few projects where both owner-occupied flats and houses are selected for the data and the conclusion was overfitting, so that is what I am looking to avoid.
This is an classic example of over-fitting due to lack of data or insufficient data.
Let me example the selection process to resolve this kind of problem. I will example using the example of credit card fraud then relate that with your question or any future problem of prediction with classified data.
In ideal world credit card fraud are not that common. So, if you look at the real data you will find only 2% data which resulted in fraud. So, if you train a model with this datasets it would be biased as you don't have normal distribution of the class (i.e fraud and none fraud transaction in your case its Owner-occupied flats and houses). There are 4 a way to tackle this issue.
Let's Suppose Datasets has 90 none fraud data points and 10 fraud data points.
1. Under sampling majority class
In this we just select 10 data points from 90 and train model with 10:10 so distribution is normalised (In your case using only 7000 of 43000 flats). This is not ideal way as we would be throughout a huge amount of data.
2. Over sampling minority class by duplication
In this we duplicate the 10 data points to make it 90 data point distribution is normalised (In your case duplicating 7000 house data to make it 43000 i.e equal to that of flat). While this work there is a better way.
3. Over sampling minority class by SMOTE (recommended)
Synthetic Minority Over-sampling Technique is a technique we use k nearest neigbors algo to generate the minority class in your case the housing data. There is module named imbalanced-learn (here) which can be use to implement this.
4. Ensemble Method
In this method you divide your data into multiple datasets to make it balance for example dividing 90 into 9 sets so that each set can have 10 fraud class data and 10 none fraud class data. In your case diving 43000 in batch of 7000. After that training each one separately and using majority vote mechanism to predict.
So now I have created the following diagram. The green line shows the price per square meter of owner occupied flats and the red line shows price per square meter of houses (all prices in DKK). I was wondering if there is imbalanced classification? The maximum deviation of the prices is atmost 10% (see for example 2018). Is 10% enough to say that the data is biased and hence therefore is imbalanced classified?

Understand price optimization using PuLP and formulate problem

I am trying to write a small price optimization engine that optimizes revenues given a list of articles.
I have a list of articles and for each of them, I have its price elasticity of demand. My constraints are currently not defined, however, there will be definitely something putting a roof to the maximum price and the minimum price.
Currently, I am stuck in finding a way in which I could write down to the model the relationship of price and price elasticity, more precisely the model should have a constraint that understands that if an item is very elastic changing its price will affect a lot of quantity sold.
Moreover, I am actually not sure which kind of data I really need as input variables. Do I need something like a list of prices and quantity sold at different price points?
I am afraid elasticities introduce nonlinearities in the model:
log(Q) = C + Elasticity * log(P)
where C is a constant. Or stated differently:
Q = K * P^Elasticity
where K = Exp(C) is again a constant.
These types of nonlinearities are typical in many economic models. They are often solved with non-linear solvers. PuLP is for linear models only, so if you want to use that you may want to use a linear approximation (i.e. a linear demand function). You probably should discuss this with your teacher or supervisor.

tensorflow crossed column feature with vocabulary list for cross terms

How would a make a crossed_column with a vocabulary list for the crossed terms? That is suppose that I have two categorical columns
animal [dog, cat, puma, other]
food [pizza, salad, quinoa, other]
and now I want to make the crossed column, animal x food - but I've done some frequency counts of the training data (in spark before exporting tfrecords for training tensorflow models), and puma x quinoa only showed up once, and cat x quinoa never showed up. So I don't want to generate features for them, I don't think I have enough training examples to learn what their weights should be. What I'd like is for both of them to get absorbed in the "other x other" feature -- the thought that I'll learn some kind of average weight for a feature that covers all the infrequent terms.
It doesn't look like I can do that with tf.feature_column.crossed_column -- any idea how I would do this kind of thing in tensorflow?
Or, should I not worry about it? If I crosses all the features I'd get 20, but there are only 18 that I think are important - so maybe set the hash map size to 18 or, less, causing collisions? Then include the first order columns, animal and food, so the model can figure out what it is looking at? This is the approach I'm getting from reading the docs. I like it because it is simpler, but am concerned about model accuracy.
I think what I really want is some kind of sparse table lookup, rather than hashing the cross -- imagine I have
column A - integer Ids, 1 to 10,000
column B - integer Ids, 1 to 10,000
column C - integer Ids, 1 to 10,000
and there are only 1 million of the 1 trillion possible crosses between A,B,C that I want to make features for -- all the rest will go into the 1 million + 1 other x other x other feature, how would I do that in tensorflow?

Is multiple regression the best approach for optimization?

I am being asked to take a look at a scenario where a company has many projects that they wish to complete, but with any company budget comes into play. There is a Y value of a predefined score, with multiple X inputs. There are also 3 main constraints of Capital Costs, Expense Cost and Time for Completion in Months.
The ask is could an algorithmic approach be used to optimize which projects should be done for the year given the 3 constraints. The approach also should give different results if the constraint values change. The suggested method is multiple regression. Though I have looked into different approaches in detail. I would like to ask the wider community, if anyone has dealt with a similar problem, and what approaches have you used.
Fisrt thing we should understood, a conclution of something is not base on one argument.
this is from communication theory, that every human make a frame of knowledge (understanding conclution), where the frame construct from many piece of knowledge / information).
the concequence is we cannot use single linear regression in math to create a ML / DL system.
at least we should use two different variabel to make a sub conclution. if we push to use single variable with use linear regression (y=mx+c). it's similar to push computer predict something with low accuration. what ever optimization method that you pick...it's still low accuracy..., why...because linear regresion if you use in real life, it similar with predict 'habbit' base on data, not calculating the real condition.
that's means...., we should use multiple linear regression (y=m1x1+m2x2+ ... + c) to calculate anything in order to make computer understood / have conclution / create model of regression. but, not so simple like it. because of computer try to make a conclution from data that have multiple character / varians ... you must classified the data and the conclution.
for an example, try to make computer understood phitagoras.
we know that phitagoras formula is c=((a^2)+(b^2))^(1/2), and we want our computer can make prediction the phitagoras side (c) from two input values (a and b). so to do that, we should make a model or a mutiple linear regresion formula of phitagoras.
step 1 of course we should make a multi character data of phitagoras.
this is an example
a b c
3 4 5
8 6 10
3 14 etc..., try put 10 until 20 data
try to make a conclution of regression formula with multiple regression to predic the c base on a and b values.
you will found that some data have high accuration (higher than 98%) for some value and some value is not to accurate (under 90%). example a=3 and b=14 or b=15, will give low accuration result (under 90%).
so you must make and optimization....but how to do it...
I know many method to optimize, but i found in manual way, if I exclude the data that giving low accuracy result and put them in different group then, recalculate again to the data group that excluded, i will get more significant result. do again...until you reach the accuracy target that you want.
each group data, that have a new regression, is a new class.
means i will have several multiple regression base on data that i input (the regression come from each group of data / class) and the accuracy is really high, 99% - 99.99%.
and with the several class, the regresion have a fuction as a 'label' of the class, this is what happens in the backgroud of the automation computation. but with many module, the user of the module, feel put 'string' object as label, but the truth is, the string object binding to a regresion that constructed as label.
with some conditional parameter you can get the good ML with minimum number of data train.
try it on excel / libreoffice before step more further...
try to follow the tutorial from this video
and implement it in simple data that easy to construct in excel, like pythagoras.
so the answer is yes...the multiple regression is the best approach for optimization.

How to generate data that fits the normal distribution within each class?

Using numpy, I need to produce training and test data for a machine learning problem. The model is able to predict three different classes (X,Y,Z). The classes represent the types of patients in multiple clinical trials, and the model should be able to predict the type of patient based on data gathered about the patient (such as blood analysis and blood pressure, previous history etc.)
From a previous study we know that, in total, the classes are represented with the following distribution, in terms of a percentage of the total patient count per trial:
X - u=7.2, s=5.3
Y - u=83.7, s=15.2
Z - u=9.1, s=2.3
The u/s describe the distribution in N(u, s) for each class (so, for all trials studied, class X had mean 7.2 and variance 5.3). Unfortunately the data set for the study is not available.
How can I recreate a dataset that follows the same distribution over all classes, and within each class, subject to the constraint of X+Y+Z=100 for each record.
It is easy to generate a dataset that follows the overall distribution (the u values), but how do I get a dataset that has the same distribution per each class?
The problem you have stated is to sample from a mixture distribution. A mixture distribution is just a number of component distributions, each with a weight, such that the weights are nonnegative and sum to 1. Your mixture has 3 components. Each is a Gaussian distribution with the mean and sd you gave. It is reasonable to assume the mixing weights are the proportion of each class in the population. To sample from a mixture, first select a component using the weights as probabilities for a discrete distribution. Then sample from the component. I assume you know how to sample from a Gaussian distribution.