How to adjust Python linear regression y axis - matplotlib

I have been having Problems with price column every time I try to plot graphs on it and all my graphs have this problem and I want to change it to its actual values instead of decimals
Example of of linear graph
This is the dataframe containing the information of the dataset
Train is the name of dataframe.
Column contains the selected
columns = ['Id', 'year', 'distance_travelled(kms)', 'brand_rank', 'car_age']
for i in columns:
plt.scatter(train[i], y, label='Actual')
plt.xlabel(i)
plt.ylabel('price')
plt.show()

Related

Bar plot from two different datasets with different data range

I have the following datasets:
df1 = {'lower':[3.99,4.99,5.99,1700], 'percentile':[1,2,5,10,50,100]}
df2 = {'lower':[2.99,4.50,5,1850], 'percentile':[2,4,7,15,55,100]}
The data:
The percentile refers to the percentage of the data that corresponds
to a particular price e.g: 3.99 would represent 1% of the data while
all values under 5.99 would represent 5% of the data.
The length of the two datasets is 100 given that we are showing percentiles, but they vary between the two datasets as the price.
What I have done so far:
What I need help with:
As you see in the third graph, I can plot the two datasets overlayed, which is what I need, but I have been unsuccessful trying to change the legend and the weird tick x values on the third graph. It is not showing the percentile, or other metrics I might use the x axis with.
Any help?

Compare one-hot-encoded column header and predicted labels

I have 3 one-hot-encoded columns where the header names are labels, and one prediction column preds where the labels are predicted (see image). I want to calculate the performance of my predictions by comparing the label in preds and the 1-encoded column header.
In this example I only have 20% predicted correctly.
Is there a quick way of doing this in pandas?
IIUC, DataFrame.lookup and np.mean
df[['Type_1','Type_2','Type_3']].lookup(df.index, df['preds']).mean() * 100

train-test split of scikit learn resulting in features having only one unique value in train data

I am trying to train a multivariate linear regression model. I have a data set named 'main'. There are few categorical variable in this dataset. I dummified the categorical variable. Let's say the columns obtained after dummification are A, B, C, D and so on. Now when I am trying to run train-test split on this main dataset, the train dataset thus obtained has only values 0 in one of these columns. How can I overcome this problem.
The code which I am using is :
for train-test split:
from sklearn.model_selection import train_test_split
np.random.seed(0)
df_train, df_test = train_test_split(main, train_size = 0.7, test_size = 0.3, random_state = 100)
On running the below code :
main.columns[main.nunique() == 1]
The result is : Index([], dtype='object')
And when running the below code for train data :
df_train.columns[df_train.nunique() == 1]
The result is : Index(['A', 'D', 'S'], dtype='object')
I want the resulting train set to contain features with all combination of values in it. However, this split is giving me only one value in some features
Edit : I checked the unique values in these columns and these columns are highly unbalanced with only one value present for the positive case. I tries stratify and it needs at lease two rows of positive class. And this the case for many columns. So I cannot separately include this columnns in the train dataset as it would require writing code for all the columns. I want this to be done automatically.
Have you tried changing random_state value ?

Selection column in a dataframe in pandas apply min function

I have n-dataframe in a list
df=[df_1, df_2, df_3, ...., df_n]
Where df_n is a dataframe in pandas (python). df_n is a variable of my keras-model.
X_train=[df_1_1,df_2_1,...,df_n_1]
Where:
df_1_1 is the first dataframe of the list (the first variable) and the first columns of this dataframe, his dataframe has m columns.
Each column of this dataframe if this variable applies a different type of smoothing or filter.
I have 100 column in each dataframe and I want to select the combination of columns (of different dataframes), the X_train than have min value in the score of my model.
score = model.evaluate(X_test,Y_test)
X_test and Y_test are the last n occurrences of the selected columns.
There some library for selected this columns (neuronal networks, GA, colony ant, ...)?
How can I implement it?
What is your prediction task? Do you need a neural network or not? You are essentially looking at a feature selection problem here. You could use simpler models such as the lasso which will select columns using L1-regularization. Or you could use an ensembling technique such as random forests and consider the relative feature importances to select your columns. Perhaps have a look at scikit-learn.

Infer Series Labels and Data from pandas dataframe column for plotting

Consider a simple 2x2 dataset with with Series labels prepended as the first column ("Repo")
Repo AllTests Restricted
0 Galactian 1860.0 410.0
1 Forecast-MLib 140.0 47.0
Here are the DataFrame columns:
p(df.columns)
([u'Repo', u'AllTests', u'Restricted']
So we have the first column is the string/label and the second and third columns are data values. We want one series per row corresponding to the Galactian and the Forecast-MLlib repos.
It would seem this would be a common task and there would be a straightforward way to simply plot the DataFrame . However the following related question does not provide any simple way: it essentially throws away the DataFrame structural knowledge and plots manually:
Set matplotlib plot axis to be the dataframe column name
So is there a more natural way to plot these Series - that does not involve deconstructing the already-useful DataFrame but instead infers the first column as labels and the remaining as series data points?
Update Here is a self contained snippet
runtimes = npa([1860.,410.,140.,47.])
runtimes.shape = (2,2)
labels = npa(['Galactian','Forecast-MLlib'])
labels.shape=(2,1)
rtlabels = np.concatenate((labels,runtimes),axis=1)
rtlabels.shape = (2,3)
colnames = ['Repo','AllTests','Restricted']
df = pd.DataFrame(rtlabels, columns=colnames)
ps(df)
df.set_index('Repo').astype(float).plot()
plt.show()
And here is output
Repo AllTests Restricted
0 Galactian 1860.0 410.0
1 Forecast-MLlib 140.0 47.0
And with piRSquared help it looks like this
So the data is showing now .. but the Series and Labels are swapped. Will look further to try to line them up properly.
Another update
By flipping the columns/labels the series are coming out as desired.
The change was to :
labels = npa(['AllTests','Restricted'])
..
colnames = ['Repo','Galactian','Forecast-MLlib']
So the updated code is
runtimes = npa([1860.,410.,140.,47.])
runtimes.shape = (2,2)
labels = npa(['AllTests','Restricted'])
labels.shape=(2,1)
rtlabels = np.concatenate((labels,runtimes),axis=1)
rtlabels.shape = (2,3)
colnames = ['Repo','Galactian','Forecast-MLlib']
df = pd.DataFrame(rtlabels, columns=colnames)
ps(df)
df.set_index('Repo').astype(float).plot()
plt.title("Restricting Long-Running Tests\nin Galactus and Forecast-ML")
plt.show()
p('df columns', df.columns)
ps(df)
Pandas assumes your label information is in the index and columns. Set the index first:
df.set_index('Repo').astype(float).plot()
Or
df.set_index('Repo').T.astype(float).plot()