Pandas groupby in combination with sklean preprocessing continued - pandas

Continue from this post:
Pandas groupby in combination with sklearn preprocessing
I need to do preprocessing by scaling grouped data by two columns, somehow get some error for the second method
import pandas as pd
import numpy as np
from sklearn.preprocessing import robust_scale,minmax_scale
df = pd.DataFrame( dict( id=list('AAAAABBBBB'),
loc = (10,20,10,20,10,20,10,20,10,20),
value=(0,10,10,20,100,100,200,30,40,100)))
df['new'] = df.groupby(['id','loc']).value.transform(lambda x:minmax_scale(x.astype(float) ))
df['new'] = df.groupby(['id','loc']).value.transform(lambda x:robust_scale(x ))
The second one give me error like this:
ValueError: Expected 2D array, got 1D array instead: array=[ 0. 10.
100.]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a
single sample.
If I use reshape I got error like this:
Exception: Data must be 1-dimensional
If I ever print out the grouped data, g['value'] is pandas series.
for n, g in df.groupby(['id','loc']):
print(type(g['value']))
Do you know what might cause it?
Thanks.

Base on the warning code , you should add reshape and concatenate
df.groupby(['id','loc']).value.transform(lambda x:np.concatenate(robust_scale(x.values.reshape(-1,1))))
Out[606]:
0 -0.2
1 -1.0
2 0.0
3 1.0
4 1.8
5 0.0
6 1.0
7 -2.0
8 -1.0
9 0.0
Name: value, dtype: float64

Related

ValueError for sklearn, problem maybe caused by float32/float64 dtypes?

So I want to check the feature importance in a dataset, but I get this error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
I checked the dataset and fair enough there were nan values. So I added a line to drop all nan rows. Now there are no nan values. I re-ran the code and still the same error. I checked the .dtypes and fair enough, it was all float64. So I added .astype(np.float32) to the columns that I pass to sklearn. But now I still have the same error. I scrolled through the entire dataframe manually and also used data.describe() and all values are between 0 and 5, so far away from infinity or large values.
What is causing the error here?
Here is my code:
import pandas as pd
import numpy as np
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
data = pd.read_csv("data.csv")
data.dropna(inplace=True) #dropping all nan values
X = data.iloc[:,8:42]
X = X.astype(np.float32) #converting data from float64 to float32
y = data.iloc[:,4]
y = y.astype(np.float32) #converting data from float64 to float32
# feature importance
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_)
You are in the third case (large value) then in the second case (infinity) after the downcast:
Demo:
import numpy as np
a = np.array(np.finfo(numpy.float64).max)
# array(1.79769313e+308)
b = a.astype('float32')
# array(inf, dtype=float32)
How to debug? Suppose the following array:
a = np.array([np.finfo(numpy.float32).max, np.finfo(numpy.float64).max])
# array([3.40282347e+038, 1.79769313e+308])
a[a > np.finfo(numpy.float32).max]
# array([1.79769313e+308])

Pandas interpolation type when method='index'?

The pandas documentation indicates that when method='index', the numerical values of the index are used. However, I haven't found any indication of the underlying interpolation method employed. It looks like it uses linear interpolation. Can anyone confirm this definitively or point me to where this is stated in the documentation?
So turns out the document is bit misleading for those who read it will likely to think:
‘index’, ‘values’: use the actual numerical values of the index.
as fill the NaN values with numerical values of the index which is not correct, we should read it as linear interpolate value use the actual numerical values of the index
The difference between method='linear' and method='index' in source code of pandas.DataFrame.interpolate mainly are in following code:
if method == "linear":
# prior default
index = np.arange(len(obj.index))
index = Index(index)
else:
index = obj.index
So if you using the default RangeIndex as index of the dataframe, then interpolate results of method='linear' and method='index' will be the same, however if you specify the different index then results will not be the same, following example will show you the difference clearly:
import pandas as pd
import numpy as np
d = {'val': [1, np.nan, 3]}
df0 = pd.DataFrame(d)
df1 = pd.DataFrame(d, [0, 1, 6])
print("df0:\nmethod_index:\n{}\nmethod_linear:\n{}\n".format(df0.interpolate(method='index'), df0.interpolate(method='linear')))
print("df1:\nmethod_index:\n{}\nmethod_linear:\n{}\n".format(df1.interpolate(method='index'), df1.interpolate(method='linear')))
Outputs:
df0:
method_index:
val
0 1.0
1 2.0
2 3.0
method_linear:
val
0 1.0
1 2.0
2 3.0
df1:
method_index:
val
1 1.000000
2 1.333333
6 3.000000
method_linear:
val
1 1.0
2 2.0
6 3.0
As you can see, when index=[0, 1, 6] with val=[1.0, 2.0, 3.0], the interpolated value is 1.0 + (3.0-1.0) / (6-0) = 1.333333
Following the runtime of the pandas source code (generic.py -> managers.py -> blocks.py -> missing.py), we can find the implementation of linear interpolate value use the actual numerical values of the index:
NP_METHODS = ["linear", "time", "index", "values"]
if method in NP_METHODS:
# np.interp requires sorted X values, #21037
indexer = np.argsort(inds[valid])
result[invalid] = np.interp(
inds[invalid], inds[valid][indexer], yvalues[valid][indexer]
)

Binarize a continuous feature with NaNs Python

I have a pandas dataframe of 4000 rows and 35 features, in which some of the continuous features contain missing values (NaNs). For example, one of them (with 46 missing values) has a very left-skewed distribution and I would like to binarize it by choosing a threshold of 1.5 below which I would like to set it as the class 0 and above or equal to 1.5 as the class 1.
Like: X_original = [0.01,2.80,-1.74,1.34,1.55], X_bin = [0, 1, 0, 0, 1].
I tried doing: dataframe["bin"] = (dataframe["original"] > 1.5).astype(int).
However, I noticed that the missing values (NaNs) disappeared and they are encoded in the 0 class.
How could I solve this problem?
To the best of my knowledge there is way to keep the missing values after a comparison, but you can do the following:
import pandas as pd
import numpy as np
X_original = pd.Series([0.01,2.80,-1.74, np.nan,1.55])
X_bin = X_original > 1.5
X_bin[X_original.isna()] = np.NaN
print(X_bin)
Output
0 0.0
1 1.0
2 0.0
3 NaN
4 1.0
dtype: float64
To keep the column as Integer (and also nullable), do:
X_bin = X_bin.astype(pd.Int8Dtype())
print(X_bin)
Output
0 0
1 1
2 0
3 <NA>
4 1
dtype: Int8
The best way to handle this issue that I found was to use list comprehension:
dataframe["Bin"] = [0 if el<1.5 else 1 if el >= 1.5 else np.NaN for el in dataframe["Original"]]
Then I convert the float numbers to objects except the np.NaN
dataframe["Bin"] = dataframe["Bin"].replace([0.0,1.0],["0","1"])

How to show multiple timeseries plots using seaborn

I'm trying to generate 4 plots from a DataFrame using Seaborn
Date A B C D
2019-04-05 330.665 161.975 168.69 0
2019-04-06 322.782 150.243 172.539 0
2019-04-07 322.782 150.243 172.539 0
2019-04-08 295.918 127.801 168.117 0
2019-04-09 282.674 126.894 155.78 0
2019-04-10 293.818 133.413 160.405 0
I have casted dates using pd.to_DateTime and numbers using pd.to_numeric. Here is the df.info():
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 460 to 465
Data columns (total 5 columns):
Date 6 non-null datetime64[ns]
A 6 non-null float64
B 6 non-null float64
C 6 non-null float64
D 6 non-null float64
dtypes: datetime64[ns](1), float64(4)
memory usage: 288.0 bytes
I can do a wide column plot by just calling .plot() on df.
However,
The legend of the plot is covering the plot itself
I would instead like to have 4 separate plots in 1 diagram and have tried using lmplot to achieve this.
I would like to add labels to the plot like so:
Plot with image
I first melted the data:
df=pd.melt(df,id_vars='Date', var_name='Var', value_name='Unit')
And then tried lmplot
sns.lmplot(x = df['Date'], y='Unit', col='Var', data=df)
However, I get the traceback:
TypeError: Invalid comparison between dtype=datetime64[ns] and str
I have also tried setting df.set_index['Date'] and replotting that using x=df.index and that gave me the same error.
The data can be plotted using Google Sheets but I am trying to automate a workflow where the chart can be generated and sent via Slack to selected recipients.
I hope I have expressed myself clearly enough as I am rather new to Python and Seaborn and hope to get some help from the experts here.
Regarding the legend you can just use .legend(loc="upper left", bbox_to_anchor=(1,1)) as in this example
%matplotlib inline
import pandas as pd
import numpy as np
data = np.random.rand(10,4)
df = pd.DataFrame(data, columns=["A", "B", "C", "D"])
df.plot()\
.legend(loc="upper left", bbox_to_anchor=(1,1));
While for the second IIUC you can play from
df.plot(subplots=True, layout=(2,2));

Convert strings with time suffixes to numbers in numpy

I have a numpy series with values like "1.0s", "100ms", etc. I can't plot this (with pandas, after putting the array into a series), because pandas doesn't recognize that these are numbers. How can I have numpy or pandas extrapolate these into numbers, while paying attention to the suffixes?
see question how do I get at the pandas.offsets object given an offset string
use pandas.tseries.frequencies.to_offset
convert to timedeltas
get total seconds
from pandas.tseries.frequencies import to_offset
s = pd.Series(['1.0s', '100ms', '10s', '0.5T'])
pd.to_timedelta(s.apply(to_offset)).dt.total_seconds()
0 0.0
1 0.1
2 10.0
3 300.0
dtype: float64
This code could solve your problem.
# Test data
se = Series(['10s', '100ms', '1.0s'])
# Pattern to match ms and as integer of float
pat = "([0-9]*\.?[0-9]+)(ms|s)"
# Extracting the data
df = se.str.extract(pat, flags=0, expand=True)
# Renaming columns
df.columns = ['value', 'unit']
# Converting to number
df['value'] = pd.to_numeric(df['value'])
# Converting to the same unit
df.loc[df['unit']=='s', ['value', 'unit']] = (df['value'] * 1000, 'ms')
# Now you are ready to plot !
print(df['value'])
# 0 10000.0
# 1 100.0
# 2 100000.0