Binarize a continuous feature with NaNs Python - pandas

I have a pandas dataframe of 4000 rows and 35 features, in which some of the continuous features contain missing values (NaNs). For example, one of them (with 46 missing values) has a very left-skewed distribution and I would like to binarize it by choosing a threshold of 1.5 below which I would like to set it as the class 0 and above or equal to 1.5 as the class 1.
Like: X_original = [0.01,2.80,-1.74,1.34,1.55], X_bin = [0, 1, 0, 0, 1].
I tried doing: dataframe["bin"] = (dataframe["original"] > 1.5).astype(int).
However, I noticed that the missing values (NaNs) disappeared and they are encoded in the 0 class.
How could I solve this problem?

To the best of my knowledge there is way to keep the missing values after a comparison, but you can do the following:
import pandas as pd
import numpy as np
X_original = pd.Series([0.01,2.80,-1.74, np.nan,1.55])
X_bin = X_original > 1.5
X_bin[X_original.isna()] = np.NaN
print(X_bin)
Output
0 0.0
1 1.0
2 0.0
3 NaN
4 1.0
dtype: float64
To keep the column as Integer (and also nullable), do:
X_bin = X_bin.astype(pd.Int8Dtype())
print(X_bin)
Output
0 0
1 1
2 0
3 <NA>
4 1
dtype: Int8

The best way to handle this issue that I found was to use list comprehension:
dataframe["Bin"] = [0 if el<1.5 else 1 if el >= 1.5 else np.NaN for el in dataframe["Original"]]
Then I convert the float numbers to objects except the np.NaN
dataframe["Bin"] = dataframe["Bin"].replace([0.0,1.0],["0","1"])

Related

Comparing string values from sequential rows in pandas series

I am trying to count common string values in sequential rows of a panda series using a user defined function and to write an output into a new column. I figured out individual steps, but when I put them together, I get a wrong result. Could you please tell me the best way to do this? I am a very beginner Pythonista!
My pandas df is:
df = pd.DataFrame({"Code": ['d7e', '8e0d', 'ft1', '176', 'trk', 'tr71']})
My string comparison loop is:
x='d7e'
y='8e0d'
s=0
for i in y:
b=str(i)
if b not in x:
s+=0
else:
s+=1
print(s)
the right result for these particular strings is 2
Note, when I do def func(x,y): something happens to s counter and it doesn't produce the right result. I think I need to reset it to 0 every time the loop runs.
Then, I use df.shift to specify the position of y and x in a series:
x = df["Code"]
y = df["Code"].shift(periods=-1, axis=0)
And finally, I use df.apply() method to run the function:
df["R1SB"] = df.apply(func, axis=0)
and I get None values in my new column "R1SB"
My correct output would be:
"Code" "R1SB"
0 d7e None
1 8e0d 2
2 ft1 0
3 176 1
4 trk 0
5 tr71 2
Thank you for your help!
TRY:
df['R1SB'] = df.assign(temp=df.Code.shift(1)).apply(
lambda x: np.NAN
if pd.isna(x['temp'])
else sum(i in str(x['temp']) for i in str(x['Code'])),
1,
)
OUTPUT:
Code R1SB
0 d7e NaN
1 8e0d 2.0
2 ft1 0.0
3 176 1.0
4 trk 0.0
5 tr71 2.0

Creating a base 100 Index from time series that begins with a number of NaNs

I have the following dataframe (time-series of returns truncated for succinctness):
import pandas as pd
import numpy as np
df = pd.DataFrame({'return':np.array([np.nan, np.nan, np.nan, 0.015, -0.024, 0.033, 0.021, 0.014, -0.092])})
I'm trying to start the index (i.e., "base-100") at the last NaN before the first return - while at the same time keep the NaNs preceding the 100 value in place - (thinking in terms of appending to existing dataframe and for graphing purposes).
I only have found a way to create said index when there are no NaNs in the return vector:
df['index'] = 100*np.exp(np.nan_to_num(df['return'].cumsum()))
Any ideas - thx in advance!
If your initial array is
zz = np.array([np.nan, np.nan, np.nan, 0.015, -0.024, 0.033, 0.021, 0.014, -0.092])
Then you can obtain your desired output like this (although there's probably a more optimized way to do it):
np.concatenate((zz[:np.argmax(np.isfinite(zz))],
100*np.exp(np.cumsum(zz[np.isfinite(zz)]))))
Use Series.isna, change order by indexing and get index of last NaN by Series.idxmax:
idx = df['return'].isna().iloc[::-1].idxmax()
Pass to DataFrame.loc, repalce missing value and use cumulative sum:
df['return'] = df.loc[idx:, 'return'].fillna(100).cumsum()
print (df)
return
0 NaN
1 NaN
2 100.000
3 100.015
4 99.991
5 100.024
6 100.045
7 100.059
8 99.967
You can use Series.isna with Series.cumsum and compare by max, then replace last NaN by Series.fillna and last use cumulative sum:
s = df['return'].isna().cumsum()
df['return'] = df['return'].mask(s.eq(s.max()), df['return'].fillna(100)).cumsum()
print (df)
return
0 NaN
1 NaN
2 100.000
3 100.015
4 99.991
5 100.024
6 100.045
7 100.059
8 99.967

Pandas groupby in combination with sklean preprocessing continued

Continue from this post:
Pandas groupby in combination with sklearn preprocessing
I need to do preprocessing by scaling grouped data by two columns, somehow get some error for the second method
import pandas as pd
import numpy as np
from sklearn.preprocessing import robust_scale,minmax_scale
df = pd.DataFrame( dict( id=list('AAAAABBBBB'),
loc = (10,20,10,20,10,20,10,20,10,20),
value=(0,10,10,20,100,100,200,30,40,100)))
df['new'] = df.groupby(['id','loc']).value.transform(lambda x:minmax_scale(x.astype(float) ))
df['new'] = df.groupby(['id','loc']).value.transform(lambda x:robust_scale(x ))
The second one give me error like this:
ValueError: Expected 2D array, got 1D array instead: array=[ 0. 10.
100.]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a
single sample.
If I use reshape I got error like this:
Exception: Data must be 1-dimensional
If I ever print out the grouped data, g['value'] is pandas series.
for n, g in df.groupby(['id','loc']):
print(type(g['value']))
Do you know what might cause it?
Thanks.
Base on the warning code , you should add reshape and concatenate
df.groupby(['id','loc']).value.transform(lambda x:np.concatenate(robust_scale(x.values.reshape(-1,1))))
Out[606]:
0 -0.2
1 -1.0
2 0.0
3 1.0
4 1.8
5 0.0
6 1.0
7 -2.0
8 -1.0
9 0.0
Name: value, dtype: float64

pandas using qcut on series with fewer values than quantiles

I have thousands of series (rows of a DataFrame) that I need to apply qcut on. Periodically there will be a series (row) that has fewer values than the desired quantile (say, 1 value vs 2 quantiles):
>>> s = pd.Series([5, np.nan, np.nan])
When I apply .quantile() to it, it has no problem breaking into 2 quantiles (of the same boundary value)
>>> s.quantile([0.5, 1])
0.5 5.0
1.0 5.0
dtype: float64
But when I apply .qcut() with an integer value for number of quantiles an error is thrown:
>>> pd.qcut(s, 2)
...
ValueError: Bin edges must be unique: array([ 5., 5., 5.]).
You can drop duplicate edges by setting the 'duplicates' kwarg
Even after I set the duplicates argument, it still fails:
>>> pd.qcut(s, 2, duplicates='drop')
....
IndexError: index 0 is out of bounds for axis 0 with size 0
How do I make this work? (And equivalently, pd.qcut(s, [0, 0.5, 1], duplicates='drop') also doesn't work.)
The desired output is to have the 5.0 assigned to a single bin and the NaN are preserved:
0 (4.999, 5.000]
1 NaN
2 NaN
Ok, this is a workaround which might work for you.
pd.qcut(s,len(s.dropna()),duplicates='drop')
Out[655]:
0 (4.999, 5.0]
1 NaN
2 NaN
dtype: category
Categories (1, interval[float64]): [(4.999, 5.0]]
You can try filling your object/number cols with the appropriate filling ('null' for string and 0 for numeric)
#fill numeric cols with 0
numeric_columns = df.select_dtypes(include=['number']).columns
df[numeric_columns] = df[numeric_columns].fillna(0)
#fill object cols with null
string_columns = df.select_dtypes(include=['object']).columns
df[string_columns] = df[string_columns].fillna('null')
Use python 3.5 instead of python 2.7 .
This worked for me

cannot convert nan to int (but there are no nans)

I have a dataframe with a column of floats that I want to convert to int:
> df['VEHICLE_ID'].head()
0 8659366.0
1 8659368.0
2 8652175.0
3 8652174.0
4 8651488.0
In theory I should just be able to use:
> df['VEHICLE_ID'] = df['VEHICLE_ID'].astype(int)
But I get:
Output: ValueError: Cannot convert NA to integer
But I am pretty sure that there are no NaNs in this series:
> df['VEHICLE_ID'].fillna(999,inplace=True)
> df[df['VEHICLE_ID'] == 999]
> Output: Empty DataFrame
Columns: [VEHICLE_ID]
Index: []
What's going on?
Basically the error is telling you that you NaN values and I will show why your attempts didn't reveal this:
In [7]:
# setup some data
df = pd.DataFrame({'a':[1.0, np.NaN, 3.0, 4.0]})
df
Out[7]:
a
0 1.0
1 NaN
2 3.0
3 4.0
now try to cast:
df['a'].astype(int)
this raises:
ValueError: Cannot convert NA to integer
but then you tried something like this:
In [5]:
for index, row in df['a'].iteritems():
if row == np.NaN:
print('index:', index, 'isnull')
this printed nothing, but NaN cannot be evaluated like this using equality, in fact it has a special property that it will return False when comparing against itself:
In [6]:
for index, row in df['a'].iteritems():
if row != row:
print('index:', index, 'isnull')
index: 1 isnull
now it prints the row, you should use isnull for readability:
In [9]:
for index, row in df['a'].iteritems():
if pd.isnull(row):
print('index:', index, 'isnull')
index: 1 isnull
So what to do? We can drop the rows: df.dropna(subset='a'), or we can replace using fillna:
In [8]:
df['a'].fillna(0).astype(int)
Out[8]:
0 1
1 0
2 3
3 4
Name: a, dtype: int32
When your series contains floats and nan's and you want to convert to integers, you will get an error when you do try to convert your float to a numpy integer, because there are na values.
DON'T DO:
df['VEHICLE_ID'] = df['VEHICLE_ID'].astype(int)
From pandas >= 0.24 there is now a built-in pandas integer. This does allow integer nan's. Notice the capital in 'Int64'. This is the pandas integer, instead of the numpy integer.
SO, DO THIS:
df['VEHICLE_ID'] = df['VEHICLE_ID'].astype('Int64')
More info on pandas integer na values:
https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#nan-integer-na-values-and-na-type-promotions