Remove nan from pandas binner - pandas

I have created the following pandas dataframe called train:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats as stats
ds = {
'matchKey' : [621062, 622750, 623508, 626451, 626611, 626796, 627114, 630055, 630225],
'og_max_last_dpd' : [10, 10, -99999, 10, 10, 10, 10, 10, 10],
'og_min_last_dpd' : [10, 10, -99999, 10, 10, 10, 10, 10, 10],
'og_max_max_dpd' : [0, 0, -99999, 1, 0, 5, 0, 4, 0],
'Target':[1,0,1,0,0,1,1,1,0]
}
train = pd.DataFrame(data=ds)
The dataframe looks like this:
print(train)
matchKey og_max_last_dpd og_min_last_dpd og_max_max_dpd Target
0 621062 10 10 0 1
1 622750 10 10 0 0
2 623508 -99999 -99999 -99999 1
3 626451 10 10 1 0
4 626611 10 10 0 0
5 626796 10 10 5 1
6 627114 10 10 0 1
7 630055 10 10 4 1
8 630225 10 10 0 0
I have then binned the column called og_max_max_dpd using this code:
def mono_bin(Y, X, char, n=20):
X2 = X.fillna(-99999)
r = 0
while np.abs(r) < 1:
d1 = pd.DataFrame({"X": X2, "Y": Y, "Bucket": pd.qcut(X2, n, duplicates="drop")})#,include_lowest=True
d2 = d1.groupby("Bucket", as_index=True)
r, p = stats.spearmanr(d2.mean().X, d2.mean().Y)
n = n - 1
d3 = pd.DataFrame(d2.min().X, columns=["min_" + X.name])
d3["max_" + X.name] = d2.max().X
d3[Y.name] = d2.sum().Y
d3["total"] = d2.count().Y
d3[Y.name + "_rate"] = d2.mean().Y
d4 = (d3.sort_values(by="min_" + X.name)).reset_index(drop=True)
# print("=" * 85)
# print(d4)
ninf = float("-inf")
pinf = float("+inf")
array = []
for i in range(len(d4) - 1):
array.append(d4["max_" + char].iloc[i])
return [ninf] + array + [pinf]
binner = mono_bin(train['Target'], train['og_max_max_dpd'], 'og_max_max_dpd')
I have printed out the binner which looks like this:
print(binner)
[-inf, -99999.0, nan, 0.0, nan, nan, 1.0, nan, nan, 4.0, nan, inf]
I want to remove the nan from that list so that the binner looks like this:
[-inf, -99999.0, 0.0, 1.0, 4.0, inf]
Does anyone know how to remove the nan?

You can simply use dropna to remove it from d4:
...
d3[Y.name + "_rate"] = d2.mean().Y
d4 = (d3.sort_values(by="min_" + X.name)).reset_index(drop=True)
d4.dropna(inplace=True)
# print("=" * 85)
# print(d4)
ninf = float("-inf")
...

Related

Pandas : How to Apply a Condition on Every Values of a Dataframe, Based on a Second Symmetrical Dataframe

I have a dictionary with 2 DF : "quantity variation in %" and "prices". They are both symmetrical DF.
Let's say I want to set the price = 0 if the quantity variation in percentage is greater than 100 %
import numpy as np; import pandas as pd
d = {'qty_pct': pd.DataFrame({ '2020': [200, 0.5, 0.4],
'2021': [0.9, 0.5, 500],
'2022': [0.9, 300, 0.4]}),
'price': pd.DataFrame({ '2020': [-6, -2, -9],
'2021': [ 2, 3, 4],
'2022': [ 4, 6, 8]})}
# I had something like that in mind ...
df = d['price'].applymap(lambda x: 0 if x[d['qty_pct']] >=1 else x)
P.S. If by any chance there is a way to do this on asymmetrical DF, I would be curious to see how it's done.
Thanks,
I want to obtain this DF :
price = pd.DataFrame({'2020': [ 0, -2, -9],
'2021': [ 2, 3, 0],
'2022': [ 4, 0, 8]})
Assume price and qty_pct always have the same dimension, then you can just do:
d['price'][d['qty_pct'] >= 1] = 0
d['price']
2020 2021 2022
0 0 2 4
1 -2 3 0
2 -9 0 8

How to create a multiIndex (hierarchical index) dataframe object from another df's column's unique values?

I'm trying to create a pandas multiIndexed dataframe that is a summary of the unique values in each column.
Is there an easier way to have this information summarized besides creating this dataframe?
Either way, it would be nice to know how to complete this code challenge. Thanks for your help! Here is the toy dataframe and the solution I attempted using a for loop with a dictionary and a value_counts dataframe. Not sure if it's possible to incorporate MultiIndex.from_frame or .from_product here somehow...
Original Dataframe:
data = pd.DataFrame({'A': ['case', 'case', 'case', 'case', 'case'],
'B': [2001, 2002, 2003, 2004, 2005],
'C': ['F', 'M', 'F', 'F', 'M'],
'D': [0, 0, 0, 1, 0],
'E': [1, 0, 1, 0, 1],
'F': [1, 1, 0, 0, 0]})
A B C D E F
0 case 2001 F 0 1 1
1 case 2002 M 0 0 1
2 case 2003 F 0 1 0
3 case 2004 F 1 0 0
4 case 2005 M 1 1 0
Desired outcome:
unique percent
A case 100
B 2001 20
2002 20
2003 20
2004 20
2005 20
C F 60
M 40
D 0 80
1 20
E 0 40
1 60
F 0 60
1 40
My failed for loop attempt:
def unique_values(df):
values = {}
columns = []
df = pd.DataFrame(values, columns=columns)
for col in data:
df2 = data[col].value_counts(normalize=True)*100
values = values.update(df2.to_dict)
columns = columns.append(col*len(df2))
return df
unique_values(data)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-84-a341284fb859> in <module>
11
12
---> 13 unique_values(data)
<ipython-input-84-a341284fb859> in unique_values(df)
5 for col in data:
6 df2 = data[col].value_counts(normalize=True)*100
----> 7 values = values.update(df2.to_dict)
8 columns = columns.append(col*len(df2))
9 return df
TypeError: 'method' object is not iterable
Let me know if there's something obvious I'm missing! Still relatively new to EDA and pandas, any pointers appreciated.
This is a fairly straightforward application of .melt:
data.melt().reset_index().groupby(['variable', 'value']).count()/len(data)
output
index
variable value
A case 1.0
B 2001 0.2
2002 0.2
2003 0.2
2004 0.2
2005 0.2
C F 0.6
M 0.4
D 0 0.8
1 0.2
E 0 0.4
1 0.6
F 0 0.6
1 0.4
I'm sorry! I've written an answer, but it's in javascript. I came here after I thought I've clicked on javascript and started coding, but on posting I saw that you're coding in python.
I will post it anyway, maybe it will help you. Python is not that much different from javascript ;-)
const data = {
A: ["case", "case", "case", "case", "case"],
B: [2001, 2002, 2003, 2004, 2005],
C: ["F", "M", "F", "F", "M"],
D: [0, 0, 0, 1, 0],
E: [1, 0, 1, 0, 1],
F: [1, 1, 0, 0, 0]
};
const getUniqueStats = (_data) => {
const results = [];
for (let row in _data) {
// create list of unique values
const s = [...new Set(_data[row])];
// filter for unique values and count them for percentage, then push
results.push({ index: row, values: s.map((x) => ({ unique: x, percentage: (_data[row].filter((y) => y === x).length / data[row].length) * 100 })) });
}
return results;
};
const results = getUniqueStats(data);
results.forEach((row) =>
row.values.forEach((value) =>
console.log(`${row.index}\t${value.unique}\t${value.percentage}%`)
)
);

Group Pandas dataframe Age column by Age groups [duplicate]

I have a data frame column with numeric values:
df['percentage'].head()
46.5
44.2
100.0
42.12
I want to see the column as bin counts:
bins = [0, 1, 5, 10, 25, 50, 100]
How can I get the result as bins with their value counts?
[0, 1] bin amount
[1, 5] etc
[5, 10] etc
...
You can use pandas.cut:
bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = pd.cut(df['percentage'], bins)
print (df)
percentage binned
0 46.50 (25, 50]
1 44.20 (25, 50]
2 100.00 (50, 100]
3 42.12 (25, 50]
bins = [0, 1, 5, 10, 25, 50, 100]
labels = [1,2,3,4,5,6]
df['binned'] = pd.cut(df['percentage'], bins=bins, labels=labels)
print (df)
percentage binned
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5
Or numpy.searchsorted:
bins = [0, 1, 5, 10, 25, 50, 100]
df['binned'] = np.searchsorted(bins, df['percentage'].values)
print (df)
percentage binned
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5
...and then value_counts or groupby and aggregate size:
s = pd.cut(df['percentage'], bins=bins).value_counts()
print (s)
(25, 50] 3
(50, 100] 1
(10, 25] 0
(5, 10] 0
(1, 5] 0
(0, 1] 0
Name: percentage, dtype: int64
s = df.groupby(pd.cut(df['percentage'], bins=bins)).size()
print (s)
percentage
(0, 1] 0
(1, 5] 0
(5, 10] 0
(10, 25] 0
(25, 50] 3
(50, 100] 1
dtype: int64
By default cut returns categorical.
Series methods like Series.value_counts() will use all categories, even if some categories are not present in the data, operations in categorical.
Using the Numba module for speed up.
On big datasets (more than 500k), pd.cut can be quite slow for binning data.
I wrote my own function in Numba with just-in-time compilation, which is roughly six times faster:
from numba import njit
#njit
def cut(arr):
bins = np.empty(arr.shape[0])
for idx, x in enumerate(arr):
if (x >= 0) & (x < 1):
bins[idx] = 1
elif (x >= 1) & (x < 5):
bins[idx] = 2
elif (x >= 5) & (x < 10):
bins[idx] = 3
elif (x >= 10) & (x < 25):
bins[idx] = 4
elif (x >= 25) & (x < 50):
bins[idx] = 5
elif (x >= 50) & (x < 100):
bins[idx] = 6
else:
bins[idx] = 7
return bins
cut(df['percentage'].to_numpy())
# array([5., 5., 7., 5.])
Optional: you can also map it to bins as strings:
a = cut(df['percentage'].to_numpy())
conversion_dict = {1: 'bin1',
2: 'bin2',
3: 'bin3',
4: 'bin4',
5: 'bin5',
6: 'bin6',
7: 'bin7'}
bins = list(map(conversion_dict.get, a))
# ['bin5', 'bin5', 'bin7', 'bin5']
Speed comparison:
# Create a dataframe of 8 million rows for testing
dfbig = pd.concat([df]*2000000, ignore_index=True)
dfbig.shape
# (8000000, 1)
%%timeit
cut(dfbig['percentage'].to_numpy())
# 38 ms ± 616 µs per loop (mean ± standard deviation of 7 runs, 10 loops each)
%%timeit
bins = [0, 1, 5, 10, 25, 50, 100]
labels = [1,2,3,4,5,6]
pd.cut(dfbig['percentage'], bins=bins, labels=labels)
# 215 ms ± 9.76 ms per loop (mean ± standard deviation of 7 runs, 10 loops each)
We could also use np.select:
bins = [0, 1, 5, 10, 25, 50, 100]
df['groups'] = (np.select([df['percentage'].between(i, j, inclusive='right')
for i,j in zip(bins, bins[1:])],
[1, 2, 3, 4, 5, 6]))
Output:
percentage groups
0 46.50 5
1 44.20 5
2 100.00 6
3 42.12 5
Convenient and fast version using Numpy
np.digitize is a convenient and fast option:
import pandas as pd
import numpy as np
df = pd.DataFrame({'x': [1,2,3,4,5]})
df['y'] = np.digitize(a['x'], bins=[3,5])
print(df)
returns
x y
0 1 0
1 2 0
2 3 1
3 4 1
4 5 2

Assign conditional values to columns in Dask

I am trying to do a conditional assignation to the rows of a specific column: target. I have done some research, and it seemed that the answer was given here: "How to do row processing and item assignment in dask".
I will reproduce my necessity. Mock data set:
x = [3, 0, 3, 4, 0, 0, 0, 2, 0, 0, 0, 6, 9]
y = [200, 300, 400, 215, 219, 360, 280, 396, 145, 276, 190, 554, 355]
mock = pd.DataFrame(dict(target = x, speed = y))
The look of mock is:
In [4]: mock.head(7)
Out [4]:
speed target
0 200 3
1 300 0
2 400 3
3 215 4
4 219 0
5 360 0
6 280 0
Having this Pandas DataFrame, I convert it into a Dask DataFrame:
mock_dask = dd.from_pandas(mock, npartitions = 2)
I apply my conditional rule: all values in target above 0, must be 1, all others 0 (binaryze target). Following the mentioned thread above, it should be:
result = mock_dask.target.where(mock_dask.target > 0, 1)
I have a look at the result dataset and it is not working as expected:
In [7]: result.head(7)
Out [7]:
0 3
1 1
2 3
3 4
4 1
5 1
6 1
Name: target, dtype: object
As we can see, the column target in mock and result are not the expected results. It seems that my code is converting all 0 original values to 1, instead of the values that are greater than 0 into 1 (the conditional rule).
Dask newbie here, Thanks in advance for your help.
OK, the documentation in Dask DataFrame API is pretty clear. Thanks to #MRocklin feedback I have realized my mistake. In the documentation, where function (the last one in the list) is used with the following syntax:
DataFrame.where(cond[, other]) Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.
Thus, the correct code line would be:
result = mock_dask.target.where(mock_dask.target <= 0, 1)
This will output:
In [7]: result.head(7)
Out [7]:
0 1
1 0
2 1
3 1
4 0
5 0
6 0
Name: target, dtype: int64
Which is the expected output.
They seem to be the same to me
In [1]: import pandas as pd
In [2]: x = [1, 0, 1, 1, 0, 0, 0, 2, 0, 0, 0, 6, 9]
...: y = [200, 300, 400, 215, 219, 360, 280, 396, 145, 276, 190, 554, 355]
...: mock = pd.DataFrame(dict(target = x, speed = y))
...:
In [3]: import dask.dataframe as dd
In [4]: mock_dask = dd.from_pandas(mock, npartitions = 2)
In [5]: mock.target.where(mock.target > 0, 1).head(5)
Out[5]:
0 1
1 1
2 1
3 1
4 1
Name: target, dtype: int64
In [6]: mock_dask.target.where(mock_dask.target > 0, 1).head(5)
Out[6]:
0 1
1 1
2 1
3 1
4 1
Name: target, dtype: int64

Numpy.piecewise not working as intended

In[2]: from numpy import *
In[3]: alpha = lambda x: piecewise(x,[x <= 4, 4 < x <= 24, x > 24], [10, 20, 50])
In[4]: print(alpha(5))
0
In[5]: print(alpha(3))
10
In[6]: print(alpha(26))
0
Why isn't this working? there are 3 conditions and 3 functions
Found out that select does what i want it to
In[2]: from numpy import *
In[3]: alpha = lambda x: select([x <= 4, (4 < x) & (x <= 24), x > 24], [10, 20, 50])
In[4]: print(alpha(5))
20
In[5]: print(alpha(3))
10
In[6]: print(alpha(26))
50