How to create a multiIndex (hierarchical index) dataframe object from another df's column's unique values? - pandas

I'm trying to create a pandas multiIndexed dataframe that is a summary of the unique values in each column.
Is there an easier way to have this information summarized besides creating this dataframe?
Either way, it would be nice to know how to complete this code challenge. Thanks for your help! Here is the toy dataframe and the solution I attempted using a for loop with a dictionary and a value_counts dataframe. Not sure if it's possible to incorporate MultiIndex.from_frame or .from_product here somehow...
Original Dataframe:
data = pd.DataFrame({'A': ['case', 'case', 'case', 'case', 'case'],
'B': [2001, 2002, 2003, 2004, 2005],
'C': ['F', 'M', 'F', 'F', 'M'],
'D': [0, 0, 0, 1, 0],
'E': [1, 0, 1, 0, 1],
'F': [1, 1, 0, 0, 0]})
A B C D E F
0 case 2001 F 0 1 1
1 case 2002 M 0 0 1
2 case 2003 F 0 1 0
3 case 2004 F 1 0 0
4 case 2005 M 1 1 0
Desired outcome:
unique percent
A case 100
B 2001 20
2002 20
2003 20
2004 20
2005 20
C F 60
M 40
D 0 80
1 20
E 0 40
1 60
F 0 60
1 40
My failed for loop attempt:
def unique_values(df):
values = {}
columns = []
df = pd.DataFrame(values, columns=columns)
for col in data:
df2 = data[col].value_counts(normalize=True)*100
values = values.update(df2.to_dict)
columns = columns.append(col*len(df2))
return df
unique_values(data)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-84-a341284fb859> in <module>
11
12
---> 13 unique_values(data)
<ipython-input-84-a341284fb859> in unique_values(df)
5 for col in data:
6 df2 = data[col].value_counts(normalize=True)*100
----> 7 values = values.update(df2.to_dict)
8 columns = columns.append(col*len(df2))
9 return df
TypeError: 'method' object is not iterable
Let me know if there's something obvious I'm missing! Still relatively new to EDA and pandas, any pointers appreciated.

This is a fairly straightforward application of .melt:
data.melt().reset_index().groupby(['variable', 'value']).count()/len(data)
output
index
variable value
A case 1.0
B 2001 0.2
2002 0.2
2003 0.2
2004 0.2
2005 0.2
C F 0.6
M 0.4
D 0 0.8
1 0.2
E 0 0.4
1 0.6
F 0 0.6
1 0.4

I'm sorry! I've written an answer, but it's in javascript. I came here after I thought I've clicked on javascript and started coding, but on posting I saw that you're coding in python.
I will post it anyway, maybe it will help you. Python is not that much different from javascript ;-)
const data = {
A: ["case", "case", "case", "case", "case"],
B: [2001, 2002, 2003, 2004, 2005],
C: ["F", "M", "F", "F", "M"],
D: [0, 0, 0, 1, 0],
E: [1, 0, 1, 0, 1],
F: [1, 1, 0, 0, 0]
};
const getUniqueStats = (_data) => {
const results = [];
for (let row in _data) {
// create list of unique values
const s = [...new Set(_data[row])];
// filter for unique values and count them for percentage, then push
results.push({ index: row, values: s.map((x) => ({ unique: x, percentage: (_data[row].filter((y) => y === x).length / data[row].length) * 100 })) });
}
return results;
};
const results = getUniqueStats(data);
results.forEach((row) =>
row.values.forEach((value) =>
console.log(`${row.index}\t${value.unique}\t${value.percentage}%`)
)
);

Related

Remove nan from pandas binner

I have created the following pandas dataframe called train:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats as stats
ds = {
'matchKey' : [621062, 622750, 623508, 626451, 626611, 626796, 627114, 630055, 630225],
'og_max_last_dpd' : [10, 10, -99999, 10, 10, 10, 10, 10, 10],
'og_min_last_dpd' : [10, 10, -99999, 10, 10, 10, 10, 10, 10],
'og_max_max_dpd' : [0, 0, -99999, 1, 0, 5, 0, 4, 0],
'Target':[1,0,1,0,0,1,1,1,0]
}
train = pd.DataFrame(data=ds)
The dataframe looks like this:
print(train)
matchKey og_max_last_dpd og_min_last_dpd og_max_max_dpd Target
0 621062 10 10 0 1
1 622750 10 10 0 0
2 623508 -99999 -99999 -99999 1
3 626451 10 10 1 0
4 626611 10 10 0 0
5 626796 10 10 5 1
6 627114 10 10 0 1
7 630055 10 10 4 1
8 630225 10 10 0 0
I have then binned the column called og_max_max_dpd using this code:
def mono_bin(Y, X, char, n=20):
X2 = X.fillna(-99999)
r = 0
while np.abs(r) < 1:
d1 = pd.DataFrame({"X": X2, "Y": Y, "Bucket": pd.qcut(X2, n, duplicates="drop")})#,include_lowest=True
d2 = d1.groupby("Bucket", as_index=True)
r, p = stats.spearmanr(d2.mean().X, d2.mean().Y)
n = n - 1
d3 = pd.DataFrame(d2.min().X, columns=["min_" + X.name])
d3["max_" + X.name] = d2.max().X
d3[Y.name] = d2.sum().Y
d3["total"] = d2.count().Y
d3[Y.name + "_rate"] = d2.mean().Y
d4 = (d3.sort_values(by="min_" + X.name)).reset_index(drop=True)
# print("=" * 85)
# print(d4)
ninf = float("-inf")
pinf = float("+inf")
array = []
for i in range(len(d4) - 1):
array.append(d4["max_" + char].iloc[i])
return [ninf] + array + [pinf]
binner = mono_bin(train['Target'], train['og_max_max_dpd'], 'og_max_max_dpd')
I have printed out the binner which looks like this:
print(binner)
[-inf, -99999.0, nan, 0.0, nan, nan, 1.0, nan, nan, 4.0, nan, inf]
I want to remove the nan from that list so that the binner looks like this:
[-inf, -99999.0, 0.0, 1.0, 4.0, inf]
Does anyone know how to remove the nan?
You can simply use dropna to remove it from d4:
...
d3[Y.name + "_rate"] = d2.mean().Y
d4 = (d3.sort_values(by="min_" + X.name)).reset_index(drop=True)
d4.dropna(inplace=True)
# print("=" * 85)
# print(d4)
ninf = float("-inf")
...

Timeseries: Groupby and calculate variance

I have the following dataframe with timeseries data:
df = pd.DataFrame(columns = ['id', 'value'])
df['value'] =[9, 16, 10, 12, 11, 14]
df['id'] = [1, 1, 1, 2, 2, 2]
For each timeseries (defined by column 'id' I want to calculate the variance to find timeseries that do not change at all or only very little.
The final dataframe should look like this:
df_end = pd.DataFrame(columns = ['id','value', 'var'])
df_end['value'] =[9, 16, 10, 12, 11, 14]
df_end['id'] = [1, 1, 1, 2, 2, 2]
df_end['var'] = [21, 21, 21, 2.3, 2.3, 2.3]
I tried:
df.groupby(df['id']).var()
which gives me the values, but I couldn't put it into the df in the right form. I am sure, there is a handy function for this that I don't know about yet!
Thanks for helping out!
Use GroupBy.transform with specify column value:
df['var'] = df.groupby('id')['value'].transform('var')
print (df)
id value var
0 1 9 14.333333
1 1 16 14.333333
2 1 10 14.333333
3 2 12 2.333333
4 2 11 2.333333
5 2 14 2.333333

Generating boolean dataframe based on contents in series and dataframe

I have:
df = pd.DataFrame(
[
[22, 33, 44],
[55, 11, 22],
[33, 55, 11],
],
index=["abc", "def", "ghi"],
columns=list("abc")
) # size(3,3)
and:
unique = pd.Series([11, 22, 33, 44, 55]) # size(1,5)
then I create a new df based on unique and df, so that:
df_new = pd.DataFrame(index=unique, columns=df.columns) # size(5,3)
From this newly created df, I'd like to create a new boolean df based on unique and df, so that the end result is:
df_new = pd.DataFrame(
[
[0, 1, 1],
[1, 0, 1],
[1, 1, 0],
[0, 0, 1],
[1, 1, 0],
],
index=unique,
columns=df.columns
)
This new df is either true or false depending on whether the value is present in the original dataframe or not. For example, the first column has three values: [22, 55, 33]. In a df with dimensions (5,3), this first column would be: [0, 1, 1, 0, 1] i.e. [0, 22, 33, 0 , 55]
I tried filter2 = unique.isin(df) but this doesn't work, also notnull. I tried applying a filter but the dimensions returned were incorrect. How can I do this?
Use DataFrame.stack with DataFrame.reset_index, DataFrame.pivot, then check if not missing values by DataFrame.notna, cast to integers for True->1 and False->0 mapping and last remove index and columns names by DataFrame.rename_axis:
df_new = (df.stack()
.reset_index(name='v')
.pivot('v','level_1','level_0')
.notna()
.astype(int)
.rename_axis(index=None, columns=None))
print (df_new)
a b c
11 0 1 1
22 1 0 1
33 1 1 0
44 0 0 1
55 1 1 0
Helper Series is not necessary, but if there is more values or is necessary change order by helper Series use add DataFrame.reindex:
#added 66
unique = pd.Series([11, 22, 33, 44, 55,66])
df_new = (df.stack()
.reset_index(name='v')
.pivot('v','level_1','level_0')
.reindex(unique)
.notna()
.astype(int)
.rename_axis(index=None, columns=None))
print (df_new)
a b c
11 0 1 1
22 1 0 1
33 1 1 0
44 0 0 1
55 1 1 0
66 0 0 0

Assign conditional values to columns in Dask

I am trying to do a conditional assignation to the rows of a specific column: target. I have done some research, and it seemed that the answer was given here: "How to do row processing and item assignment in dask".
I will reproduce my necessity. Mock data set:
x = [3, 0, 3, 4, 0, 0, 0, 2, 0, 0, 0, 6, 9]
y = [200, 300, 400, 215, 219, 360, 280, 396, 145, 276, 190, 554, 355]
mock = pd.DataFrame(dict(target = x, speed = y))
The look of mock is:
In [4]: mock.head(7)
Out [4]:
speed target
0 200 3
1 300 0
2 400 3
3 215 4
4 219 0
5 360 0
6 280 0
Having this Pandas DataFrame, I convert it into a Dask DataFrame:
mock_dask = dd.from_pandas(mock, npartitions = 2)
I apply my conditional rule: all values in target above 0, must be 1, all others 0 (binaryze target). Following the mentioned thread above, it should be:
result = mock_dask.target.where(mock_dask.target > 0, 1)
I have a look at the result dataset and it is not working as expected:
In [7]: result.head(7)
Out [7]:
0 3
1 1
2 3
3 4
4 1
5 1
6 1
Name: target, dtype: object
As we can see, the column target in mock and result are not the expected results. It seems that my code is converting all 0 original values to 1, instead of the values that are greater than 0 into 1 (the conditional rule).
Dask newbie here, Thanks in advance for your help.
OK, the documentation in Dask DataFrame API is pretty clear. Thanks to #MRocklin feedback I have realized my mistake. In the documentation, where function (the last one in the list) is used with the following syntax:
DataFrame.where(cond[, other]) Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.
Thus, the correct code line would be:
result = mock_dask.target.where(mock_dask.target <= 0, 1)
This will output:
In [7]: result.head(7)
Out [7]:
0 1
1 0
2 1
3 1
4 0
5 0
6 0
Name: target, dtype: int64
Which is the expected output.
They seem to be the same to me
In [1]: import pandas as pd
In [2]: x = [1, 0, 1, 1, 0, 0, 0, 2, 0, 0, 0, 6, 9]
...: y = [200, 300, 400, 215, 219, 360, 280, 396, 145, 276, 190, 554, 355]
...: mock = pd.DataFrame(dict(target = x, speed = y))
...:
In [3]: import dask.dataframe as dd
In [4]: mock_dask = dd.from_pandas(mock, npartitions = 2)
In [5]: mock.target.where(mock.target > 0, 1).head(5)
Out[5]:
0 1
1 1
2 1
3 1
4 1
Name: target, dtype: int64
In [6]: mock_dask.target.where(mock_dask.target > 0, 1).head(5)
Out[6]:
0 1
1 1
2 1
3 1
4 1
Name: target, dtype: int64

How to merge columns using mask

I am trying to merge two columns (Phone 1 and 2)
Here is my fake data:
import pandas as pd
employee = {'EmployeeID' : [0, 1, 2, 3, 4, 5, 6, 7],
'LastName' : ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'],
'Name' : ['w', 'x', 'y', 'z', None, None, None, None],
'phone1' : [1, 1, 2, 2, 4, 5, 6, 6],
'phone2' : [None, None, 3, 3, None, None, 7, 7],
'level_15' : [0, 1, 0, 1, 0, 0, 0, 1]}
df2 = pd.DataFrame(employee)
and I want the 'phone' column to be
'phone' : [1, 2, 3, 4, 5, 7, 9, 10]
In the beginning of my code, i split the names based on '/' and this code below creates a column with 0s and 1s which I used as mask to do other tasks through out my code.
df2 = (df2.set_index(cols)['name'].str.split('/',expand=True).stack().reset_index(name='Name'))
m = df2['level_15'].eq(0)
print (m)
#remove column level_15
df2 = df2.drop(['level_15'], axis=1)
#add last name for select first letter by condition, replace NaNs by forward fill
df2['last_name'] = df2['name'].str[:2].where(m).ffill()
df2['name'] = df2['name'].mask(m, df2['name'].str[2:])
I feel like there is a way to merge phone1 and phone2 using the 0s and 1s, but I can't figure out. Thank you.
First, start by filling in NaNs;
df2['phone2'] = df2.phone2.fillna(df2.phone1)
# Alternatively, based on your latest update
# df2['phone2'] = df2.phone2.mask(df2.phone2.eq(0)).fillna(df2.phone1)
You can just use np.where to merge columns on odd/even indices:
df2['phone'] = np.where(np.arange(len(df2)) % 2 == 0, df2.phone1, df2.phone2)
df2 = df2.drop(['phone1', 'phone2'], 1)
df2
EmployeeID LastName Name phone
0 0 a w 1
1 1 b x 2
2 2 c y 3
3 3 d z 4
4 4 e None 5
5 5 f None 6
6 6 g None 7
7 7 h None 8
Or, with Series.where/mask:
df2['phone'] = df2.pop('phone1').where(
np.arange(len(df2)) % 2 == 0, df2.pop('phone2')
)
Or,
df2['phone'] = df2.pop('phone1').mask(
np.arange(len(df2)) % 2 != 0, df2.pop('phone2)
)
df2
EmployeeID LastName Name phone
0 0 a w 1
1 1 b x 2
2 2 c y 3
3 3 d z 4
4 4 e None 5
5 5 f None 6
6 6 g None 7
7 7 h None 8