Filling values based on column name - pandas

I have this simple data frame
import numpy as np
import pandas as pd
data = {'Name':['Karan','Rohit','Sahil','Aryan'],'Age':[23,22,21,23]}
df = pd.DataFrame(data)
I would like to create a new columns based on value of column age and insert 1 if column name fits with value in column Age
like this
Name Age 21 22 23
0 Karan 23 None None 1
1 Rohit 22 None 1 None
2 Sahil 21 1 None None
3 Aryan 23 None None 1
I have tried
def data_categorical_check(df, column_cat):
unique_val = np.unique(np.array(df.iloc[:, [column_cat]]))
x = None
for i in range(len(unique_val)):
x = str(unique_val[i])
df[x] = None
df[x]=[ int(i == unique_val[i]) for i in df["age"]]
return df
This makes columns OK, but I am not able to correctly insert values.
I am looking for general solution.
I would like to define column to check in argument 'column cat'.

Simple..Encode the values using get_dummies then mask the zeros and join back with original dataframe
s = pd.get_dummies(df['Age'])
df.join(s[s != 0])
Name Age 21 22 23
0 Karan 23 NaN NaN 1.0
1 Rohit 22 NaN 1.0 NaN
2 Sahil 21 1.0 NaN NaN
3 Aryan 23 NaN NaN 1.0

Use pd.crosstab:
>>> pd.concat([df, pd.crosstab(df.index, df.Age)], axis=1)
Name Age 21 22 23
0 Karan 23 0 0 1
1 Rohit 22 0 1 0
2 Sahil 21 1 0 0
3 Aryan 23 0 0 1
# OR
>>> pd.concat([df, pd.crosstab(df.index, df.Age).mask(lambda x: x==0)], axis=1)
Name Age 21 22 23
0 Karan 23 NaN NaN 1.0
1 Rohit 22 NaN 1.0 NaN
2 Sahil 21 1.0 NaN NaN
3 Aryan 23 NaN NaN 1.0

You can do it by creating a function thats return the row with the new column created:
def data_categorical_check(row):
row[str(row["Age"])]=1
return row
And applying it by using "apply" method:
df.apply(lambda x: data_categorical_check(x), axis=1)

Related

python rolling product on non-adjacent row

I would like to calculate rolling product of non-adjacent row, such as product of values in every fifth row as shown in the photo (result in blue cell is the product of data in blue cell etc.)
The best way I can do now is the following;
temp = pd.DataFrame([range(20)]).transpose()
df = temp.copy()
df['shift1'] = temp.shift(5)
df['shift2'] = temp.shift(10)
df['shift3'] = temp.shift(15)
result = df.product(axis=1)
however, it looks to be cumbersome as I want to change the row step dynamically.
can anyone tell me if there is a better way to navigate this?
Thank you
You can use groupby.cumprod/groupby.prod with the modulo 5 as grouper:
import numpy as np
m = np.arange(len(df)) % 5
# option 1
df['result'] = df.groupby(m)['data'].cumprod()
# option 2
df.loc[~m.duplicated(keep='last'), 'result2'] = df.groupby(m)['data'].cumprod()
# or
# df.loc[~m.duplicated(keep='last'),
# 'result2'] = df.groupby(m)['data'].prod().to_numpy()
output:
data result result2
0 0 0 NaN
1 1 1 NaN
2 2 2 NaN
3 3 3 NaN
4 4 4 NaN
5 5 0 NaN
6 6 6 NaN
7 7 14 NaN
8 8 24 NaN
9 9 36 NaN
10 10 0 NaN
11 11 66 NaN
12 12 168 NaN
13 13 312 NaN
14 14 504 NaN
15 15 0 0.0
16 16 1056 1056.0
17 17 2856 2856.0
18 18 5616 5616.0
19 19 9576 9576.0

How do I make the pandas index of a pivot table part of the column names?

I'm trying to pivot two columns out by another flag column with out multi-indexing. I would like to have the column names be a part of the indicator itself. Take for example:
import pandas as pd
df_dict = {'fire_indicator':[0,0,1,0,1],
'cost':[200, 300, 354, 456, 444],
'value':[1,1,2,1,1],
'id':['a','b','c','d','e']}
df = pd.DataFrame(df_dict)
If I do the following:
df.pivot_table(index = 'id', columns = 'fire_indicator', values = ['cost','value'])
I get the following:
cost value
fire_indicator 0 1 0 1
id
a 200.0 NaN 1.0 NaN
b 300.0 NaN 1.0 NaN
c NaN 354.0 NaN 2.0
d 456.0 NaN 1.0 NaN
e NaN 444.0 NaN 1.0
What I'm trying to do is the following:
id fire_indicator_0_cost fire_indicator_1_cost fire_indicator_0_value fire_indicator_0_value
a 200 0 1 0
b 300 0 1 0
c 0 354 0 2
d 456 0 1 0
e 0 444 0 1
I know there is a way in SAS. Is there a way in python pandas?
Just rename and re_index:
out = df.pivot_table(index = 'id', columns = 'fire_indicator', values = ['cost','value'])
out.columns = [f'fire_indicator_{y}_{x}' for x,y in out.columns]
# not necessary if you want `id` be the index
out = out.reset_index()
Output:
id fire_indicator_0_cost fire_indicator_1_cost fire_indicator_0_value fire_indicator_1_value
-- ---- ----------------------- ----------------------- ------------------------ ------------------------
0 a 200 nan 1 nan
1 b 300 nan 1 nan
2 c nan 354 nan 2
3 d 456 nan 1 nan
4 e nan 444 nan 1

Using scalar values in series as variables in user defined function

I want to define a function that is applied element wise for each row in a dataframe, comparing each element to a scalar value in a separate series. I started with the function below.
def greater_than(array, value):
g = array[array >= value].count(axis=1)
return g
But it is applying the mask along axis 0 and I need it to apply it along axis 1. What can I do?
e.g.
In [3]: df = pd.DataFrame(np.arange(16).reshape(4,4))
In [4]: df
Out[4]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
In [26]: s
Out[26]: array([ 1, 1000, 1000, 1000])
In [25]: greater_than(df,s)
Out[25]:
0 0
1 1
2 1
3 1
dtype: int64
In [27]: g = df[df >= s]
In [28]: g
Out[28]:
0 1 2 3
0 NaN NaN NaN NaN
1 4.0 NaN NaN NaN
2 8.0 NaN NaN NaN
3 12.0 NaN NaN NaN
The result should look like:
In [29]: greater_than(df,s)
Out[29]:
0 3
1 0
2 0
3 0
dtype: int64
as 1,2, & 3 are all >= 1 and none of the remaining values are greater than or equal to 1000.
Your best bet may be to do some transposes (no copies are made, if that's a concern)
In [164]: df = pd.DataFrame(np.arange(16).reshape(4,4))
In [165]: s = np.array([ 1, 1000, 1000, 1000])
In [171]: df.T[(df.T>=s)].T
Out[171]:
0 1 2 3
0 NaN 1.0 2.0 3.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
In [172]: df.T[(df.T>=s)].T.count(axis=1)
Out[172]:
0 3
1 0
2 0
3 0
dtype: int64
You can also just sum the mask directly, if the count is all you're after.
In [173]: (df.T>=s).sum(axis=0)
Out[173]:
0 3
1 0
2 0
3 0
dtype: int64

easy multidimensional numpy ndarray to pandas dataframe method?

Having a 4-D numpy.ndarray, e.g.
myarr = np.random.rand(10,4,3,2)
dims={'time':1:10,'sub':1:4,'cond':['A','B','C'],'measure':['meas1','meas2']}
But with possible higher dimensions. How can I create a pandas.dataframe with multiindex, just passing the dimensions as indexes, without further manual adjustments (reshaping the ndarray into 2D shape)?
I can't wrap my head around the reshaping, not even really in 3 dimensions quite yet, so I'm searching for an 'automatic' method if possible.
What would be a function to which to pass the column/row indexes and create a dataframe? Something like:
df=nd2df(myarr,dim2row=[0,1],dim2col=[2,3],rowlab=['time','sub'],collab=['cond','measure'])
And and up with something like:
meas1 meas2
A B C A B C
sub time
1 1
2
3
.
.
2 1
2
...
If it is not possible/feasible to do it automatized, an explanation that is less terse than the Multiindexing manual is appreciated.
I can't even get it right when I don't care about the order of the dimensions, e.g. I would expect this to work:
a=np.arange(24).reshape((3,2,2,2))
iterables=[[1,2,3],[1,2],['m1','m2'],['A','B']]
pd.MultiIndex.from_product(iterables, names=['time','sub','meas','cond'])
pd.DataFrame(a.reshape(2*3*1,2*2),index)
gives:
ValueError: Shape of passed values is (4, 6), indices imply (4, 24)
You're getting the error because you've reshaped the ndarray as 6x4 and applying an index intended to capture all dimensions in a single series. The following is a setup to get the pet example working:
a=np.arange(24).reshape((3,2,2,2))
iterables=[[1,2,3],[1,2],['m1','m2'],['A','B']]
index = pd.MultiIndex.from_product(iterables, names=['time','sub','meas','cond'])
pd.DataFrame(a.reshape(24, 1),index=index)
Solution
Here's a generic DataFrame creator that should get the job done:
def produce_df(rows, columns, row_names=None, column_names=None):
"""rows is a list of lists that will be used to build a MultiIndex
columns is a list of lists that will be used to build a MultiIndex"""
row_index = pd.MultiIndex.from_product(rows, names=row_names)
col_index = pd.MultiIndex.from_product(columns, names=column_names)
return pd.DataFrame(index=row_index, columns=col_index)
Demonstration
Without named index levels
produce_df([['a', 'b'], ['c', 'd']], [['1', '2'], ['3', '4']])
1 2
3 4 3 4
a c NaN NaN NaN NaN
d NaN NaN NaN NaN
b c NaN NaN NaN NaN
d NaN NaN NaN NaN
With named index levels
produce_df([['a', 'b'], ['c', 'd']], [['1', '2'], ['3', '4']],
row_names=['alpha1', 'alpha2'], column_names=['number1', 'number2'])
number1 1 2
number2 3 4 3 4
alpha1 alpha2
a c NaN NaN NaN NaN
d NaN NaN NaN NaN
b c NaN NaN NaN NaN
d NaN NaN NaN NaN
From the structure of your data,
names=['sub','time','measure','cond'] #ind1,ind2,col1,col2
labels=[[1,2,3],[1,2],['meas1','meas2'],list('ABC')]
A straightforward way to your goal:
index = pd.MultiIndex.from_product(labels,names=names)
data=arange(index.size) # or myarr.flatten()
df=pd.DataFrame(data,index=index)
df22=df.reset_index().pivot_table(values=0,index=names[:2],columns=names[2:])
"""
measure meas1 meas2
cond A B C A B C
sub time
1 1 0 1 2 3 4 5
2 6 7 8 9 10 11
2 1 12 13 14 15 16 17
2 18 19 20 21 22 23
3 1 24 25 26 27 28 29
2 30 31 32 33 34 35
"""
I still don't know how to do it directly, but here is an easy-to-follow step by step way:
# Create 4D-array
a=np.arange(24).reshape((3,2,2,2))
# Set only one row index
rowiter=[[1,2,3]]
row_ind=pd.MultiIndex.from_product(rowiter, names=[u'time'])
# put the rest of dimenstion into columns
coliter=[[1,2],['m1','m2'],['A','B']]
col_ind=pd.MultiIndex.from_product(coliter, names=[u'sub',u'meas',u'cond'])
ncols=np.prod([len(coliter[x]) for x in range(len(coliter))])
b=pd.DataFrame(a.reshape(len(rowiter[0]),ncols),index=row_ind,columns=col_ind)
print(b)
# Reshape columns to rows as pleased:
b=b.stack('sub')
# switch levels and order in rows (level goes from inner to outer):
c=b.swaplevel(0,1,axis=0).sortlevel(0,axis=0)
To check the correct assignment of dimensions:
print(a[:,0,0,0])
[ 0 8 16]
print(a[0,:,0,0])
[0 4]
print(a[0,0,:,0])
[0 2]
print(b)
meas m1 m2
cond A B A B
time sub
1 1 0 1 2 3
2 4 5 6 7
2 1 8 9 10 11
2 12 13 14 15
3 1 16 17 18 19
2 20 21 22 23
print(c)
meas m1 m2
cond A B A B
sub time
1 1 0 1 2 3
2 8 9 10 11
3 16 17 18 19
2 1 4 5 6 7
2 12 13 14 15
3 20 21 22 23

transforming data frame in ipython a little like transpose

Suppose I have a data frame like the following data.frame in pandas
a 1 11
a 3 12
a 20 13
b 2 14
b 4 15
I want to generate a resulting data.frame like this
V1 1 2 3 4 20
a 11 NaN 12 NaN 13
b NaN 14 NaN 15 NaN
How can I get this transformation?
Thank you.
You can use pivot:
import pandas as pd
df = pd.DataFrame({'col1': ['a','a','a','b','b'],
'col2': [1,3,20,2,4],
'col3': [11,12,13,14,15]})
print df.pivot(index='col1', columns='col2')
Output:
col3
col2 1 2 3 4 20
col1
a 11 NaN 12 NaN 13
b NaN 14 NaN 15 NaN