Split rows and assign values pandas - pandas

I have four a data frame as follows:
Proxyid
A
B
C
D
123
1
0
0
0
456
1
1
1
1
789
0
0
0
0
This is the idea of the data frame. now I want to duplicate the rows where there are more than one 1. and assign values as follows.
Proxyid
A
B
C
D
123
1
0
0
0
456
1
0
0
0
456
0
1
0
0
456
0
0
1
0
456
0
0
0
1
789
0
0
0
0
I would really appreciate any input. Thank you.

One option via pd.get_dumies:
df1 = (
pd.get_dummies(
df.set_index('Proxyid')
.mul(df.columns[1:])
.replace('', np.NAN)
.stack()
)
.reset_index().drop('level_1', 1)
)
result = df1.append(df[~df.Proxyid.isin(df1.Proxyid)])
OUTPUT:
Proxyid
A
B
C
D
0
123
1
0
0
0
1
456
1
0
0
0
2
456
0
1
0
0
3
456
0
0
1
0
4
456
0
0
0
1
2
789
0
0
0
0
If you've extra columns just add them in set_index and use:
df1 = df.set_index(['Proxyid', 'test'])
df1 = pd.get_dummies(df1.mul(df1.columns).replace('', np.NAN).stack()).reset_index()
result = df1.append(df[~df.Proxyid.isin(df1.Proxyid)])

Related

Is there a way to loop through and based a single column value and mark a value into multiple new columns in Pandas?

The dataframe would look something similar to this:
start = [0,2,4,5,1]
end = [3,5,5,5,2]
df = pd.DataFrame({'start': start,'end': end})
The result I want look something like this:
Basically marking a value from start to finish across multiple columns. So if one that start on 0 and ends on 3 I want to mark new column 0 to 3 with a value(1) and the rest with 0.
start = [0,2,4,5,1]
end = [3,5,5,5,2]
diff = [3,3,1,0,1]
col_0 = [1,0,0,0,0]
col_1=[1,0,0,0,1]
col_2 = [1,1,0,0,1]
col_3=[1,1,0,0,0]
col_4=[0,1,1,0,0]
col_5=[0,1,1,1,0]
df = pd.DataFrame({'start': start,'end': end, 'col_0':col_0, 'col_1': col_1, 'col_2': col_2, 'col_3':col_3, 'col_4': col_4, 'col_5': col_5})
start end col_0 col_1 col_2 col_3 col_4 col_5
0 3 1 1 1 1 0 0
2 5 0 0 1 1 1 1
4 5 0 0 0 0 1 1
5 5 0 0 0 0 0 1
1 2 0 1 1 0 0 0
Use dict.fromkeys in list comprehension for each row in DataFrame and pass to DataFrame constructor if perfromance is important:
L = [dict.fromkeys(range(s, e + 1), 1) for s, e in zip(df['start'], df['end'])]
df = df.join(pd.DataFrame(L, index=df.index).add_prefix('col_').fillna(0).astype(int))
print (df)
start end col_0 col_1 col_2 col_3 col_4 col_5
0 0 3 1 1 1 1 0 0
1 2 5 0 0 1 1 1 1
2 4 5 0 0 0 0 1 1
3 5 5 0 0 0 0 0 1
4 1 2 0 1 1 0 0 0
If possible some range value is missing like in changed sample data add DataFrame.reindex:
#missing column 6
start = [0,2,4,7,1]
end = [3,5,5,8,2]
df = pd.DataFrame({'start': start,'end': end})
L = [dict.fromkeys(range(s, e + 1), 1) for s, e in zip(df['start'], df['end'])]
df1 = (pd.DataFrame(L, index=df.index)
.reindex(columns=range(df['start'].min(), df['end'].max() + 1), fill_value=0)
.add_prefix('col_')
.fillna(0)
.astype(int))
df = df.join(df1)
print (df)
start end col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8
0 0 3 1 1 1 1 0 0 0 0 0
1 2 5 0 0 1 1 1 1 0 0 0
2 4 5 0 0 0 0 1 1 0 0 0
3 7 8 0 0 0 0 0 0 0 1 1
4 1 2 0 1 1 0 0 0 0 0 0
EDIT: For counts hours use:
start = pd.to_datetime([0,2,4,5,1], format='%H')
end = pd.to_datetime([3,5,5,5,2], format='%H')
df = pd.DataFrame({'start': start,'end': end})
df.loc[[0,1], 'end'] += pd.Timedelta(1, 'day')
#list for hours datetimes
L = [dict.fromkeys(pd.date_range(s, e, freq='H'), 1) for s, e in zip(df['start'], df['end'])]
df1 = pd.DataFrame(L, index=df.index)
#aggregate sum by hours in columns
df1 = df1.groupby(df1.columns.hour, axis=1).sum().astype(int)
print (df1)
0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 \
0 2 2 2 2 1 1 1 1 1 1 ... 1 1 1 1 1 1 1
1 1 1 2 2 2 2 1 1 1 1 ... 1 1 1 1 1 1 1
2 0 0 0 0 1 1 0 0 0 0 ... 0 0 0 0 0 0 0
3 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0
4 0 1 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0
21 22 23
0 1 1 1
1 1 1 1
2 0 0 0
3 0 0 0
4 0 0 0
[5 rows x 24 columns]
Convert your range from start to stop to a list of indices then explode it. Finally, use indexing to set values to 1:
import numpy as np
range_to_ind = lambda x: range(x['start'], x['end']+1)
(i, j) = df.apply(range_to_ind, axis=1).explode().astype(int).reset_index().values.T
a = np.zeros((df.shape[0], max(df['end'])+1), dtype=int)
a[i, j] = 1
df = df.join(pd.DataFrame(a).add_prefix('col_'))
Output:
>>> df
start end col_0 col_1 col_2 col_3 col_4 col_5
0 0 3 1 1 1 1 0 0
1 2 5 0 0 1 1 1 1
2 4 5 0 0 0 0 1 1
3 5 5 0 0 0 0 0 1
4 1 2 0 1 1 0 0 0

Pandas iloc and conditional sum

This is my dataframe:
0 1 0 1 1
1 0 1 0 1
I generate the sum for each column as below:
data.iloc[:,1:] = data.iloc[:,1:].sum(axis=0)
The result is:
0 1 1 1 2
1 1 1 1 2
But I only want to update values that are not zero:
0 1 0 1 2
1 0 1 0 2
As it is a large dataframe and I don't know which columns will contain zero, I am having trouble in getting the condition to work togther with the iloc
Assuming the following input:
0 1 2 3 4
0 0 1 0 1 1
1 1 0 1 0 1
you can use the underlying numpy array and numpy.where:
import numpy as np
a = data.values[:, 1:]
data.iloc[:,1:] = np.where(a!=0, a.sum(0), a)
output:
0 1 2 3 4
0 0 1 0 1 2
1 1 0 1 0 2

getting dummy values acorss all columns

get dummies method does not seem to work as expected while using with more than one column.
For e.g. if I have this dataframe...
shopping_list = [
["Apple", "Bread", "Fridge"],
["Rice", "Bread", "Milk"],
["Apple", "Rice", "Bread", "Milk"],
["Rice", "Milk"],
["Apple", "Bread", "Milk"],
]
df = pd.DataFrame(shopping_list)
If I use get_dummmies method, the items are repeated across columns like this:
pd.get_dummies(df)
0_Apple 0_Rice 1_Bread 1_Milk 1_Rice 2_Bread 2_Fridge 2_Milk 3_Milk
0 1 0 1 0 0 0 1 0 0
1 0 1 1 0 0 0 0 1 0
2 1 0 0 0 1 1 0 0 1
3 0 1 0 1 0 0 0 0 0
4 1 0 1 0 0 0 0 1 0
While the expected result is:
Apple Bread Fridge Milk Rice
0 1 1 1 0 0
1 0 1 0 1 1
2 1 1 0 1 1
3 0 0 0 1 1
4 1 1 0 1 0
Add parameters prefix and prefix_sep to get_dummies and then add max for avoid duplicated columns names (it aggregate by max):
df = pd.get_dummies(df, prefix='', prefix_sep='').max(axis=1, level=0)
print(df)
Apple Rice Bread Milk Fridge
0 1 0 1 0 1
1 0 1 1 1 0
2 1 1 1 1 0
3 0 1 0 1 0
4 1 0 1 1 0

Creating a "choice" variable that indicates what choice was taken, by assigning the column name - Pandas

I want to create a "choice" variable that indicates what choice was taken among some alternatives.
The alternatives in this example are: 123, 456, 789.
If there is no choice, assign 0 value, if there is multiple choice, assign a 1 value,
otherwise, assign the choice name (by taking the column name).
Data illustration:
ID Date X 123 456 789
A 07/16/2019 .. 1 0 0
A 07/19/2019 .. 0 0 0
A 07/20/2019 .. 0 1 0
A 07/22/2019 .. 1 0 0
A 07/23/2019 .. 0 1 1
B 07/27/2019 .. 0 0 1
B 07/28/2019 .. 0 0 0
B 07/30/2019 .. 0 0 0
Expected result:
ID Date X 123 456 789 choice
A 07/16/2019 .. 1 0 0 123
A 07/19/2019 .. 0 0 0 0
A 07/20/2019 .. 0 1 0 456
A 07/22/2019 .. 1 0 0 123
A 07/23/2019 .. 0 1 1 1
B 07/27/2019 .. 0 0 1 789
B 07/28/2019 .. 0 0 0 0
B 07/30/2019 .. 0 0 0 0
Use numpy.select with DataFrame.idxmax:
#seelct last 3 columns
df1 = df.iloc[:, -3:]
#sum 1 values
s = df1.sum(axis=1)
#set values by conditions
df['choice'] = np.select([s == 1, s == 0], [df1.idxmax(axis=1), 0], default=1)
print (df)
ID Date X 123 456 789 choice
0 A 07/16/2019 .. 1 0 0 123
1 A 07/19/2019 .. 0 0 0 0
2 A 07/20/2019 .. 0 1 0 456
3 A 07/22/2019 .. 1 0 0 123
4 A 07/23/2019 .. 0 1 1 1
5 B 07/27/2019 .. 0 0 1 789
6 B 07/28/2019 .. 0 0 0 0
7 B 07/30/2019 .. 0 0 0 0
Here a ways to do it by selection, using a custom function which does the job and pandas apply:
#list with the names of valid alternative columns
alternatives = ['123', '456', '789']
#custom function to do the selection
def pick_choice(row):
ones = row[alternatives].loc[row[alternatives] == 1]
if len(ones) == 0:
return 0
elif len(ones) > 1:
return 1
elif len(ones) == 1:
return ones.index[0]
df['choice'] = df.apply(pick_choice, axis=1)
Resulting df is:
ID Date X 123 456 789 choice
0 A 07/16/2019 .. 1 0 0 123
1 A 07/19/2019 .. 0 0 0 0
2 A 07/20/2019 .. 0 1 0 456
3 A 07/22/2019 .. 1 0 0 123
4 A 07/23/2019 .. 0 1 1 1
5 B 07/27/2019 .. 0 0 1 789
6 B 07/28/2019 .. 0 0 0 0
7 B 07/30/2019 .. 0 0 0 0
Careful that dtype of column 'choice' is object and not int, because column names are strings (even if you have integers).

How to create dummy variables on Ordinal columns in Python

I am new to Python. I have created dummy columns on categorical column using pandas get_dummies. How to create dummy columns on ordinal column (say column Rating has values 1,2,3...,10)
Consider the dataframe df
df = pd.DataFrame(dict(Cats=list('abcdcba'), Ords=[3, 2, 1, 0, 1, 2, 3]))
df
Cats Ords
0 a 3
1 b 2
2 c 1
3 d 0
4 c 1
5 b 2
6 a 3
pd.get_dummies
works the same on either column
with df.Cats
pd.get_dummies(df.Cats)
a b c d
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 0 0 1 0
5 0 1 0 0
6 1 0 0 0
with df.Ords
0 1 2 3
0 0 0 0 1
1 0 0 1 0
2 0 1 0 0
3 1 0 0 0
4 0 1 0 0
5 0 0 1 0
6 0 0 0 1
with both
pd.get_dummies(df)
Ords Cats_a Cats_b Cats_c Cats_d
0 3 1 0 0 0
1 2 0 1 0 0
2 1 0 0 1 0
3 0 0 0 0 1
4 1 0 0 1 0
5 2 0 1 0 0
6 3 1 0 0 0
Notice that it split out Cats but not Ords
Let's expand on this by adding another Cats2 column and calling pd.get_dummies
pd.get_dummies(df.assign(Cats2=df.Cats)))
Ords Cats_a Cats_b Cats_c Cats_d Cats2_a Cats2_b Cats2_c Cats2_d
0 3 1 0 0 0 1 0 0 0
1 2 0 1 0 0 0 1 0 0
2 1 0 0 1 0 0 0 1 0
3 0 0 0 0 1 0 0 0 1
4 1 0 0 1 0 0 0 1 0
5 2 0 1 0 0 0 1 0 0
6 3 1 0 0 0 1 0 0 0
Interesting, it splits both object columns but not the numeric one.