How to expand one row to multiple rows according to its value in Pandas - pandas

This is a DataFrame I have for example. Please refer the image link.
Before:
Before
d = {1: ['2134',20, 1,1,1,0], 2: ['1010',5, 1,0,0,0], 3: ['3457',15, 0,1,1,0]}
columns=['Code', 'Price', 'Bacon','Onion','Tomato', 'Cheese']
df = pd.DataFrame.from_dict(data=d, orient='index').sort_index()
df.columns = columns
What I want to do is expanding a single row into multiple rows. Then the Dataframe should be look like the image of below link. The intention is using some columns(from 'Bacon' to 'Cheese') as categories.
After:
After
I tried to find the answer, but failed. Thanks.

You can first reshape with set_index and stack, then filter by query and get_dummies from column level_2 and last reindex columns for add missing with no 1 and reset_index:
df = df.set_index(['Code', 'Price']) \
.stack() \
.reset_index(level=2, name='val') \
.query('val == 1') \
.level_2.str.get_dummies() \
.reindex(columns=df.columns[2:], fill_value=0) \
.reset_index()
print (df)
Code Price Bacon Onion Tomato Cheese
0 2134 20 1 0 0 0
1 2134 20 0 1 0 0
2 2134 20 0 0 1 0
3 1010 5 1 0 0 0
4 3457 15 0 1 0 0
5 3457 15 0 0 1 0

You can use stack and transpose to do this operation and format accordingly.
df = df.stack().to_frame().T
df.columns = ['{}_{}'.format(*c) for c in df.columns]

Use pd.melt to put all the food in one column and then pd.get_dummies to expand the columns.
df1 = pd.melt(df, id_vars=['Code', 'Price'])
df1 = df1[df1['value'] == 1]
df1 = pd.get_dummies(df1, columns=['variable'], prefix='', prefix_sep='').sort_values(['Code', 'Price'])
df1.reindex(columns=df.columns, fill_value=0)
Edited after I saw how jezrael used reindex to both add and drop a column.

Related

python pandas sort_values with multiple custom keys

I have a dataframe with 2 columns, I would like to sort column A ascending, and B ascending but using the absolute value. How do I do this? I tried like df.sort_values(by=['A', 'B'], key=lambda x: abs(x)) but it will take the absolute value of both columns and sort ascend.
df = pd.DataFrame({'A': [1,2,-3], 'B': [-1, 2, -3]})
output:
A B
0 1 -1
1 2 2
2 -3 -3
Expected output:
A B
0 -3 -1
1 1 2
2 2 -3
You can't have multiple sort key because index can't be dissociated. The only way is to sort your columns independently and recreate a dataframe:
>>> df.agg({'A': lambda x: x.sort_values().values,
'B': lambda x: x.sort_values(key=abs).values}) \
.apply(pd.Series).T
A B
0 -3 -1
1 1 2
2 2 -3
Use numpy.sort, to so sort column A values
df =df.assign(A= np.sort(df['A'].values))
A B
0 -3 -1
1 1 2
2 2 -3

Transform values from 1 column to multiple columns

I have the following table:
and would like to convert the product column to something like:
How would you recomend I do this in pandas? Test df below
import numpy as np
import pandas as pd
test_dict = {'Acount': ['1', '2', '3', '4'], 'Product': [np.nan, 'A','A,B,C', 'C']}
df = pd.DataFrame.from_dict(test_dict)
For a single column you can use Series.str.get_dummies which allows you to specify the character that separates all categories. Set 'Acount' to the index so that appears in the output:
df.set_index('Acount')['Product'].str.get_dummies(sep=',')
A B C
Acount
1 0 0 0
2 1 0 0
3 1 1 1
4 0 0 1
Let's use .str.split, explode and pd.crosstab:
df_count = df.assign(Product=df['Product'].str.split(',')).explode('Product')
pd.crosstab(df_count['Acount'], df_count['Product']).reindex(df['Acount'].unique(), fill_value=0)
Output:
Product A B C
Acount
1 0 0 0
2 1 0 0
3 1 1 1
4 0 0 1
Details
Let's assign 'Product' as a list of elements using .str.split on commas.
Next, use explode to unnest the list in the 'Product' column.
Now, use pd.crosstab to count the occurrence for each value by 'Acount'.
Lastly, reindex to fill missing 'Acount' not present in crosstab.

Pandas find columns with wildcard names

I have a pandas dataframe with column names like this:
id ColNameOrig_x ColNameOrig_y
There are many such columns, the 'x' and 'y' came about because 2 datasets with similar column names were merged.
What I need to do:
df.ColName = df.ColNameOrig_x + df.ColNameOrig_y
I am now manually repeating this line for many cols(close to 50), is there a wildcard way of doing this?
You can use DataFrame.filter with DataFrame.groupby by lambda function and axis=1 for grouping per columns names with aggregate sum or use text functions like Series.str.split with indexing:
df1 = df.filter(like='_').groupby(lambda x: x.split('_')[0], axis=1).sum()
print (df1)
ColName1Orig ColName2Orig
0 3 7
1 11 15
df1 = df.filter(like='_').groupby(df.columns.str.split('_').str[0], axis=1).sum()
print (df1)
ColName1Orig ColName2Orig
0 3 7
1 11 15
df1 = df.filter(like='_').groupby(df.columns.str[:12], axis=1).sum()
print (df1)
ColName1Orig ColName2Orig
0 3 7
1 11 15
You can use the subscripting syntax to access column names dynamically:
col_groups = ['ColName1', 'ColName2']
for grp in col_groups:
df[grp] = df[f'{grp}Orig_x'] + df[f'{grp}Orig_y']
Or you can aggregate by column group. For example
df = pd.DataFrame([
[1,2,3,4],
[5,6,7,8]
], columns=['ColName1Orig_x', 'ColName1Orig_y', 'ColName2Orig_x', 'ColName2Orig_y'])
# Here's your opportunity to define the wildcard
col_groups = df.columns.str.extract('(.+)Orig_[x|y]')[0]
df.columns = [col_groups, df.columns]
df.groupby(level=0, axis=1).sum()
Input:
ColName1Orig_x ColName1Orig_y ColName2Orig_x ColName2Orig_y
1 2 3 4
5 6 7 8
Output:
ColName1 ColName2
3 7
11 15

Creating column from categories in a column in pandas

I have a dataframe in which I want to create columns based on the levels of data from that column. For example,
Cust_ID MCC Date TRANS_AMT Frequency
1 1750 Jan 6633 1
1 1799 Jan 5584 1
2 3001 Mar 405 2
2 3174 Oct 1219 1
I want to create columns based on the levels of data I have in column MCC and Date. For each Cust_ID, I want TRANS_AMT and Frequency they have done at each MCC and Date level combined.
Below is the required output:
Because ordering of columns in final DataFrame should be important, convert column date to ordered categorical, then create MultiIndex by DataFrame.set_index and columns TRANS_AMT and Frequency convert to ordered CategoricalIndex too.
Then reshape by DataFrame.unstack and sorting by second level of MultiIndex in columns by DataFrame.sort_index.
Last flatten values in list comprehension with f-strings and DataFrame.reset_index for column from index:
cats = ['Jan', 'Feb', 'Mar', 'Apr','May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
df['Date'] = pd.Categorical(df['Date'], categories=cats, ordered=True)
df1 = df.set_index(['Cust_ID','MCC','Date'])
df1.columns = pd.CategoricalIndex(df1.columns,
categories=['TRANS_AMT','Frequency'],
ordered=True)
df1 = df1.unstack(level=[1,2], fill_value=0).sort_index(axis=1, level=1)
df1.columns = [f'{a}_{b}_{c}' for a, b, c in df1.columns]
df1 = df1.reset_index()
print (df1)
Cust_ID TRANS_AMT_1750_Jan Frequency_1750_Jan TRANS_AMT_1799_Jan \
0 1 6633 1 5584
1 2 0 0 0
Frequency_1799_Jan TRANS_AMT_3001_Mar Frequency_3001_Mar \
0 1 0 0
1 0 405 2
TRANS_AMT_3174_Oct Frequency_3174_Oct
0 0 0
1 1219 1
If not important ordering remove converting to categoricals:
df1 = (df.set_index(['Cust_ID','MCC','Date'])
.unstack(level=[1,2], fill_value=0)
.sort_index(axis=1, level=1))
df1.columns = [f'{a}_{b}_{c}' for a, b, c in df1.columns]
df1 = df1.reset_index()
print (df1)
Cust_ID Frequency_1750_Jan TRANS_AMT_1750_Jan Frequency_1799_Jan \
0 1 1 6633 1
1 2 0 0 0
TRANS_AMT_1799_Jan Frequency_3001_Mar TRANS_AMT_3001_Mar \
0 5584 0 0
1 0 2 405
Frequency_3174_Oct TRANS_AMT_3174_Oct
0 0 0
1 1 1219

How to add a new row to pandas dataframe with non-unique multi-index

df = pd.DataFrame(np.arange(4*3).reshape(4,3), index=[['a','a','b','b'],[1,2,1,2]], columns=list('xyz'))
where df looks like:
Now I add a new row by:
df.loc['new',:]=[0,0,0]
Then df becomes:
Now I want to do the same but with a different df that has non-unique multi-index:
df = pd.DataFrame(np.arange(4*3).reshape(4,3), index=[['a','a','b','b'],[1,1,2,2]], columns=list('xyz'))
,which looks like:
and call
df.loc['new',:]=[0,0,0]
The result is "Exception: cannot handle a non-unique multi-index!"
How could I achieve the goal?
Use append or concat with helper DataFrame:
df1 = pd.DataFrame([[0,0,0]],
columns=df.columns,
index=pd.MultiIndex.from_arrays([['new'], ['']]))
df2 = df.append(df1)
df2 = pd.concat([df, df1])
print (df2)
x y z
a 1 0 1 2
1 3 4 5
b 2 6 7 8
2 9 10 11
new 0 0 0