Creating column from categories in a column in pandas - pandas

I have a dataframe in which I want to create columns based on the levels of data from that column. For example,
Cust_ID MCC Date TRANS_AMT Frequency
1 1750 Jan 6633 1
1 1799 Jan 5584 1
2 3001 Mar 405 2
2 3174 Oct 1219 1
I want to create columns based on the levels of data I have in column MCC and Date. For each Cust_ID, I want TRANS_AMT and Frequency they have done at each MCC and Date level combined.
Below is the required output:

Because ordering of columns in final DataFrame should be important, convert column date to ordered categorical, then create MultiIndex by DataFrame.set_index and columns TRANS_AMT and Frequency convert to ordered CategoricalIndex too.
Then reshape by DataFrame.unstack and sorting by second level of MultiIndex in columns by DataFrame.sort_index.
Last flatten values in list comprehension with f-strings and DataFrame.reset_index for column from index:
cats = ['Jan', 'Feb', 'Mar', 'Apr','May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
df['Date'] = pd.Categorical(df['Date'], categories=cats, ordered=True)
df1 = df.set_index(['Cust_ID','MCC','Date'])
df1.columns = pd.CategoricalIndex(df1.columns,
categories=['TRANS_AMT','Frequency'],
ordered=True)
df1 = df1.unstack(level=[1,2], fill_value=0).sort_index(axis=1, level=1)
df1.columns = [f'{a}_{b}_{c}' for a, b, c in df1.columns]
df1 = df1.reset_index()
print (df1)
Cust_ID TRANS_AMT_1750_Jan Frequency_1750_Jan TRANS_AMT_1799_Jan \
0 1 6633 1 5584
1 2 0 0 0
Frequency_1799_Jan TRANS_AMT_3001_Mar Frequency_3001_Mar \
0 1 0 0
1 0 405 2
TRANS_AMT_3174_Oct Frequency_3174_Oct
0 0 0
1 1219 1
If not important ordering remove converting to categoricals:
df1 = (df.set_index(['Cust_ID','MCC','Date'])
.unstack(level=[1,2], fill_value=0)
.sort_index(axis=1, level=1))
df1.columns = [f'{a}_{b}_{c}' for a, b, c in df1.columns]
df1 = df1.reset_index()
print (df1)
Cust_ID Frequency_1750_Jan TRANS_AMT_1750_Jan Frequency_1799_Jan \
0 1 1 6633 1
1 2 0 0 0
TRANS_AMT_1799_Jan Frequency_3001_Mar TRANS_AMT_3001_Mar \
0 5584 0 0
1 0 2 405
Frequency_3174_Oct TRANS_AMT_3174_Oct
0 0 0
1 1 1219

Related

Pandas: compare df and add missing rows

I have a list of dataframes which have 1 column in common ('label'). However, in some of the dataframes some rows are missing.
Example: df1 = pd.DataFrame([['sample1',2,3], ['sample4',7,8]], columns=['label', 'B', 'E'], index=[1,2]) df2 = pd.DataFrame([['sample1',20,30], ['sample2',70,80], ['sample3',700,800]], columns=['label', 'B', 'C'], index=[2,3,4])
I would like to add rows, so the length of the dfs are the same but preserving the right order. The desired output would be:
label B E
1 sample1 2 3
2 0 0 0
3 0 0 0
4 sample4 7 8
label B C
1 sample1 20 30
2 sample2 70 80
3 sample3 700 800
4 0 0 0
I was looking into pandas three-way joining multiple dataframes on columns
but I don't want to merge my dataframes. And pandas align() function : illustrative example
doesn't give the desired output either. I was also thinking about comparing the 'label' column with a list and loop through to add the missing rows. If somebody could point me into the right direction, that would be great.
You can get the common indices in the desired order, then reindex:
# here the order matters to get the preference
# for a sorted order use:
# unique = sorted(pd.concat([df1['label'], df2['label']]).unique())
unique = pd.concat([df2['label'], df1['label']]).unique()
out1 = (df1.set_axis(df1['label'])
.reindex(unique, fill_value=0)
.reset_index(drop=True)
)
out2 = (df2.set_axis(df2['label'])
.reindex(unique, fill_value=0)
.reset_index(drop=True)
)
outputs:
# out1
label B E
0 sample1 2 3
1 0 0 0
2 0 0 0
3 sample4 7 8
# out2
label B C
0 sample1 20 30
1 sample2 70 80
2 sample3 700 800
3 0 0 0

Pandas Group Columns by Value of 1 and Sort By Frequency

I have to take this dataframe:
d = {'Apple': [0,0,1,0,1,0], 'Aurora': [0,0,0,0,0,1], 'Barn': [0,1,1,0,0,0]}
df = pd.DataFrame(data=d)
Apple Aurora Barn
0 0 0 0
1 0 0 1
2 1 0 1
3 0 0 0
4 1 0 0
5 0 1 0
And count the frequency of the number one in each column, and create a new dataframe that looks like this:
df = pd.DataFrame([['Apple',0.3333], ['Aurora',0.166666], ['Barn', 0.3333]], columns = ['index', 'value'])
index value
0 Apple 0.333300
1 Aurora 0.166666
2 Barn 0.333300
I have tried this:
df['freq'] = df.groupby(1)[1].transform('count')
But I get an error: KeyError: 1
So I'm not sure how to count the value 1 across rows and columns, and group by column names and the frequency of 1 in each column.
If I understand correctly, you could do simply this:
freq = df.mean()
Output:
>>> freq
Apple 0.333333
Aurora 0.166667
Barn 0.333333
dtype: float64

get first row in a group and assign values

I have a pandas dataframe in the below format
id name value_1 value_2
1 def 1 0
2 abc 0 1
I would need to sort the above dataframe based on id, name, value_1 & value_2. Following that, for every group of [id,name,value_1,value_2], get the first row and set df['result'] = 1. For the other rows in that group, set df['result'] = 0.
I do the sorting and get the first row using the below code:
df = df.sort_values(["id","name","value_1","value_2"], ascending=True)
first_row_per_group = df.groupby(["id","name","value_1","value_2"]).agg('first')
After getting the first row, I set first_row_per_group ['result'] = 1. But I am not sure how to set the other rows (non-first) rows to 0.
Any suggestions would be appreciated.
duplicated would be faster than groupby:
df = df.sort_values(['id', 'name', 'value_1', 'value_2'])
df['result'] = (~df['id'].duplicated()).astype(int)
use df.groupby(...).cumcount() to get a counter of rows within the group which you can then manipulate.
In [51]: df
Out[51]:
a b c
0 def 1 0
1 abc 0 1
2 def 1 0
3 abc 0 1
In [52]: df2 = df.sort_values(['a','b','c'])
In [53]: df2['result'] = df2.groupby(['a', 'b', 'c']).cumcount()
In [54]: df2['result'] = np.where(df2['result'] == 0, 1, 0)
In [55]: df2
Out[55]:
a b c result
1 abc 0 1 1
3 abc 0 1 0
0 def 1 0 1
2 def 1 0 0

Data standardization of feat having lt/gt values among absolute values

One of the datasets I am dealing with has few features which have lt/gt values along with absolute values. Please refer to an example below -
>>> df = pd.DataFrame(['<10', '23', '34', '22', '>90', '42'], columns=['foo'])
>>> df
foo
0 <10
1 23
2 34
3 22
4 >90
5 42
note - foo is % value. ie 0 <= foo <= 100
How are such data transformed to run regression models on?
One thing you could do is, for values <10, impute the median value (5). Similarly, for those >90, impute 95.
Then add two extra boolean columns:
df = pd.DataFrame(['<10', '23', '34', '22', '>90', '42'], columns=['foo'])
dummies = pd.get_dummies(df, columns=['foo'])[['foo_<10', 'foo_>90']]
df = df.replace('<10', 5).replace('>90', 95)
df = pd.concat([df, dummies], axis=1)
df
This will give you
foo foo_<10 foo_>90
0 5 1 0
1 23 0 0
2 34 0 0
3 22 0 0
4 95 0 1
5 42 0 0

How to expand one row to multiple rows according to its value in Pandas

This is a DataFrame I have for example. Please refer the image link.
Before:
Before
d = {1: ['2134',20, 1,1,1,0], 2: ['1010',5, 1,0,0,0], 3: ['3457',15, 0,1,1,0]}
columns=['Code', 'Price', 'Bacon','Onion','Tomato', 'Cheese']
df = pd.DataFrame.from_dict(data=d, orient='index').sort_index()
df.columns = columns
What I want to do is expanding a single row into multiple rows. Then the Dataframe should be look like the image of below link. The intention is using some columns(from 'Bacon' to 'Cheese') as categories.
After:
After
I tried to find the answer, but failed. Thanks.
You can first reshape with set_index and stack, then filter by query and get_dummies from column level_2 and last reindex columns for add missing with no 1 and reset_index:
df = df.set_index(['Code', 'Price']) \
.stack() \
.reset_index(level=2, name='val') \
.query('val == 1') \
.level_2.str.get_dummies() \
.reindex(columns=df.columns[2:], fill_value=0) \
.reset_index()
print (df)
Code Price Bacon Onion Tomato Cheese
0 2134 20 1 0 0 0
1 2134 20 0 1 0 0
2 2134 20 0 0 1 0
3 1010 5 1 0 0 0
4 3457 15 0 1 0 0
5 3457 15 0 0 1 0
You can use stack and transpose to do this operation and format accordingly.
df = df.stack().to_frame().T
df.columns = ['{}_{}'.format(*c) for c in df.columns]
Use pd.melt to put all the food in one column and then pd.get_dummies to expand the columns.
df1 = pd.melt(df, id_vars=['Code', 'Price'])
df1 = df1[df1['value'] == 1]
df1 = pd.get_dummies(df1, columns=['variable'], prefix='', prefix_sep='').sort_values(['Code', 'Price'])
df1.reindex(columns=df.columns, fill_value=0)
Edited after I saw how jezrael used reindex to both add and drop a column.