pandas pivot onto values - pandas

Given a dataframe
df=pd.DataFrame([[1,11,0],[1,12,1],[2,21,0],[2,22,1]])
df.columns = ['Key','Value','PivotOn']
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
The pivot action will give me columns 0 and 1 from the column 'PivotOn'. But I would like to always pivot onto values 0, 1 and 2, even if there might not exist a row that has PivotOn = 2 (just produce nan for it).
I cannot modify original dataframe so I'd want something like:
pivoted = df.pivot(index='Key',columns=[0,1,2],values='Value')
where it will always produce 3 columns of 0, 1 and 2 and column 2 is filled with nans.

Assume PivotOn has three unique values 0, 1, 2
df=pd.DataFrame([[1,11,0],[1,12,1],[2,21,0],[2,22,2]])
df.columns = ['Key','Value','PivotOn']
df
+---+-----+-------+---------+
| | Key | Value | PivotOn |
+---+-----+-------+---------+
| 0 | 1 | 11 | 0 |
| 1 | 1 | 12 | 1 |
| 2 | 2 | 21 | 0 |
| 3 | 2 | 22 | 2 |
+---+-----+-------+---------+
And say you need to include columns 2, 3 and 4 (you can also assume that 2 may or may not be present in original df, so generalizing)
Then go as -
expected = {2, 3, 4}
res = list(expected - set(df.PivotOn.unique()))
if len(res) > 1:
new_df = pd.DataFrame({'Key':np.NaN, 'Value':np.NaN, 'PivotOn':res}, index=range(df.shape[0], df.shape[0] + len(res)))
ndf = pd.concat([df, new_df], sort=False)
pivoted = ndf.pivot(index='Key',columns='PivotOn',values='Value').dropna(how='all')
else:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
pivoted
+---------+------+------+------+-----+-----+
| PivotOn | 0 | 1 | 2 | 3 | 4 |
+---------+------+------+------+-----+-----+
| Key | | | | | |
| 1.0 | 11.0 | 12.0 | NaN | NaN | NaN |
| 2.0 | 21.0 | NaN | 22.0 | NaN | NaN |
+---------+------+------+------+-----+-----+

You might try this if all you need is a column '2' with nan's when they do not exist in your dataframe;
def no_col_2(df):
if 2 not in df['PivotOn']:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
pivoted['2'] = np.nan
else:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
no_col_2(df)
print(pivoted)
PivotOn 0 1 2
Key
1 11 12 NaN
2 21 22 NaN

Related

Python Data Frame - How can I evaluate/use a column being created on the fly

Suppose that I have a dataframe as follows:
+---------+-------+------------+
| Product | Price | Calculated |
+---------+-------+------------+
| A | 10 | 10 |
| B | 20 | NaN |
| C | 25 | NaN |
| D | 30 | NaN |
+---------+-------+------------+
The above can be created using below code:
data = {'Product':['A', 'B', 'C', 'D'],
'Price':[10, 20, 25, 30],
'Calculated':[10, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data)
I want to update column calculated on the fly. For 2nd row the calculated = Prv. calculated / Previous Price i.e. calculated at row 2 is 10/10=1
Now that we have value for row 2 calculated row 3 calculated would be 1/20 and so on and so forth.
Expected Output
+---------+-------+------------+
| Product | Price | Calculated |
+---------+-------+------------+
| A | 10 | 10 |
| B | 20 | 1 |
| C | 25 | 0.05 |
| D | 30 | 0.002 |
+---------+-------+------------+
The above can be achieved using loops but I don't want to use loops instead I need a vectorized approach to update column Calculated. How can I achieve that?
You are looking at cumprod with a shift:
# also `df['Calculated'].iloc[0]` instead of `.ffill()`
df['Calculated'] = df['Calculated'].ffill()/df.Price.cumprod().shift(fill_value=1)
Output:
Product Price Calculated
0 A 10 10.000
1 B 20 1.000
2 C 25 0.050
3 D 30 0.002

Replacing NAN in a dataframe with unique key

I have the following dataframe:
df
Index key1 | key2 | key3 | value1 | Value2
0 1 | 3 | 4 | 6 | 4
1 1 | 3 | 4 | Nan | 3
2 1 | 2 | 3 | 8 | 6
3 1 | 2 | 3 | Nan | 5
4 5 | 7 | 1 | Nan | 2
For the value with the same keys (key1, key2, key3), I want to use numeric value and whenever there is no numeric value I want to remove the row. For value2, I simply want the sum.
Desired df
Index key1 | key2 | key3 | value1 | value2
0 1 | 3 | 4 | 6 | 7
2 1 | 2 | 3 | 8 | 11
Retaining the correct index is not important.
If there is only one non NaN per group use GroupBy.agg with GroupBy.first for first non NaN value and also aggregate sum:
df['value1'] = pd.to_numeric(df['value1'], errors='coerce')
df = (df.groupby(['key1','key2','key3'], as_index=False, sort=False)
.agg(value1 = ('value1','first'),Value2 = ('Value2','sum'))
.dropna(subset=['value1']))
print (df)
key1 key2 key3 value1 Value2
0 1 3 4 6.0 7
1 1 2 3 8.0 11
If possible multiple non NaNs values first aggregate sum, then remove missing rows by DataFrame.dropna, remove column fr aggregate sum and add new by Series by DataFrame.join:
df['value1'] = pd.to_numeric(df['value1'], errors='coerce')
s = df.groupby(['key1','key2','key3'])['Value2'].sum()
df = df.dropna(subset=['value1']).drop('Value2', 1).join(s, on=['key1','key2','key3'])
print (df)
Index key1 key2 key3 value1 Value2
0 0 1 3 4 6.0 7
2 2 1 2 3 8.0 11

Get the difference of two columns with an offset/rolling/shift of 1

Stupid question: I have two columns A and B and would like to create a new_col, which is actually the difference between the current B and the previous A. Previous e.g. means the row before the current row. How can this be achieved (maybe even with a variable offset)?
Target:
df
| A | B | new_col |
|---|----|----------|
| 1 | 2 | nan (or2)|
| 3 | 4 | 3 |
| 5 | 10 | 7 |
Pseudo code:
new_col[0] = B[0] - 0
new_col[1] = B[1] - A[0]
new_col[2] = B[2] - A[1]
Use Series.shift:
df['new_col'] = df['B'] - df['A'].shift()
A B new_col
0 1 2 NaN
1 3 4 3.0
2 5 10 7.0

Cumulative count of unique values in pandas

I would like to cumulatively count unique values from a column in a pandas frame by week. For example, imagine that I have data like this:
df = pd.DataFrame({'user_id':[1,1,1,2,2,2],'week':[1,1,2,1,2,2],'module_id':['A','B','A','A','B','C']})
+---+---------+------+-----------+
| | user_id | week | module_id |
+---+---------+------+-----------+
| 0 | 1 | 1 | A |
| 1 | 1 | 1 | B |
| 2 | 1 | 2 | A |
| 3 | 2 | 1 | A |
| 4 | 2 | 2 | B |
| 5 | 2 | 2 | C |
+---+---------+------+-----------+
What I want is a running count of the number of unique module_ids up to each week, i.e. something like this:
+---+---------+------+-------------------------+
| | user_id | week | cumulative_module_count |
+---+---------+------+-------------------------+
| 0 | 1 | 1 | 2 |
| 1 | 1 | 2 | 2 |
| 2 | 2 | 1 | 1 |
| 3 | 2 | 2 | 3 |
+---+---------+------+-------------------------+
It is straightforward to do this as a loop, for example this works:
running_tally = {}
result = {}
for index, row in df.iterrows():
if row['user_id'] not in running_tally:
running_tally[row['user_id']] = set()
result[row['user_id']] = {}
running_tally[row['user_id']].add(row['module_id'])
result[row['user_id']][row['week']] = len(running_tally[row['user_id']])
print(result)
{1: {1: 2, 2: 2}, 2: {1: 1, 2: 3}}
But my real data frame is enormous and so I would like a vectorised algorithm instead of a loop.
There's a similar sounding question here, but looking at the accepted answer (here) the original poster does not want uniqueness across dates cumulatively, as I do.
How would I do this vectorised in pandas?
Idea is create lists per groups by both columns and then use np.cumsum for cumulative lists, last convert values to sets and get length:
df1 = (df.groupby(['user_id','week'])['module_id']
.apply(list)
.groupby(level=0)
.apply(np.cumsum)
.apply(lambda x: len(set(x)))
.reset_index(name='cumulative_module_count'))
print (df1)
user_id week cumulative_module_count
0 1 1 2
1 1 2 2
2 2 1 1
3 2 2 3
Jezrael's answer can be slightly improved by using pipe instead of apply(list), which should be faster, and then using np.unique instead of the trick with np.cumsum:
df1 = (df.groupby(['user_id', 'week']).pipe(lambda x: x.apply(np.unique))
.groupby('user_id')
.apply(np.cumsum)
.apply(np.sum)
.apply(lambda x: len(set(x)))
.rename('cumulated_module_count')
.reset_index(drop=False))
print(df1)
user_id week cumulated_module_count
0 1 1 2
1 1 2 2
2 2 1 1
3 2 2 3

Find which values are in every group in pandas

Is there a way of aggregating or transforming in pandas that would give me the list of values that are present in each group.
For example, taking this data
+---------+-----------+
| user_id | module_id |
+---------+-----------+
| 1 | A |
| 1 | B |
| 1 | C |
| 2 | A |
| 2 | B |
| 2 | D |
| 3 | B |
| 3 | C |
| 3 | D |
| 3 | E |
+---------+-----------+
how would I complete this code
df.groupby('user_id')
to give the result C, the only module_id that is in each of the groups?
Use get_dummies with max for indicator DataFrame and then filter only 1 columns - 1 values are processes like Trues in DataFrame.all:
cols = (pd.get_dummies(df.set_index('user_id')['module_id'])
.max(level=0)
.loc[:, lambda x: x.all()].columns)
print (cols)
Index(['B'], dtype='object')
Similar solution:
df1 = pd.get_dummies(df.set_index('user_id')['module_id']).max(level=0)
print (df1)
A B C D E
user_id
1 1 1 1 0 0
2 1 1 0 1 0
3 0 1 1 1 1
cols = df1.columns[df1.all()]
More solutions:
cols = df.groupby(['module_id', 'user_id']).size().unstack().dropna().index
print (cols)
Index(['B'], dtype='object', name='module_id')
cols = df.pivot_table(index='module_id', columns='user_id', aggfunc='size').dropna().index