I have the following dataframe:
df
Index key1 | key2 | key3 | value1 | Value2
0 1 | 3 | 4 | 6 | 4
1 1 | 3 | 4 | Nan | 3
2 1 | 2 | 3 | 8 | 6
3 1 | 2 | 3 | Nan | 5
4 5 | 7 | 1 | Nan | 2
For the value with the same keys (key1, key2, key3), I want to use numeric value and whenever there is no numeric value I want to remove the row. For value2, I simply want the sum.
Desired df
Index key1 | key2 | key3 | value1 | value2
0 1 | 3 | 4 | 6 | 7
2 1 | 2 | 3 | 8 | 11
Retaining the correct index is not important.
If there is only one non NaN per group use GroupBy.agg with GroupBy.first for first non NaN value and also aggregate sum:
df['value1'] = pd.to_numeric(df['value1'], errors='coerce')
df = (df.groupby(['key1','key2','key3'], as_index=False, sort=False)
.agg(value1 = ('value1','first'),Value2 = ('Value2','sum'))
.dropna(subset=['value1']))
print (df)
key1 key2 key3 value1 Value2
0 1 3 4 6.0 7
1 1 2 3 8.0 11
If possible multiple non NaNs values first aggregate sum, then remove missing rows by DataFrame.dropna, remove column fr aggregate sum and add new by Series by DataFrame.join:
df['value1'] = pd.to_numeric(df['value1'], errors='coerce')
s = df.groupby(['key1','key2','key3'])['Value2'].sum()
df = df.dropna(subset=['value1']).drop('Value2', 1).join(s, on=['key1','key2','key3'])
print (df)
Index key1 key2 key3 value1 Value2
0 0 1 3 4 6.0 7
2 2 1 2 3 8.0 11
Stupid question: I have two columns A and B and would like to create a new_col, which is actually the difference between the current B and the previous A. Previous e.g. means the row before the current row. How can this be achieved (maybe even with a variable offset)?
Target:
df
| A | B | new_col |
|---|----|----------|
| 1 | 2 | nan (or2)|
| 3 | 4 | 3 |
| 5 | 10 | 7 |
Pseudo code:
new_col[0] = B[0] - 0
new_col[1] = B[1] - A[0]
new_col[2] = B[2] - A[1]
Use Series.shift:
df['new_col'] = df['B'] - df['A'].shift()
A B new_col
0 1 2 NaN
1 3 4 3.0
2 5 10 7.0
Is there a way of aggregating or transforming in pandas that would give me the list of values that are present in each group.
For example, taking this data
+---------+-----------+
| user_id | module_id |
+---------+-----------+
| 1 | A |
| 1 | B |
| 1 | C |
| 2 | A |
| 2 | B |
| 2 | D |
| 3 | B |
| 3 | C |
| 3 | D |
| 3 | E |
+---------+-----------+
how would I complete this code
df.groupby('user_id')
to give the result C, the only module_id that is in each of the groups?
Use get_dummies with max for indicator DataFrame and then filter only 1 columns - 1 values are processes like Trues in DataFrame.all:
cols = (pd.get_dummies(df.set_index('user_id')['module_id'])
.max(level=0)
.loc[:, lambda x: x.all()].columns)
print (cols)
Index(['B'], dtype='object')
Similar solution:
df1 = pd.get_dummies(df.set_index('user_id')['module_id']).max(level=0)
print (df1)
A B C D E
user_id
1 1 1 1 0 0
2 1 1 0 1 0
3 0 1 1 1 1
cols = df1.columns[df1.all()]
More solutions:
cols = df.groupby(['module_id', 'user_id']).size().unstack().dropna().index
print (cols)
Index(['B'], dtype='object', name='module_id')
cols = df.pivot_table(index='module_id', columns='user_id', aggfunc='size').dropna().index
I need help using crosstab on the df below.
a b c
-------------------------
| a | None | c |
| a | b | None |
| None | b | c |
| a | None | None |
| None | None | None |
I want to pull rows where more than letter is specified (a&b, a&c, b&c) i.e. rows 1-3. I believe the easiest way to do this is through crosstab (I know I'll get a count but can I also view the rows through this method?). I want to avoid having to write a lengthy 'or' statement to acheive this.
Desired Output:
a b c
-------------------------
| a | None | c |
| a | b | None |
| None | b | c |
You aren't looking for crosstab, just check the number of non-nulls using notnull:
df[df.notnull().sum(1).gt(1)]
a b c
0 a NaN c
1 a b NaN
2 NaN b c
Or you can use dropna:
t = 2
df.dropna(thresh=df.shape[1] - t + 1)
a b c
0 a NaN c
1 a b NaN
2 NaN b c
Given a dataframe
df=pd.DataFrame([[1,11,0],[1,12,1],[2,21,0],[2,22,1]])
df.columns = ['Key','Value','PivotOn']
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
The pivot action will give me columns 0 and 1 from the column 'PivotOn'. But I would like to always pivot onto values 0, 1 and 2, even if there might not exist a row that has PivotOn = 2 (just produce nan for it).
I cannot modify original dataframe so I'd want something like:
pivoted = df.pivot(index='Key',columns=[0,1,2],values='Value')
where it will always produce 3 columns of 0, 1 and 2 and column 2 is filled with nans.
Assume PivotOn has three unique values 0, 1, 2
df=pd.DataFrame([[1,11,0],[1,12,1],[2,21,0],[2,22,2]])
df.columns = ['Key','Value','PivotOn']
df
+---+-----+-------+---------+
| | Key | Value | PivotOn |
+---+-----+-------+---------+
| 0 | 1 | 11 | 0 |
| 1 | 1 | 12 | 1 |
| 2 | 2 | 21 | 0 |
| 3 | 2 | 22 | 2 |
+---+-----+-------+---------+
And say you need to include columns 2, 3 and 4 (you can also assume that 2 may or may not be present in original df, so generalizing)
Then go as -
expected = {2, 3, 4}
res = list(expected - set(df.PivotOn.unique()))
if len(res) > 1:
new_df = pd.DataFrame({'Key':np.NaN, 'Value':np.NaN, 'PivotOn':res}, index=range(df.shape[0], df.shape[0] + len(res)))
ndf = pd.concat([df, new_df], sort=False)
pivoted = ndf.pivot(index='Key',columns='PivotOn',values='Value').dropna(how='all')
else:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
pivoted
+---------+------+------+------+-----+-----+
| PivotOn | 0 | 1 | 2 | 3 | 4 |
+---------+------+------+------+-----+-----+
| Key | | | | | |
| 1.0 | 11.0 | 12.0 | NaN | NaN | NaN |
| 2.0 | 21.0 | NaN | 22.0 | NaN | NaN |
+---------+------+------+------+-----+-----+
You might try this if all you need is a column '2' with nan's when they do not exist in your dataframe;
def no_col_2(df):
if 2 not in df['PivotOn']:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
pivoted['2'] = np.nan
else:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
no_col_2(df)
print(pivoted)
PivotOn 0 1 2
Key
1 11 12 NaN
2 21 22 NaN