Replacing NAN in a dataframe with unique key - pandas

I have the following dataframe:
df
Index key1 | key2 | key3 | value1 | Value2
0 1 | 3 | 4 | 6 | 4
1 1 | 3 | 4 | Nan | 3
2 1 | 2 | 3 | 8 | 6
3 1 | 2 | 3 | Nan | 5
4 5 | 7 | 1 | Nan | 2
For the value with the same keys (key1, key2, key3), I want to use numeric value and whenever there is no numeric value I want to remove the row. For value2, I simply want the sum.
Desired df
Index key1 | key2 | key3 | value1 | value2
0 1 | 3 | 4 | 6 | 7
2 1 | 2 | 3 | 8 | 11
Retaining the correct index is not important.

If there is only one non NaN per group use GroupBy.agg with GroupBy.first for first non NaN value and also aggregate sum:
df['value1'] = pd.to_numeric(df['value1'], errors='coerce')
df = (df.groupby(['key1','key2','key3'], as_index=False, sort=False)
.agg(value1 = ('value1','first'),Value2 = ('Value2','sum'))
.dropna(subset=['value1']))
print (df)
key1 key2 key3 value1 Value2
0 1 3 4 6.0 7
1 1 2 3 8.0 11
If possible multiple non NaNs values first aggregate sum, then remove missing rows by DataFrame.dropna, remove column fr aggregate sum and add new by Series by DataFrame.join:
df['value1'] = pd.to_numeric(df['value1'], errors='coerce')
s = df.groupby(['key1','key2','key3'])['Value2'].sum()
df = df.dropna(subset=['value1']).drop('Value2', 1).join(s, on=['key1','key2','key3'])
print (df)
Index key1 key2 key3 value1 Value2
0 0 1 3 4 6.0 7
2 2 1 2 3 8.0 11

Related

Get the difference of two columns with an offset/rolling/shift of 1

Stupid question: I have two columns A and B and would like to create a new_col, which is actually the difference between the current B and the previous A. Previous e.g. means the row before the current row. How can this be achieved (maybe even with a variable offset)?
Target:
df
| A | B | new_col |
|---|----|----------|
| 1 | 2 | nan (or2)|
| 3 | 4 | 3 |
| 5 | 10 | 7 |
Pseudo code:
new_col[0] = B[0] - 0
new_col[1] = B[1] - A[0]
new_col[2] = B[2] - A[1]
Use Series.shift:
df['new_col'] = df['B'] - df['A'].shift()
A B new_col
0 1 2 NaN
1 3 4 3.0
2 5 10 7.0

Cumulative count of unique values in pandas

I would like to cumulatively count unique values from a column in a pandas frame by week. For example, imagine that I have data like this:
df = pd.DataFrame({'user_id':[1,1,1,2,2,2],'week':[1,1,2,1,2,2],'module_id':['A','B','A','A','B','C']})
+---+---------+------+-----------+
| | user_id | week | module_id |
+---+---------+------+-----------+
| 0 | 1 | 1 | A |
| 1 | 1 | 1 | B |
| 2 | 1 | 2 | A |
| 3 | 2 | 1 | A |
| 4 | 2 | 2 | B |
| 5 | 2 | 2 | C |
+---+---------+------+-----------+
What I want is a running count of the number of unique module_ids up to each week, i.e. something like this:
+---+---------+------+-------------------------+
| | user_id | week | cumulative_module_count |
+---+---------+------+-------------------------+
| 0 | 1 | 1 | 2 |
| 1 | 1 | 2 | 2 |
| 2 | 2 | 1 | 1 |
| 3 | 2 | 2 | 3 |
+---+---------+------+-------------------------+
It is straightforward to do this as a loop, for example this works:
running_tally = {}
result = {}
for index, row in df.iterrows():
if row['user_id'] not in running_tally:
running_tally[row['user_id']] = set()
result[row['user_id']] = {}
running_tally[row['user_id']].add(row['module_id'])
result[row['user_id']][row['week']] = len(running_tally[row['user_id']])
print(result)
{1: {1: 2, 2: 2}, 2: {1: 1, 2: 3}}
But my real data frame is enormous and so I would like a vectorised algorithm instead of a loop.
There's a similar sounding question here, but looking at the accepted answer (here) the original poster does not want uniqueness across dates cumulatively, as I do.
How would I do this vectorised in pandas?
Idea is create lists per groups by both columns and then use np.cumsum for cumulative lists, last convert values to sets and get length:
df1 = (df.groupby(['user_id','week'])['module_id']
.apply(list)
.groupby(level=0)
.apply(np.cumsum)
.apply(lambda x: len(set(x)))
.reset_index(name='cumulative_module_count'))
print (df1)
user_id week cumulative_module_count
0 1 1 2
1 1 2 2
2 2 1 1
3 2 2 3
Jezrael's answer can be slightly improved by using pipe instead of apply(list), which should be faster, and then using np.unique instead of the trick with np.cumsum:
df1 = (df.groupby(['user_id', 'week']).pipe(lambda x: x.apply(np.unique))
.groupby('user_id')
.apply(np.cumsum)
.apply(np.sum)
.apply(lambda x: len(set(x)))
.rename('cumulated_module_count')
.reset_index(drop=False))
print(df1)
user_id week cumulated_module_count
0 1 1 2
1 1 2 2
2 2 1 1
3 2 2 3

pandas pivot onto values

Given a dataframe
df=pd.DataFrame([[1,11,0],[1,12,1],[2,21,0],[2,22,1]])
df.columns = ['Key','Value','PivotOn']
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
The pivot action will give me columns 0 and 1 from the column 'PivotOn'. But I would like to always pivot onto values 0, 1 and 2, even if there might not exist a row that has PivotOn = 2 (just produce nan for it).
I cannot modify original dataframe so I'd want something like:
pivoted = df.pivot(index='Key',columns=[0,1,2],values='Value')
where it will always produce 3 columns of 0, 1 and 2 and column 2 is filled with nans.
Assume PivotOn has three unique values 0, 1, 2
df=pd.DataFrame([[1,11,0],[1,12,1],[2,21,0],[2,22,2]])
df.columns = ['Key','Value','PivotOn']
df
+---+-----+-------+---------+
| | Key | Value | PivotOn |
+---+-----+-------+---------+
| 0 | 1 | 11 | 0 |
| 1 | 1 | 12 | 1 |
| 2 | 2 | 21 | 0 |
| 3 | 2 | 22 | 2 |
+---+-----+-------+---------+
And say you need to include columns 2, 3 and 4 (you can also assume that 2 may or may not be present in original df, so generalizing)
Then go as -
expected = {2, 3, 4}
res = list(expected - set(df.PivotOn.unique()))
if len(res) > 1:
new_df = pd.DataFrame({'Key':np.NaN, 'Value':np.NaN, 'PivotOn':res}, index=range(df.shape[0], df.shape[0] + len(res)))
ndf = pd.concat([df, new_df], sort=False)
pivoted = ndf.pivot(index='Key',columns='PivotOn',values='Value').dropna(how='all')
else:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
pivoted
+---------+------+------+------+-----+-----+
| PivotOn | 0 | 1 | 2 | 3 | 4 |
+---------+------+------+------+-----+-----+
| Key | | | | | |
| 1.0 | 11.0 | 12.0 | NaN | NaN | NaN |
| 2.0 | 21.0 | NaN | 22.0 | NaN | NaN |
+---------+------+------+------+-----+-----+
You might try this if all you need is a column '2' with nan's when they do not exist in your dataframe;
def no_col_2(df):
if 2 not in df['PivotOn']:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
pivoted['2'] = np.nan
else:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
no_col_2(df)
print(pivoted)
PivotOn 0 1 2
Key
1 11 12 NaN
2 21 22 NaN

How to keep the last value in Pandas without removing the rows

I am working on a dataset in which I want to attribute the last action of a user to a certain goal. In the process I arrive at below tableset.
table
date | action_id | u_id | goal
2016-01-08 | CUID22 | 586758 | 'Goal#1'
2017-03-04 | CUID45 | 586758 | 'Goal#1'
2018-09-01 | CUID30 | 586758 | 'Goal#1'
How can I remove/replace the first two u_id or goal values whilst keeping the rows to arrive at below tableset.
table
date | action_id | u_id | goal
2016-01-08 | CUID22 | NaN | NaN
2017-03-04 | CUID45 | NaN | NaN
2018-09-01 | CUID30 | 586758 | 'Goal#1'
I beleive you need duplicated:
cols = ['u_id','goal']
df.loc[df.duplicated(cols, keep='last'), cols] = np.nan
Or:
cols = ['u_id','goal']
df[cols] = df[cols].mask(df.duplicated(cols, keep='last'))
print (df)
date action_id u_id goal
0 2016 0 NaN NaN
1 2017 1 NaN NaN
2 2018 2 1.0 1.0

graphlab find all the columns that has at least one None value

How should one find all the columns in SFrame that has at least one None value in it? One way to do this would be to iterate through every column and check if any value in that column is None or not. Is there a better way to do the job?
To find None values in an SFrame use the SArray method num_missing (doc).
Solution
>>> col_w_none = [col for col in sf.column_names() if sf[col].num_missing()>0]
Example
>>> sf = gl.SFrame({'foo':[1,2,3,4], 'bar':[1,None,3,4]})
>>> print sf
+------+-----+
| bar | foo |
+------+-----+
| 1 | 1 |
| None | 2 |
| 3 | 3 |
| 4 | 4 |
+------+-----+
[4 rows x 2 columns]
>>> print [col for col in sf.column_names() if sf[col].num_missing()>0]
['bar']
Caveats
It isn't optimal since it won't stop to iterate at the first None
value.
It won't detect NaN and empty string.
>>> sf = gl.SFrame({'foo':[1,2,3,4], 'bar':[1,None,3,4], 'baz':[1,2,float('nan'),4], 'qux':['spam', '', 'ham', 'eggs']} )
>>> print sf
+------+-----+-----+------+
| bar | baz | foo | qux |
+------+-----+-----+------+
| 1 | 1.0 | 1 | spam |
| None | 2.0 | 2 | |
| 3 | nan | 3 | ham |
| 4 | 4.0 | 4 | eggs |
+------+-----+-----+------+
[4 rows x 4 columns]
>>> print [col for col in sf.column_names() if sf[col].num_missing()>0]
['bar']
Here is a Pandas solution:
In [50]: df
Out[50]:
keys values
0 1 1.0
1 2 2.0
2 2 3.0
3 3 4.0
4 3 5.0
5 3 NaN
6 3 7.0
In [51]: df.columns.to_series()[df.isnull().any()]
Out[51]:
values values
dtype: object
In [52]: df.columns.to_series()[df.isnull().any()].tolist()
Out[52]: ['values']
Explanation:
In [53]: df.isnull().any()
Out[53]:
keys False
values True
dtype: bool
You can use isnull:
pd.isnull(df).sum() > 0
Example:
df = pd.DataFrame({'col1':['A', 'A', 'B','B'], 'col2': ['B','B','C','C'], 'col3': ['C','C','A','A'], 'col4': [11,12,13,np.nan], 'col5': [30,10,14,91]})
df
col1 col2 col3 col4 col5
0 A B C 11.0 30
1 A B C 12.0 10
2 B C A 13.0 14
3 B C A NaN 91
pd.isnull(df).sum() > 0
col1 False
col2 False
col3 False
col4 True
col5 False
dtype: bool