graphlab find all the columns that has at least one None value - pandas

How should one find all the columns in SFrame that has at least one None value in it? One way to do this would be to iterate through every column and check if any value in that column is None or not. Is there a better way to do the job?

To find None values in an SFrame use the SArray method num_missing (doc).
Solution
>>> col_w_none = [col for col in sf.column_names() if sf[col].num_missing()>0]
Example
>>> sf = gl.SFrame({'foo':[1,2,3,4], 'bar':[1,None,3,4]})
>>> print sf
+------+-----+
| bar | foo |
+------+-----+
| 1 | 1 |
| None | 2 |
| 3 | 3 |
| 4 | 4 |
+------+-----+
[4 rows x 2 columns]
>>> print [col for col in sf.column_names() if sf[col].num_missing()>0]
['bar']
Caveats
It isn't optimal since it won't stop to iterate at the first None
value.
It won't detect NaN and empty string.
>>> sf = gl.SFrame({'foo':[1,2,3,4], 'bar':[1,None,3,4], 'baz':[1,2,float('nan'),4], 'qux':['spam', '', 'ham', 'eggs']} )
>>> print sf
+------+-----+-----+------+
| bar | baz | foo | qux |
+------+-----+-----+------+
| 1 | 1.0 | 1 | spam |
| None | 2.0 | 2 | |
| 3 | nan | 3 | ham |
| 4 | 4.0 | 4 | eggs |
+------+-----+-----+------+
[4 rows x 4 columns]
>>> print [col for col in sf.column_names() if sf[col].num_missing()>0]
['bar']

Here is a Pandas solution:
In [50]: df
Out[50]:
keys values
0 1 1.0
1 2 2.0
2 2 3.0
3 3 4.0
4 3 5.0
5 3 NaN
6 3 7.0
In [51]: df.columns.to_series()[df.isnull().any()]
Out[51]:
values values
dtype: object
In [52]: df.columns.to_series()[df.isnull().any()].tolist()
Out[52]: ['values']
Explanation:
In [53]: df.isnull().any()
Out[53]:
keys False
values True
dtype: bool

You can use isnull:
pd.isnull(df).sum() > 0
Example:
df = pd.DataFrame({'col1':['A', 'A', 'B','B'], 'col2': ['B','B','C','C'], 'col3': ['C','C','A','A'], 'col4': [11,12,13,np.nan], 'col5': [30,10,14,91]})
df
col1 col2 col3 col4 col5
0 A B C 11.0 30
1 A B C 12.0 10
2 B C A 13.0 14
3 B C A NaN 91
pd.isnull(df).sum() > 0
col1 False
col2 False
col3 False
col4 True
col5 False
dtype: bool

Related

Replacing NAN in a dataframe with unique key

I have the following dataframe:
df
Index key1 | key2 | key3 | value1 | Value2
0 1 | 3 | 4 | 6 | 4
1 1 | 3 | 4 | Nan | 3
2 1 | 2 | 3 | 8 | 6
3 1 | 2 | 3 | Nan | 5
4 5 | 7 | 1 | Nan | 2
For the value with the same keys (key1, key2, key3), I want to use numeric value and whenever there is no numeric value I want to remove the row. For value2, I simply want the sum.
Desired df
Index key1 | key2 | key3 | value1 | value2
0 1 | 3 | 4 | 6 | 7
2 1 | 2 | 3 | 8 | 11
Retaining the correct index is not important.
If there is only one non NaN per group use GroupBy.agg with GroupBy.first for first non NaN value and also aggregate sum:
df['value1'] = pd.to_numeric(df['value1'], errors='coerce')
df = (df.groupby(['key1','key2','key3'], as_index=False, sort=False)
.agg(value1 = ('value1','first'),Value2 = ('Value2','sum'))
.dropna(subset=['value1']))
print (df)
key1 key2 key3 value1 Value2
0 1 3 4 6.0 7
1 1 2 3 8.0 11
If possible multiple non NaNs values first aggregate sum, then remove missing rows by DataFrame.dropna, remove column fr aggregate sum and add new by Series by DataFrame.join:
df['value1'] = pd.to_numeric(df['value1'], errors='coerce')
s = df.groupby(['key1','key2','key3'])['Value2'].sum()
df = df.dropna(subset=['value1']).drop('Value2', 1).join(s, on=['key1','key2','key3'])
print (df)
Index key1 key2 key3 value1 Value2
0 0 1 3 4 6.0 7
2 2 1 2 3 8.0 11

Get the difference of two columns with an offset/rolling/shift of 1

Stupid question: I have two columns A and B and would like to create a new_col, which is actually the difference between the current B and the previous A. Previous e.g. means the row before the current row. How can this be achieved (maybe even with a variable offset)?
Target:
df
| A | B | new_col |
|---|----|----------|
| 1 | 2 | nan (or2)|
| 3 | 4 | 3 |
| 5 | 10 | 7 |
Pseudo code:
new_col[0] = B[0] - 0
new_col[1] = B[1] - A[0]
new_col[2] = B[2] - A[1]
Use Series.shift:
df['new_col'] = df['B'] - df['A'].shift()
A B new_col
0 1 2 NaN
1 3 4 3.0
2 5 10 7.0

Find which values are in every group in pandas

Is there a way of aggregating or transforming in pandas that would give me the list of values that are present in each group.
For example, taking this data
+---------+-----------+
| user_id | module_id |
+---------+-----------+
| 1 | A |
| 1 | B |
| 1 | C |
| 2 | A |
| 2 | B |
| 2 | D |
| 3 | B |
| 3 | C |
| 3 | D |
| 3 | E |
+---------+-----------+
how would I complete this code
df.groupby('user_id')
to give the result C, the only module_id that is in each of the groups?
Use get_dummies with max for indicator DataFrame and then filter only 1 columns - 1 values are processes like Trues in DataFrame.all:
cols = (pd.get_dummies(df.set_index('user_id')['module_id'])
.max(level=0)
.loc[:, lambda x: x.all()].columns)
print (cols)
Index(['B'], dtype='object')
Similar solution:
df1 = pd.get_dummies(df.set_index('user_id')['module_id']).max(level=0)
print (df1)
A B C D E
user_id
1 1 1 1 0 0
2 1 1 0 1 0
3 0 1 1 1 1
cols = df1.columns[df1.all()]
More solutions:
cols = df.groupby(['module_id', 'user_id']).size().unstack().dropna().index
print (cols)
Index(['B'], dtype='object', name='module_id')
cols = df.pivot_table(index='module_id', columns='user_id', aggfunc='size').dropna().index

How to use crosstab for multiple columns?

I need help using crosstab on the df below.
a b c
-------------------------
| a | None | c |
| a | b | None |
| None | b | c |
| a | None | None |
| None | None | None |
I want to pull rows where more than letter is specified (a&b, a&c, b&c) i.e. rows 1-3. I believe the easiest way to do this is through crosstab (I know I'll get a count but can I also view the rows through this method?). I want to avoid having to write a lengthy 'or' statement to acheive this.
Desired Output:
a b c
-------------------------
| a | None | c |
| a | b | None |
| None | b | c |
You aren't looking for crosstab, just check the number of non-nulls using notnull:
df[df.notnull().sum(1).gt(1)]
a b c
0 a NaN c
1 a b NaN
2 NaN b c
Or you can use dropna:
t = 2
df.dropna(thresh=df.shape[1] - t + 1)
a b c
0 a NaN c
1 a b NaN
2 NaN b c

pandas pivot onto values

Given a dataframe
df=pd.DataFrame([[1,11,0],[1,12,1],[2,21,0],[2,22,1]])
df.columns = ['Key','Value','PivotOn']
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
The pivot action will give me columns 0 and 1 from the column 'PivotOn'. But I would like to always pivot onto values 0, 1 and 2, even if there might not exist a row that has PivotOn = 2 (just produce nan for it).
I cannot modify original dataframe so I'd want something like:
pivoted = df.pivot(index='Key',columns=[0,1,2],values='Value')
where it will always produce 3 columns of 0, 1 and 2 and column 2 is filled with nans.
Assume PivotOn has three unique values 0, 1, 2
df=pd.DataFrame([[1,11,0],[1,12,1],[2,21,0],[2,22,2]])
df.columns = ['Key','Value','PivotOn']
df
+---+-----+-------+---------+
| | Key | Value | PivotOn |
+---+-----+-------+---------+
| 0 | 1 | 11 | 0 |
| 1 | 1 | 12 | 1 |
| 2 | 2 | 21 | 0 |
| 3 | 2 | 22 | 2 |
+---+-----+-------+---------+
And say you need to include columns 2, 3 and 4 (you can also assume that 2 may or may not be present in original df, so generalizing)
Then go as -
expected = {2, 3, 4}
res = list(expected - set(df.PivotOn.unique()))
if len(res) > 1:
new_df = pd.DataFrame({'Key':np.NaN, 'Value':np.NaN, 'PivotOn':res}, index=range(df.shape[0], df.shape[0] + len(res)))
ndf = pd.concat([df, new_df], sort=False)
pivoted = ndf.pivot(index='Key',columns='PivotOn',values='Value').dropna(how='all')
else:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
pivoted
+---------+------+------+------+-----+-----+
| PivotOn | 0 | 1 | 2 | 3 | 4 |
+---------+------+------+------+-----+-----+
| Key | | | | | |
| 1.0 | 11.0 | 12.0 | NaN | NaN | NaN |
| 2.0 | 21.0 | NaN | 22.0 | NaN | NaN |
+---------+------+------+------+-----+-----+
You might try this if all you need is a column '2' with nan's when they do not exist in your dataframe;
def no_col_2(df):
if 2 not in df['PivotOn']:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
pivoted['2'] = np.nan
else:
pivoted = df.pivot(index='Key',columns='PivotOn',values='Value')
no_col_2(df)
print(pivoted)
PivotOn 0 1 2
Key
1 11 12 NaN
2 21 22 NaN