how to create a variable in pandas dataframe based on another variable - pandas

I have an issue where I have a dataframe data with multiple columns and I want to create a variable filter in the dataframe and assign the value 1 if activation_date is null else 0.
I have written this code but this is failing to get the results, everything is getting 0 irrespective if the dates are still present.
data['filter'] = [0 if x is not None else 1 for x in data['activation_dt']]

I think you need isnull for check None or NaNs and then convert True to 1 and False to 0 by astype(int):
data = pd.DataFrame({'activation_dt':[None, np.nan, 1]})
print (data)
activation_dt
0 NaN
1 NaN
2 1.0
data['filter'] = data['activation_dt'].isnull().astype(int)
print (data)
activation_dt filter
0 NaN 1
1 NaN 1
2 1.0 0

Related

Pandas dataframe, a cumsum calculation including max function

I'm sitting with a pandas dataframe and I have a time series problem where I have some values called diff. I need to calculate a value, here called sum, according to the below formula for each category separately:
sumn = max(0, diffn + sumn-1 - factor)
factor = 2 (factor is a parameter and in this example set to 2)
The dataframe looks something like this and the value of sum is set to 0 for hour = 0:
category
hour
diff
sum
a
0
0
0
a
1
4
NaN
a
2
3
NaN
a
3
1
NaN
b
0
0
0
b
1
1
NaN
b
2
-5
NaN
b
3
4
NaN
My expected output is the following:
category
hour
diff
sum
a
0
0
0
a
1
4
2
a
2
3
3
a
3
1
2
b
0
0
0
b
1
1
0
b
2
-5
0
b
3
4
2
Any idea how to solve this? Preferably without iterrows or any for loops since there are a lot of rows.
Would be happy for any help here.
If it would have been without the max function I could have used something like this:
df['sum'] = df.groupby(['category'])['diff'].cumsum() - factor
But the max function messes things up for me.
You can use the following lambda function:
sumn = 0
def calc_sum(df):
global sumn
if not df['hour']: # Reset when hour=0
sumn = 0
sumn = max(0, df['diff'] + sumn - 2)
return sumn
df['sum'] = df.groupby(['category']).apply(lambda df: df.apply(calc_sum, axis=1)).values
Output:

pandas joining strings in a group, skipping na values

I'm using a combination of str.join (let's call the column joined col_str) and groupby (Let's call the grouped col col_a) in order to summarize data row-wise.
col_str, may contain nan values. Unsurprisingly, and as seen in str.join documentation, joining nan will result in an empty string:
df = df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', '))
To mitigate this, I tried to convert col_str to string (e.g. df['col_str'] = df['col_str'].astype(str) ). But then, empty values now literally have a string nan value, hence considered non empty.
Not only that str.join now includes nan strings, but also other calculations over the script, that rely on those nans, are ruined.
To address that, I thought about converting just the non-empty values as follows:
df['col_str'] = np.where(pd.isnull(df['col_str']), df['col_str'],
df['col_str'].astype(str))
But now str.join return empty values again :-(
So, I tried fillna('') and even dropna(). None provided me with the desired results.
You get the vicious cycle here, right?
astype(str) => nan strings in join and calculations ruined
Leaving as-is => join.str returns empty results.
Thanks for your assistance!
Edit:
Data is read from a csv. Sample:
Code to test -
df = pd.read_csv('/Users/goidelg/Downloads/sample_data.csv', low_memory=False)
print("---Original DF ---")
print(df)
print("---Joining NaNs as NaN---")
print(df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', ')).rename('strings_concat')))
print("---Convertin col to str---")
df['col_str'] = df['col_str'].astype(str)
print(df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', ')).rename('strings_concat')))
And results for the script:
First remove missing values by DataFrame.dropna or Series.notna in boolean indexing:
df = pd.DataFrame({'col_a':[1,2,3,4,1,2,3,4,1,2],
'col_str':['a','b','c','d',np.nan, np.nan, np.nan, np.nan,'a', 's']})
df1 = (df.join(df['col_a'].map(df[df['col_str'].notna()]
.groupby('col_a')['col_str'].unique()
.str.join(', ')). rename('labels')))
print (df1)
col_a col_str labels
0 1 a a
1 2 b b, s
2 3 c c
3 4 d d
4 1 NaN a
5 2 NaN b, s
6 3 NaN c
7 4 NaN d
8 1 a a
9 2 s b, s
df2 = (df.join(df['col_a'].map(df.dropna(subset=['col_str'])
.groupby('col_a')['col_str']
.unique().str.join(', ')).rename('labels')))
print (df2)
col_a col_str labels
0 1 a a
1 2 b b, s
2 3 c c
3 4 d d
4 1 NaN a
5 2 NaN b, s
6 3 NaN c
7 4 NaN d
8 1 a a
9 2 s b, s

selecting nan values in a pandas dataframe using loc [duplicate]

Given this dataframe, how to select only those rows that have "Col2" equal to NaN?
df = pd.DataFrame([range(3), [0, np.NaN, 0], [0, 0, np.NaN], range(3), range(3)], columns=["Col1", "Col2", "Col3"])
which looks like:
0 1 2
0 0 1 2
1 0 NaN 0
2 0 0 NaN
3 0 1 2
4 0 1 2
The result should be this one:
0 1 2
1 0 NaN 0
Try the following:
df[df['Col2'].isnull()]
#qbzenker provided the most idiomatic method IMO
Here are a few alternatives:
In [28]: df.query('Col2 != Col2') # Using the fact that: np.nan != np.nan
Out[28]:
Col1 Col2 Col3
1 0 NaN 0.0
In [29]: df[np.isnan(df.Col2)]
Out[29]:
Col1 Col2 Col3
1 0 NaN 0.0
If you want to select rows with at least one NaN value, then you could use isna + any on axis=1:
df[df.isna().any(axis=1)]
If you want to select rows with a certain number of NaN values, then you could use isna + sum on axis=1 + gt. For example, the following will fetch rows with at least 2 NaN values:
df[df.isna().sum(axis=1)>1]
If you want to limit the check to specific columns, you could select them first, then check:
df[df[['Col1', 'Col2']].isna().any(axis=1)]
If you want to select rows with all NaN values, you could use isna + all on axis=1:
df[df.isna().all(axis=1)]
If you want to select rows with no NaN values, you could notna + all on axis=1:
df[df.notna().all(axis=1)]
This is equivalent to:
df[df['Col1'].notna() & df['Col2'].notna() & df['Col3'].notna()]
which could become tedious if there are many columns. Instead, you could use functools.reduce to chain & operators:
import functools, operator
df[functools.reduce(operator.and_, (df[i].notna() for i in df.columns))]
or numpy.logical_and.reduce:
import numpy as np
df[np.logical_and.reduce([df[i].notna() for i in df.columns])]
If you're looking for filter the rows where there is no NaN in some column using query, you could do so by using engine='python' parameter:
df.query('Col2.notna()', engine='python')
or use the fact that NaN!=NaN like #MaxU - stop WAR against UA
df.query('Col2==Col2')

what is the difference between the total= df.isnull().sum(), percent1= df.count(),percent= df.isnull().count()?

Can anyone tell difference between the total= df.isnull().sum(), percent1= df.count(), df.isnull().count() as Ideally df.isnull().count() should give all the count of only null values but it is giving count of all the values .Can anyone help me to understand this?
Below is the code where i am getting output of variable total as only null values count and percent1 as only not null values count and percent as count of all the values irrespective of null or not null.
total= df.isnull().sum().sort_values(ascending=False)
percent1= df.count()#helps to get all the non null values count
percent= df.isnull().count()
print(total)
print(percent1)
print(percent)
The definition of count according to the doc is:
Count non-NA cells for each column or row.
And using isnull (or isna) change your dataframe df of whatever types you have in it to a boolean dataframe, with True where nan is originally df and False otherwise, there is no more nan in this dataframe, so count on df.isnull() will return the number of row of df as no nan exist in it. With an example:
df = pd.DataFrame({'a':range(4), 'b':[1,np.nan, 3, np.nan]})
print (df)
a b
0 0 1.0
1 1 NaN
2 2 3.0
3 3 NaN
if you use count on this dataframe you get:
print (df.count())
a 4
b 2 # here you get 2 because you have 2 nan in the column b as defined above
dtype: int64
but if you use isnull on it you get
print (df.isnull())
a b
0 False False
1 False True #was nan in column b in df
2 False False
3 False True
Here you don't have nan anymore, so the result of count will be the number of rows for both columns
print (df.isnull().count())
a 4
b 4 #no more nan in df.isnull()
dtype: int64
But because True is actually equal to 1 and False equal to 0, then using the sum method will add one for each True in df.isnull(), meaning of nan originally in df
print (df.isnull().sum())
a 0 # because only False in column a of df.isnull()
b 2 # because you have two True in df.isnull() in column b
dtype: int64
Finally, you can see the relation like this:
(df.count()+df.isnull().sum())==df.isnull().count()

Pandas: how to use convert_objects to replace strings with NaN values

This is related to a previous question I've asked, here: Replace any string in columns with 1
However, since that question has been answered long ago, I've started a new question here. I am essentially trying to use convert_objects to replace string values with 1's in the following dataframe (abbreviated here):
uniq_epoch T_Opp T_Eval
1 0 0
1 0 vv.bo
2 bx 0
3 0 0
3 vo.bp 0
...
I am using the following code to do this. I've actually tried using this code on the entire dataframe, and have also applied it to a particular column. The result each time is that there is no error message, but also no change to the data (no values are converted to NaN, and the dtype is still 'O').
df = df.convert_objects(convert_numeric = True)
or
df.T_Eval = df.T_Eval.convert_objects(convert_numeric=True)
Desired final output is as follows:
uniq_epoch T_Opp T_Eval
1 0 0
1 0 1
2 1 0
3 0 0
3 1 0
...
Where there may also be a step prior to this, with the 1s as NaN, and fillna(1) is used to insert 1s where strings have been.
I've already searched posts on stackoverflow, and looked at the documentation for convert_objects, but it is unfortunately pretty sparse. I wouldn't have known to even attempt to apply it this way if not for the previous post (linked above).
I'll also mention that there are quite a few strings (codes) in these columns, and that the codes can recombine, so that to do this with a dict and replace(), would take about the same amount of time as if I did this by hand.
Based on the previous post and the various resources I've been able to find, I can't figure out why this isn't working - any help much appreciated, including pointing towards further documentation.
This is on 0.13.1
docs here
and here
Maybe you have an older version; IIRC convert_objects introduced in 0.11.
In [5]: df = read_csv(StringIO(data),sep='\s+',index_col=0)
In [6]: df
Out[6]:
T_Opp T_Eval
uniq_epoch
1 0 0
1 0 vv.bo
2 bx 0
3 0 0
3 vo.bp 0
[5 rows x 2 columns]
In [7]: df.convert_objects(convert_numeric=True)
Out[7]:
T_Opp T_Eval
uniq_epoch
1 0 0
1 0 NaN
2 NaN 0
3 0 0
3 NaN 0
[5 rows x 2 columns]
In [8]: df.convert_objects(convert_numeric=True).dtypes
Out[8]:
T_Opp float64
T_Eval float64
dtype: object
In [9]: df.convert_objects(convert_numeric=True).fillna(1)
Out[9]:
T_Opp T_Eval
uniq_epoch
1 0 0
1 0 1
2 1 0
3 0 0
3 1 0
[5 rows x 2 columns]