Group by based on an if statement - pandas

I have a df that contains ids and timestamps.
I was looking to group by the id and then a condition on the timestamp in the two rows.
Something like if timestamp_col1 > timestamp_col1 for the second row then 1 else 2
Basically grouping the ids and an if statement to give a value of 1 if the first row timestamp is < than the second and 2 if the second row timestamp is < then the first
Updated output below where last two values should be 2

Use to_timedelta for converting times, then aggregate difference between first and last value and compare by gt (>), last map with numpy.where for assign new column:
df = pd.DataFrame({
'ID Code': ['a','a','b','b'],
'Time Created': ['21:25:27','21:12:09','21:12:00','21:12:40']
})
df['Time Created'] = pd.to_timedelta(df['Time Created'])
mask = df.groupby('ID Code')['Time Created'].agg(lambda x: x.iat[0] < x.iat[-1])
print (mask)
ID Code
a True
b False
Name: Time Created, dtype: bool
df['new'] = np.where(df['ID Code'].map(mask), 1, 2)
print (df)
ID Code Time Created new
0 a 21:25:27 2
1 a 21:12:09 2
2 b 21:12:00 1
3 b 21:12:40 1
Another solution with transform for return aggregate value to new column, here boolean mask:
df['Time Created'] = pd.to_timedelta(df['Time Created'])
mask = (df.groupby('ID Code')['Time Created'].transform(lambda x: x.iat[0] > x.iat[-1]))
print (mask)
0 True
1 True
2 False
3 False
Name: Time Created, dtype: bool
df['new'] = np.where(mask, 2, 1)
print (df)
ID Code Time Created new
0 a 21:25:27 2
1 a 21:12:09 2
2 b 21:12:00 1
3 b 21:12:40 1

Related

Data frame: get row and update it

I want to select a row based on a condition and then update it in dataframe.
One solution I found is to update df based on condition, but I must repeat the condition, what is the better solution so that I get the desired row once and change it?
df.loc[condition, "top"] = 1
df.loc[condition, "pred_text1"] = 2
df.loc[condtion, "pred1_score"] = 3
something like:
row = df.loc[condition]
row["top"] = 1
row["pred_text1"] = 2
row["pred1_score"] = 3
Extract the boolean mask and set it as a variable.
m = condition
df.loc[m, 'top'] = 1
df.loc[m, 'pred_text1'] = 2
df.loc[m, 'pred1_score'] = 3
but the shortest way is:
df.loc[condition, ['top', 'pred_text1', 'pred_score']] = [1, 2, 3]
Update
Wasn't it possible to retrieve the index of row and then update it by that index?
idx = df[condition].idx
df.loc[idx, 'top'] = 1
df.loc[idx, 'pred_text1'] = 2
df.loc[idx, 'pred1_score'] = 3

Python Pandas: How to delete row with certain value of 'Object' datatype?

I have a dataframe name train_data.
This is the datatype of each column.
The columns workclass, occupation, and native-country are of "Object" datatype and some of the rows contain values of "?".
In this example, you can see row index 5 has some values with "?".
I want to delete all rows with any cell that has any "?".
I tried the following code, but it didn't work.
train_data = train_data[~(train_data.values == '?').any(1)]
train_data
use .loc for index slicing.
import pandas as pd
df1 = pd.DataFrame({'A' : [0,1,2,3,'?'],
'B' : [2,4,5,'?',9],
'C' : [0,'?',2,3,4]})
print(df1)
A B C
0 0 2 0
1 1 4 ?
2 2 5 2
3 3 ? 3
4 ? 9 4
print(df1.loc[~df1.eq('?').any(1)])
A B C
0 0 2 0
2 2 5 2
if you only want to check object columns use
pd.select_dtypes
df1.select_dtypes('object').eq('?').any(1)
0 False
1 True
2 False
3 True
4 True
dtype: bool
Edit.
One method for handling leading or trailing white spaces.
df1 = pd.DataFrame({'A' : [0,1,2,3,'?'],
'B' : [2,4,5,' ?',9],
'C' : [0,'? ',2,3,4]})
df1.eq('?').any(1)
0 False
1 False
2 False
3 False
4 True
dtype: bool
df1.replace('(\s+\?)|(\?\s+)',r'?',regex=True).eq('?').any(1)
0 False
1 True
2 False
3 True
4 True
dtype: bool
str.strip() with a lambda
str_cols = df1.select_dtypes('object').columns
df1[str_cols] = df1[str_cols].apply(lambda x : x.str.strip())
df1.eq('?').any(1)
0 False
1 True
2 False
3 True
4 True
dtype: bool

Fill zeroes with increment of the max value

I have the following dataframe
df = pd.DataFrame([{'id':'a', 'val':1}, {'id':'b', 'val':2}, {'id':'c', 'val': 0}, {'id':'d', 'val':0}])
What I want is to replace 0's with +1 of the max value
The result I want is as follows:
df = pd.DataFrame([{'id':'a', 'val':1}, {'id':'b', 'val':2}, {'id':'c', 'val': 3}, {'id':'d', 'val':4}])
I tried the following:
for _, r in df.iterrows():
if r.val == 0:
r.val = df.val.max()+1
However, it there a one-line way to do the above
Filter only 0 rows with boolean indexing and DataFrame.loc and assign range with count Trues values of condition with add maximum value and 1, because python count from 0 in range:
df.loc[df['val'].eq(0), 'val'] = range(df['val'].eq(0).sum()) + df.val.max() + 1
print (df)
id val
0 a 1
1 b 2
2 c 3
3 d 4

vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')

How this is working?
I know the intuition behind it that given movie_dataset(using panda we have loaded it in "md" and we are finding those rows in 'votecount' which are not null and converting them to int.
but i am not understanding the syntax.
md[md['vote_count'].notnull()] returns a filtered view of your current md dataframe where vote_count is not NULL. Which is being set to the variable vote_counts This is Boolean Indexing.
# Assume this dataframe
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df.loc[2,'B'] = np.nan
when you do df['B'].notnull() it will return a boolean vector which can be used to filter your data where the value is True
df['B'].notnull()
0 True
1 True
2 False
3 True
4 True
Name: B, dtype: bool
df[df['B'].notnull()]
A B C
0 -0.516625 -0.596213 -0.035508
1 0.450260 1.123950 -0.317217
3 0.405783 0.497761 -1.759510
4 0.307594 -0.357566 0.279341

Pandas create row number - but not as an index

I want to create a row number series - but not override my date index.
I can do it with a loop but I think there must be an easier way?
_cnt = [ ]
for i in range ( len ( df ) ):
_cnt.append ( i )
df[ 'row' ] = _cnt
Thanks.
Probably the easiest way:
df['row'] = range(len(df))
>>> df
0 1
0 0.444965 0.993382
1 0.001578 0.174628
2 0.663239 0.072992
3 0.664612 0.291361
4 0.486449 0.528354
>>> df['row'] = range(len(df))
>>> df
0 1 row
0 0.444965 0.993382 0
1 0.001578 0.174628 1
2 0.663239 0.072992 2
3 0.664612 0.291361 3
4 0.486449 0.528354 4