Fill zeroes with increment of the max value - pandas

I have the following dataframe
df = pd.DataFrame([{'id':'a', 'val':1}, {'id':'b', 'val':2}, {'id':'c', 'val': 0}, {'id':'d', 'val':0}])
What I want is to replace 0's with +1 of the max value
The result I want is as follows:
df = pd.DataFrame([{'id':'a', 'val':1}, {'id':'b', 'val':2}, {'id':'c', 'val': 3}, {'id':'d', 'val':4}])
I tried the following:
for _, r in df.iterrows():
if r.val == 0:
r.val = df.val.max()+1
However, it there a one-line way to do the above

Filter only 0 rows with boolean indexing and DataFrame.loc and assign range with count Trues values of condition with add maximum value and 1, because python count from 0 in range:
df.loc[df['val'].eq(0), 'val'] = range(df['val'].eq(0).sum()) + df.val.max() + 1
print (df)
id val
0 a 1
1 b 2
2 c 3
3 d 4

Related

How to create a new column based on row values in python?

I have data like below:
df = pd.DataFrame()
df["collection_amount"] = 100, 200, 300
df["25%_coll"] = 1, 0, 1
df["75%_coll"] = 0, 1, 1
df["month"] = 4, 5, 6
I want to create a output like below:
basically if 25% is 1 then it should create a column based on month as a new column.
Please help me thank you.
This should work: do ask if something doesn't make sense
for i in range(len(df)):
if df['25%_coll'][i]==1:
df['month_%i_25%%_coll'%df.month[i]]=[df.collection_amount[i] if k==i else 0 for k in range(len(df))]
if df['75%_coll'][i]==1:
df['month_%i_75%%_coll'%df.month[i]]=[df.collection_amount[i] if k==i else 0 for k in range(len(df))]
To build the new columns you could try the following:
df2 = df.melt(id_vars=["month", "collection_amount"])
df2.loc[df2["value"].eq(0), "collection_amount"] = 0
df2["new_cols"] = "month_" + df2["month"].astype("str") + "_" + df2["variable"]
df2 = df2.pivot_table(
index="month", columns="new_cols", values="collection_amount",
fill_value=0, aggfunc="sum"
).reset_index(drop=True)
.melt() the dataframe with index columns month and collection_amount.
Set the appropriate collection_amount values to 0.
Build the new column names in column new_cols.
month collection_amount variable value new_cols
0 4 100 25%_coll 1 month_4_25%_coll
1 5 0 25%_coll 0 month_5_25%_coll
2 6 300 25%_coll 1 month_6_25%_coll
3 4 0 75%_coll 0 month_4_75%_coll
4 5 200 75%_coll 1 month_5_75%_coll
5 6 300 75%_coll 1 month_6_75%_coll
Use .pivot_table() on this dataframe to build the new columns.
The rest isn't completely clear: Either use df = pd.concat([df, df2], axis=1), or df.merge(df2, ...) to merge on month (with .reset_index() without drop=True).
Result for the sample dataframe
df = pd.DataFrame({
"collection_amount": [100, 200, 300],
"25%_coll": [1, 0, 1], "75%_coll": [0, 1, 1],
"month": [4, 5, 6]
})
is
new_cols month_4_25%_coll month_4_75%_coll month_5_25%_coll \
0 100 0 0
1 0 0 0
2 0 0 0
new_cols month_5_75%_coll month_6_25%_coll month_6_75%_coll
0 0 0 0
1 200 0 0
2 0 300 300

Data frame: get row and update it

I want to select a row based on a condition and then update it in dataframe.
One solution I found is to update df based on condition, but I must repeat the condition, what is the better solution so that I get the desired row once and change it?
df.loc[condition, "top"] = 1
df.loc[condition, "pred_text1"] = 2
df.loc[condtion, "pred1_score"] = 3
something like:
row = df.loc[condition]
row["top"] = 1
row["pred_text1"] = 2
row["pred1_score"] = 3
Extract the boolean mask and set it as a variable.
m = condition
df.loc[m, 'top'] = 1
df.loc[m, 'pred_text1'] = 2
df.loc[m, 'pred1_score'] = 3
but the shortest way is:
df.loc[condition, ['top', 'pred_text1', 'pred_score']] = [1, 2, 3]
Update
Wasn't it possible to retrieve the index of row and then update it by that index?
idx = df[condition].idx
df.loc[idx, 'top'] = 1
df.loc[idx, 'pred_text1'] = 2
df.loc[idx, 'pred1_score'] = 3

Apply Process Function to Groups in a Dataframe

I've got a dataframe, like this:
df_1 = pd.DataFrame({'X' : ['A','A','A','A','B','B','B'],
'Y' : [1, 0, 1, 1, 0, 0,'Nan']})
I would like to group it by X and create a column Z:
df_2 = pd.DataFrame({'X' : ['A','B'],
'Z' : [0.5, 0.5]})
But the difficult to describe thing that I would like to do is to apply this function:
def fun(Y,Z):
if Y == 1:
Z = Z + 1
elif Y == 0:
Z = Z - 1
So the first Y in df_1 is a 1, that is in group A so the Z for group A increases to 1.5. Then the next one is a 0 so it goes back to 0.5, then there are 2 more 1's so it ends up at 2.5.
Which would give me:
X Z
A 2.5
B -1.5
You can modify your first DataFrame and use sum with index alignment.
0 -> -1 (subtract 1 when a zero is found)
NaN --> 0 (unchanged when a NaN is found
df_1['Y'] = pd.to_numeric(df_1.Y, errors='coerce')
u = df_1.assign(Z=df_1.Y.replace({0: -1, np.nan: 0})).groupby('X')['Z'].sum().to_frame()
df_2 = df_2.set_index('X') + u
Z
X
A 2.5
B -1.5

Group by based on an if statement

I have a df that contains ids and timestamps.
I was looking to group by the id and then a condition on the timestamp in the two rows.
Something like if timestamp_col1 > timestamp_col1 for the second row then 1 else 2
Basically grouping the ids and an if statement to give a value of 1 if the first row timestamp is < than the second and 2 if the second row timestamp is < then the first
Updated output below where last two values should be 2
Use to_timedelta for converting times, then aggregate difference between first and last value and compare by gt (>), last map with numpy.where for assign new column:
df = pd.DataFrame({
'ID Code': ['a','a','b','b'],
'Time Created': ['21:25:27','21:12:09','21:12:00','21:12:40']
})
df['Time Created'] = pd.to_timedelta(df['Time Created'])
mask = df.groupby('ID Code')['Time Created'].agg(lambda x: x.iat[0] < x.iat[-1])
print (mask)
ID Code
a True
b False
Name: Time Created, dtype: bool
df['new'] = np.where(df['ID Code'].map(mask), 1, 2)
print (df)
ID Code Time Created new
0 a 21:25:27 2
1 a 21:12:09 2
2 b 21:12:00 1
3 b 21:12:40 1
Another solution with transform for return aggregate value to new column, here boolean mask:
df['Time Created'] = pd.to_timedelta(df['Time Created'])
mask = (df.groupby('ID Code')['Time Created'].transform(lambda x: x.iat[0] > x.iat[-1]))
print (mask)
0 True
1 True
2 False
3 False
Name: Time Created, dtype: bool
df['new'] = np.where(mask, 2, 1)
print (df)
ID Code Time Created new
0 a 21:25:27 2
1 a 21:12:09 2
2 b 21:12:00 1
3 b 21:12:40 1

Pandas create row number - but not as an index

I want to create a row number series - but not override my date index.
I can do it with a loop but I think there must be an easier way?
_cnt = [ ]
for i in range ( len ( df ) ):
_cnt.append ( i )
df[ 'row' ] = _cnt
Thanks.
Probably the easiest way:
df['row'] = range(len(df))
>>> df
0 1
0 0.444965 0.993382
1 0.001578 0.174628
2 0.663239 0.072992
3 0.664612 0.291361
4 0.486449 0.528354
>>> df['row'] = range(len(df))
>>> df
0 1 row
0 0.444965 0.993382 0
1 0.001578 0.174628 1
2 0.663239 0.072992 2
3 0.664612 0.291361 3
4 0.486449 0.528354 4