I have the following code that updates the current row based on the status of the previous row:
prev_status = 0
for idx, row in df.iterrows():
if prev_status in [1, 2] and row[column_a] != 0:
row[column_b] += row[column_a]
row[column_c] = 0
row[column_d] = 0
row[column_a] = 0
prev_status = row[status]
df.loc[idx] = row
However this is very slow when running on 1GB of data. What are ways to optimize this?
Try this:
df['previous_status'] = df['status'].shift(1)
df.loc[df['previous_status'] in [1, 2] & df['column_a'] != 0, 'column_b'] += df['column_a']
df.loc[df['previous_status'] in [1, 2] & df['column_a'] != 0, 'column_c'] = 0
df.loc[df['previous_status'] in [1, 2] & df['column_a'] != 0, 'column_d'] = 0
df.loc[df['previous_status'] in [1, 2] & df['column_a'] != 0, 'column_a'] = 0
Look at using shift, e.g.
df["new_column"] = df["column_name"].shift(x)
This creates a column where the values are the values of another column shifted by x number of rows. It makes it much quicker to do vectorwise calculations on a column, rather than applying a function to every row in the DataFrame.
Related
df = df.assign[test = np.select[df.trs = 'iw' & df.rp == 'yu'],[1,0],'null']
I want if df.trs == iw' and df.rp == 'yu'` than new column should be created should be 0 else 1 only for condotion fullfilling row not every row
I tried no.slect and with condition array. But not getting desired output
You don't need numpy.select, a simple boolean operator is sufficient:
df['test'] = (df['trs'].eq('iw') & df['rp'].eq('yu')).astype(int)
If you really want to use numpy, this would require numpy.where:
df['test'] = np.where(df['trs'].eq('iw') & df['rp'].eq('yu'), 1, 0)
I want to select a row based on a condition and then update it in dataframe.
One solution I found is to update df based on condition, but I must repeat the condition, what is the better solution so that I get the desired row once and change it?
df.loc[condition, "top"] = 1
df.loc[condition, "pred_text1"] = 2
df.loc[condtion, "pred1_score"] = 3
something like:
row = df.loc[condition]
row["top"] = 1
row["pred_text1"] = 2
row["pred1_score"] = 3
Extract the boolean mask and set it as a variable.
m = condition
df.loc[m, 'top'] = 1
df.loc[m, 'pred_text1'] = 2
df.loc[m, 'pred1_score'] = 3
but the shortest way is:
df.loc[condition, ['top', 'pred_text1', 'pred_score']] = [1, 2, 3]
Update
Wasn't it possible to retrieve the index of row and then update it by that index?
idx = df[condition].idx
df.loc[idx, 'top'] = 1
df.loc[idx, 'pred_text1'] = 2
df.loc[idx, 'pred1_score'] = 3
I have a column of indicator functions(a), which when true I want to perform an action on the next n(3 in this example) rows of another column(b). The following achieves what I am looking for but will get very inefficient as n gets large :
Are there other ways to do this? I am trying to avoid loops.
Tricky but possible using an apply:
testing = pd.DataFrame({
'a': [0, 1, 0, 0, 0],
'b': [0, 0, 0, 0, 0]
})
def func(value, n):
if value.a == 0 and value.b != -1:
value.b = 0
elif value.a == 1 and value.b == 0:
value.b = 0
testing.loc[value.name + 1:value.name + n, 'b'] = -1
elif value.a == 1 and value.b == -1 and testing.loc[value.name, 'a'] == 1:
testing.loc[value.name - 1, 'b'] = -1
testing.loc[value.name + 1:value.name + n, 'b'] = -1
return value
testing.apply(func, axis = 1, args = (3,))
Output:
a b
0 0 0
1 1 0
2 0 -1
3 0 -1
4 0 -1
I have the following dataframe
df = pd.DataFrame([{'id':'a', 'val':1}, {'id':'b', 'val':2}, {'id':'c', 'val': 0}, {'id':'d', 'val':0}])
What I want is to replace 0's with +1 of the max value
The result I want is as follows:
df = pd.DataFrame([{'id':'a', 'val':1}, {'id':'b', 'val':2}, {'id':'c', 'val': 3}, {'id':'d', 'val':4}])
I tried the following:
for _, r in df.iterrows():
if r.val == 0:
r.val = df.val.max()+1
However, it there a one-line way to do the above
Filter only 0 rows with boolean indexing and DataFrame.loc and assign range with count Trues values of condition with add maximum value and 1, because python count from 0 in range:
df.loc[df['val'].eq(0), 'val'] = range(df['val'].eq(0).sum()) + df.val.max() + 1
print (df)
id val
0 a 1
1 b 2
2 c 3
3 d 4
date_0 = list(pd.date_range('2017-01-01', periods=6, freq='MS'))
date_1 = list(pd.date_range('2017-01-01', periods=8, freq='MS'))
data_0 = [9, 8, 4, 0, 0, 0]
data_1 = [9, 9, 0, 0, 0, 7, 0, 0]
id_0 = [0]*6
id_1 = [1]*8
df = pd.DataFrame({'ids': id_0 + id_1, 'dates': date_0 + date_1, 'data': data_0 + data_1})
For each id (here 0 and 1) I want to know how long is the series of zeros at the end of the time frame.
For the given example, the result is id_0 = 3, id_1 = 2.
So how do I limit the timestamps, so I can run something like that:
df.groupby('ids').agg('count')
First need get all consecutive 0 with trick by compare with shifted values for not equal and cumsum.
Then count pre groups, remove first level of MultiIndex and get last values per group by drop_duplicates with keep='last':
s = df['data'].ne(df['data'].shift()).cumsum().mul(~df['data'].astype(bool))
df = (s.groupby([df['ids'], s]).size()
.reset_index(level=1, drop=True)
.reset_index(name='val')
.drop_duplicates('ids', keep='last'))
print (df)
ids val
1 0 3
4 1 2