Pandas if condition is True perform function to next n rows - pandas

I have a column of indicator functions(a), which when true I want to perform an action on the next n(3 in this example) rows of another column(b). The following achieves what I am looking for but will get very inefficient as n gets large :
Are there other ways to do this? I am trying to avoid loops.

Tricky but possible using an apply:
testing = pd.DataFrame({
'a': [0, 1, 0, 0, 0],
'b': [0, 0, 0, 0, 0]
})
def func(value, n):
if value.a == 0 and value.b != -1:
value.b = 0
elif value.a == 1 and value.b == 0:
value.b = 0
testing.loc[value.name + 1:value.name + n, 'b'] = -1
elif value.a == 1 and value.b == -1 and testing.loc[value.name, 'a'] == 1:
testing.loc[value.name - 1, 'b'] = -1
testing.loc[value.name + 1:value.name + n, 'b'] = -1
return value
testing.apply(func, axis = 1, args = (3,))
Output:
a b
0 0 0
1 1 0
2 0 -1
3 0 -1
4 0 -1

Related

How to create a new column based on row values in python?

I have data like below:
df = pd.DataFrame()
df["collection_amount"] = 100, 200, 300
df["25%_coll"] = 1, 0, 1
df["75%_coll"] = 0, 1, 1
df["month"] = 4, 5, 6
I want to create a output like below:
basically if 25% is 1 then it should create a column based on month as a new column.
Please help me thank you.
This should work: do ask if something doesn't make sense
for i in range(len(df)):
if df['25%_coll'][i]==1:
df['month_%i_25%%_coll'%df.month[i]]=[df.collection_amount[i] if k==i else 0 for k in range(len(df))]
if df['75%_coll'][i]==1:
df['month_%i_75%%_coll'%df.month[i]]=[df.collection_amount[i] if k==i else 0 for k in range(len(df))]
To build the new columns you could try the following:
df2 = df.melt(id_vars=["month", "collection_amount"])
df2.loc[df2["value"].eq(0), "collection_amount"] = 0
df2["new_cols"] = "month_" + df2["month"].astype("str") + "_" + df2["variable"]
df2 = df2.pivot_table(
index="month", columns="new_cols", values="collection_amount",
fill_value=0, aggfunc="sum"
).reset_index(drop=True)
.melt() the dataframe with index columns month and collection_amount.
Set the appropriate collection_amount values to 0.
Build the new column names in column new_cols.
month collection_amount variable value new_cols
0 4 100 25%_coll 1 month_4_25%_coll
1 5 0 25%_coll 0 month_5_25%_coll
2 6 300 25%_coll 1 month_6_25%_coll
3 4 0 75%_coll 0 month_4_75%_coll
4 5 200 75%_coll 1 month_5_75%_coll
5 6 300 75%_coll 1 month_6_75%_coll
Use .pivot_table() on this dataframe to build the new columns.
The rest isn't completely clear: Either use df = pd.concat([df, df2], axis=1), or df.merge(df2, ...) to merge on month (with .reset_index() without drop=True).
Result for the sample dataframe
df = pd.DataFrame({
"collection_amount": [100, 200, 300],
"25%_coll": [1, 0, 1], "75%_coll": [0, 1, 1],
"month": [4, 5, 6]
})
is
new_cols month_4_25%_coll month_4_75%_coll month_5_25%_coll \
0 100 0 0
1 0 0 0
2 0 0 0
new_cols month_5_75%_coll month_6_25%_coll month_6_75%_coll
0 0 0 0
1 200 0 0
2 0 300 300

Pandas index clause across multiple columns in a multi-column header

I have a data frame with multi-column headers.
import pandas as pd
headers = pd.MultiIndex.from_tuples([("A", "u"), ("A", "v"), ("B", "x"), ("B", "y")])
f = pd.DataFrame([[1, 1, 0, 1], [1, 0, 0, 0], [0, 0, 1, 1], [1, 0, 1, 0]], columns = headers)
f
A B
u v x y
0 1 1 0 1
1 1 0 0 0
2 0 0 1 1
3 1 0 1 0
I want to select the rows in which either all the A columns or all the B columns are true.
I can do so explicitly.
f[f["A"]["u"].astype(bool) | f["A"]["v"].astype(bool)]
A B
u v x y
0 1 1 0 1
1 1 0 0 0
3 1 0 1 0
f[f["B"]["x"].astype(bool) | f["B"]["y"].astype(bool)]
A B
u v x y
0 1 1 0 1
2 0 0 1 1
3 1 0 1 0
I want to write a function select(f, top_level_name) where the indexing clause applies to all the columns under the same top level name such that
select(f, "A") == f[f["A"]["u"].astype(bool) | f["A"]["v"].astype(bool)]
select(f, "B") == f[f["B"]["x"].astype(bool) | f["B"]["y"].astype(bool)]
I want this function to work with arbitrary numbers of sub-columns with arbitrary names.
How do I write select?

Pandas: How to efficiently update rows based on previous value?

I have the following code that updates the current row based on the status of the previous row:
prev_status = 0
for idx, row in df.iterrows():
if prev_status in [1, 2] and row[column_a] != 0:
row[column_b] += row[column_a]
row[column_c] = 0
row[column_d] = 0
row[column_a] = 0
prev_status = row[status]
df.loc[idx] = row
However this is very slow when running on 1GB of data. What are ways to optimize this?
Try this:
df['previous_status'] = df['status'].shift(1)
df.loc[df['previous_status'] in [1, 2] & df['column_a'] != 0, 'column_b'] += df['column_a']
df.loc[df['previous_status'] in [1, 2] & df['column_a'] != 0, 'column_c'] = 0
df.loc[df['previous_status'] in [1, 2] & df['column_a'] != 0, 'column_d'] = 0
df.loc[df['previous_status'] in [1, 2] & df['column_a'] != 0, 'column_a'] = 0
Look at using shift, e.g.
df["new_column"] = df["column_name"].shift(x)
This creates a column where the values are the values of another column shifted by x number of rows. It makes it much quicker to do vectorwise calculations on a column, rather than applying a function to every row in the DataFrame.

Create tensors where all elements up to a given index are 1s, the rest are 0s

I have a placeholder lengths = tf.placeholder(tf.int32, [10]). Each of the 10 values assigned to this placeholder are <= 25. I now want to create a 2-dimensional tensor, called masks, of shape [10, 25], where each of the 10 vectors of length 25 has the first n elements set to 1, and the rest set to 0 - with n being the corresponding value in lengths.
What is the easiest way to do this using TensorFlow's built in methods?
For example:
lengths = [4, 6, 7, ...]
-> masks = [[1, 1, 1, 1, 0, 0, 0, 0, ..., 0],
[1, 1, 1, 1, 1, 1, 0, 0, ..., 0],
[1, 1, 1, 1, 1, 1, 1, 0, ..., 0],
...
]
You can reshape lengths to a (10, 1) tensor, then compare it with another sequence/indices 0,1,2,3,...,25, which due to broadcasting will result in True if the indices are smaller then lengths, otherwise False; then you can cast the boolean result to 1 and 0:
lengths = tf.constant([4, 6, 7])
n_features = 25
​
import tensorflow as tf
​
masks = tf.cast(tf.range(n_features) < tf.reshape(lengths, (-1, 1)), tf.int8)
with tf.Session() as sess:
print(sess.run(masks))
#[[1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
# [1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
# [1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

Pandas dataframe operations

I have the following dataframe,
df = pd.DataFrame({
'CARD_NO': [000, 001, 002, 002, 001, 111],
'request_code': [2400,2200,2400,3300,5500,6600],
'merch_id': [1, 2, 1, 3, 3, 5],
'resp_code': [0, 1, 0, 1, 1, 1]})
Based on this requirement,
inquiries = df[(df.request_code == 2400) & (df.merch_id == 1) & (df.resp_code == 0)]
I need to flag records in df for which CARD_NO == CARD_NO where inquiries is True.
If inquiries returns:
[6 rows x 4 columns]
index CARD_NO merch_id request_code resp_code
0 0 1 2400 0
2 2 1 2400 0
Then df should look like so:
index CARD_NO merch_id request_code resp_code flag
0 0 1 2400 0 N
1 1 2 2200 1 N
2 2 1 2400 0 N
3 2 3 3300 1 Y
4 1 3 5500 1 N
5 111 5 6600 1 N
I've tried several merges, but cannot seem to get the result I want.
Any help would be greatly appreciated.
Thank you.
the following should work if I understand your question correctly, which is that you want to set the flag is ture only when the CARD_NO is in the filtered group but the row itself is not in the filtered group.
import numpy as np
filter = (df.request_code == 2400) & (df.merch_id == 1) & (df.resp_code == 0)
df['flag']=np.where(~filter & df.CARD_NO.isin(df.ix[filter, 'CARD_NO']), 'Y', 'N')
filtered = (df.request_code == 2400) & (df.merch_id == 1) & (df.resp_code == 0)
df["flag"] = filtered.map(lambda x: "Y" if x else "N")