Dataframe change column value on if statement and keeps the new value to next row - dataframe

I wish you good health to you and your family.
In my dataframe I have a column 'condition' which is filled with .astype(float).
Based on information that i put in this dataframe for every row it makes math and if is over specific amount it increase the value of 'condition' by 1 . Everything works fine with it and as it should be.
I made another column named ['order']. Which change its value if ['condition'] has value of 3. That's the code with witch you can see what I mean:
import pandas as pd
import numpy as np
def graph():
df = (pd.DataFrame(np.random.randint(-3,4,size=(100, 1)), columns=[('condition')]))
df['order'] = 0
df.loc[(df['condition'] == 3) & (df['order'] == 0) , 'order'] = df['order'] + 1
df.loc[(df['condition'] == -3) & (df['order'] == 1) , 'order'] = df['order'] + -1
df.to_csv('copy_bars.csv')
graph()
As you can see it changes the value in 'order' row to 1 when it fill first condition. But it never change back from 1 to 0 because of second if statement. It changes to 0 just because at the begging I give the row amount of 0.
How could I modify the code so when it is changed to 1 to keep this new value until second if statement fill ?
Row, Condition, Order
0 -1 0
1 3 1
2 -1 0
3 2 0
4 -2 0
5 -3 0
6 0 0
instead of this I would like to get in Order column for line from 1 to 4 to be represented with value of 1 so can my second condition trigger.

If I understood what you want this should be something like what you want. Because it is row by row and is based on two values it is not easy to vectorize but probably someone else can do it. Hope it works for you.
order = []
have_found_plus_3 = False
for i, row in df.iterrows():
if row['condition'] == 3:
have_found_plus_3 = True
elif row['condition'] == -3:
have_found_plus_3 = False
if have_found_plus_3:
order.append(1)
else:
order.append(0)
df['order'] = order

Related

fast row-wise boolean operations in sparse matrices

I have a ~4.4M dataframe with purchase orders. I'm interested in a column that indicates the presence of certain items in that purchase order. It is structured like this:
df['item_arr'].head()
1 [a1, a2, a5]
2 [b1, b2, c3...
3 [b3]
4 [a2]
There are 4k different items, and in each row there is always at least one. I have generated another 4.4M x 4k dataframe df_te_sub with a sparse structure indicating the same array in terms of booleans, i.e.
c = df_te_sub.columns[[10, 20, 30]]
df_te_sub[c].head()
>>a10 b8 c1
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
The name of the columns is not important, although it is in alphabetical order, for what is worth.
Given a subset g of items, I am trying to extract the orders (rows) for two different cases:
At least one of the items is present in the row
The items is present in the row are all a subset of g
The first rule I found it the best way was to do:
c = df_te_sub[g].sparse.to_coo()
rows = pd.unique(c.row)
The second rule presented a challenge. I tried different things but they are all slow:
# using set
s = set(g)
df['item_arr'].apply(s.issuperset)
# using the "non selected items"
c = df_te_sub[df_te_sub.columns[~df_te_sub.columns.isin(g)]].sparse.to_coo()
x = np.ones(len(df_te_sub), dtype='bool')
x[c.row] = False
# mix
s = set(g)
c = df_te_sub[g].sparse.to_coo()
rows = pd.unique(c.row)
df['item_arr'].iloc[rows].apply(s.issuperset)
Any ideas to improve performance? I need to do this for several subsets.
The output can be given either in rows (e.g. [0, 2, 3]) or as a boolean mask (e.g. True False True True ....), as both will work to slice the order dataframe.
I feel like you're overthinking this. If you have a boolean array of membership you've already done 90% of the work.
from scipy.sparse import csc_matrix
# Turn your sparse data into a sparse array
arr = csc_matrix(df_te_sub.sparse.to_coo())
# Get the number of items per row
row_len = arr.sum(axis=1).A.flatten()
# Get the column indices for each item and slice your array
arr_col_idx = [df.columns.get_loc(g_val) for g_val in g]
# Sum the number of items in g in the slice per row
arr_g = arr[:, arr_col_idx].sum(axis=1).A.flatten()
# Find all the rows with at least one thing in g
arr_one_g = arr_g > 0
# Find all the things in the rows which are subsets of G
# This assumes row_len is always greater than 0, if it isnt add a test for that
arr_subset_g = (row_len - arr_g) == 0
arr_one_g and arr_subset_g are 1d boolean arrays that should index for the things you want.

Groupby and A)Concate matching strings(and or substring) B)Sum the values

I have df:
row_numbers ID code amount
1 med a 1
2 med a, b 1
3 med b, c 1
4 med c 1
5 med d 10
6 cad a, b 1
7 cad a, b, d 0
8 cad e 2
Pasted the above df:
I wanted to do groupby on column-ID and A)Combine the strings if substring/string matches(on column-code) B)sum the values of column-amount.
Expected results:
Explanation:
column-row_numbers has no role here in df. I just took here to explain the output.
A)grouping on column-ID and looking at column-code, row1 string i.e., a is matching with row2's sub string. row2's substring i.e., b is matching with row3's substring. row3's substring i.e., c is matching with string of row4 and Hence combining row1, row2, row3 and row4. row5 string is not matching with any of string/substring so it is separate group. B) Based on this adding row1, row2, row3 and row4 values. and row5 as separate group.
Thanks in advance for your time and thoughts:).
EDIT - 1
Pasting the real time.
Expected output:
Explanation:
have to do on grouping column-id and concatenating the values of column-code and summing the values of column-units and vol. It is color coded the matching(to be contacted) values of column-code. row1 has link with row5 and row9. row9 has inturn link with row3. Hence combining row1, row5, row9, row3. Simliarly row2 and row7 and so on. row8 has no link with any of the values with-in group-med(column-id) and hence will be as separate row.
Thanks!.
Update: From your latest sample data, this is not a simple data munging. There is no vectorized solution. It relates to graph theory. You need to find connected components within each group of ID and do the calculation on each connected components.
Consider each string as a node of graph. If 2 strings are overlapped, they are connected nodes. For every node, you need to traverse all paths connected to it. Do calculation on all connected nodes through these paths. This traversal can be done by using Depth-first search logic.
However, before processing depth-first search, you need to preprocess strings to set to check overlapping.
Method 1: Recursive
Do the following:
Define a function dfs to recursively run depth-first search
Define a function gfunc to use with groupby apply. This function will traverse elements of each group of ID and return the desired dataframe.
Get rid of any blank spaces in each string and split and convert them
to sets using replace, split and map and assign it to a new column new_code to df
Call groupby on ID and apply using function gfunc. Call droplevel and reset_index to get the desired output
Codes as follows:
import numpy as np
def dfs(node, index, glist, g_checked_rows):
ret_arr = df.loc[index, ['code', 'amount', 'volume']].values
g_checked_rows.add(index)
for j, s in glist:
if j not in g_checked_rows and not node.isdisjoint(s):
t_arr = dfs(s, j, glist, g_checked_rows)
ret_arr[0] += ', ' + t_arr[0]
ret_arr[1:] += t_arr[1:]
return ret_arr
def gfunc(x):
checked_rows = set()
final = []
code_list = list(x.new_code.items())
for i, row in code_list:
if i not in checked_rows:
final.append(dfs(row, i, code_list, checked_rows))
return pd.DataFrame(final, columns=['code','units','vol'])
df['new_code'] = df.code.str.replace(' ','').str.split(',').map(set)
df_final = df.groupby('ID', sort=False).apply(gfunc).droplevel(1).reset_index()
Out[16]:
ID code units vol
0 med CO-96, CO-B15, CO-B15, CO-96, OA-18, OA-18 4 4
1 med CO-16, CO-B20, CO-16 3 3
2 med CO-252, CO-252, CO-45 3 3
3 med OA-258 1 1
4 cad PR-96, PR-96, CO-243 4 4
5 cad PR-87, OA-258, PR-87 3 3
Note: I assume your pandas version is 0.24+. If it is < 0.24, the last step you need to use reset_index and drop instead of droplevel and reset_index as follows
df_final = df.groupby('ID', sort=False).apply(gfunc).reset_index().drop('level_1', 1)
Method 2: Iterative
To make this complete, I implement a version of gfunc using iterative process instead of recursive. Iterative process requires only one function.
However, the function is more complicated. The logic of iterative process as follows
push the first node to deque. Check if deque not empty, pop the top node out.
if a node is not marked checked, process it and mark it as checked
find all its neighbors in the reverse order of list of nodes that
haven't been marked, push them to the deque
Check if deque not empty, pop out a node from the top deque and
process from step 2
Code as follows:
def gfunc_iter(x):
checked_rows = set()
final = []
q = deque()
code_list = list(x.new_code.items())
code_list_rev = code_list[::-1]
for i, row in code_list:
if i not in checked_rows:
q.append((i, row))
ret_arr = np.array(['', 0, 0], dtype='O')
while (q):
n, node = q.pop()
if n in checked_rows:
continue
ret_arr_child = df.loc[n, ['code', 'amount', 'volume']].values
if not ret_arr[0]:
ret_arr = ret_arr_child.copy()
else:
ret_arr[0] += ', ' + ret_arr_child[0]
ret_arr[1:] += ret_arr_child[1:]
checked_rows.add(n)
#push to `q` all neighbors in the reversed list of nodes
for j, s in code_list_rev:
if j not in checked_rows and not node.isdisjoint(s):
q.append((j, s))
final.append(ret_arr)
return pd.DataFrame(final, columns=['code','units','vol'])
df['new_code'] = df.code.str.replace(' ','').str.split(',').map(set)
df_final = df.groupby('ID', sort=False).apply(gfunc_iter).droplevel(1).reset_index()
I believe the three main ideas for executing what you want are:
create an accumulator datastructure ( a DataFrame in this case)
iterate over a pair of rows, in each iteration you have (currentRow, nextRow)
pattern matching of current row in next row and pattern matching in the accumulated rows
It's not totally clear the exactly pattern match you're looking for, so I assumed that if any letter of currentRow code is on the next one, then concatenate them.
using a data.csv (with espace separators) as example:
row_numbers ID code amount
1 med a 1
2 med a,b 1
3 med b,c 1
4 med c 1
5 med d 10
6 cad a,b 1
7 cad a,b,d 0
8 cad e 2
import pandas as pd
from itertools import zip_longest
def generate_pairs(group):
''' generate pairs (currentRow, nextRow) '''
group_curriterrows = group.iterrows()
group_nextiterrows = group.iterrows()
group_nextiterrows.__next__()
zip_list = zip_longest(group_curriterrows, group_nextiterrows)
return zip_list
def generate_lists_to_check(currRow, nextRow, accumulated_rows):
''' generate list if any next letters are in current ones and
another list if any next letters are in the accumulated codes '''
currLetters = str(currRow["code"]).split(",")
nextLetters = str(nextRow["code"]).split(",")
letter_inNext = [letter in nextLetters for letter in currLetters]
unique_acc_codes = [str(v) for v in accumulated_rows["code"].unique()]
letter_inHistory = [any(letter in unq for letter in nextLetters)
for unq in unique_acc_codes]
return letter_inNext, letter_inHistory
def create_newRow(accumulated_rows, nextRow):
nextRow["row_numbers"] = str(nextRow["row_numbers"])
accumulated_rows = accumulated_rows.append(nextRow,ignore_index=True)
return accumulated_rows
def update_existingRow(accumulated_rows, match_idx, Row):
accumulated_rows.loc[match_idx]["code"] += ","+Row["code"]
accumulated_rows.loc[match_idx]["amount"] += Row["amount"]
accumulated_rows.loc[match_idx]["volume"] += Row["volume"]
accumulated_rows.loc[match_idx]["row_numbers"] += ','+str(Row["row_numbers"])
return accumulated_rows
if __name__ == "__main__":
df = pd.read_csv("extended.tsv",sep=" ")
groups = pd.DataFrame(columns=df.columns)
for ID, group in df.groupby(["ID"], sort=False):
accumulated_rows = pd.DataFrame(columns=df.columns)
group_firstRow = group.iloc[0]
accumulated_rows.loc[len(accumulated_rows)] = group_firstRow.values
row_numbers = str(group_firstRow.values[0])
accumulated_rows.set_value(0,'row_numbers',row_numbers)
zip_list = generate_pairs(group)
for (currRow_idx, currRow), Next in zip_list:
if not (Next is None):
(nextRow_idx, nextRow) = Next
letter_inNext, letter_inHistory = \
generate_lists_to_check(currRow, nextRow, accumulated_rows)
if any(letter_inNext) :
accumulated_rows = update_existingRow(accumulated_rows, (len(accumulated_rows)-1), nextRow)
elif any(letter_inHistory):
matches = [ idx for (idx, bool_val) in enumerate(letter_inHistory) if bool_val == True ]
first_match_idx = matches[0]
accumulated_rows = update_existingRow(accumulated_rows, first_match_idx, nextRow)
for match_idx in matches[1:]:
accumulated_rows = update_existingRow(accumulated_rows, first_match_idx, accumulated_rows.loc[match_idx])
accumulated_rows = accumulated_rows.drop(match_idx)
elif not any(letter_inNext):
accumulated_rows = create_newRow(accumulated_rows, nextRow)
groups = groups.append(accumulated_rows)
groups.reset_index(inplace=True,drop=True)
print(groups)
OUTPUT normal rows order REMOVING lines using column volume from current code because first exampe has no column volume:
row_numbers ID code amount
0 1 med a,a,b,b,c,c 4
1 5 med d 10
2 6 cad a,b,a,b,d 1
3 8 cad e 2
OUTPUT new example:
row_numbers ID code amount volume
0 1,5,9,3 med CO-96,CO-B15,CO-B15,CO-96,OA-18,OA-18 4 4
1 2,7 med CO-16,CO-B20,CO-16 3 3
2 4,6 med CO-252,CO-252,CO-45 3 3
3 8 med OA-258 1 1
4 10,13 cad PR-96,PR-96,CO-243 4 4
5 11,12 cad PR-87,OA-258,PR-87 3 3

loop over a dataframe and populate with values

I am trying to loop over a dataframe and fill a new column with values according to a rule:
#formula for trading strategy
df['new_column'] = ""
for index,row in df.iterrows():
if row.reversal == 1:
row.new_column = 1
index += 126
row.new_column = -1
else:
row.new_column = 0
This formula is meant to populate the new column in a way that, when reversal=1, a value of 1 is given, followed by 0s for the next 125 rows, and a -1 in the 126th row. Then it should start again looking at whether the 127th item of the reversal column is 1 (indicating a reversal) or 0, etc. Instead, if reversal !=1, a value of 0 is given.
The problem is that when I take a look at the new column formed, it is still an empty column. There must be an error in the way I input the values in it. I looked at other ways to construct if statements for dataframes (e.g., lambda), but they do not allow me to perform all the operations in this code
new_column could have values: 0, 1 or -1,
i suggest you to initially load 0 your column, so no need to set 0:
df['new_column'] = 0
index = 0
while index < df.shape[0]:
if df['reversal'][index] == 1:
df.loc[index,'new_column'] = 1 #set 1 to same index
df.loc[index + 126, 'new_column'] = -1 #set -1 to 126th row
index = index + 127 #inc index to next loop
else:
index = index + 1
Be carefull the value of index is not bigger than the number of row of the dataframe
you could modify the test to secure the loop (to avoid error message):
if df['reversal'][index] == 1 and (index + 126) < df.shape[0]:

Python SettingWithCopyWarning, but I'm trying to set the value using .ix

I have a pandas dataframe in python and I'm trying to modify a specific value in a particular row. I found a solution to this problem Set value for particular cell in pandas DataFrame using index, but it is still generating the SettingWithCopy error.
The name of the data frame is internal_df and it has columns 'price', 'visits', and 'orders'. Specifically, I want to add the number of orders and visits to a lower price point if we don't have a sufficient number of visits (100 in this example). Note that below the variable 'price' is a float, and the data types for 'price' within the internal_df data frame is float, while price and orders are ints.
if int(internal_df[internal_df['price']==price]['visits']) < 100:
for index, row in internal_df.iterrows():
if float(row['price']) > price:
internal_df.ix[internal_df['price'] == price, 'visits'] = internal_df.ix[internal_df['price'] == price, 'visits'] + row['visits']
internal_df.ix[internal_df['price'] == price, 'orders'] = internal_df.ix[internal_df['price'] == price, 'orders'] + row['orders']
Here is a sample of the data
price visits sales
0 1399.99 2 0
1 169.99 2 0
2 99.99 1 0
3 99.99 1 0
4 139.99 1 0
5 319.99 1 0
6 198.99 1 0
7 119.99 1 0
8 39.99 1 0
9 259.98 1 0
Does anyone have any suggestions, or should I just ignore the error?
Brad
Note that .ix is deprecated because it indexes by position or by label, depending on the data type of the index. Use .loc or .iloc instead.
This SettingWithCopyWarning might originate from a "get" operation several lines of code above what you've provided. A quick fix might be to find where internal_df is first assigned, and to add .copy() to the end of the assignment statement. For example, if you have internal_df = df[df['colname'] <= value], change that to internal_df = df[df['colname'] <= value].copy() and hopefully that resolves the error.
Also, I think you can do what you're trying to do without a for loop, which would be faster and more readable!

Calculate diff() between selected rows

I have a dataframe with ordered times (in seconds) and a column that is either 0 or 1:
time bit
index
0 0.24 0
1 0.245 0
2 0.47 1
3 0.471 1
4 0.479 0
5 0.58 1
... ... ...
I want to select those rows where the time difference is, let's say <0.01 s. But only those differences between rows with bit 1 and bit 0. So in the above example I would only select row 3 and 4 (or any one of them). I thought that I would calculate the diff() of the time column. But I need to somehow select on the 0/1 bit.
Coming from the future to answer this one. You can apply a function to the dataframe that finds the indices of the rows that adhere to the condition and returns the row pairs accordingly:
def filter_(x, threshold = 0.01):
indices = df.index[(df.time.diff() < threshold) & (df.bit.diff().abs() == 1)]
mask = indices | indices - 1
return x[mask]
print(df.apply(filter_, args = (0.01,)))
Output:
time bit
3 0.471 1
4 0.479 0