loop over a dataframe and populate with values - pandas

I am trying to loop over a dataframe and fill a new column with values according to a rule:
#formula for trading strategy
df['new_column'] = ""
for index,row in df.iterrows():
if row.reversal == 1:
row.new_column = 1
index += 126
row.new_column = -1
else:
row.new_column = 0
This formula is meant to populate the new column in a way that, when reversal=1, a value of 1 is given, followed by 0s for the next 125 rows, and a -1 in the 126th row. Then it should start again looking at whether the 127th item of the reversal column is 1 (indicating a reversal) or 0, etc. Instead, if reversal !=1, a value of 0 is given.
The problem is that when I take a look at the new column formed, it is still an empty column. There must be an error in the way I input the values in it. I looked at other ways to construct if statements for dataframes (e.g., lambda), but they do not allow me to perform all the operations in this code

new_column could have values: 0, 1 or -1,
i suggest you to initially load 0 your column, so no need to set 0:
df['new_column'] = 0
index = 0
while index < df.shape[0]:
if df['reversal'][index] == 1:
df.loc[index,'new_column'] = 1 #set 1 to same index
df.loc[index + 126, 'new_column'] = -1 #set -1 to 126th row
index = index + 127 #inc index to next loop
else:
index = index + 1
Be carefull the value of index is not bigger than the number of row of the dataframe
you could modify the test to secure the loop (to avoid error message):
if df['reversal'][index] == 1 and (index + 126) < df.shape[0]:

Related

Dataframe change column value on if statement and keeps the new value to next row

I wish you good health to you and your family.
In my dataframe I have a column 'condition' which is filled with .astype(float).
Based on information that i put in this dataframe for every row it makes math and if is over specific amount it increase the value of 'condition' by 1 . Everything works fine with it and as it should be.
I made another column named ['order']. Which change its value if ['condition'] has value of 3. That's the code with witch you can see what I mean:
import pandas as pd
import numpy as np
def graph():
df = (pd.DataFrame(np.random.randint(-3,4,size=(100, 1)), columns=[('condition')]))
df['order'] = 0
df.loc[(df['condition'] == 3) & (df['order'] == 0) , 'order'] = df['order'] + 1
df.loc[(df['condition'] == -3) & (df['order'] == 1) , 'order'] = df['order'] + -1
df.to_csv('copy_bars.csv')
graph()
As you can see it changes the value in 'order' row to 1 when it fill first condition. But it never change back from 1 to 0 because of second if statement. It changes to 0 just because at the begging I give the row amount of 0.
How could I modify the code so when it is changed to 1 to keep this new value until second if statement fill ?
Row, Condition, Order
0 -1 0
1 3 1
2 -1 0
3 2 0
4 -2 0
5 -3 0
6 0 0
instead of this I would like to get in Order column for line from 1 to 4 to be represented with value of 1 so can my second condition trigger.
If I understood what you want this should be something like what you want. Because it is row by row and is based on two values it is not easy to vectorize but probably someone else can do it. Hope it works for you.
order = []
have_found_plus_3 = False
for i, row in df.iterrows():
if row['condition'] == 3:
have_found_plus_3 = True
elif row['condition'] == -3:
have_found_plus_3 = False
if have_found_plus_3:
order.append(1)
else:
order.append(0)
df['order'] = order

fast row-wise boolean operations in sparse matrices

I have a ~4.4M dataframe with purchase orders. I'm interested in a column that indicates the presence of certain items in that purchase order. It is structured like this:
df['item_arr'].head()
1 [a1, a2, a5]
2 [b1, b2, c3...
3 [b3]
4 [a2]
There are 4k different items, and in each row there is always at least one. I have generated another 4.4M x 4k dataframe df_te_sub with a sparse structure indicating the same array in terms of booleans, i.e.
c = df_te_sub.columns[[10, 20, 30]]
df_te_sub[c].head()
>>a10 b8 c1
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
The name of the columns is not important, although it is in alphabetical order, for what is worth.
Given a subset g of items, I am trying to extract the orders (rows) for two different cases:
At least one of the items is present in the row
The items is present in the row are all a subset of g
The first rule I found it the best way was to do:
c = df_te_sub[g].sparse.to_coo()
rows = pd.unique(c.row)
The second rule presented a challenge. I tried different things but they are all slow:
# using set
s = set(g)
df['item_arr'].apply(s.issuperset)
# using the "non selected items"
c = df_te_sub[df_te_sub.columns[~df_te_sub.columns.isin(g)]].sparse.to_coo()
x = np.ones(len(df_te_sub), dtype='bool')
x[c.row] = False
# mix
s = set(g)
c = df_te_sub[g].sparse.to_coo()
rows = pd.unique(c.row)
df['item_arr'].iloc[rows].apply(s.issuperset)
Any ideas to improve performance? I need to do this for several subsets.
The output can be given either in rows (e.g. [0, 2, 3]) or as a boolean mask (e.g. True False True True ....), as both will work to slice the order dataframe.
I feel like you're overthinking this. If you have a boolean array of membership you've already done 90% of the work.
from scipy.sparse import csc_matrix
# Turn your sparse data into a sparse array
arr = csc_matrix(df_te_sub.sparse.to_coo())
# Get the number of items per row
row_len = arr.sum(axis=1).A.flatten()
# Get the column indices for each item and slice your array
arr_col_idx = [df.columns.get_loc(g_val) for g_val in g]
# Sum the number of items in g in the slice per row
arr_g = arr[:, arr_col_idx].sum(axis=1).A.flatten()
# Find all the rows with at least one thing in g
arr_one_g = arr_g > 0
# Find all the things in the rows which are subsets of G
# This assumes row_len is always greater than 0, if it isnt add a test for that
arr_subset_g = (row_len - arr_g) == 0
arr_one_g and arr_subset_g are 1d boolean arrays that should index for the things you want.

Creating a loop to check the column enteries in pandas

Can anyone help me write a loop function for this use-case as I'm new to programming i don't get how to write this.
What i want is
A loop should check the the if the value of item_id column in the DATAFRAME (B) is same in the question_id column in the DATAFRAME (questions) , then it should compare user_answer entry (Dataframe B) to correct_answer (Dataframe questions) ,
if it matches then it should return True/Correct or set a counter to +1
if it doesn't match then it should return as False/InCorrect or set a counter to -1
You can try:
counter = 0
for key, item_id in B['item_id'].iteritems():
try:
if B.loc[key, 'user_answer'] == questions.loc[questions['question_id'] == item_id, 'correct_answer'].values[0]:
counter += 1
else:
pass # put here whatever you want to do if the answer is wrong
except Exception:
pass # put here whatever you want to do if the question id from DF(B) is not in DF(questions)

Get column index label based on values

I have the following:
C1 C2 C3
0 0 0 1
1 0 0 1
2 0 0 1
And i would like to get the corresponding column index value that has 1's, so the result
should be "C3".
I know how to do this by transposing the dataframe and then getting the index values, but this is not ideal for data in the dataframes i have, and i wonder there might be a more efficient solution?
I will save the result in a list because otherwise there could be more than one column with values ​​equal to 1. You can use DataFrame.loc
if all column values ​​must be 1 then you can use:
df.loc[:,df.eq(1).all()].columns.tolist()
Output:
['C3']
if this isn't necessary then use:
df.loc[:,df.eq(1).any()].columns.tolist()
or as suggested #piRSquared, you can select directly from df.columns:
[*df.columns[df.eq(1).all()]]

TypeError: 'DataFrame' object is not callable in concatenating different dataframes of certain types

I keep getting the following error.
I read a file that contains time series data of 3 columns: [meter ID] [daycode(explain later)] [meter reading in kWh]
consum = pd.read_csv("data/File1.txt", delim_whitespace=True, encoding = "utf-8", names =['meter', 'daycode', 'val'], engine='python')
consum.set_index('meter', inplace=True)
test = consum.loc[[1048]]
I will observe meter readings for all the length of data that I have in this file, but first filter by meter ID.
test['day'] = test['daycode'].astype(str).str[:3]
test['hm'] = test['daycode'].astype(str).str[-2:]
For readability, I convert daycode based on its rule. First 3 digits are in range of 1 to 365 x2 = 730, last 2 digits in range of 1 to 48. These are 30-min interval reading of 2-year length. (but not all have in full)
So I create files that contain dates in one, and times in another separately. I will use index to convert the digits of daycode into the corresponding date & time that these file contain.
#dcodebook index starts from 0. So minus 1 from the daycode before match
dcodebook = pd.read_csv("data/dcode.txt", encoding = "utf-8", sep = '\r', names =['match'])
#hcodebook starts from 1
hcodebook = pd.read_csv("data/hcode.txt", encoding = "utf-8", sep ='\t', lineterminator='\r', names =['code', 'print'])
hcodebook = hcodebook.drop(['code'], axis= 1)
For some weird reason, dcodebook was indexed using .iloc function as I understood, but hcodebook needed .loc.
#iloc: by int-position
#loc: by label value
#ix: by both
day_df = dcodebook.iloc[test['day'].astype(int) - 1].reset_index(drop=True)
#to avoid duplicate index Valueerror, create separate dataframes..
hm_df = hcodebook.loc[test['hm'].astype(int) - 1]
#.to_frame error / do I need .reset_index(drop=True)?
The following line is where the code crashes.
datcode_df = day_df(['match']) + ' ' + hm_df(['print'])
print datcode_df
print test
What I don't understand:
I tested earlier that columns of different dataframes can be merged using the simple addition as seen
I initially assigned this to the existing column ['daycode'] in test dataframe, so that previous values will be replaced. And the same error msg was returned.
Please advise.
You need same size of both DataFrames, so is necessary day and hm are unique.
Then reset_index with drop=True for same indices and last remove () in join:
day_df = dcodebook.iloc[test['day'].astype(int) - 1].reset_index(drop=True)
hm_df = hcodebook.loc[test['hm'].astype(int) - 1].reset_index(drop=True)
datcode_df = day_df['match'] + ' ' + hm_df['print']