Problem in creating a new column using for loop

Problem in creating a new column using for loop - pandas

I have to create a new column 'Action' in a dataframe whose values are :
1 if the next day's Close Price is greater than the present day's
-1 if the next day's Close Price is less than the present day's
that is,
Action[i] = 1 if Close Price[i+1]>Close Price[i]
Action[i] = -1 if Close Price[i+1]
I have used the following code:
dt = pd.read_csv("C:\Subhro\ML_Internship\HDFC_Test.csv", sep=',',header=0)
df = pd.DataFrame(dt)
for i in df.index:
if(df['Close Price'][i+1]>df['Close Price'][i]):
df['Action'][i]=1
elif(df['Close Price'][i+1]<df['Close Price'][i]):
df['Action'][i]=-1
print(df)
But I am getting an error :
KeyError: 'Action'
in line:
df['Action'][i]=1
Please help me out

You are getting the key error because you don't have a column called action. Any of the following before the loop will resolve the error:
df['Action'] = 0
or
df['Action'] = np.nan
However, you will get warnings because of the way you are assigning the cell values. (See here)
It is recommended that you instead use e.g.
df.loc[i, "Action"] = 1
Note that with this method, you won't even need to create an empty "Action" column before the loop.

Related

How can I optimize my for loop in order to be able to run it on a 320000 lines DataFrame table?

I think I have a problem with time calculation.
I want to run this code on a DataFrame of 320 000 lines, 6 columns:
index_data = data["clubid"].index.tolist()
for i in index_data:
for j in index_data:
if data["clubid"][i] == data["clubid"][j]:
if data["win_bool"][i] == 1:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 1
):
NW_tot[i] += 1
else:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 0
):
NL_tot[i] += 1
The objective is to determine the number of wins and the number of losses from a given match taking into account the previous match, this for every clubid.
The problem is, I don't get an error, but I never obtain any results either.
When I tried with a smaller DataFrame ( data[0:1000] ) I got a result in 13 seconds. This is why I think it's a time calculation problem.
I also tried to first use a groupby("clubid"), then do my for loop into every group but I drowned myself.
Something else that bothers me, I have at least 2 lines with the exact same date/hour, because I have at least two identical dates for 1 match. Because of this I can't put the date in index.
Could you help me with these issues, please?

As I pointed out in the comment above, I think you can simply sum the vector of win_bool by group. If the dates are sorted this should be equivalent to your loop, correct?
import pandas as pd
dat = pd.DataFrame({
"win_bool":[0,0,1,0,1,1,1,0,1,1,1,1,1,1,0],
"clubid": [1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
"date" : [1,2,1,2,3,4,5,1,2,1,2,3,4,5,6],
"othercol":["a","b","b","b","b","b","b","b","b","b","b","b","b","b","b"]
})
temp = dat[["clubid", "win_bool"]].groupby("clubid")
NW_tot = temp.sum()
NL_tot = temp.count()
NL_tot = NL_tot["win_bool"] - NW_tot["win_bool"]
If you have duplicate dates that inflate the counts, you could first drop duplicates by dates (within groups):
# drop duplicate dates
temp = dat.drop_duplicates(["clubid", "date"])[["clubid", "win_bool"]].groupby("clubid")

Creating a loop to check the column enteries in pandas

Can anyone help me write a loop function for this use-case as I'm new to programming i don't get how to write this.
What i want is
A loop should check the the if the value of item_id column in the DATAFRAME (B) is same in the question_id column in the DATAFRAME (questions) , then it should compare user_answer entry (Dataframe B) to correct_answer (Dataframe questions) ,
if it matches then it should return True/Correct or set a counter to +1
if it doesn't match then it should return as False/InCorrect or set a counter to -1

You can try:
counter = 0
for key, item_id in B['item_id'].iteritems():
try:
if B.loc[key, 'user_answer'] == questions.loc[questions['question_id'] == item_id, 'correct_answer'].values[0]:
counter += 1
else:
pass # put here whatever you want to do if the answer is wrong
except Exception:
pass # put here whatever you want to do if the question id from DF(B) is not in DF(questions)

Using to_datetime several columns names

I am working with several CSV's that first N columns are information and then the next Ms (M is big) columns are information regarding a date.
This is the dataframe picture
I need to set just the columns between N+1 to N+M - 1 columns name to date format.
I tried this, in this case N+1 = 5, no matter M, I suppose that I can use -1 to not affect the last column name.
ContDiarios.columns[5:-1] = pd.to_datetime(ContDiarios.columns[5:-1])
but I get the following error:
TypeError: Index does not support mutable operations

The way you are doing is not feasable. Please try this way
def convert(x):
try:
return pd.to_datetime(x)
except:
return x
x.columns = map(convert,x.columns)
Or you can also use df.rename property to convert it.

pandas - numpy using np.where to calculate and construct new columns

I am trying to create a new column based on selection criteria in another column. This is at an end of a while loop so the data frame does not have the column until this part of the first iteration. All subsequent iterations will be based on this columns previous iteration's total and the current totals:
if 'cBeds' in sPhase.columns:
sPhase['cBeds'] = np.where(sPhase['COUNTYFP'] == '1', (sPhase['cBeds'] + (sPhase[infCount] * .08)), sPhase['cBeds'])
else:
sPhase['cBeds'] = np.where(sPhase['COUNTYFP'] == '1', (sPhase[infCount] * .08), sPhase['cBeds'])
However, when I run the code I get 'KeyError: 'cBeds'
How can handle updating a column in a conditional when the column doesn't exist on the first iteration?

In the else clause, you reference sPhase['cbeds'] as the third parameter to np.where even though you've already established that the column does not exist.
If you want to avoid this problem, just add the column at the beginning of the loop and give it a default value that you can conditionally change later.

Length of value issue with unique ids

I am trying to write a simple code and haven't found a simple answer for this. I am trying to assign a unique ID to each person based on when the file was amended and their employee ID. Then add the column of Unique IDs to the file.
excel1 = "Book1.xlsx"
df1 = pd.read_excel(excel1, header = 0)
time = time.strftime('%m%d%Y%H%m', time.gmtime(os.path.getmtime ("Book1.xlsx")))
unique_id=[df1["ID"] + time]
df1["CID"]=unique_id
When I try to run it I keep getting an error of
ValueError: Length of values does not match length of index
Could anyone have an answer on this?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Problem in creating a new column using for loop - pandas

Related

How can I optimize my for loop in order to be able to run it on a 320000 lines DataFrame table?

Creating a loop to check the column enteries in pandas

Using to_datetime several columns names

pandas - numpy using np.where to calculate and construct new columns

Length of value issue with unique ids

Categories

Resources