pandas for index, row in dataframe.iterrows() - pandas

I was not able to figure out the reason why my code didn't work. Ii seemingly doesn't have any problem for me. Can anyone help to point out the issue in my code?
What I tried:
true_avengers['Deaths'] = 0
for index, row in true_avengers.iterrows():
for i in range(1,6):
col = 'Death{}'.format(i)
if row[col] == 'YES':
row['Deaths'] += 1
Answer:
def clean_deaths(row):
num_deaths = 0
columns = ['Death1', 'Death2', 'Death3', 'Death4', 'Death5']
for c in columns:
death = row[c]
if pd.isnull(death) or death == 'NO':
continue
elif death == 'YES':
num_deaths += 1
return num_deaths
true_avengers['Deaths'] = true_avengers.apply(clean_deaths, axis=1)
Much appreciated if you can enlighten me!

You do not use pandas correctly. It is usually not necessary to loop through the rows explicitly. Here's a clean vectorized solution. First, identify the columns of interest. Their names consist pf "Death" followed by a number:
death_columns = true_avengers.columns.str.match(r"Death\d+")
Find out which of them are "YES":
changes = true_avengers.iloc[:, death_columns]=='YES'
Calculate the sum of the occurrences and add them to the last column:
true_avengers['Deaths'] += changes.sum(axis=1)

Related

Trigger Point of Moving average crossover

I am trying to define the trigger point when wt1(Moving average 1) crosses over wt2(moving average 2) and add it to the column ['side'].
So basically add 1 to side at the moment wt1 crosses above wt2.
This is the current code I am using but doesn't seem to be working.
for i in range(len(df)):
if df.wt1.iloc[i] > df.wt2.iloc[i] and df.wt1.iloc[i-1] < df.wt2.iloc[i-1]:
df.side.iloc[1]
If I do the following:
long_signals = (df.wt1 > df.wt2)
df.loc[long_signals, 'side'] = 1
it return the value of 1 the entire time wt1 is above wt2, which is not what i am trying to do.
Expected outcome is when wt1 crosses above wt2 side should be labeled as 1.
Help would be appreciated!
Use shift in your condition:
long_signals = (df.wt1 > df.wt2) & (df.wt1.shift() <= df.wt2.shift())
df.loc[long_signals, 'side'] = 1
df
if you do not like NaNs in 'side', use df.fillna(0) at the end
Your first piece of code also works with the following small modification
for i in range(len(df)):
if df.wt1.iloc[i] > df.wt2.iloc[i] and df.wt1.iloc[i-1] <= df.wt2.iloc[i-1]:
df.loc[i,'side'] = 1

How do I find out where one line crosses another with pandas/matplotlib?

I've got a data frame called df['x'] that I have used Matplotlib to graph as shown below. The data frame has a date time index and it fluctuates between being 0 and above 0. There are also two purple lines y=2 and y=5.
I'm interested in finding the dates on the index that the blue line crosses the purple line y=2 when going upward. However with one caveat, if the blue line crosses the purple line going up for a second time before reaching 0 I would not like it to be counted. So I'm looking for a system that returns me 6 dates. Thanks for any help.
https://gyazo.com/40aac726ef3546c22c9191fa2d9bc3e2
I ended up trying a bunch of different things. Initially I really wasn't sure how to approach this problem because I'm fairly new to coding. After doing a bit of research I found a solution that works by applying a function to create a new column, then finding the index of the areas of the column that meet a certain criteria.
#Making a function to apply to df
trade_pos = ''
dip = ''
def apply_function(df):
global trade_pos
global dip
if df.Drawdown == 0 and trade_pos == 'open':
trade_pos = 'closed'
dip = 'false'
return 'Exit'
elif df.Drawdown == 0:
trade_pos = 'closed'
dip = 'false'
return 0
elif df.Drawdown >= 2 and df.Drawdown < 5 and trade_pos == 'closed' and dip != 'true':
trade_pos = 'open'
return 'Entry'
elif df.Drawdown >=5 and trade_pos == 'open':
trade_pos = 'closed'
dip = 'true'
return 'Exit'
else:
return 0
#Applying function to df
df['Strategy'] = df.apply(apply_function, axis=1)
Finding dates of crossover with caveats
entry = df[df['Strategy'] == 'Entry']
entry = list(entry.index)
#We finish with a list of dates that meet the criteria of the original post
I hope that helps anybody else with a similar problem.

How can I optimize my for loop in order to be able to run it on a 320000 lines DataFrame table?

I think I have a problem with time calculation.
I want to run this code on a DataFrame of 320 000 lines, 6 columns:
index_data = data["clubid"].index.tolist()
for i in index_data:
for j in index_data:
if data["clubid"][i] == data["clubid"][j]:
if data["win_bool"][i] == 1:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 1
):
NW_tot[i] += 1
else:
if (data["startdate"][i] >= data["startdate"][j]) & (
data["win_bool"][j] == 0
):
NL_tot[i] += 1
The objective is to determine the number of wins and the number of losses from a given match taking into account the previous match, this for every clubid.
The problem is, I don't get an error, but I never obtain any results either.
When I tried with a smaller DataFrame ( data[0:1000] ) I got a result in 13 seconds. This is why I think it's a time calculation problem.
I also tried to first use a groupby("clubid"), then do my for loop into every group but I drowned myself.
Something else that bothers me, I have at least 2 lines with the exact same date/hour, because I have at least two identical dates for 1 match. Because of this I can't put the date in index.
Could you help me with these issues, please?
As I pointed out in the comment above, I think you can simply sum the vector of win_bool by group. If the dates are sorted this should be equivalent to your loop, correct?
import pandas as pd
dat = pd.DataFrame({
"win_bool":[0,0,1,0,1,1,1,0,1,1,1,1,1,1,0],
"clubid": [1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
"date" : [1,2,1,2,3,4,5,1,2,1,2,3,4,5,6],
"othercol":["a","b","b","b","b","b","b","b","b","b","b","b","b","b","b"]
})
temp = dat[["clubid", "win_bool"]].groupby("clubid")
NW_tot = temp.sum()
NL_tot = temp.count()
NL_tot = NL_tot["win_bool"] - NW_tot["win_bool"]
If you have duplicate dates that inflate the counts, you could first drop duplicates by dates (within groups):
# drop duplicate dates
temp = dat.drop_duplicates(["clubid", "date"])[["clubid", "win_bool"]].groupby("clubid")

I got problem with the length of dataframe while using pandas

a=0
for i in range (0,len(df)):
if df['column name'][i][7]!='1' or df['column name'][i][7]='6':
a=a+1
If i run this piece of code, I got error "string index out of range". Can someone help me to solve this problem.
P.S. df has about 10 million rows
If the index is greater or equal to the length of the string then this error occurs.
You can have a check whether the string is equal or more than 7 characters.
a=0
for i in range (0,len(df)):
data = df['column name'][i]
if len(data) > 6 and (data[7] != '1' or data[7] == '6'):
a=a+1
you can do this by list comprehension
can_count = lambda row: len(row['col']) > 6 and (row['col'][7] != '1' or row['col'][7] == '6')
a = sum((1 for _, row in df.iterrows() if can_count(row)))
One thing to note is df['column name'][i][7]='6' should be == not =
I see you are using the assignment operator '=' in your code instead of '==' . I have copy pasted the line to indicate this. Can you retry and indicate the error message you finally get. Also, a little more comment on what you would like to achieve with the operation.
if df['column name'][i][7]!='1' or df['column name'][i][7]='6':
can you please add an example for your string? your data is probably too short.
if you use this: df['column name'][i][7], your string should be at least 8 charas long.
good luck

modifying a dataframe by adding additional if statement column

Modifying a data frame by adding an additional column with if statement.
I created 5 lists namely: East_Asia, Central_Asia,Central_America,South_America, Europe_East & Europe_West. And I wanted to add a conditional column based on existing column. i.e if japan in Central_East, then the japan row in the adding column should contain Central East, so on.
df['native_region'] =df["native_country"].apply(lambda x: "Asia-East" if x in 'Asia_East'
"Central-Asia" elif x in "Central_Asia"
"South-America" elif x in "South_America"
"Europe-West" elif x in "Europe_West"
"Europe-East" elif x in "Europe_East"
"United-States" elif x in "
United-States"
else "Outlying-US"
)
File "", line 2
"Central-Asia" elif x in "Central_Asia"
^
SyntaxError: invalid syntax
I might be wrong, but I think you're taking the problem the wrong way around.
What you seem to be doing there is just to replace '_' by '-', which you can do with the following line:
df['native_region'] = df.native_country.str.replace('_', '-')
And then, in my experience, it's more understandable to work like that :
known_countries = ['Asia-East', 'Central-Asia', 'South-America', ...]
is_known = df['native_country'].isin(known_countries )
df.native_region[~known_countries] = 'Outlying-US'
This could work also if you worked with countries like :
east_asia_countries = ['Japan', 'China', 'Korea']
isin_east_asia = df['native_country'].isin(east_asia_countries)
df.native_region[known_countries] = 'East-Asia'