Using recursion to iterate multiple times through rows in a dataframe - not returning the expected result - pandas

How to loop through a dataframe series multiple times using a recursive function?
I am trying to get a simple case to work and use it in a more complicated function.
I am using a simple dataframe:
df = pd.DataFrame({'numbers': [1,2,3,4,5]
I want to iterate through the rows multiple time and sum the values. Each iteration, the index starting point increments by 1.
def recursive_sum(df, mysum=0, count=0):
df = df.iloc[count:]
if len(df.index) < 2:
return mysum
else:
for i in range(len(df.index)):
mysum += df.iloc[i, 0]
count += 1
return recursive_sum(df, mysum, count)
I think I should get:
#Iteration 1: count = 0, len(df.index) = 5 < 2, mysum = 1 + 2 + 3 + 4 + 5 = 15
#Iteration 2: count = 1, len(df.index) = 4 < 2, mysum = 15 + 2 + 3 + 4 + 5 = 29
#Iteration 3: count = 2, len(df.index) = 3 < 2, mysum = 29 + 3 + 4 + 5 = 41
#Iteration 4: count = 2, len(df.index) = 2 < 2, mysum = 41 + 4 + 5 = 50
#Iteration 5: count = 2, len(df.index) = 1 < 2, mysum = 50
But I am returning 38.

Just fixed it:
def recursive_sum(df, mysum=0, count=0):
if(len(df.index) - count) < 2:
return mysum
else:
for i in range(count, len(df.index)):
mysum += df.iloc[0]
count += 1
return recursive_sum(df, mysum, count)

Related

How do I (sum) the output of a list?

List
L= [23, 91, 0, -11, 4, 23, 49]
Code
for i in L:
if i > 10:
num = i * 30
else:
num = i * 1
if num % 2 == 0:
num += 6
if i > 50:
num -= 10
if i != -11:
num += 10
print(num)
Output
696
2736
6
-11
10
696
1476
I'm trying to sum the numbers in the output and then divide the total by 2.
Initialize a variable sum outside the loop then, add the value of num to sum before the print statement. Finally, print(sum/2) once outside the loop.
sum = 0
for i in L:
...
sum += num
print(num)
print(sum/2)

Getting count of rows from breakpoints of different column

Consider there are two columns A and B in a dataframe. How can I decile column A and use those breakpoints of column A deciles to calculate the count of rows in column B??
import pandas as pd
import numpy as np
df=pd.read_excel("E:\Sai\Development\UCG\qcut.xlsx")
df['Range']=pd.qcut(df['a'],10)
df_gb=df.groupby('Range',as_index=False).agg({'a':[min,max,np.size]})
df_gb.columns = df_gb.columns.droplevel()
df_gb=df_gb.rename(columns={'':'Range','size':'count_A'})
df['Range_B']=0
df['Range_B'].loc[df['b']<=df_gb['max'][0]]=1
df['Range_B'].loc[(df['b']>df_gb['max'][0]) & (df['b']<=df_gb['max'][1])]=2
df['Range_B'].loc[(df['b']>df_gb['max'][1]) & (df['b']<=df_gb['max'][2])]=3
df['Range_B'].loc[(df['b']>df_gb['max'][2]) & (df['b']<=df_gb['max'][3])]=4
df['Range_B'].loc[(df['b']>df_gb['max'][3]) & (df['b']<=df_gb['max'][4])]=5
df['Range_B'].loc[(df['b']>df_gb['max'][4]) & (df['b']<=df_gb['max'][5])]=6
df['Range_B'].loc[(df['b']>df_gb['max'][5]) & (df['b']<=df_gb['max'][6])]=7
df['Range_B'].loc[(df['b']>df_gb['max'][6]) & (df['b']<=df_gb['max'][7])]=8
df['Range_B'].loc[(df['b']>df_gb['max'][7]) & (df['b']<=df_gb['max'][8])]=9
df['Range_B'].loc[df['b']>df_gb['max'][8]]=10
df_gb_b=df.groupby('Range_B',as_index=False).agg({'b':np.size})
df_gb_b=df_gb_b.rename(columns={'b':'count_B'})
df_final = pd.concat([df_gb, df_gb_b], axis=1)
df_final=df_final[['Range','count_A','count_B']]
Is there any simple solution, as I intend to do for so many columns
I hope this would help:
df['Range'] = pd.qcut(df['a'], 10)
df2 = df.groupby(['Range'])['a'].count().reset_index().rename(columns = {'a':'count_A'})
for item in df2['Range'].values:
df2.loc[df2['Range'] == item, 'count_B'] = df['b'].apply(lambda x: x in item).sum()
df2 = df2.sort_values('Range', ascending = True)
if you want to additionally count values b that are out of range a:
min_border = df2['Range'].values[0].left
max_border = df2['Range'].values[-1].right
df2.loc[0, 'count_B'] += df.loc[df['b'] <= min_border, 'b'].count()
df2.iloc[-1, 2] += df.loc[df['b'] > max_border, 'b'].count()
One way -
df = pd.DataFrame({'A': np.random.randint(0, 100, 20), 'B': np.random.randint(0, 10, 20)})
bins = [0, 1, 4, 8, 16, 32, 60, 100, 200, 500, 5999]
labels = ["{0} - {1}".format(i, j) for i, j in zip(bins, bins[1:])]
df['group_A'] = pd.cut(df['A'], bins, right=False, labels=labels)
df['group_B'] = pd.cut(df.B, bins, right=False, labels=labels)
df1 = df.groupby(['group_A'])['A'].count().reset_index()
df2 = df.groupby(['group_B'])['B'].count().reset_index()
df_final = pd.merge(df1, df2, left_on =['group_A'], right_on =['group_B']).drop(['group_B'], axis=1).rename(columns={'group_A': 'group'})
print(df_final)
Output
group A B
0 0 - 1 0 1
1 1 - 4 1 3
2 4 - 8 1 9
3 8 - 16 2 7
4 16 - 32 3 0
5 32 - 60 7 0
6 60 - 100 6 0
7 100 - 200 0 0
8 200 - 500 0 0
9 500 - 5999 0 0

Apply function on a two dataframe rows

Given a pandas dataframe like this:
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
col1 col2
0 1 4
1 2 5
2 3 6
I would like to do something equivalent to this using a function but without passing "by value" or as a global variable the whole dataframe (it could be huge and then it would give me a memory error):
i = -1
for index, row in df.iterrows():
if i < 0:
i = index
continue
c1 = df.loc[i][0] + df.loc[index][0]
c2 = df.loc[i][1] + df.loc[index][1]
df.ix[index, 0] = c1
df.ix[index, 1] = c2
i = index
col1 col2
0 1 4
1 3 9
2 6 15
i.e., I would like to have a function which will give me the previous output:
def my_function(two_rows):
row1 = two_rows[0]
row2 = two_rows[1]
c1 = row1[0] + row2[0]
c2 = row1[1] + row2[1]
row2[0] = c1
row2[1] = c2
return row2
df.apply(my_function, axis=1)
df
col1 col2
0 1 4
1 3 9
2 6 15
Is there a way of doing this?
What you've demonstrated is a cumsum
df.cumsum()
col1 col2
0 1 4
1 3 9
2 6 15
def f(df):
n = len(df)
r = range(1, n)
for j in df.columns:
for i in r:
df[j].values[i] += df[j].values[i - 1]
return df
f(df)
To define a function as a loop that does this in place
Slow cell by cell
def f(df):
n = len(df)
r = range(1, n)
for j in df.columns:
for i in r:
df[j].values[i] += df[j].values[i - 1]
return df
f(df)
col1 col2
0 1 4
1 3 9
2 6 15
Compromise between memory and efficiency
def f(df):
for j in df.columns:
df[j].values[:] = df[j].values.cumsum()
return df
f(df)
f(df)
col1 col2
0 1 4
1 3 9
2 6 15
Note that you don't need to return df. I chose to for convenience.

sum numpy array at given indices

I want to add the values of a vector:
a = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='d')
to the values of another vector:
c = np.array([10, 10, 10], dtype='d')
at position given by another array (of the same size of a, with values 0 <= b[i] < len(c))
b = np.array([2, 0, 1, 0, 2, 0, 1, 1, 0, 2], dtype='int32')
This is very simple to write in pseudo code:
for I in range(b.shape[0]):
J = b[I]
c[J] += a[I]
Something like this, but vectorized (length of c is some hundreds in real case).
c[0] += np.sum(a[b==0]) # 27 (10 + 1 + 3 + 5 + 8)
c[1] += np.sum(a[b==1]) # 25 (10 + 2 + 6 + 7)
c[2] += np.sum(a[b==2]) # 23 (10 + 0 + 4 + 9)
My initial guess was:
c[b] += a
but only last values of a are summed.
You can use np.bincount to get ID based weighted summations and then add with c, like so -
np.bincount(b,a) + c

Project Euler #1 - Lasso

I've been working on Project Euler questions as part of learning how to code in Lasso and am wondering if my solution can be improved upon. Here is what I've got below for question #1 in Lasso 8 code, and it returns the correct answer:
var ('total' = 0);
loop(1000-1);
loop_count % 3 == 0 || loop_count % 5 == 0 ? $total += loop_count;
/loop;
output($total);
My question: is there a better or faster way to code this? Thanks!
Actually Chris it looks like my L9 code answer was almost exactly the same. However what I had to do to time is was wrap it in a loop to time it 1000 times.
Lasso 9 can do Microseconds, whereas versions prior can only time in milliseconds.
Below are 3 ways - the first is yours, then my two versions.
define br => '<br>'
local(start_time = micros)
loop(1000)=>{
var ('total' = 0);
loop(1000-1);
loop_count % 3 == 0 || loop_count % 5 == 0 ? $total += loop_count;
/loop;
$total;
}
'Avg (L8 code in 9): '+(micros - #start_time)/1000+' micros'
br
br
local(start_time = micros)
loop(1000)=>{
local(sum = 0)
loop(999)=>{ loop_count % 3 == 0 || loop_count % 5 == 0 ? #sum += loop_count }
#sum
}
'Avg (incremental improvement): '+(micros - #start_time)/1000+' micros'
br
br
local(start_time = micros)
loop(1000)=>{
local(sum = 0)
loop(999)=>{ not (loop_count % 3) || not(loop_count % 5) ? #sum += loop_count }
#sum
}
'Avg using boolean not: '+(micros - #start_time)/1000+' micros'
The output is:
Avg (L8 code in 9): 637 micros
Avg (incremental improvement): 595 micros
Avg using boolean not: 596 micros
Note that I didn't use "output" as it's redundant in many situations in 8 and completely redundant 9 :)
There is a fun story about how Gauss once summed numbers, which involves a strategy which can help to avoid the loop.
local('p' = 3);
local('q' = 5);
local('n' = 1000);
local('x' = integer);
local('before');
local('after');
#before = micros
loop(1000) => {
/* In the tradition of Gauss */
local('n2' = #n - 1)
local('pq' = #p * #q)
local('p2' = #n2 / #p)
local('q2' = #n2 / #q)
local('pq2' = #n2 / #pq)
local('p3' = (#p2 + 1) * (#p2 / 2) + (#p2 % 2 ? #p2 / 2 + 1 | 0))
local('q3' = (#q2 + 1) * (#q2 / 2) + (#q2 % 2 ? #q2 / 2 + 1 | 0))
local('pq3' = (#pq2 + 1) * (#pq2 / 2) + (#pq2 % 2 ? #pq2 / 2 + 1 | 0))
#x = #p * #p3 + #q * #q3 - #pq * #pq3
}
#after = micros
'Answer: ' + #x + '<br/>\n'
'Average time: ' + ((#after - #before) / 1000) + '<br/>\n'
/* Different numbers */
#p = 7
#q = 11
#before = micros
loop(1000) => {
/* In the tradition of Gauss */
local('n2' = #n - 1)
local('pq' = #p * #q)
local('p2' = #n2 / #p)
local('q2' = #n2 / #q)
local('pq2' = #n2 / #pq)
local('p3' = (#p2 + 1) * (#p2 / 2) + (#p2 % 2 ? #p2 / 2 + 1 | 0))
local('q3' = (#q2 + 1) * (#q2 / 2) + (#q2 % 2 ? #q2 / 2 + 1 | 0))
local('pq3' = (#pq2 + 1) * (#pq2 / 2) + (#pq2 % 2 ? #pq2 / 2 + 1 | 0))
#x = #p * #p3 + #q * #q3 - #pq * #pq3
}
#after = micros
'Answer: ' + #x + '<br/>\n'
'Average time: ' + ((#after - #before) / 1000) + '<br/>\n'
The output is:
Answer: 233168<br/>
Average time: 3<br/>
Answer: 110110<br/>
Average time: 2<br/>
Although the first time I ran it, that first average time was 18 instead of 3. Maybe Lasso is doing something smart for subsequent runs, or maybe it was just bad luck.
n = input number
x = (n-1)/3 = count of 3 divisible numbers.*
sum3 = (3*x*(x+1)) / 2 = sum of those numbers.**
y = (n-1)/5 = count of 5 divisible numbers.
sum5 = (5*y*(y+1)) / 2 = sum of those numbers.
half_Ans = sum3 + sum5
but 15, 30, 45... count twice (in both sum3 & sum5).
so remove it one time, so only they count once.
z = (n-1)/15 = count of 15 divisible numbers.
sum15 = (15*z*(z+1)) / 2 = sum of those numbers.
Answer = half_Ans - sum15
* => (n-1)/3 gives count of 3 divisible numbers.
if n = 100 we need to count of (3, 6, 9, ..., 99)
3 is 1st, 6 is 2nd, .... so on 99 is 33rd
so total count of those number is gain by last number / 3
last number is near to our input n (specifically less than input n)
if n = 99 we must not count 99, because statement is "find the sum of all the multiples of 3 or 5 below n".
so w/o subtract 1 last unwanted number also count, if n is divisible by 3.
** => (3*x*(x+1)) / 2 gives sum of those numbers
if n = 100 sum id 3 + 6 + 9 + ... + 99
all component are multiple of 3.
so 3 + 6 + 9 + ... + 99 = 3*(1 + 2 + 3 + ... + 33)
sum of 1 to m is (m*(m+1)) / 2
so 3 + 6 + 9 + ... + 99 = (3*33*(33+1)) / 2
here m for 1 to m is last number or total number of that sequence
or length of sequence that's why we find count of divisible numbers.