Pandas Dynamic Index Referencing during Calculation - dynamic

I have the following data frame
val sum
0 1 0
1 2 0
2 3 0
3 4 0
4 5 0
5 6 0
6 7 0
I would like to calculate the sum of the next three rows' (including the current row) values. I need to do this for very big files. What is the most efficient way? The expected result is
val sum
0 1 6
1 2 9
2 3 12
3 4 15
4 5 18
5 6 13
6 7 7
In general, how can I dynamically referencing to other rows (via boolean operations) while making assignments?

> pd.rolling_sum(df['val'], window=3).shift(-2)
0 6
1 9
2 12
3 15
4 18
5 NaN
6 NaN
If you want the last values to be "filled in" then you'll need to tack on NaN's to the end of your dataframe.

Related

How can I create a column of numbers that ascends after a certain amount of rows?

I have a column of scores going in descending order. I want to create a column of difficulty level with scale 1-10 going up every 37 rows for diffculty 1-7 and then 36 rows for 8-10. i have created a small example below where the difficulty goes down in 3 row intervals and the final difficulty '4' and '5' is 2 rows
In:
score
0 11
1 10
2 9
3 8
4 8
5 6
6 5
7 4
8 4
9 3
10 2
11 1
12 1
Out:
score difficulty
0 11 1
1 10 1
2 9 1
3 8 2
4 8 2
5 6 2
6 5 3
7 4 3
8 4 3
9 3 4
10 2 4
11 1 5
12 1 5
If I understand your problem correctly, you could do something like:
import pandas as pd
from random import randint
count = (37*7) + (36*3)
difficulty = [int(i/37) + 1 for i in range(37*7)] + [int(i/36) + 8 for i in range(36*3)]
df = pd.DataFrame({'score': [randint(0, 10) for i in range(count)]})
df['difficulty'] = difficulty

Remove Elements from Dataframe Based on Group Appearance Rate

I have a simple dataframe that is basically a list of objects with their own list of items (see below). What is the cleanest method of filtering out all rows in the overall dataframe based on their rate of occurrence within each group? For example, I want to remove all rows that appear in groups at least 75% of the time. In this example table, I would expect all rows with '30' in column 2 to be deleted, because it appears in 3 out of the 4 groups. Is this a use case for a lambda filter? If so, what would the filter be?
Col1
Col2
0
3
0
7
0
15
0
30
1
5
1
6
1
11
1
30
2
1
2
9
2
17
2
29
3
2
3
14
3
18
3
30
Try:
condition = df.drop_duplicates().groupby(['Col2'])['Col1'].count() / len(df['Col1'].drop_duplicates())<0.75
condition = condition[condition].index
print(df[df['Col2'].isin(condition)])
Output:
Col1 Col2
0 0 3
1 0 7
2 0 15
4 1 5
5 1 6
6 1 11
8 2 1
9 2 9
10 2 17
11 2 29
12 3 2
13 3 14
14 3 18

If a column value does not have a certain number of occurances in a dataframe, how to duplicate rows at random until that count is met?

Say that this is what my dataframe looks like
A B
0 1 5
1 4 2
2 3 5
3 3 3
4 3 2
5 2 0
6 4 5
7 2 3
8 4 1
9 5 1
I want every unique value in Column B to occur at least 3 times. So none of the rows with a B value of 5 are duplicated. The row with a column B value of 0 are duplicated twice. And the rest have one of their two rows duplicated at random.
Here is an example desired output
A B
0 1 5
1 4 2
2 3 5
3 3 3
4 3 2
5 2 0
6 4 5
7 2 3
8 4 1
9 5 1
10 4 2
11 2 3
12 2 0
13 2 0
14 4 1
Edit:
The row chosen to be duplicated should be selected at random
To random pick rows, I would use groupby apply with sample on each group. x of lambda is each group of B, so I use reapeat - x.shape[0] to find number of rows need to create. There may be some cases group B already has more rows than 3, so I use np.clip to force negative values to 0. Sample on 0 row is the same as ignore it. Finally, reset_index and append back to df
repeats = 3
df1 = (df.groupby('B').apply(lambda x: x.sample(n=np.clip(repeats-x.shape[0], 0, np.inf)
.astype(int), replace=True))
.reset_index(drop=True))
df_final = df.append(df1).reset_index(drop=True)
Out[43]:
A B
0 1 5
1 4 2
2 3 5
3 3 3
4 3 2
5 2 0
6 4 5
7 2 3
8 4 1
9 5 1
10 2 0
11 2 0
12 5 1
13 4 2
14 2 3

Counting consecutive occurences in dataframe based on condition

I am trying to find whether 3 or more occurences of any consecutive number in a column are present, and if so mark the last one with a 1 and the rest with zero's.
df['a'] = df.assign(consecutive=df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size')).query('consecutive > #threshold')
is what i have found here: Identifying consecutive occurrences of a value however this gives me the error: ValueError: Wrong number of items passed 6, placement implies 1. I understand the issue that it cannot be printed into the dataframe but what would be the correct approach to get this desired result?
Secondly if this condition is satisfied, I would like to execute an equation (e.g. 2*b) to multiple rows neighbouring (either previous or results to follow) the 1 (like the shift function but then repetitive to e.g. 3 previous rows). I'm quite sure this must be possible but have not been able to get this whole objective to work. It does not necessarily have to be based on the one in column c, this is just a proposal.
small data excerpt below for interpretation, column c and d present desired result:
a b c d
16215 2 0 0
24848 4 0 0
24849 4 0 8
24850 4 0 8
24851 4 1 8
24852 6 0 0
24853 6 0 0
24854 8 0 0
24855 8 0 0
24856 8 0 16
25208 8 0 16
25932 8 1 16
28448 10 0 0
28449 10 0 0
28450 10 0 0
Using cumsum with diff create the groupkey, then find the last position of each group when it total count is more than 3 , then we using bfill with limit
s=df.b.diff().ne(0).cumsum()
s1=s.groupby(s).transform('count')
s2=s.groupby(s).cumcount()
df['c']=((s1==s2+1)&(s1>3)).astype(int)
df['d']=(df.c.mask(df.c==0)*df.b*2).bfill(limit=2).combine_first(df.c)
df
Out[87]:
a b c d
0 16215 2 0 0.0
1 24848 4 0 0.0
2 24849 4 0 8.0
3 24850 4 0 8.0
4 24851 4 1 8.0
5 24852 6 0 0.0
6 24853 6 0 0.0
7 24854 8 0 0.0
8 24855 8 0 0.0
9 24856 8 0 16.0
10 25208 8 0 16.0
11 25932 8 1 16.0
12 28448 10 0 0.0
13 28449 10 0 0.0
14 28450 10 0 0.0

calculate the total value for each group using Calculated Column in Spotfire

I have a problem about the sum calculation for the rows using calculated column in Spotfire.
For example, the raw data is as below, the raw table is order by id, for each type, the sequence is 2,3,0.
id type value state
1 1 12 2
2 1 7 3
3 1 10 0
4 2 11 2
5 2 6 3
6 3 9 0
7 3 7 2
8 3 5 3
9 2 9 0
10 1 7 2
11 1 3 3
12 1 2 0
for type of each cycle of (2,3,0), I want to sum the value, then the result could be:
id type value state cycle time
1 1 12 2
2 1 7 3
3 1 10 0 29
4 2 11 2
5 2 6 3
6 3 7 2
7 3 5 3
8 3 9 0 21
9 2 9 0 26
10 2 7 2
11 2 3 3
12 2 2 0 12
note: only the row which its state is 0 will have the sum value , i think it will be easier to see the rules, when we order the type :
id type value state cycle time
1 1 12 2
2 1 7 3
3 1 10 0 29
4 2 11 2
5 2 6 3
9 2 9 0 26
10 2 7 2
11 2 3 3
12 2 2 0 12
6 3 7 2
7 3 5 3
8 3 9 0 21
thanks for your time and help!
Here is a solution for you.
Insert a Calculated Column RowId() and name it RowId
Insert a Calculated Column If(Mod([RowId],3)=0,[RowId] / 3,Ceiling([RowId] / 3)) and name it Groups
Insert a Calculated Column Sum([value]) OVER ([Groups]) and name it Running Sum
Insert a Calculated Column If([state] = 0,[RunningSum]) and name it OnlyState=0
The only thing to really explain here is #2. With the data sorted as you listed in your example, the last row for each group, based on the RowId, should be divisible by 3. We have to do it this way since your type field can have multiple groups for any given type. RowId 3, 6, 9, 12 etc will all have a Modulus of 0 since they are divisible by 3. This marks the last row in each set. If it is the last row, we just set it to RowId / 3. This gives us groups 1,2,3,4 etc... For the rows which aren't divisible by 3, we round them up to the nearest whole number of the divisor... which will be the last row in the set.
The last calculated column is the only way I know how to get ONLY the values you care about. If you use the If [state] = 0 logic anywhere else, you negate all other rows.