How can I create a column of numbers that ascends after a certain amount of rows? - pandas

I have a column of scores going in descending order. I want to create a column of difficulty level with scale 1-10 going up every 37 rows for diffculty 1-7 and then 36 rows for 8-10. i have created a small example below where the difficulty goes down in 3 row intervals and the final difficulty '4' and '5' is 2 rows
In:
score
0 11
1 10
2 9
3 8
4 8
5 6
6 5
7 4
8 4
9 3
10 2
11 1
12 1
Out:
score difficulty
0 11 1
1 10 1
2 9 1
3 8 2
4 8 2
5 6 2
6 5 3
7 4 3
8 4 3
9 3 4
10 2 4
11 1 5
12 1 5

If I understand your problem correctly, you could do something like:
import pandas as pd
from random import randint
count = (37*7) + (36*3)
difficulty = [int(i/37) + 1 for i in range(37*7)] + [int(i/36) + 8 for i in range(36*3)]
df = pd.DataFrame({'score': [randint(0, 10) for i in range(count)]})
df['difficulty'] = difficulty

Related

Pandas function to group by cumulative sum and return another column when a certain amount is reached

Here it is my problem.
I got a dataframe like this:
ID item amount level
1 1 10 5
1 1 10 10
2 4 15 5
2 9 30 8
2 4 10 10
2 4 10 20
3 4 10 4
3 4 10 6
and I need to know, per each id, at what level the cumulative sum of each item reaches a fixed amount.
For example, If I need to know the first time when a given items reach an amount of 20 or more for a user.
I would like to have something like:
ID item amount level
1 1 10 5
1 1 20 10
2 4 15 5
2 9 30 8
2 4 25 10
2 4 40 20
3 4 10 4
3 4 20 6
and then something like a list or a dictionary in which I can store the results. for example:
d[item_number] = [list_of_levels_per_id_when_20_is_reached]
In this example:
{1: [10], 4: [10,6], 9: [8]}
cumsum
You can perform the cumsum post group with:
df['amount_cumsum'] = df.groupby(['ID', 'item'])['amount'].cumsum()
Output (as separate column for clarity):
ID item amount level amount_cumsum
0 1 1 10 5 10
1 1 1 10 10 20
2 2 4 15 5 15
3 2 9 30 8 30
4 2 4 10 10 25
5 3 4 10 4 10
6 3 4 10 6 20
dictionary
(df[df['amount_cumsum'].ge(20)]
.groupby(['item'])['level'].agg(list)
.to_dict()
)
Output:
{1: [10], 4: [10, 6], 9: [8]}

Remove Elements from Dataframe Based on Group Appearance Rate

I have a simple dataframe that is basically a list of objects with their own list of items (see below). What is the cleanest method of filtering out all rows in the overall dataframe based on their rate of occurrence within each group? For example, I want to remove all rows that appear in groups at least 75% of the time. In this example table, I would expect all rows with '30' in column 2 to be deleted, because it appears in 3 out of the 4 groups. Is this a use case for a lambda filter? If so, what would the filter be?
Col1
Col2
0
3
0
7
0
15
0
30
1
5
1
6
1
11
1
30
2
1
2
9
2
17
2
29
3
2
3
14
3
18
3
30
Try:
condition = df.drop_duplicates().groupby(['Col2'])['Col1'].count() / len(df['Col1'].drop_duplicates())<0.75
condition = condition[condition].index
print(df[df['Col2'].isin(condition)])
Output:
Col1 Col2
0 0 3
1 0 7
2 0 15
4 1 5
5 1 6
6 1 11
8 2 1
9 2 9
10 2 17
11 2 29
12 3 2
13 3 14
14 3 18

If a column value does not have a certain number of occurances in a dataframe, how to duplicate rows at random until that count is met?

Say that this is what my dataframe looks like
A B
0 1 5
1 4 2
2 3 5
3 3 3
4 3 2
5 2 0
6 4 5
7 2 3
8 4 1
9 5 1
I want every unique value in Column B to occur at least 3 times. So none of the rows with a B value of 5 are duplicated. The row with a column B value of 0 are duplicated twice. And the rest have one of their two rows duplicated at random.
Here is an example desired output
A B
0 1 5
1 4 2
2 3 5
3 3 3
4 3 2
5 2 0
6 4 5
7 2 3
8 4 1
9 5 1
10 4 2
11 2 3
12 2 0
13 2 0
14 4 1
Edit:
The row chosen to be duplicated should be selected at random
To random pick rows, I would use groupby apply with sample on each group. x of lambda is each group of B, so I use reapeat - x.shape[0] to find number of rows need to create. There may be some cases group B already has more rows than 3, so I use np.clip to force negative values to 0. Sample on 0 row is the same as ignore it. Finally, reset_index and append back to df
repeats = 3
df1 = (df.groupby('B').apply(lambda x: x.sample(n=np.clip(repeats-x.shape[0], 0, np.inf)
.astype(int), replace=True))
.reset_index(drop=True))
df_final = df.append(df1).reset_index(drop=True)
Out[43]:
A B
0 1 5
1 4 2
2 3 5
3 3 3
4 3 2
5 2 0
6 4 5
7 2 3
8 4 1
9 5 1
10 2 0
11 2 0
12 5 1
13 4 2
14 2 3

calculate the total value for each group using Calculated Column in Spotfire

I have a problem about the sum calculation for the rows using calculated column in Spotfire.
For example, the raw data is as below, the raw table is order by id, for each type, the sequence is 2,3,0.
id type value state
1 1 12 2
2 1 7 3
3 1 10 0
4 2 11 2
5 2 6 3
6 3 9 0
7 3 7 2
8 3 5 3
9 2 9 0
10 1 7 2
11 1 3 3
12 1 2 0
for type of each cycle of (2,3,0), I want to sum the value, then the result could be:
id type value state cycle time
1 1 12 2
2 1 7 3
3 1 10 0 29
4 2 11 2
5 2 6 3
6 3 7 2
7 3 5 3
8 3 9 0 21
9 2 9 0 26
10 2 7 2
11 2 3 3
12 2 2 0 12
note: only the row which its state is 0 will have the sum value , i think it will be easier to see the rules, when we order the type :
id type value state cycle time
1 1 12 2
2 1 7 3
3 1 10 0 29
4 2 11 2
5 2 6 3
9 2 9 0 26
10 2 7 2
11 2 3 3
12 2 2 0 12
6 3 7 2
7 3 5 3
8 3 9 0 21
thanks for your time and help!
Here is a solution for you.
Insert a Calculated Column RowId() and name it RowId
Insert a Calculated Column If(Mod([RowId],3)=0,[RowId] / 3,Ceiling([RowId] / 3)) and name it Groups
Insert a Calculated Column Sum([value]) OVER ([Groups]) and name it Running Sum
Insert a Calculated Column If([state] = 0,[RunningSum]) and name it OnlyState=0
The only thing to really explain here is #2. With the data sorted as you listed in your example, the last row for each group, based on the RowId, should be divisible by 3. We have to do it this way since your type field can have multiple groups for any given type. RowId 3, 6, 9, 12 etc will all have a Modulus of 0 since they are divisible by 3. This marks the last row in each set. If it is the last row, we just set it to RowId / 3. This gives us groups 1,2,3,4 etc... For the rows which aren't divisible by 3, we round them up to the nearest whole number of the divisor... which will be the last row in the set.
The last calculated column is the only way I know how to get ONLY the values you care about. If you use the If [state] = 0 logic anywhere else, you negate all other rows.

Update Query in SQL with numeric pattern in MS Access

Good Day All,
I need assistance in an creating an update query that groups my data.
The data in my table is actually spatial in nature and can be thought of a matrix that is 10 columns by 5 rows. I have the ObjectID, Row and Column but I want the column DesiredResult which is a 2x2 grouping of the rows & columns.
So the R,Cs of 1,1 1,2, 2,1 and 2,2, will have a DesiredResult of 1 while the 1,3 1,4 2,3 2,4 will have a DesiredResult of 2 and so on (see below for an example) ....
I was able to create the R and C columns using a combination of Quotient & Mod so I assume I would do somethign similar but I am stuck. How would I go about this query in MS Access ?
ObjectID R C DesiredResult
1 1 1 1
2 1 2 1
3 1 3 2
4 1 4 2
5 1 5 3
6 1 6 3
7 1 7 4
8 1 8 4
9 1 9 5
10 1 10 5
11 2 1 1
12 2 2 1
13 2 3 2
14 2 4 2
15 2 5 3
16 2 6 3
17 2 7 4
18 2 8 4
19 2 9 5
20 2 10 5
21 3 1 6
22 3 2 6
23 3 3 7
24 3 4 7
25 3 5 8
26 3 6 8
27 3 7 9
28 3 8 9
29 3 9 10
30 3 10 10
31 4 1 6
32 4 2 6
33 4 3 7
34 4 4 7
35 4 5 8
36 4 6 8
37 4 7 9
38 4 8 9
39 4 9 10
40 4 10 10
41 5 1 11
42 5 2 11
43 5 3 12
44 5 4 12
45 5 5 13
46 5 6 13
47 5 7 14
48 5 8 14
49 5 9 15
50 5 10 15
Something like ... ?
SELECT a.Row, a.Col, Col\2 AS D1, Col Mod 2 AS D2, [D1]+[D2] AS Desired
FROM table AS a
ORDER BY a.Row, a.Col;
Remou had a close approximation but it turns out this gives me what I need. I needed both a row and a column index.
SELECT ObjectID, R, C,
Int(([C]-1)/2) AS ColIndex,
Int(([R]-1)/2) AS RowIndex,
[RowIndex]*5+[ColIndex]+1 AS DesiredResult
FROM Testing
ORDER BY ObjectID
The key in the query is that there is the number 2 in both the Column & Row Index (which is the grouping size) and the number 5 is used in Desired Result and represents the Number of Row cells.
Thanks !