Pandas: I want slice the data and shuffle them to genereate some synthetic data - pandas

Just want to help with data science to generate some synthetic data since we don't have enough labelled data. I want to cut the rows around the random position of the y column around 0s, don't cut 1 sequence.
After cutting, want to shuffle those slices and generate a new DataFrame.
It's better to have some parameters that adjust the maximum, and minimum sequence to cut, the number of cuts, and something like that.
The raw data
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
4 2 0
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
12 0.5 0
...
Some possible cuts
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
--------------
4 2 0
--------------
5 100 1
6 200 1
7 1234 1
-------------
8 12 0
9 40 0
10 200 1
11 300 1
-------------
12 0.5 0
...
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
4 2 0
-------------
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
------------
12 0.5 0
...
This is NOT CORRECT
ts v1 y
0 100 1
1 120 1
------------
2 80 1
3 5 0
4 2 0
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
12 0.5 0
...

You can use:
#number of cuts
N = 3
#create random N index values of index if y=0
idx = np.random.choice(df.index[df['y'].eq(0)], N, replace=False)
#create groups with check membership and cumulative sum
arr = df.index.isin(idx).cumsum()
#randomize unique integers - groups
u = np.unique(arr)
np.random.shuffle(u)
#change order of groups in DataFrame
df = df.set_index(arr).loc[u].reset_index(drop=True)
print (df)
ts v1 y
0 9 40.0 0
1 10 200.0 1
2 11 300.0 1
3 12 0.5 0
4 3 5.0 0
5 4 2.0 0
6 5 100.0 1
7 6 200.0 1
8 7 1234.0 1
9 8 12.0 0
10 0 100.0 1
11 1 120.0 1
12 2 80.0 1

Related

Reset 'Id' value of appended Dataframe

I have appended multiple dataframes to form single dataframe. Each dataframe had multiple rows assigned with specific ID. After appending, Big dataframe has multiple rows with same Id. Would like assign new id's.
Current Dataframe:
Index name groupid
0 Abc 0
1 cvb 0
2 sdf 0
3 ksh 1
4 kjl 1
5 lmj 2
6 hyb 2
0 khf 0
1 uyt 0
2 tre 1
3 awe 1
4 uys 2
5 asq 2
6 lsx 2
Desired Output:
Index name groupid new_id
0 Abc 0 0
1 cvb 0 0
2 sdf 0 0
3 ksh 1 1
4 kjl 1 1
5 lmj 2 2
6 hyb 2 2
7 khf 0 3
8 uyt 0 3
9 tre 1 4
10 awe 1 4
11 uys 2 5
12 asq 2 5
13 lsx 2 5
You would have to use a slightly modified version of groupby:
df['new_id'] = df.groupby(df['groupid'].ne(df['groupid'].shift()).cumsum(), sort=False)
.ngroup())
Output is:
Index name groupid new_id
0 0 Abc 0 0
1 1 cvb 0 0
2 2 sdf 0 0
3 3 ksh 1 1
4 4 kjl 1 1
5 5 lmj 2 2
6 6 hyb 2 2
7 0 khf 0 3
8 1 uyt 0 3
9 2 tre 1 4
10 3 awe 1 4
11 4 uys 2 5
12 5 asq 2 5
13 6 lsx 2 5
See previous answer for reference.

Dataframe within a Dataframe - to create new column_

For the following dataframe:
import pandas as pd
df=pd.DataFrame({'list_A':[3,3,3,3,3,\
2,2,2,2,2,2,2,4,4,4,4,4,4,4,4,4,4,4,4]})
How can 'list_A' be manipulated to give 'list_B'?
Desired output:
list_A
list_B
0
3
1
1
3
1
2
3
1
3
3
0
4
2
1
5
2
1
6
2
0
7
2
0
8
4
1
9
4
1
10
4
1
11
4
1
12
4
0
13
4
0
14
4
0
15
4
0
16
4
0
As you can see, if List_A has the number 3 - then the first 3 values of List_B are '1' and then the value of List_B changes to '0', until List_A changes value again.
GroupBy.cumcount
df['list_B'] = df['list_A'].gt(df.groupby('list_A').cumcount()).astype(int)
print(df)
Output
list_A list_B
0 3 1
1 3 1
2 3 1
3 3 0
4 3 0
5 2 1
6 2 1
7 2 0
8 2 0
9 2 0
10 2 0
11 2 0
12 4 1
13 4 1
14 4 1
15 4 1
16 4 0
17 4 0
18 4 0
19 4 0
20 4 0
21 4 0
22 4 0
23 4 0
EDIT
blocks = df['list_A'].ne(df['list_A'].shift()).cumsum()
df['list_B'] = df['list_A'].gt(df.groupby(blocks).cumcount()).astype(int)

Pandas Groupby and divide the dataset into subgroups based on user input and label numbers to each subgroup

Here is my data:
ID Mnth Amt Flg
B 1 10 0
B 2 12 0
B 3 14 0
B 4 41 0
B 5 134 0
B 6 14 0
B 7 134 0
B 8 134 0
B 9 12 0
B 10 41 0
B 11 4 0
B 12 14 0
B 12 14 0
A 1 34 0
A 2 22 0
A 3 56 0
A 4 129 0
A 5 40 0
A 6 20 0
A 7 58 0
A 8 123 0
If I give 3 as input, my output should be:
ID Mnth Amt Flg Level_Flag
B 1 10 0 0
B 2 12 0 1
B 3 14 0 1
B 4 41 0 1
B 5 134 0 2
B 6 14 0 2
B 7 134 0 2
B 8 134 0 3
B 9 12 0 3
B 10 41 0 3
B 11 4 0 4
B 12 14 0 4
B 12 14 0 4
A 1 34 0 0
A 2 22 0 0
A 3 56 0 1
A 4 129 0 1
A 5 40 0 1
A 6 20 0 2
A 7 58 0 2
A 8 123 0 2
So basically I want to divide the data into subgroups with 3 rows in each subgroup from bottom up and label those subgroups as mentioned in level_flag column. I have IDs like A,C and so on. So I want to do this for each group of ID.Thanks in Advance.
Edit :- I want the same thing to be done after grouping it by ID
First we decide the unique numbers nums by dividing the length of your df by n. Then we repeat those numbers n times. Finally we reverse the array and chop it of at the length of df and reverse it one more time.
def create_flags(d, n):
nums = np.ceil(len(d) / n)
level_flag = np.repeat(np.arange(nums), n)[::-1][:len(d)][::-1]
return level_flag
df['Level_Flag'] = df.groupby('ID')['ID'].transform(lambda x: create_flags(x, 3))
ID Mnth Amt Flg Level_Flag
0 B 1 10 0 0.0
1 B 2 12 0 1.0
2 B 3 14 0 1.0
3 B 4 41 0 1.0
4 B 5 134 0 2.0
5 B 6 14 0 2.0
6 B 7 134 0 2.0
7 B 8 134 0 3.0
8 B 9 12 0 3.0
9 B 10 41 0 3.0
10 B 11 4 0 4.0
11 B 12 14 0 4.0
12 B 12 14 0 4.0
To remove the incomplete rows, use GroupBy.transform:
m = df.groupby(['ID', 'Level_Flag'])['Level_Flag'].transform('count').ge(3)
df = df[m]
ID Mnth Amt Flg Level_Flag
1 B 2 12 0 1.0
2 B 3 14 0 1.0
3 B 4 41 0 1.0
4 B 5 134 0 2.0
5 B 6 14 0 2.0
6 B 7 134 0 2.0
7 B 8 134 0 3.0
8 B 9 12 0 3.0
9 B 10 41 0 3.0
10 B 11 4 0 4.0
11 B 12 14 0 4.0
12 B 12 14 0 4.0

Pandas get order of column value grouped by other column value

I have the following dataframe:
srch_id price
1 30
1 20
1 25
3 15
3 102
3 39
Now I want to create a third column in which I determine the price position grouped by the search id. This is the result I want:
srch_id price price_position
1 30 3
1 20 1
1 25 2
3 15 1
3 102 3
3 39 2
I think I need to use the transform function. However I can't seem to figure out how I should handle the argument I get using .transform():
def k(r):
return min(r)
tmp = train.groupby('srch_id')['price']
train['min'] = tmp.transform(k)
Because r is either a list or an element?
You can use series.rank() with df.groupby():
df['price_position']=df.groupby('srch_id')['price'].rank()
print(df)
srch_id price price_position
0 1 30 3.0
1 1 20 1.0
2 1 25 2.0
3 3 15 1.0
4 3 102 3.0
5 3 39 2.0
is this:
df['price_position'] = df.sort_values('price').groupby('srch_id').price.cumcount() + 1
Out[1907]:
srch_id price price_position
0 1 30 3
1 1 20 1
2 1 25 2
3 3 15 1
4 3 102 3
5 3 39 2

Pandas: Calculate percentage of column for each class

I have a dataframe like this:
Class Boolean Sum
0 1 0 10
1 1 1 20
2 2 0 15
3 2 1 25
4 3 0 52
5 3 1 48
I want to calculate percentage of 0/1's for each class, so for example the output could be:
Class Boolean Sum %
0 1 0 10 0.333
1 1 1 20 0.666
2 2 0 15 0.375
3 2 1 25 0.625
4 3 0 52 0.520
5 3 1 48 0.480
Divide column Sum with GroupBy.transform for return Series with same length as original DataFrame filled by aggregated values:
df['%'] = df['Sum'].div(df.groupby('Class')['Sum'].transform('sum'))
print (df)
Class Boolean Sum %
0 1 0 10 0.333333
1 1 1 20 0.666667
2 2 0 15 0.375000
3 2 1 25 0.625000
4 3 0 52 0.520000
5 3 1 48 0.480000
Detail:
print (df.groupby('Class')['Sum'].transform('sum'))
0 30
1 30
2 40
3 40
4 100
5 100
Name: Sum, dtype: int64