Just want to help with data science to generate some synthetic data since we don't have enough labelled data. I want to cut the rows around the random position of the y column around 0s, don't cut 1 sequence.
After cutting, want to shuffle those slices and generate a new DataFrame.
It's better to have some parameters that adjust the maximum, and minimum sequence to cut, the number of cuts, and something like that.
The raw data
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
4 2 0
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
12 0.5 0
...
Some possible cuts
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
--------------
4 2 0
--------------
5 100 1
6 200 1
7 1234 1
-------------
8 12 0
9 40 0
10 200 1
11 300 1
-------------
12 0.5 0
...
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
4 2 0
-------------
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
------------
12 0.5 0
...
This is NOT CORRECT
ts v1 y
0 100 1
1 120 1
------------
2 80 1
3 5 0
4 2 0
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
12 0.5 0
...
You can use:
#number of cuts
N = 3
#create random N index values of index if y=0
idx = np.random.choice(df.index[df['y'].eq(0)], N, replace=False)
#create groups with check membership and cumulative sum
arr = df.index.isin(idx).cumsum()
#randomize unique integers - groups
u = np.unique(arr)
np.random.shuffle(u)
#change order of groups in DataFrame
df = df.set_index(arr).loc[u].reset_index(drop=True)
print (df)
ts v1 y
0 9 40.0 0
1 10 200.0 1
2 11 300.0 1
3 12 0.5 0
4 3 5.0 0
5 4 2.0 0
6 5 100.0 1
7 6 200.0 1
8 7 1234.0 1
9 8 12.0 0
10 0 100.0 1
11 1 120.0 1
12 2 80.0 1
Here is my data:
ID Mnth Amt Flg
B 1 10 0
B 2 12 0
B 3 14 0
B 4 41 0
B 5 134 0
B 6 14 0
B 7 134 0
B 8 134 0
B 9 12 0
B 10 41 0
B 11 4 0
B 12 14 0
B 12 14 0
A 1 34 0
A 2 22 0
A 3 56 0
A 4 129 0
A 5 40 0
A 6 20 0
A 7 58 0
A 8 123 0
If I give 3 as input, my output should be:
ID Mnth Amt Flg Level_Flag
B 1 10 0 0
B 2 12 0 1
B 3 14 0 1
B 4 41 0 1
B 5 134 0 2
B 6 14 0 2
B 7 134 0 2
B 8 134 0 3
B 9 12 0 3
B 10 41 0 3
B 11 4 0 4
B 12 14 0 4
B 12 14 0 4
A 1 34 0 0
A 2 22 0 0
A 3 56 0 1
A 4 129 0 1
A 5 40 0 1
A 6 20 0 2
A 7 58 0 2
A 8 123 0 2
So basically I want to divide the data into subgroups with 3 rows in each subgroup from bottom up and label those subgroups as mentioned in level_flag column. I have IDs like A,C and so on. So I want to do this for each group of ID.Thanks in Advance.
Edit :- I want the same thing to be done after grouping it by ID
First we decide the unique numbers nums by dividing the length of your df by n. Then we repeat those numbers n times. Finally we reverse the array and chop it of at the length of df and reverse it one more time.
def create_flags(d, n):
nums = np.ceil(len(d) / n)
level_flag = np.repeat(np.arange(nums), n)[::-1][:len(d)][::-1]
return level_flag
df['Level_Flag'] = df.groupby('ID')['ID'].transform(lambda x: create_flags(x, 3))
ID Mnth Amt Flg Level_Flag
0 B 1 10 0 0.0
1 B 2 12 0 1.0
2 B 3 14 0 1.0
3 B 4 41 0 1.0
4 B 5 134 0 2.0
5 B 6 14 0 2.0
6 B 7 134 0 2.0
7 B 8 134 0 3.0
8 B 9 12 0 3.0
9 B 10 41 0 3.0
10 B 11 4 0 4.0
11 B 12 14 0 4.0
12 B 12 14 0 4.0
To remove the incomplete rows, use GroupBy.transform:
m = df.groupby(['ID', 'Level_Flag'])['Level_Flag'].transform('count').ge(3)
df = df[m]
ID Mnth Amt Flg Level_Flag
1 B 2 12 0 1.0
2 B 3 14 0 1.0
3 B 4 41 0 1.0
4 B 5 134 0 2.0
5 B 6 14 0 2.0
6 B 7 134 0 2.0
7 B 8 134 0 3.0
8 B 9 12 0 3.0
9 B 10 41 0 3.0
10 B 11 4 0 4.0
11 B 12 14 0 4.0
12 B 12 14 0 4.0
I have a large list with a shape in excess of (1000000, 200). I would like to count the occurrences of the items in the last column (:, -1). I can do this in pandas with a smaller list;
distribution = mylist.groupby('var1').count()
However I do not have labels on any of my dimensions. So unsure of how to reference them.
Edit:
print of pandas sample data;
0 1 2 3 4 ... 204 205 206 207 208
0 1 1 Random 1 4 12 ... 8 -14860 0 -5.0000 43.065233
1 1 1 Random 2 3 2 ... 8 -92993 -1 -1.0000 43.057945
2 1 1 Random 3 13 3 ... 8 -62907 1 -2.0000 43.070335
3 1 1 Random 3 13 3 ... 8 -62907 -1 -2.0000 43.070335
4 1 1 Random 4 4 2 ... 8 -38673 -1 0.0000 43.057945
5 1 1 Book 1 3 9 ... 8 -82339 -1 0.0000 43.059402
... ... ... ... .. .. ... .. ... .. ... ...
11795132 292 1 Random 5 12 2 ... 8 -69229 -1 0.0000 12.839051
11795133 292 1 Book 2 7 10 ... 8 -60664 -1 0.0000 46.823615
11795134 292 1 Random 2 9 4 ... 8 -78754 1 -2.0000 11.762521
11795135 292 1 Random 2 9 4 ... 8 -78754 -1 -2.0000 11.762521
11795136 292 1 Random 1 7 5 ... 8 -76275 -1 0.0000 41.839286
I want a few different counts and summaries so plan to do one at a time with;
mylist = input_list.values
mylist = mylist[:, -1]
mylist.astype(int)
Expected output;
11 2
12 1
41 1
43 6
46 1
iloc enables you to reference a column without using labels
distribution = input_list.groupby(input_list.iloc[:, -1]).count()
I have a dataframe like this:
Class Boolean Sum
0 1 0 10
1 1 1 20
2 2 0 15
3 2 1 25
4 3 0 52
5 3 1 48
I want to calculate percentage of 0/1's for each class, so for example the output could be:
Class Boolean Sum %
0 1 0 10 0.333
1 1 1 20 0.666
2 2 0 15 0.375
3 2 1 25 0.625
4 3 0 52 0.520
5 3 1 48 0.480
Divide column Sum with GroupBy.transform for return Series with same length as original DataFrame filled by aggregated values:
df['%'] = df['Sum'].div(df.groupby('Class')['Sum'].transform('sum'))
print (df)
Class Boolean Sum %
0 1 0 10 0.333333
1 1 1 20 0.666667
2 2 0 15 0.375000
3 2 1 25 0.625000
4 3 0 52 0.520000
5 3 1 48 0.480000
Detail:
print (df.groupby('Class')['Sum'].transform('sum'))
0 30
1 30
2 40
3 40
4 100
5 100
Name: Sum, dtype: int64
It is my first time to use AMPL to solve a problem.
Could anyone tell me why does this program gives me Invalide Subcript x[2,1,1] ?
Thanks !
param K; #number of customers
param T; #number of orders
param J; #number of fabs
param I; #number if items
param A {i in 1..I, t in 1..T}; #quantity of item i requested in order t
param P { t in 1..T}; # price of order t if fully fulfilled
param C {i in 1..I, j in 1..J}; #number of items i that can be produced by fab j per hour
param Cap {j in 1..J}; #capacity hours for fab j
set list {j in 1..J}= {i in 1..I: C[i,j] <>0}; #set of items that can be produced by firm j
set nonlist {j in 1..J}= {i in 1..I: C[i,j] =0}; #set of items that cannot be produced by firm j
var x {i in 1..I, j in 1..J, t in i..T}; #optimal quantity of item i produced by fab j for order t
maximize profit:
#sum {i in 1..I, j in 1..J, t in 1..T} (P[t]*(x[i,j,t] / (sum{ i in 1..I} A[i,t]))); #written like that doesn't work?!
sum{t in 1..T} (P[t]*((sum{i in 1..I, j in 1..J} x[i,j,t])/(sum {i in 1..I} A[i,t])));
subject to limit {i in 1..I, t in 1..T}: sum {j in 1..J} x[i,j,t] <= A[i,t] ; #cannot produce more than ordered for each item i in each order t
subject to capacity {j in 1..J} : sum { t in 1..T, i in list[j]} (x [i,j,t] / C[i,j]) <= Cap [j]; #cannot produce more than maximum capacity for each fab j
subject to realistic {j in 1..J, i in nonlist[j], t in 1..T}: x[i,j,t] =0; # firm j cannot produce item i if C[i,j]=0
subject to nonnegativity {i in 1..I, j in 1..J, t in 1..T}: x[i,j,t] >= 0;
The data file is
param T := 10;
param J:= 8;
param I:= 12;
param A:
1 2 3 4 5 6 7 8 9 10 :=
1 0 1000 0 0 5000 0 0 2000 1500 0
2 0 2000 0 4000 0 1000 1000 2000 0 0
3 0 0 1500 0 0 3500 500 0 3000 0
4 2000 0 0 0 0 1500 0 500 4000 2000
5 3000 0 0 5000 1500 0 0 1000 500 0
6 0 1000 0 0 2500 0 5000 0 1000 0
7 0 0 5000 0 0 0 0 1000 3000 0
8 0 0 4000 0 0 3000 0 0 2000 2000
9 0 0 6000 8000 2500 0 0 0 500 0
10 5000 0 0 0 0 0 0 2000 3000 3000
11 0 3000 0 2000 0 1500 0 3000 500 0
12 0 0 2000 3000 0 0 500 1000 1500 4000 ;
param P :=
1 5500
2 4300
3 9300
4 8600
5 8000
6 6700
7 4700
8 7000
9 9600
10 7200 ;
param Cap :=
1 840
2 750
3 610
4 470
5 560
6 240
7 1250
8 930;
param C:
1 2 3 4 5 6 7 8 :=
1 10 5 0 25 20 40 0 0
2 5 0 20 0 15 0 5 10
3 10 15 30 0 20 40 0 0
4 10 0 5 20 0 50 15 15
5 5 0 0 25 0 50 15 15
6 0 5 10 40 15 0 5 0
7 20 10 0 5 30 0 10 0
8 50 15 10 0 0 30 5 0
9 40 20 30 0 0 0 10 20
10 0 25 15 0 15 45 5 0
11 0 20 0 30 0 20 15 5
12 0 0 30 15 20 0 10 20;
Running command gives result Invalide Subcript x[2,1,1]:
The variable x[2,1,1] doesn't exist because x is indexed over {i in 1..I, j in 1..J, t in i..T}, so when i is 2, t goes from 2 to T. You should either change the declaration of x to something like
var x {i in 1..I, j in 1..J, t in 1..T};
or change the indexing in the declaration of profit and, possibly, constraints to be consistent with the indexing of x.