Cumulative variable calculation which is reset under a given condition, for each ID - Pandas - pandas

I want to create a cumulative variable based on a non-cumulative variable. This variable should be reset when the value of Y equals 1 (but the reset will start from the row below).
I want to do that for each ID in the data frame.
Data illustration:
ID X Non_cum Y
A .. 0 0
A .. 20 0
A .. 40 0
B .. 0 0
B .. 100 0
B .. 200 1
B .. 50 0
Expected result:
ID X Non_cum Y Cum
A .. 0 0 0
A .. 20 0 20
A .. 40 0 60
B .. 0 0 0
B .. 100 0 100
B .. 200 1 300
B .. 50 0 50

You can group by ID and cumsum of Y (with shift):
groups = df.groupby(['ID'])
df['Y_block'] = groups['Y'].shift(fill_value=0)
df['Y_block'] = groups['Y_block'].cumsum()
df['Cum'] = df.groupby(['ID','Y_block'])['Non_cum'].cumsum()
Output (Cum column):
0 0
1 20
2 60
3 0
4 100
5 300
6 50
Name: Cum, dtype: int64

Related

Pandas: I want slice the data and shuffle them to genereate some synthetic data

Just want to help with data science to generate some synthetic data since we don't have enough labelled data. I want to cut the rows around the random position of the y column around 0s, don't cut 1 sequence.
After cutting, want to shuffle those slices and generate a new DataFrame.
It's better to have some parameters that adjust the maximum, and minimum sequence to cut, the number of cuts, and something like that.
The raw data
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
4 2 0
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
12 0.5 0
...
Some possible cuts
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
--------------
4 2 0
--------------
5 100 1
6 200 1
7 1234 1
-------------
8 12 0
9 40 0
10 200 1
11 300 1
-------------
12 0.5 0
...
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
4 2 0
-------------
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
------------
12 0.5 0
...
This is NOT CORRECT
ts v1 y
0 100 1
1 120 1
------------
2 80 1
3 5 0
4 2 0
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
12 0.5 0
...
You can use:
#number of cuts
N = 3
#create random N index values of index if y=0
idx = np.random.choice(df.index[df['y'].eq(0)], N, replace=False)
#create groups with check membership and cumulative sum
arr = df.index.isin(idx).cumsum()
#randomize unique integers - groups
u = np.unique(arr)
np.random.shuffle(u)
#change order of groups in DataFrame
df = df.set_index(arr).loc[u].reset_index(drop=True)
print (df)
ts v1 y
0 9 40.0 0
1 10 200.0 1
2 11 300.0 1
3 12 0.5 0
4 3 5.0 0
5 4 2.0 0
6 5 100.0 1
7 6 200.0 1
8 7 1234.0 1
9 8 12.0 0
10 0 100.0 1
11 1 120.0 1
12 2 80.0 1

how to multiple all comums by all columns in SQL?

I have a file with i.e. 50 parameters - that makes 50 columns.
I would like to search for the correlation of some data to the parameters.
But I quess there could be a corrleation to 2 or 3 parameters at same time.
To check this I need new columns containing a result of multiplication of each column by each column.
I.e. I have:
ID
A
B
C
...
-1
0.5
1
0
...
-2
2
-3
-100
...
-3
0
0
1
...
4
1.5
1.5
1
...
And I need:
ID
A
B
C
...
A*A
A*B
A*C
B*B
B*C
C*C
...
-1
0.5
1
0
...
0.25
0.5
0
1
0
0
-2
2
-3
-100
...
4
-6
-200
9
300
10000
-3
0
0
1
...
0
0
0
0
0
1
4
1.5
1.5
1
...
2.25
2.25
1.5
2.25
1.5
1
With 2-3 columns it is easy, but how to deal with 100?
Any ideas?
Regards

Pandas gropuby[['col_name', 'values from another dataframe']]

I have two very large pandas dataframes, df and df_new
Sample df:
A B DU DR
100 103 -2 -10
100 110 -8 -9
100 112 0 -4
100 105 2 0
100 111 NAN 12
.
.
.
264 100 NAN -15
.
.
.
Sample df_new:
A TD
100 0
100 1
100 2
.
.
.
103 0
103 1
.
.
.
I wish to get another pandas dataframe with count of B whose DU is less than or equal to TD of df_new for the same value of A in both df and df_new. Similary I need count of B's whose DU is greater than TD of df_new for the same value of A (it should also include count of np.nan).
i.e:
my expected dataframe should be something like this:
A TD Count_Less Count_More
100 0 3 2
100 1 3 2
100 2 4 1
.
.
.
103 0 0 5
103 1 1 4
.
.
.
How can I do this in Python?
Please note the data size is huge.
First use DataFrame.merge with left join for one Dataframe, then compare columns by Series.gt for > and
Series.le for <= to new columns with DataFrame.assign and last aggregate sum:
df1 = df_new.merge(df.assign(DU = df['DU'].fillna(df_new['TD'].max() + 1)), on='A', how='left')
df2 = (df1.assign(Count_Less=df1['DU'].le(df1['TD']).astype(int),
Count_More=(df1['DU'].gt(df1['TD'])).astype(int))
.groupby(['A','TD'], as_index=False)['Count_Less','Count_More'].sum()
)
print (df2)
A TD Count_Less Count_More
0 100 0 3 2
1 100 1 3 2
2 100 2 4 1
3 103 0 0 0
4 103 1 0 0
Another solution with custom function, but slow if large DataFrame df_new:
df1 = df.assign(DU = df['DU'].fillna(df_new['TD'].max() + 1))
def f(x):
du = df1.loc[df1['A'].eq(x['A']), 'DU']
Count_Less = du.le(x['TD']).sum()
Count_More = du.gt(x['TD']).sum()
return pd.Series([Count_Less,Count_More], index=['Count_Less','Count_More'])
df_new = df_new.join(df_new.apply(f, axis=1))
print (df_new)
A TD Count_Less Count_More
0 100 0 3 2
1 100 1 3 2
2 100 2 4 1
3 103 0 0 0
4 103 1 0 0

Pandas: Calculate percentage of column for each class

I have a dataframe like this:
Class Boolean Sum
0 1 0 10
1 1 1 20
2 2 0 15
3 2 1 25
4 3 0 52
5 3 1 48
I want to calculate percentage of 0/1's for each class, so for example the output could be:
Class Boolean Sum %
0 1 0 10 0.333
1 1 1 20 0.666
2 2 0 15 0.375
3 2 1 25 0.625
4 3 0 52 0.520
5 3 1 48 0.480
Divide column Sum with GroupBy.transform for return Series with same length as original DataFrame filled by aggregated values:
df['%'] = df['Sum'].div(df.groupby('Class')['Sum'].transform('sum'))
print (df)
Class Boolean Sum %
0 1 0 10 0.333333
1 1 1 20 0.666667
2 2 0 15 0.375000
3 2 1 25 0.625000
4 3 0 52 0.520000
5 3 1 48 0.480000
Detail:
print (df.groupby('Class')['Sum'].transform('sum'))
0 30
1 30
2 40
3 40
4 100
5 100
Name: Sum, dtype: int64

group data using pandas, but how do I keep the order of the group and do math on two of the columns rows?

df:
Time Name X Y
0 00 AA 0 0
1 30 BB 1 1
2 45 CC 2 2
3 60 GG:AB 3 3
4 90 GG:AC 4 4
5 120 AA 5 3
dataGroup = df.groupby
([pd.Grouper(key=Time,freq='30s'),'Name'])).sort_values(by=['Timestamp'],ascending=True)
I have tried doing a diff() on the row, but it is returning NaN or something not expected.
df.groupby('Name', sort=False)['X'].diff()
How do I keep the groupings and the time sort, and do diff between a row and its previous row (for both the X and the Y column)
Expected output:
XDiff would be Group AA,
XDiff row 1 = (X row1 - origin (known))
XDiff row 2 = (X row2 - X row1)
Time Name X Y XDiff YDiff
0 00 AA 0 0 0 0
5 120 AA 5 3 5 3
1 30 BB 1 1 0 0
6 55 BB 2 3 1 2
2 45 CC 2 2 0 0
3 60 GG:AB 3 3 0 0
4 90 GG:AC 4 4 0 0
It would be nice to see the total distance for each group (ie, AA is 5, BB is 1)
In my example, I only have a couple of rows for each group, but what if there were 100 of them, the diff would give me values for the distance between any two, but not the total distance for that group.
Ripping off https://stackoverflow.com/a/20664760/6672746, you can use a lambda function to calculate the difference between rows for X and Y. I also included two lines to set the index (after the groupby) and sort it.
df['x_diff'] = df.groupby(['Name'])['X'].transform(lambda x: x.diff()).fillna(0)
df['y_diff'] = df.groupby(['Name'])['Y'].transform(lambda x: x.diff()).fillna(0)
df.set_index(["Name", "Time"], inplace=True)
df.sort_index(level=["Name", "Time"], inplace=True)
Output:
X Y x_diff y_diff
Name Time
AA 0 0 0 0.0 0.0
120 5 3 5.0 3.0
BB 30 1 1 0.0 0.0
CC 45 2 2 0.0 0.0
GG:AB 60 3 3 0.0 0.0
GG:AC 90 4 4 0.0 0.0