Pandas: Calculate percentage of column for each class - pandas

I have a dataframe like this:
Class Boolean Sum
0 1 0 10
1 1 1 20
2 2 0 15
3 2 1 25
4 3 0 52
5 3 1 48
I want to calculate percentage of 0/1's for each class, so for example the output could be:
Class Boolean Sum %
0 1 0 10 0.333
1 1 1 20 0.666
2 2 0 15 0.375
3 2 1 25 0.625
4 3 0 52 0.520
5 3 1 48 0.480

Divide column Sum with GroupBy.transform for return Series with same length as original DataFrame filled by aggregated values:
df['%'] = df['Sum'].div(df.groupby('Class')['Sum'].transform('sum'))
print (df)
Class Boolean Sum %
0 1 0 10 0.333333
1 1 1 20 0.666667
2 2 0 15 0.375000
3 2 1 25 0.625000
4 3 0 52 0.520000
5 3 1 48 0.480000
Detail:
print (df.groupby('Class')['Sum'].transform('sum'))
0 30
1 30
2 40
3 40
4 100
5 100
Name: Sum, dtype: int64

Related

Pandas: I want slice the data and shuffle them to genereate some synthetic data

Just want to help with data science to generate some synthetic data since we don't have enough labelled data. I want to cut the rows around the random position of the y column around 0s, don't cut 1 sequence.
After cutting, want to shuffle those slices and generate a new DataFrame.
It's better to have some parameters that adjust the maximum, and minimum sequence to cut, the number of cuts, and something like that.
The raw data
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
4 2 0
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
12 0.5 0
...
Some possible cuts
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
--------------
4 2 0
--------------
5 100 1
6 200 1
7 1234 1
-------------
8 12 0
9 40 0
10 200 1
11 300 1
-------------
12 0.5 0
...
ts v1 y
0 100 1
1 120 1
2 80 1
3 5 0
4 2 0
-------------
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
------------
12 0.5 0
...
This is NOT CORRECT
ts v1 y
0 100 1
1 120 1
------------
2 80 1
3 5 0
4 2 0
5 100 1
6 200 1
7 1234 1
8 12 0
9 40 0
10 200 1
11 300 1
12 0.5 0
...
You can use:
#number of cuts
N = 3
#create random N index values of index if y=0
idx = np.random.choice(df.index[df['y'].eq(0)], N, replace=False)
#create groups with check membership and cumulative sum
arr = df.index.isin(idx).cumsum()
#randomize unique integers - groups
u = np.unique(arr)
np.random.shuffle(u)
#change order of groups in DataFrame
df = df.set_index(arr).loc[u].reset_index(drop=True)
print (df)
ts v1 y
0 9 40.0 0
1 10 200.0 1
2 11 300.0 1
3 12 0.5 0
4 3 5.0 0
5 4 2.0 0
6 5 100.0 1
7 6 200.0 1
8 7 1234.0 1
9 8 12.0 0
10 0 100.0 1
11 1 120.0 1
12 2 80.0 1

pandas aggregate based on continuous same rows

Suppose I have this data frame and I want to aggregate and sum values on column 'a' based on the labels that have the same amount.
a label
0 1 0
1 3 0
2 5 0
3 2 1
4 2 1
5 2 1
6 3 0
7 3 0
8 4 1
The desired result will be:
a label
0 9 0
1 6 1
2 6 0
3 4 1
and not this:
a label
0 15 0
1 10 1
IIUC
s=df.groupby(df.label.diff().ne(0).cumsum()).agg({'a':'sum','label':'first'})
s
Out[280]:
a label
label
1 9 0
2 6 1
3 6 0
4 4 1

Conditional sum after groupby based on value in another column

I have the dataframe as below.
Cycle Type Count Value
1 1 5 0.014
1 1 40 -0.219
1 1 5 0.001
1 1 100 -0.382
1 1 5 0.001
1 1 25 -0.064
2 1 5 0.003
2 1 110 -0.523
2 1 10 0.011
2 1 5 -0.009
2 1 5 0.012
2 1 156 -0.612
3 1 5 0.002
3 1 45 -0.167
3 1 5 0.003
3 1 10 -0.052
3 1 5 0.001
3 1 80 -0.194
I want to sum the 'Count' of all the positive & negative 'Value' AFTER groupby
The answer would something like
1 1 15 (sum of count when Value is positive),
1 1 165 (sum of count when Value is negative),
2 1 20,
2 1 171,
3 1 15,
3 1 135
I think this will work (grouped.set_index('Count').groupby(['Cycle','Type'])['Value']....... but i am unable to figure out how to specify positive & negative values to sum()
If I understood correctly, You can try below code,
df= pd.DataFrame (data)
df_negative=df[df['Value'] < 0]
df_positive=df[df['Value'] > 0]
df_negative = df_negative.groupby(['Cycle','Type']).Count.sum().reset_index()
df_positive = df_positive.groupby(['Cycle','Type']).Count.sum().reset_index()
df_combine = pd.concat([df_positive,df_negative]).sort_values('Cycle')
df_combine

Pandas get order of column value grouped by other column value

I have the following dataframe:
srch_id price
1 30
1 20
1 25
3 15
3 102
3 39
Now I want to create a third column in which I determine the price position grouped by the search id. This is the result I want:
srch_id price price_position
1 30 3
1 20 1
1 25 2
3 15 1
3 102 3
3 39 2
I think I need to use the transform function. However I can't seem to figure out how I should handle the argument I get using .transform():
def k(r):
return min(r)
tmp = train.groupby('srch_id')['price']
train['min'] = tmp.transform(k)
Because r is either a list or an element?
You can use series.rank() with df.groupby():
df['price_position']=df.groupby('srch_id')['price'].rank()
print(df)
srch_id price price_position
0 1 30 3.0
1 1 20 1.0
2 1 25 2.0
3 3 15 1.0
4 3 102 3.0
5 3 39 2.0
is this:
df['price_position'] = df.sort_values('price').groupby('srch_id').price.cumcount() + 1
Out[1907]:
srch_id price price_position
0 1 30 3
1 1 20 1
2 1 25 2
3 3 15 1
4 3 102 3
5 3 39 2

Display Rows only if group of rows' sum is greater then 0

I have a table like the one below. I would like to get this data to SSRS (Grouped by LineID and Product and Column as Hour) to show only those rows where HourCount > 0 for every LineID and Product.
LineID Product Hour HourCount
3 A 0 0
3 A 1 0
3 A 2 0
3 A 3 0
3 A 4 0
3 A 5 0
3 B 0 65
3 B 1 56
3 B 2 45
3 B 3 34
3 B 4 43
3 B 5 45
4 A 0 54
4 A 1 34
4 A 2 45
4 A 3 44
4 A 4 55
4 A 5 44
4 B 0 0
4 B 1 0
4 B 2 0
4 B 3 0
4 B 4 0
4 B 5 0
5 A 0 45
5 A 1 77
5 A 2 66
5 A 3 55
5 A 4 0
5 A 5 0
5 B 0 0
5 B 1 0
5 B 2 45
5 B 3 0
5 B 4 0
5 B 5 0
Basically I would like this table to look like this before it's in SSRS:
LineID Product Hour HourCount
3 B 0 65
3 B 1 56
3 B 2 45
3 B 3 34
3 B 4 43
3 B 5 45
4 A 0 54
4 A 1 34
4 A 2 45
4 A 3 44
4 A 4 55
4 A 5 44
5 A 0 45
5 A 1 77
5 A 2 66
5 A 3 55
5 A 4 0
5 A 5 0
5 B 0 0
5 B 1 0
5 B 2 45
5 B 3 0
5 B 4 0
5 B 5 0
So display Product for the line only if any of the Hourd have HourCount higher then 0.
Is there any query that could give me these results or I should play with display settings in SSRS?
Something like this should work:
with NonZero as
(
select *
, GroupZeroCount = sum(HourCount) over (partition by LineID, Product)
from HourTable
)
select LineID
, Product
, [Hour]
, HourCount
from NonZero
where GroupZeroCount > 0
SQL Fiddle with demo.
You could certainly so something similar in SSRS, but it's certainly much easier and intuitive to apply at the T-SQL level.
I think you are looking for
SELECT LineID,Product,Hour,Count(Hour) AS HourCount
FROM abc
GROUP BY LineID,Productm,Hour HAVING Count(Hour) > 0