Trying to get subtotals from a pandas dataframe - pandas

I'm doing cross-tabulation between two columns in the dataframe. Here's a sample from the columns:
column_1 column_2
A -8
B 95
A -93
D 11
C -62
D -14
A -55
C 66
B 76
D -49
I'm looking for a code that returns sub totals for A, B, C and D. For instance, for A the subtotal will be -156 (-8-93-55 = -156).
I tried to do that with pandas.crosstab() function:
pandas.crosstab(df[column_1], df[column_2], margins=True, margins_name=column_1).Total
Here's a sample of the output:
-271 -263 -241 -223 -221 -212 -207 -201 ... sum_column
A 1 0 1 0 0 1 0 0 ... ##
B 0 0 0 1 0 0 0 0 ... ##
C 0 0 0 0 1 0 0 1 ... ##
D 0 1 0 0 0 0 1 0 ... ##
The sum column consists of the sums of the boolean values in each row, instead of the sub totals for each of the four letters. I saw once that a boolean table can be used for calculations but I quite sure that by changing the pandas.crosstab() command the desired output can be achieved.
I'd be happy to get some ideas and thoughts from you.
Thanks.

If you'd simply like the totals by the individual categories in column_1 (A, B, C, D), maybe a groupby and summation could be helpful! You would call the groupby on the column with your categories, and then call sum on the result, like this:
df.groupby('column1').sum()

Related

Pandas dataframe, a cumsum calculation including max function

I'm sitting with a pandas dataframe and I have a time series problem where I have some values called diff. I need to calculate a value, here called sum, according to the below formula for each category separately:
sumn = max(0, diffn + sumn-1 - factor)
factor = 2 (factor is a parameter and in this example set to 2)
The dataframe looks something like this and the value of sum is set to 0 for hour = 0:
category
hour
diff
sum
a
0
0
0
a
1
4
NaN
a
2
3
NaN
a
3
1
NaN
b
0
0
0
b
1
1
NaN
b
2
-5
NaN
b
3
4
NaN
My expected output is the following:
category
hour
diff
sum
a
0
0
0
a
1
4
2
a
2
3
3
a
3
1
2
b
0
0
0
b
1
1
0
b
2
-5
0
b
3
4
2
Any idea how to solve this? Preferably without iterrows or any for loops since there are a lot of rows.
Would be happy for any help here.
If it would have been without the max function I could have used something like this:
df['sum'] = df.groupby(['category'])['diff'].cumsum() - factor
But the max function messes things up for me.
You can use the following lambda function:
sumn = 0
def calc_sum(df):
global sumn
if not df['hour']: # Reset when hour=0
sumn = 0
sumn = max(0, df['diff'] + sumn - 2)
return sumn
df['sum'] = df.groupby(['category']).apply(lambda df: df.apply(calc_sum, axis=1)).values
Output:

How to unpivot table from boolean form

I have a table like this where type (A, B, C) is represented as boolean form
ID
A
B
C
One
1
0
0
Two
0
0
1
Three
0
1
0
I want to have the table like
ID
Type
One
A
Two
C
Three
B
You can melt and select the rows with 1 with loc while using pop to remove the intermediate values:
out = df.melt('ID', var_name='Type').loc[lambda d: d.pop('value').eq(1)]
output:
ID Type
0 One A
5 Three B
7 Two C
You can do:
x,y = np.where(df.iloc[:, 1:])
out = pd.DataFrame({'ID': df.loc[x,'ID'], 'Type': df.columns[y]})
Output:
ID Type
0 One ID
1 Two B
2 Three A
You can also use the new pd.from_dummies constructor here as well. This was added in pandas version 1.5
Note that this also preserves the original order of your ID column.
df['Type'] = pd.from_dummies(df.loc[:, 'A':'C'])
print(df)
ID A B C Type
0 One 1 0 0 A
1 Two 0 0 1 C
2 Three 0 1 0 B
print(df[['ID', 'Type']])
ID Type
0 One A
1 Two C
2 Three B

Pandas merge conflict rows by counts?

A conflict row is that two rows have same feature but with different label, like this:
feature label
a 1
a 0
Now, I want to merge these conflict rows to only one label getting from their counts. If I have more a 1, then a will be labeled as 1. Otherwise, a should be labeled as 0.
I can find these conflicts by df1=df.groupy('feature', as_index=Fasle).nunique(),df1 = df1[df1['label]==2]' , and their value counts by df2 = df.groupby("feature")["label"].value_counts().reset_index(name="counts").
But how to find these conflic rows and their counts in one Dataframe (df_conflict = ?), and then merge them by counts, (df_merged = merge(df))?
Lets take df = pd.DataFrame({"feature":['a','a','b','b','a','c','c','d'],'label':[1,0,0,1,1,0,0,1]}) as example.
feature label
0 a 1
1 a 0
2 b 0
3 b 1
4 a 1
5 c 0
6 c 0
7 d 1
df_conflict should be :
feature label counts
a 1 2
a 0 1
b 0 1
b 1 1
And df_merged will be:
feature label
a 1
b 0
c 0
d 1
I think you need first filter groups with count of unique values by DataFrameGroupBy.nunique with GroupBy.transform before SeriesGroupBy.value_counts:
df1 = df[df.groupby('feature')['label'].transform('nunique').gt(1)]
df_conflict = df1.groupby('feature')['label'].value_counts().reset_index(name='count')
print (df_conflict)
feature label count
0 a 1 2
1 a 0 1
2 b 0 1
3 b 1 1
For second get feature with labels by maximum occurencies:
df_merged = df.groupby('feature')['label'].agg(lambda x: x.value_counts().index[0]).reset_index()
print (df_merged)
feature label
0 a 1
1 b 0
2 c 0
3 d 1

rolling sum of a column in pandas dataframe at variable intervals

I have a list of index numbers that represent index locations for a DF. list_index = [2,7,12]
I want to sum from a single column in the DF by rolling through each number in list_index and totaling the counts between the index points (and restart count at 0 at each index point). Here is a mini example.
The desired output is in OUTPUT column, which increments every time there is another 1 from COL 1 and RESTARTS the count at 0 on the location after the number in the list_index.
I was able to get it to work with a loop but there are millions of rows in the DF and it takes a while for the loop to run. It seems like I need a lambda function with a sum but I need to input start and end point in index.
Something like lambda x:x.rolling(start_index, end_index).sum()? Can anyone help me out on this.
You can try of cummulative sum and retrieving only 1 values related information , rolling sum with diffferent intervals is not possible
a = df['col'].eq(1).cumsum()
df['output'] = a - a.mask(df['col'].eq(1)).ffill().fillna(0).astype(int)
Out:
col output
0 0 0
1 1 1
2 1 2
3 0 0
4 1 1
5 1 2
6 1 3
7 0 0
8 0 0
9 0 0
10 0 0
11 1 1
12 1 2
13 0 0
14 0 0
15 1 1

Extract rows with maximum values in pandas dataframe

We can use .idxmax to get the maximum value of a dataframe­(df). My problem is that I have a df with several columns (more than 10), one of a column has identifiers of same value. I need to extract the identifiers with the maximum value:
>df
id value
a 0
b 1
b 1
c 0
c 2
c 1
Now, this is what I'd want:
>df
id value
a 0
b 1
c 2
I am trying to get it by using df.groupy(['id']), but it is a bit tricky:
df.groupby(["id"]).ix[df['value'].idxmax()]
Of course, that doesn't work. I fear that I am not on the right path, so I thought I'd ask you guys! Thanks!
Close! Groupby the id, then use the value column; return the max for each group.
In [14]: df.groupby('id')['value'].max()
Out[14]:
id
a 0
b 1
c 2
Name: value, dtype: int64
Op wants to provide these locations back to the frame, just create a transform and assign.
In [17]: df['max'] = df.groupby('id')['value'].transform(lambda x: x.max())
In [18]: df
Out[18]:
id value max
0 a 0 0
1 b 1 1
2 b 1 1
3 c 0 2
4 c 2 2
5 c 1 2