Pandas Multiple "group by" and operations on values - pandas

A have a dataset
ID ID2 var1
1 p 10
1 r 5
1 p 9
2 p 7
2 r 6
2 r 7
I need to certify that in each NÂș ID the difference between (the sum of var1 by "p") and (the sum of var1 by "r") is more than 0. In other words, I need to group by ID and apply arithmetic operations between values grouped by ID2.
Thank you for any suggestions

import pandas as pd
from io import StringIO
df = pd.read_fwf(StringIO(
"""ID ID2 var1
1 p 10
1 r 5
1 p 9
2 p 7
2 r 6
2 r 7""")).set_index("ID")
df2 = df.pivot_table(values = "var1", index="ID", columns="ID2", aggfunc='sum')
# Example operatin -- difference
df2['diff'] = df2['p'] - df2['r']
df2
Result
ID2 p r diff
ID
1 19 5 14
2 7 13 -6

You can use .groupby and .diff() to calculate the difference after the groupby.
df.groupby(['ID', 'ID2']).var1.sum().diff()
Out[72]:
ID ID2
1 p NaN
r -14.0
2 p 2.0
r 6.0
Name: var1, dtype: float64
You can also add an indicator, which shows if the difference was greater than 0 with np.where, before that we use .reset_index to get our var1 column back.
groupby = df.groupby(['ID', 'ID2']).var1.sum().diff().reset_index()
groupby['indicator'] = np.where(groupby.var1 > 0, 'yes', 'no')
print(groupby)
ID ID2 var1 indicator
0 1 p NaN no
1 1 r -14.0 no
2 2 p 2.0 yes
3 2 r 6.0 yes

I think you need
df.groupby(['ID','ID2']).sum().groupby(level=[0]).diff()
Out[174]:
var1
ID ID2
1 p NaN
r -14.0
2 p NaN
r 6.0

Your data:
import pandas as pd
df=pd.DataFrame([[1,'p',10], [1,'r',5], [1,'p',9 ],
[2,'p',7 ], [2,'r',6 ], [2,'r',7 ]],
columns=['ID', 'ID2', 'var1'])
You can make a cross tabulation:
df=pd.crosstab(df.ID, [df.ID2,df.var1], margins=True)
>>>df
ID2 p r All
var1 7 9 10 5 6 7
ID
1 0 1 1 1 0 0 3
2 1 0 0 0 1 1 3
All 1 1 1 1 1 1 6
With no margins:
pd.crosstab(df.ID, [df.ID2,df.var1])
ID2 p r
var1 7 9 10 5 6 7
ID
1 0 1 1 1 0 0
2 1 0 0 0 1 1

Thank you guys very much for all your suggestions! I'm almost there...:)
I was trying all the codes.
I think I was not clear when explaining what the output I want. I think for the practical case I'm working on it would be useful to add an additional variable or two into the original list like this (below) This allows me to take decisions regarding IDs with negative differences in following steps.
output:
ID ID2 var1 var2(diff) var_control
1 p 10 14 0
1 r 5 14 0
1 p 9 14 0
2 p 7 -6 1
2 r 6 -6 1
2 r 7 -6 1

I think i did it with all your help. Thank you so much! You are awesome
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': [23, 23, 23, 43, 43],
'id2': ["r", "p", "p", "p", "r"],
'var1': [4, 6, 7, 1, 3]})
print(df)
df2 = df.pivot_table(values = "var1", index="id", columns="id2", aggfunc='sum')
df2['diff'] = df2['p'] - df2['r']
df["var_2"]=df['id'].map(df2["diff"])
df['control'] = np.where(df['var_2']<0, 1, 0)

Related

Maximum of calculated pandas column and 0

I have a very simple problem (I guess) but don't find the right syntax to do it :
The following Dataframe :
A B C
0 7 12 2
1 5 4 4
2 4 8 2
3 9 2 3
I need to create a new column D equal for each row to max (0 ; A-B+C)
I tried a np.maximum(df.A-df.B+df.C,0) but it doesn't match and give me the maximum value of the calculated column for each row (= 10 in the example).
Finally, I would like to obtain the DF below :
A B C D
0 7 12 2 0
1 5 4 4 5
2 4 8 2 0
3 9 2 3 10
Any help appreciated
Thanks
Let us try
df['D'] = df.eval('A-B+C').clip(lower=0)
Out[256]:
0 0
1 5
2 0
3 10
dtype: int64
You can use np.where:
s = df["A"]-df["B"]+df["C"]
df["D"] = np.where(s>0, s, 0) #or s.where(s>0, 0)
print (df)
A B C D
0 7 12 2 0
1 5 4 4 5
2 4 8 2 0
3 9 2 3 10
To do this in one line you can use apply to apply the maximum function to each row seperately.
In [19]: df['D'] = df.apply(lambda s: max(s['A'] - s['B'] + s['C'], 0), axis=1)
In [20]: df
Out[20]:
A B C D
0 0 0 0 0
1 5 4 4 5
2 0 0 0 0
3 9 2 3 10

How to select all the rows that are between a range of values in specific column in pandas dataframe

I have the below sample df, and I'd like to select all the rows that are between a range of values in a specific column:
0 1 2 3 4 5 index
0 -252.44 -393.07 886.72 -2.04 1.58 -2.41 0
1 -260.25 -415.53 881.35 -3.07 0.08 -1.66 1
2 -267.58 -412.60 893.07 -2.98 -1.15 -2.66 2
3 -279.30 -417.97 880.86 -1.15 -0.50 -1.37 3
4 -252.93 -395.51 883.30 -1.30 1.43 4.17 4
I'd like to get the below df (all the rows between index value of 1-3):
0 1 2 3 4 5 index
1 -260.25 -415.53 881.35 -3.07 0.08 -1.66 1
2 -267.58 -412.60 893.07 -2.98 -1.15 -2.66 2
3 -279.30 -417.97 880.86 -1.15 -0.50 -1.37 3
How can I do it?
I tried the below which didn't work:
new_df = df[df['index'] >= 1 & df['index'] <= 3]
Between min and max: use between():
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [1,2,3], 'b':[11,12,13]})
>>> df
a b
0 1 11
1 2 12
2 3 13
>>> df[df.a.between(1,2)]
a b
0 1 11
1 2 12
Your attempt new_df = df[df['index'] >= 1 & df['index'] <= 3] is wrong in two places:
it's df.index, not df["index"]
when using multiple filters, use parentheses: df[(df.index >= 1) & (df.index <= 3)]

Unstack a single column dataframe

I have a dataframe that looks like this:
statistics
0 2013-08
1 4
2 8
3 2013-09
4 7
5 13
6 2013-10
7 2
8 10
And I need it to look like this:
statistics X Y
0 2013-08 4 8
1 2013-09 7 13
2 2013-10 2 10
it would be useful to find a way that doesnt depend on the number of rows as I want to use it in a loop and the number of original rows might be changing. However, the output should always have these 3 columns
What you are doing is not an unstack operation, you are trying to do a reshape.
You can do this by using the reshape method of numpy. The variable n_cols is the number of columns you are looking for.
Here you have an example:
df = pd.DataFrame(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'], columns=['col'])
df
col
0 A
1 B
2 C
3 D
4 E
5 F
6 G
7 H
8 I
9 J
10 K
11 L
n_cols = 3
pd.DataFrame(df.values.reshape(int(len(df)/n_cols), n_cols))
0 1 2
0 A B C
1 D E F
2 G H I
3 J K L
import pandas as pd
data = pd.read_csv('data6.csv')
x=[]
y=[]
statistics= []
for i in range(0,len(data)):
if i%3==0:
statistics.append(data['statistics'][i])
elif i%3==1:
x.append(data['statistics'][i])
elif i%3 == 2:
y.append(data['statistics'][i])
data1 = pd.DataFrame({'statistics':statistics,'x':x,'y':y})
data1

Get group counts of level 1 after doing a group by on two columns

I am doing a group by on two columns and need the count of the number of values in level-1
I tried the following:
>>> import pandas as pd
>>> df = pd.DataFrame({'A': ['one', 'one', 'two', 'three', 'three', 'one'], 'B': [1, 2, 0, 4, 3, 4], 'C': [3,3,3,3,4,8]})
>>> print(df)
A B C
0 one 1 3
1 one 2 3
2 two 0 3
3 three 4 3
4 three 3 4
5 one 4 8
>>> aggregator = {'C': {'sC' : 'sum','cC':'count'}}
>>> df.groupby(["A", "B"]).agg(aggregator)
/envs/pandas/lib/python3.7/site-packages/pandas/core/groupby/generic.py:1315: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
C
sC cC
A B
one 1 3 1
2 3 1
4 8 1
three 3 4 1
4 3 1
two 0 3 1
I want an output something like this where the last column tC gives me the count corresponding to group one, two and three.
C
sC cC tC
A B
one 1 3 1 3
2 3 1
4 8 1
three 3 4 1 2
4 3 1
two 0 3 1 1
If there is only one column for aggregation pass list of tuples:
aggregator = [('sC' , 'sum'),('cC', 'count')]
df = df.groupby(["A", "B"])['C'].agg(aggregator)
For last column convert first level to Series of MultiIndex, get counts by GroupBy.transform and GroupBy.size and for first values only use numpy.where:
s = df.index.get_level_values(0).to_series()
df['tC'] = np.where(s.duplicated(), np.nan, s.groupby(s).transform('size'))
print(df)
sC cC tC
A B
one 1 3 1 3.0
2 3 1 NaN
4 8 1 NaN
three 3 4 1 2.0
4 3 1 NaN
two 0 3 1 1.0
You can also set duplicated values to empty string in tC column, but then later all numeric operation with this column failed, because mixed values - numeric with strings:
df['tC'] = np.where(s.duplicated(), '', s.groupby(s).transform('size'))
print(df)
sC cC tC
A B
one 1 3 1 3
2 3 1
4 8 1
three 3 4 1 2
4 3 1
two 0 3 1 1

Separate aggregated data in different rows [duplicate]

This question already has answers here:
How can I replicate rows of a Pandas DataFrame?
(10 answers)
Closed 11 months ago.
I want to replicate rows in a Pandas Dataframe. Each row should be repeated n times, where n is a field of each row.
import pandas as pd
what_i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'v' : [ 10, 13, 13, 8, 8, 8]
})
Is this possible?
You can use Index.repeat to get repeated index values based on the column then select from the DataFrame:
df2 = df.loc[df.index.repeat(df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
Or you could use np.repeat to get the repeated indices and then use that to index into the frame:
df2 = df.loc[np.repeat(df.index.values, df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
After which there's only a bit of cleaning up to do:
df2 = df2.drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Note that if you might have duplicate indices to worry about, you could use .iloc instead:
df.iloc[np.repeat(np.arange(len(df)), df["n"])].drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
which uses the positions, and not the index labels.
You could use set_index and repeat
In [1057]: df.set_index(['id'])['v'].repeat(df['n']).reset_index()
Out[1057]:
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Details
In [1058]: df
Out[1058]:
id n v
0 A 1 10
1 B 2 13
2 C 3 8
It's something like the uncount in tidyr:
https://tidyr.tidyverse.org/reference/uncount.html
I wrote a package (https://github.com/pwwang/datar) that implements this API:
from datar import f
from datar.tibble import tribble
from datar.tidyr import uncount
what_i_have = tribble(
f.id, f.n, f.v,
'A', 1, 10,
'B', 2, 13,
'C', 3, 8
)
what_i_have >> uncount(f.n)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
Not the best solution, but I want to share this: you could also use pandas.reindex() and .repeat():
df.reindex(df.index.repeat(df.n)).drop('n', axis=1)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
You can further append .reset_index(drop=True) to reset the .index.