select top values for each datapoint in dataset [duplicate] - pandas

I have a following sample df:
idd x y
0 1 2 3
1 1 3 4
2 1 5 6
3 2 7 10
4 2 9 8
5 3 11 12
6 3 13 14
7 3 15 16
8 3 17 18
I want to use groupby by "idd" and find min of x and y and store it in a new df along with "idd".
In the above df, I expect to have xmin for idd=1 as 2, ymin for idd=1 as 3; idd = 2, xmin should be 7, ymin should be 8, and so on.
Expecting df:
idd xmin ymin
0 1 2 3
1 2 7 8
2 3 11 12
Code tried:
for group in df.groupby("idd"):
box = [df['x'].max(), df['y'].max()]
but it finds the min of x and y of the whole column and not as per "idd".

Here's a slightly different approach without rename
df = df.groupby('idd').min().add_suffix('min').reset_index()
idd xmin ymin
0 1 2 3
1 2 7 8
2 3 11 12

You can use groupby and then take min for each group.
df.groupby('idd').min().reset_index().rename(columns={'x':'xmin','y':'ymin'})
Out[105]:
idd xmin ymin
0 1 2 3
1 2 7 8
2 3 11 12

Related

Pandas: How to extract data that has been grouped by

Here is an example code to demonstrate my problem:
import numpy as np
import pandas as pd
np.random.seed(10)
df = pd.DataFrame(np.random.randint(0,10,size=(100, 2)), columns=list('xy'))
df
x y
0 9 4
1 0 1
2 9 0
3 1 8
4 9 0
... ... ...
95 0 4
96 6 4
97 9 8
98 0 7
99 1 7
groups = df.groupby(['x'])
groups.size()
x
0 11
1 12
2 15
3 13
4 14
5 5
6 6
7 9
8 5
9 10
dtype: int64
How can I access the x-values as a column and the aggregated y-values as a second column to plot x versus y?
Two options.
Use reset_index():
groups = df.groupby(['x']).size().reset_index(name='size')
Add as_index=False to groupby:
groups = df.groupby(['x'], as_index=False).size()
Output for both:
>>> groups
x size
0 0 16
1 1 9
2 2 9
3 3 5
4 4 7
5 5 10
6 6 10
7 7 7
8 8 12
9 9 15
IIUC, use as_index=False:
groups = df.groupby(['x'], as_index=False)
out = groups.size()
out.plot(x='x', y='size')
If you only want to plot, you can also keep the x as index:
df.groupby(['x']).size().plot()
output:
x size
0 0 16
1 1 9
2 2 9
3 3 5
4 4 7
5 5 10
6 6 10
7 7 7
8 8 12
9 9 15

Maximum of calculated pandas column and 0

I have a very simple problem (I guess) but don't find the right syntax to do it :
The following Dataframe :
A B C
0 7 12 2
1 5 4 4
2 4 8 2
3 9 2 3
I need to create a new column D equal for each row to max (0 ; A-B+C)
I tried a np.maximum(df.A-df.B+df.C,0) but it doesn't match and give me the maximum value of the calculated column for each row (= 10 in the example).
Finally, I would like to obtain the DF below :
A B C D
0 7 12 2 0
1 5 4 4 5
2 4 8 2 0
3 9 2 3 10
Any help appreciated
Thanks
Let us try
df['D'] = df.eval('A-B+C').clip(lower=0)
Out[256]:
0 0
1 5
2 0
3 10
dtype: int64
You can use np.where:
s = df["A"]-df["B"]+df["C"]
df["D"] = np.where(s>0, s, 0) #or s.where(s>0, 0)
print (df)
A B C D
0 7 12 2 0
1 5 4 4 5
2 4 8 2 0
3 9 2 3 10
To do this in one line you can use apply to apply the maximum function to each row seperately.
In [19]: df['D'] = df.apply(lambda s: max(s['A'] - s['B'] + s['C'], 0), axis=1)
In [20]: df
Out[20]:
A B C D
0 0 0 0 0
1 5 4 4 5
2 0 0 0 0
3 9 2 3 10

How to multiply dataframe columns with dataframe column in pandas?

I want to multiply hdataframe columns with dataframe column.
I have two dataframews as shown here:
A dataframe, B dataframe
a b c d e
3 4 4 4 2
3 3 3 3 3
3 3 3 3 4
and I want to make multiplication A and B.
Multiplication result should be like this:
a b c d
6 8 8 8
9 9 9 9
12 12 12 12
I tried just * multiplication but got a wrong result.
Thank you in advance!
Use B.values or B.to_numpy() which will return numpy array and then you can multiply with DataFrame
Ex.:
>>> A
a b c d
0 3 4 4 4
1 3 3 3 3
2 3 3 3 3
>>> B
c
0 2
1 3
2 4
>>> A * B.values
a b c d
0 6 8 8 8
1 9 9 9 9
2 12 12 12 12
Just another variation on #Dishin's excellent answer:
U can use pandas mul method to multiply A by B, by setting B as a series and multiplying on the index:
A.mul(B.iloc[:,0],axis='index')
a b c d
0 6 8 8 8
1 9 9 9 9
2 12 12 12 12
Use DataFrame.mul with Series by selecting e column:
df = A.mul(B['e'], axis=0)
print (df)
a b c d
0 6 8 8 8
1 9 9 9 9
2 12 12 12 12
I think you are looking for the mul function, as seen on this thread here, here is the code.
df = pd.DataFrame([[3, 4, 4, 4],[3, 3, 3, 3],[3, 3, 3, 3]])
val = [2,3,4]
df.mul(val, axis = 0)
Here are the results:
0 1 2 3
0 6 8 8 8
1 9 9 9 9
2 12 12 12 12
Ignore the indices.

pandas: bin data into specific number of bins of specific size

I would like to bin a dataframe by the values in a single column into bins of a specific size and number.
Here is an example df:
df= pd.DataFrame(np.random.randint(0,10000,size=(10000, 4)), columns=list('ABCD'))
Say I want to bin by column D, I will first sort the data:
df.sort('D')
I would now wish to bin so that the first if bin size is 50 and bin number is 100, the first 50 values will go into bin 1, the next into bin 2, and so on and so forth. Any remaining values after the twenty bins should all go into the final bin. Is there anyway of doing this?
EDIT:
Here is a sample input:
x = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=list('ABCD'))
And here is the expected output:
A B C D bin
0 6 8 6 5 3
1 5 4 9 1 1
2 5 1 7 4 3
3 6 3 3 3 2
4 2 5 9 3 2
5 2 5 1 3 2
6 0 1 1 0 1
7 3 9 5 8 3
8 2 4 0 1 1
9 6 4 5 6 3
As an extra aside, is it also possible to bin any equal values in the same bin? So for example, say I have bin 1 which contains values, 0,1,1 and then bin 2 contains 1,1,2. Is there any way of putting those two 1 values in bin 2 into bin 1? This will create very uneven bin sizes but this is not an issue.
It seems you need floor divide np.arange and then assign to new column:
idx = df['D'].sort_values().index
df['b'] = pd.Series(np.arange(len(df)) // 3 + 1, index = idx)
print (df)
A B C D bin b
0 6 8 6 5 3 3
1 5 4 9 1 1 1
2 5 1 7 4 3 3
3 6 3 3 3 2 2
4 2 5 9 3 2 2
5 2 5 1 3 2 2
6 0 1 1 0 1 1
7 3 9 5 8 3 4
8 2 4 0 1 1 1
9 6 4 5 6 3 3
Detail:
print (np.arange(len(df)) // 3 + 1)
[1 1 1 2 2 2 3 3 3 4]
EDIT:
I create another question about problem with last values here:
N = 3
idx = df['D'].sort_values().index
#one possible solution, thanks divakar
def replace_irregular_groupings(a, N):
n = len(a)
m = N*(n//N)
if m!=n:
a[m:] = a[m-1]
return a
idx = df['D'].sort_values().index
arr = replace_irregular_groupings(np.arange(len(df)) // N + 1, N)
df['b'] = pd.Series(arr, index = idx)
print (df)
A B C D bin b
0 6 8 6 5 3 3
1 5 4 9 1 1 1
2 5 1 7 4 3 3
3 6 3 3 3 2 2
4 2 5 9 3 2 2
5 2 5 1 3 2 2
6 0 1 1 0 1 1
7 3 9 5 8 3 3
8 2 4 0 1 1 1
9 6 4 5 6 3 3

Is there an easy way to group columns in a Pandas DataFrame?

I am trying to use Pandas to represent motion-capture data, which has T measurements of the (x, y, z) locations of each of N markers. For example, with T=3 and N=4, the raw CSV data looks like:
T,Ax,Ay,Az,Bx,By,Bz,Cx,Cy,Cz,Dx,Dy,Dz
0,1,2,1,3,2,1,4,2,1,5,2,1
1,8,2,3,3,2,9,9,1,3,4,9,1
2,4,5,7,7,7,1,8,3,6,9,2,3
This is really simple to load into a DataFrame, and I've learned a few tricks that are easy (converting marker data to z-scores, or computing velocities, for example).
One thing I'd like to do, though, is convert the "flat" data shown above into a format that has a hierarchical index on the column (marker), so that there would be N columns at level 0 (one for each marker), and each one of those would have 3 columns at level 1 (one each for x, y, and z).
A B C D
x y z x y z x y z x y z
0 1 2 1 3 2 1 4 2 1 5 2 1
1 8 2 3 3 2 9 9 1 3 4 9 1
2 4 5 7 7 7 1 8 3 6 9 2 3
I know how do this by loading up the flat file and then manipulating the Series objects directly, perhaps by using append or just creating a new DataFrame using a manually-created MultiIndex.
As a Pandas learner, it feels like there must be a way to do this with less effort, but it's hard to discover. Is there an easier way?
You basically just need to manipulate the column names, in your case.
Starting with your original DataFrame (and a tiny index manipulation):
from StringIO import StringIO
import numpy as np
a = pd.read_csv(StringIO('T,Ax,Ay,Az,Bx,By,Bz,Cx,Cy,Cz,Dx,Dy,Dz\n\
0,1,2,1,3,2,1,4,2,1,5,2,1\n\
1,8,2,3,3,2,9,9,1,3,4,9,1\n\
2,4,5,7,7,7,1,8,3,6,9,2,3'))
a.set_index('T', inplace=True)
So that:
>> a
Ax Ay Az Bx By Bz Cx Cy Cz Dx Dy Dz
T
0 1 2 1 3 2 1 4 2 1 5 2 1
1 8 2 3 3 2 9 9 1 3 4 9 1
2 4 5 7 7 7 1 8 3 6 9 2 3
Then simply create a list of tuples for your columns, and use MultiIndex.from_tuples:
a.columns = pd.MultiIndex.from_tuples([(c[0], c[1]) for c in a.columns])
>> a
A B C D
x y z x y z x y z x y z
T
0 1 2 1 3 2 1 4 2 1 5 2 1
1 8 2 3 3 2 9 9 1 3 4 9 1
2 4 5 7 7 7 1 8 3 6 9 2 3