Pandas, bygroup operation - pandas

I have in pandas by using of groupby() next output (A,B,C are the columns in the input table)
C
A B
0 0 6
2 1
6 5
. . .
Output details: [244 rows x 1 columns] I just want to have all 3 columns instead of one,how is it possible to do?
Output, which I wish:
A B C
0 0 6
0 2 1
. . .

It appears to be undocumented, but simply: gb.bfill(), see this example:
In [68]:
df=pd.DataFrame({'A':[0,0,0,0,0,0,0,0],
'B':[0,0,0,0,1,1,1,1],
'C':[1,2,3,4,1,2,3,4],})
In [69]:
gb=df.groupby(['A', 'B'])
In [70]:
print gb.bfill()
A B C
0 0 0 1
1 0 0 2
2 0 0 3
3 0 0 4
4 0 1 1
5 0 1 2
6 0 1 3
7 0 1 4
[8 rows x 3 columns]
But I don't see why you need to do that, don't you end up with the original DataFrame (only maybe rearranged)?

Related

Convert multichannel image into pixelwise pandas dataframe

If you have a multiband image of, say, dimensions 1024 * 1024 * 200 (columns * lines * bands) and want to convert that to a pandas dataframe of the form:
Band Value
1 1 0.14
2 1 1.18
3 1 2.56
.
.
.
209715198 200 1.01
209715199 200 1.15
209715200 200 2.00
So basically all pixels in sequential form, with the band number (or wavelength) and the pixel value as columns.
Is there a clever and efficient way of doing this without a lot of loops, appending to arrays and so on?
Answer
You can do it with numpy. I'll try my best to walk you through it below. First you need the input images in a 3D numpy array. I'm just going to use a randomly generated small one for illustration. This is the full code, with an explanation below.
import numpy as np
import pandas as pd
images = np.random.randint(0,9,(2,5,5))
z, y, x = images.shape ## 2, 5, 5 (200, 1024, 1024 for your example)
arr = np.column_stack((np.repeat(np.arange(z),y*x), images.ravel()))
df = pd.DataFrame(arr, columns = ['Bands', 'Value'])
Explanation
The images output array looks like this (basically 2 images at 5x5 pixels):
[[[5 2 3 6 2]
[6 1 6 3 2]
[8 3 2 2 1]
[5 1 2 6 0]
[3 4 7 0 2]]
[[1 7 0 7 3]
[7 4 5 4 3]
[1 5 4 7 4]
[2 0 2 7 2]
[7 0 1 6 7]]]
The next step is to use np.ravel() to flatten it. Which will output your required Value column:
#images.ravel()
[5 2 3 6 2 6 1 6 3 2 8 3 2 2 1 5 1 2 6 0 3 4 7 0 2 1 7 0 7 3 7 4 5 4 3 1 5
4 7 4 2 0 2 7 2 7 0 1 6 7]
To create the band column, you need to repeat the z value for an array, x*y times. You can do this with np.repeat() and np.arange(). Which gives you a 1D array:
#(np.repeat(np.arange(z),y*x))
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1]
This is the required Band column. To combine, them use np.column_stack() and then turn it into a dataframe. All of the above steps combined. Would be:
arr = np.column_stack((np.repeat(np.arange(z),y*x), images.ravel()))
df = pd.DataFrame(arr, columns = ['Bands', 'Value'])
Which will output:
Bands Value
0 0 5
1 0 2
2 0 3
3 0 6
4 0 2
5 0 6
6 0 1
7 0 6
8 0 3
9 0 2
10 0 8
11 0 3
12 0 2
13 0 2
14 0 1
15 0 5
16 0 1
17 0 2
18 0 6
19 0 0
20 0 3
21 0 4
22 0 7
23 0 0
24 0 2
25 1 1
26 1 7
27 1 0
...
As required. I hope this at least gets you moving in the right direction.

Fill the row in a data frame with a specific value based on a condition on the specific column

I have a data frame df:
df=
A B C D
1 4 7 2
2 6 -3 9
-2 7 2 4
I am interested in changing the whole row values to 0 if it's element in the column C is negative. i.e. if df['C']<0, its corresponding row should be filled with the value 0 as shown below:
df=
A B C D
1 4 7 2
0 0 0 0
-2 7 2 4
You can use DataFrame.where or mask:
df.where(df['C'] >= 0, 0)
A B C D
0 1 4 7 2
1 0 0 0 0
2 -2 7 2 4
Another option is simple masking via multiplication:
df.mul(df['C'] >= 0, axis=0)
A B C D
0 1 4 7 2
1 0 0 0 0
2 -2 7 2 4
You can also set values directly via loc as shown in this comment:
df.loc[df['C'] <= 0] = 0
df
A B C D
0 1 4 7 2
1 0 0 0 0
2 -2 7 2 4
Which has the added benefit of modifying the original DataFrame (if you'd rather not return a copy).

Maximum of calculated pandas column and 0

I have a very simple problem (I guess) but don't find the right syntax to do it :
The following Dataframe :
A B C
0 7 12 2
1 5 4 4
2 4 8 2
3 9 2 3
I need to create a new column D equal for each row to max (0 ; A-B+C)
I tried a np.maximum(df.A-df.B+df.C,0) but it doesn't match and give me the maximum value of the calculated column for each row (= 10 in the example).
Finally, I would like to obtain the DF below :
A B C D
0 7 12 2 0
1 5 4 4 5
2 4 8 2 0
3 9 2 3 10
Any help appreciated
Thanks
Let us try
df['D'] = df.eval('A-B+C').clip(lower=0)
Out[256]:
0 0
1 5
2 0
3 10
dtype: int64
You can use np.where:
s = df["A"]-df["B"]+df["C"]
df["D"] = np.where(s>0, s, 0) #or s.where(s>0, 0)
print (df)
A B C D
0 7 12 2 0
1 5 4 4 5
2 4 8 2 0
3 9 2 3 10
To do this in one line you can use apply to apply the maximum function to each row seperately.
In [19]: df['D'] = df.apply(lambda s: max(s['A'] - s['B'] + s['C'], 0), axis=1)
In [20]: df
Out[20]:
A B C D
0 0 0 0 0
1 5 4 4 5
2 0 0 0 0
3 9 2 3 10

Pandas change each group into a single row

I have a dataframe like the follows.
>>> data
target user data
0 A 1 0
1 A 1 0
2 A 1 1
3 A 2 0
4 A 2 1
5 B 1 1
6 B 1 1
7 B 1 0
8 B 2 0
9 B 2 0
10 B 2 1
You can see that each user may contribute multiple claims about a target. I want to only store each user's most frequent data for each target. For example, for the dataframe shown above, I want the result like follows.
>>> result
target user data
0 A 1 0
1 A 2 0
2 B 1 1
3 B 2 0
How to do this? And, can I do this using groupby? (my real dataframe is not sorted)
Thanks!
Using groupby with count create the helper key , then we using idxmax
df['helperkey']=df.groupby(['target','user','data']).data.transform('count')
df.groupby(['target','user']).helperkey.idxmax()
Out[10]:
target user
A 1 0
2 3
B 1 5
2 8
Name: helperkey, dtype: int64
df.loc[df.groupby(['target','user']).helperkey.idxmax()]
Out[11]:
target user data helperkey
0 A 1 0 2
3 A 2 0 1
5 B 1 1 2
8 B 2 0 2

Select rows if columns meet condition

I have a DataFrame with 75 columns.
How can I select rows based on a condition in a specific array of columns? If I want to do this on all columns I can just use
df[(df.values > 1.5).any(1)]
But let's say I just want to do this on columns 3:45.
Use ix to slice the columns using ordinal position:
In [31]:
df = pd.DataFrame(np.random.randn(5,10), columns=list('abcdefghij'))
df
Out[31]:
a b c d e f g \
0 -0.362353 0.302614 -1.007816 -0.360570 0.317197 1.131796 0.351454
1 1.008945 0.831101 -0.438534 -0.653173 0.234772 -1.179667 0.172774
2 0.900610 0.409017 -0.257744 0.167611 1.041648 -0.054558 -0.056346
3 0.335052 0.195865 0.085661 0.090096 2.098490 0.074971 0.083902
4 -0.023429 -1.046709 0.607154 2.219594 0.381031 -2.047858 -0.725303
h i j
0 0.533436 -0.374395 0.633296
1 2.018426 -0.406507 -0.834638
2 -0.079477 0.506729 1.372538
3 -0.791867 0.220786 -1.275269
4 -0.584407 0.008437 -0.046714
So to slice the 4th to 5th columns inclusive:
In [32]:
df.ix[:, 3:5]
Out[32]:
d e
0 -0.360570 0.317197
1 -0.653173 0.234772
2 0.167611 1.041648
3 0.090096 2.098490
4 2.219594 0.381031
So in your case
df[(df.ix[:, 2:45]).values > 1.5).any(1)]
should work
indexing is 0 based and the open range is included but the closing range is not so here 3rd column is included and we slice up to column 46 but this is not included in the slice
Another solution with iloc, values can be omited:
#if need from 3rd to 45th columns
print (df[((df.iloc[:, 2:45]) > 1.5).any(1)])
Sample:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(3, size=(5,10)), columns=list('abcdefghij'))
print (df)
a b c d e f g h i j
0 1 0 0 1 1 0 0 1 0 1
1 0 2 1 2 0 2 1 2 0 0
2 2 0 1 2 2 0 1 1 2 0
3 2 1 1 1 1 2 1 1 0 0
4 1 0 0 1 2 1 0 2 2 1
print (df[((df.iloc[:, 2:5]) > 1.5).any(1)])
a b c d e f g h i j
1 0 2 1 2 0 2 1 2 0 0
2 2 0 1 2 2 0 1 1 2 0
4 1 0 0 1 2 1 0 2 2 1