I have a DataFrame with 75 columns.
How can I select rows based on a condition in a specific array of columns? If I want to do this on all columns I can just use
df[(df.values > 1.5).any(1)]
But let's say I just want to do this on columns 3:45.
Use ix to slice the columns using ordinal position:
In [31]:
df = pd.DataFrame(np.random.randn(5,10), columns=list('abcdefghij'))
df
Out[31]:
a b c d e f g \
0 -0.362353 0.302614 -1.007816 -0.360570 0.317197 1.131796 0.351454
1 1.008945 0.831101 -0.438534 -0.653173 0.234772 -1.179667 0.172774
2 0.900610 0.409017 -0.257744 0.167611 1.041648 -0.054558 -0.056346
3 0.335052 0.195865 0.085661 0.090096 2.098490 0.074971 0.083902
4 -0.023429 -1.046709 0.607154 2.219594 0.381031 -2.047858 -0.725303
h i j
0 0.533436 -0.374395 0.633296
1 2.018426 -0.406507 -0.834638
2 -0.079477 0.506729 1.372538
3 -0.791867 0.220786 -1.275269
4 -0.584407 0.008437 -0.046714
So to slice the 4th to 5th columns inclusive:
In [32]:
df.ix[:, 3:5]
Out[32]:
d e
0 -0.360570 0.317197
1 -0.653173 0.234772
2 0.167611 1.041648
3 0.090096 2.098490
4 2.219594 0.381031
So in your case
df[(df.ix[:, 2:45]).values > 1.5).any(1)]
should work
indexing is 0 based and the open range is included but the closing range is not so here 3rd column is included and we slice up to column 46 but this is not included in the slice
Another solution with iloc, values can be omited:
#if need from 3rd to 45th columns
print (df[((df.iloc[:, 2:45]) > 1.5).any(1)])
Sample:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(3, size=(5,10)), columns=list('abcdefghij'))
print (df)
a b c d e f g h i j
0 1 0 0 1 1 0 0 1 0 1
1 0 2 1 2 0 2 1 2 0 0
2 2 0 1 2 2 0 1 1 2 0
3 2 1 1 1 1 2 1 1 0 0
4 1 0 0 1 2 1 0 2 2 1
print (df[((df.iloc[:, 2:5]) > 1.5).any(1)])
a b c d e f g h i j
1 0 2 1 2 0 2 1 2 0 0
2 2 0 1 2 2 0 1 1 2 0
4 1 0 0 1 2 1 0 2 2 1
Related
With reference to Pandas groupby with categories with redundant nan
import pandas as pd
df = pd.DataFrame({"TEAM":[1,1,1,1,2,2,2], "ID":[1,1,2,2,8,4,5], "TYPE":["A","B","A","B","A","A","A"], "VALUE":[1,1,1,1,1,1,1]})
df["TYPE"] = df["TYPE"].astype("category")
df = df.groupby(["TEAM", "ID", "TYPE"]).sum()
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
4 A 0
B 0
5 A 0
B 0
8 A 0
B 0
2 1 A 0
B 0
2 A 0
B 0
4 A 1
B 0
5 A 1
B 0
8 A 1
B 0
Expected output
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
2 4 A 1
B 0
5 A 1
B 0
8 A 1
B 0
I tried to use astype("category") for TYPE. However it seems to output every cartesian product of every item in every group.
What you want is a little abnormal, but we can force it there from a pivot table:
out = df.pivot_table(index=['TEAM', 'ID'],
columns=['TYPE'],
values=['VALUE'],
aggfunc='sum',
observed=True, # This is the key when working with categoricals~
# You should known to try this with your groupby from the post you linked...
fill_value=0).stack()
print(out)
Output:
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
2 4 A 1
B 0
5 A 1
B 0
8 A 1
B 0
here is one way to do it, based on the data you shared
reset the index and then do the groupby to choose groups where sum is greater than 0, means either of the category A or B is non-zero. Finally set the index
df.reset_index(inplace=True)
(df[df.groupby(['TEAM','ID'])['VALUE']
.transform(lambda x: x.sum()>0)]
.set_index(['TEAM','ID','TYPE']))
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
2 4 A 1
B 0
5 A 1
B 0
8 A 1
B 0
At the replication of a dataframe using concat with index (see example here), is there a way I can assign a count variable for each iteration in column c (where column c is the count variable)?
Orig df:
a
b
0
1
2
1
2
3
df replicated with pd.concat[df]*5 and with an additional Column c:
a
b
c
0
1
2
1
1
2
3
1
0
1
2
2
1
2
3
2
0
1
2
3
1
2
3
3
0
1
2
4
1
2
3
4
0
1
2
5
1
2
3
5
This is a multi-row dataframe where the count variable would have to be applied to multiple rows.
Thanks for your thoughts!
You could use np.arange and np.repeat:
N = 5
new_df = pd.concat([df] * N)
new_df['c'] = np.repeat(np.arange(N), df.shape[0]) + 1
Output:
>>> new_df
a b c
0 1 2 1
1 2 3 1
0 1 2 2
1 2 3 2
0 1 2 3
1 2 3 3
0 1 2 4
1 2 3 4
0 1 2 5
1 2 3 5
I have a data frame df:
df=
A B C D
1 4 7 2
2 6 -3 9
-2 7 2 4
I am interested in changing the whole row values to 0 if it's element in the column C is negative. i.e. if df['C']<0, its corresponding row should be filled with the value 0 as shown below:
df=
A B C D
1 4 7 2
0 0 0 0
-2 7 2 4
You can use DataFrame.where or mask:
df.where(df['C'] >= 0, 0)
A B C D
0 1 4 7 2
1 0 0 0 0
2 -2 7 2 4
Another option is simple masking via multiplication:
df.mul(df['C'] >= 0, axis=0)
A B C D
0 1 4 7 2
1 0 0 0 0
2 -2 7 2 4
You can also set values directly via loc as shown in this comment:
df.loc[df['C'] <= 0] = 0
df
A B C D
0 1 4 7 2
1 0 0 0 0
2 -2 7 2 4
Which has the added benefit of modifying the original DataFrame (if you'd rather not return a copy).
I have a DataFrame:-
col count
0 B 1
1 B 2
2 A 1
3 A 2
4 A 3
5 C 1
6 C 2
7 C 3
8 C 4
wan to create new variable named Flag according to last occurrence of B , A in col variable. reference df:-
col count Flag
0 B 1 0
1 B 2 1
2 A 1 0
3 A 2 0
4 A 3 1
5 C 1 0
6 C 2 0
7 C 3 0
8 C 4 1
TIA
Use Series.duplicated with numpy.where:
df['Flag'] = np.where(df['col'].duplicated(keep='last'), 0, 1)
Or Series.view with invert mask by ~:
df['Flag'] = (~df['col'].duplicated(keep='last')).view('i1')
print (df)
col count Flag
0 B 1 0
1 B 2 1
2 A 1 0
3 A 2 0
4 A 3 1
5 C 1 0
6 C 2 0
7 C 3 0
8 C 4 1
I have in pandas by using of groupby() next output (A,B,C are the columns in the input table)
C
A B
0 0 6
2 1
6 5
. . .
Output details: [244 rows x 1 columns] I just want to have all 3 columns instead of one,how is it possible to do?
Output, which I wish:
A B C
0 0 6
0 2 1
. . .
It appears to be undocumented, but simply: gb.bfill(), see this example:
In [68]:
df=pd.DataFrame({'A':[0,0,0,0,0,0,0,0],
'B':[0,0,0,0,1,1,1,1],
'C':[1,2,3,4,1,2,3,4],})
In [69]:
gb=df.groupby(['A', 'B'])
In [70]:
print gb.bfill()
A B C
0 0 0 1
1 0 0 2
2 0 0 3
3 0 0 4
4 0 1 1
5 0 1 2
6 0 1 3
7 0 1 4
[8 rows x 3 columns]
But I don't see why you need to do that, don't you end up with the original DataFrame (only maybe rearranged)?