Python Select N number of rows dataframe - pandas

I have a dataframe with 2 columns and I want to select N number of row from column B per column A
A B
0 A
0 B
0 I
0 D
1 A
1 F
1 K
1 L
2 R
For each unique number in Column A give me N random rows from Column B: if N == 2 then the resulting dataframe would look like. If Column A doesn't have up to N rows then return all of column A
A B
0 A
0 D
1 F
1 K
2 R

Use DataFrame.sample per groups in GroupBy.apply with test length of groups with if-else:
N = 2
df1 = df.groupby('A').apply(lambda x: x.sample(N) if len(x) >=N else x).reset_index(drop=True)
print (df1)
A B
0 0 I
1 0 D
2 1 A
3 1 K
4 2 R
Or:
N = 2
df1 = df.groupby('A', group_keys=False).apply(lambda x: x.sample(N) if len(x) >=N else x)
print (df1)
A B
0 0 A
3 0 D
5 1 F
6 1 K
8 2 R

Related

Pandas groupby when group criteria repeat

A B C
a b 1
c d 1
e f 2
g h 2
i j 2
K l 1
J K 1
L M 1
I have a dataset that looks something like this. I want to group them based on C. The data is sequential and I want to give unique ids to each group. How can I achieve this?
The classical trick is to use the non-equality between successive rows (True where this happens), then a cumulative sum to forward fill and increment the Trues as increasing numerical values.
Using shift and ne, then cumsum to form the group. ngroup to get the group ID:
grouper = df['C'].ne(df['C'].shift()).cumsum()
df['group'] = df.groupby(grouper).ngroup()
Or with diff, and ne then cumsum:
grouper = df['C'].diff().ne(0).cumsum()
output:
A B C group
0 a b 1 0
1 c d 1 0
2 e f 2 1
3 g h 2 1
4 i j 2 1
5 K l 1 2
6 J K 1 2
7 L M 1 2
Intermediates of the logic to construct the grouper:
C non-eq implicit int cumsum
0 1 True 1 1
1 1 False 0 1
2 2 True 1 2
3 2 False 0 2
4 2 False 0 2
5 1 True 1 3
6 1 False 0 3
7 1 False 0 3

merge two matrix (dataframe) into one in between columns

I have two dataframe like these:
df1 a b c
0 1 2 3
1 2 3 4
2 3 4 5
df2 x y z
0 T T F
1 F T T
2 F T F
I want to merge these matrix according column one i between like this:
df a x b y c z
0 1 T 2 T 3 F
1 2 F 3 T 4 T
2 3 F 4 T 5 F
whats your idea? how we can merge or append or concate?!!
I used this code. it work dynamically:
df=pd.DataFrame()
for i in range(0,6):
if i%2 == 0:
j=(i)/2
df.loc[:,i] = df1.iloc[:,int(j)]
else:
j=(i-1)/2
df.loc[:,i] = df2.iloc[:,int(j)]
And it works correctly !!
Try:
df = pd.concat([df1, df2], axis=1)
df = df[['a','x','b','y','c','z']]
Prints:
a x b y c z
0 1 T 2 T 3 F
1 2 F 3 T 4 T
2 3 F 4 T 5 F

Count number of Columns with specific value in pandas

I am searching a way to countif rows in pandas. An example would be:
df = pd.DataFrame(data = {'A': [x,y, z], 'B':[z,y,x], 'C': [y,x,z] })
I want to count the number of repetitions on each row and add it to new columns based on specific criteria:
Criteria
C1 = x
C2 = y
C3 = z
In the example above, C3 will be [1,0,2] As there are one 'z' in row 0, no 'z' in row 1 and two 'z' in row 2.
The end table would look like:
A B C | C1 C2 C3
x z y | 1 1 1
y y x | 1 2 0
z x z | 1 0 2
How can I do this in Pandas?
Thanks a lot!
do you mean:
df.join(df.apply(pd.Series.value_counts, axis=1).fillna(0))
Output:
A B C x y z
0 x z y 1.0 1.0 1.0
1 y y x 1.0 2.0 0.0
2 z x z 1.0 0.0 2.0
Can iterate through the values and sum across axis 1
df = pd.concat([df.eq(val).sum(1) for val in ['x', 'y', 'z']], axis=1)
0 1 2
0 1 1 1
1 1 2 0
2 1 0 2
Then rename your column names accordingly.
For a more general solution, consider np.unique and using the pd.Series.name attr.
pd.concat([df.eq(val).sum(1).rename(val) for val in np.unique(df)], axis=1)
x y z
0 1 1 1
1 1 2 0
2 1 0 2
And with some trivial tweaks, you can have your end table
map_ = {'x':'C1', 'y':'C2', 'z':'C3'}
df.join(pd.concat([df.eq(i).sum(1).rename(map_[i]) for i in np.unique(df)], 1))
A B C C1 C2 C3
0 x z y 1 1 1
1 y y x 1 2 0
2 z x z 1 0 2

If a column value does not have a certain number of occurances in a dataframe, how to duplicate all rows with that column value?

Say this my dataframe
A B
0 a 5
1 b 2
2 d 5
3 g 3
4 m 2
5 c 0
6 u 5
7 p 3
8 q 1
9 z 1
If the number of a particular value in column B does not have a particular occurrence count, I want to duplicate all rows which have that particular value for B.
For the df above, say this particular value is 3. If a value for Column B is less than three, than all rows with that column value are duplicated. So rows with column value 0, 1, and 2 are duplicated, but rows with column b value of 5 are not.
Desired result:
A B
0 a 5
1 b 2
2 d 5
3 g 3
4 m 2
5 c 0
6 u 5
7 p 3
8 q 1
9 z 1
10 b 2
11 m 2
12 g 3
13 p 3
14 c 0
15 c 0
Here is my approach
n=3 #threshold
df2 = (df.assign(columns = df.groupby('B').cumcount())
.pivot_table(columns = 'columns',
index = 'B',
values = 'A',
aggfunc = 'first')
)
r = max(n,len(df2.columns))
df2 = df2.reindex(columns = range(r))
notNaN_count = df2.count(axis=1)
m_ffill = notNaN_count.mul(2).lt(n)
repeats = notNaN_count.lt(n).mul(~m_ffill).add(1)
new_df = (df2.ffill(axis = 1)
.where(m_ffill,df2)
.reindex(index = df2.index.repeat(repeats))
.stack()
.rename('A')
.reset_index()
.loc[:,df.columns]
)
print(new_df)
Output
A B
0 c 0
1 c 0
2 c 0
3 q 1
4 z 1
5 q 1
6 z 1
7 b 2
8 m 2
9 b 2
10 m 2
11 g 3
12 p 3
13 g 3
14 p 3
15 a 5
16 d 5
17 u 5
if instead of duplicating we want to multiply by a factor d,
we must make the following modifications:
n = 3
d = 2
m_ffill = notNaN_count.mul(d).lt(n)
repeats = notNaN_count.lt(n).mul(~m_ffill).mul(d).clip(lower = 1)
EDIT
n=3 #threshold
d = 2
values = df.columns.difference(['B'])
df2 = (df.assign(columns = df.groupby('B').cumcount())
.pivot_table(columns = 'columns',
index = 'B',
values = values,
aggfunc = 'first'))
r = max(n,len(df2.columns.get_level_values('columns').unique()))
df2 = df2.reindex(columns = range(r),level = 'columns')
notNaN_count = df2.count(axis=1).div(len(values))
m_ffill = notNaN_count.mul(d).lt(n)
repeats = notNaN_count.lt(n).mul(~m_ffill).mul(d).clip(lower = 1)
new_df = (df2.T
.groupby(level=0)
.ffill()
.T
.where(m_ffill,df2)
.reindex(index = df2.index.repeat(repeats))
.stack()
.reset_index()
.loc[:,df.columns]
)

Select rows if columns meet condition

I have a DataFrame with 75 columns.
How can I select rows based on a condition in a specific array of columns? If I want to do this on all columns I can just use
df[(df.values > 1.5).any(1)]
But let's say I just want to do this on columns 3:45.
Use ix to slice the columns using ordinal position:
In [31]:
df = pd.DataFrame(np.random.randn(5,10), columns=list('abcdefghij'))
df
Out[31]:
a b c d e f g \
0 -0.362353 0.302614 -1.007816 -0.360570 0.317197 1.131796 0.351454
1 1.008945 0.831101 -0.438534 -0.653173 0.234772 -1.179667 0.172774
2 0.900610 0.409017 -0.257744 0.167611 1.041648 -0.054558 -0.056346
3 0.335052 0.195865 0.085661 0.090096 2.098490 0.074971 0.083902
4 -0.023429 -1.046709 0.607154 2.219594 0.381031 -2.047858 -0.725303
h i j
0 0.533436 -0.374395 0.633296
1 2.018426 -0.406507 -0.834638
2 -0.079477 0.506729 1.372538
3 -0.791867 0.220786 -1.275269
4 -0.584407 0.008437 -0.046714
So to slice the 4th to 5th columns inclusive:
In [32]:
df.ix[:, 3:5]
Out[32]:
d e
0 -0.360570 0.317197
1 -0.653173 0.234772
2 0.167611 1.041648
3 0.090096 2.098490
4 2.219594 0.381031
So in your case
df[(df.ix[:, 2:45]).values > 1.5).any(1)]
should work
indexing is 0 based and the open range is included but the closing range is not so here 3rd column is included and we slice up to column 46 but this is not included in the slice
Another solution with iloc, values can be omited:
#if need from 3rd to 45th columns
print (df[((df.iloc[:, 2:45]) > 1.5).any(1)])
Sample:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(3, size=(5,10)), columns=list('abcdefghij'))
print (df)
a b c d e f g h i j
0 1 0 0 1 1 0 0 1 0 1
1 0 2 1 2 0 2 1 2 0 0
2 2 0 1 2 2 0 1 1 2 0
3 2 1 1 1 1 2 1 1 0 0
4 1 0 0 1 2 1 0 2 2 1
print (df[((df.iloc[:, 2:5]) > 1.5).any(1)])
a b c d e f g h i j
1 0 2 1 2 0 2 1 2 0 0
2 2 0 1 2 2 0 1 1 2 0
4 1 0 0 1 2 1 0 2 2 1