How to keep all groups using pandas+groupby+sample fraction even some groups are very small? - pandas

Let's say I have a data like this:
df=pd.DataFrame({'a':[1,2,3,4,5,6,7,8,9,10,0,11,12],
'b':[0,0,0,0,0,0,0,0,0,0,0,0,1]})
I want to use b to put the data into two groups and sample from each. You can see that group 0 has much more data than group 1. So, if I do:
df1=df.groupby(['b']).apply(lambda x: x.sample(frac=0.1)).reset_index(drop=True)
You can find group 1 cannot be sampled. It might be sampled if frac increases.
So, what I should do to keep all the groups even it is very small?

Use sample reorder the dataframe , then we find the min count value per group , and the we can do head
df1 = df.groupby('b').apply(lambda x: x.sample(frac=1)).reset_index(drop=True)
ming = df1.b.value_counts().min()
df1 = df1.groupby('b').head(ming)
df1
Out[287]:
a b
0 8 0
12 12 1

Related

Select column names in pandas based on multiple prefixes

I have a large dataframe, from which I want to select specific columns that stats with several different prefixes. My current solution is shown below:
df = pd.DataFrame(columns=['flg_1', 'flg_2', 'ab_1', 'ab_2', 'aaa', 'bbb'], data=np.array([1,2,3,4,5,6]).reshape(1,-1))
flg_vars = df.filter(regex='^flg_')
ab_vars = df.filter(regex='^ab_')
result = pd.concat([flg_vars, ab_vars], axis=1)
Is there a more efficient way of doing this? I need to filter my original data based on 8 prefixes, which leads to excessive lines of code.
Use | for regex OR:
result = df.filter(regex='^flg_|^ab_')
print (result)
flg_1 flg_2 ab_1 ab_2
0 1 2 3 4

Creating batches based on city in pandas

I have two different dataframes that I want to fuzzy match against each other to find and remove duplicates. To make the process faster/more accurate I want to only fuzzy match records from both dataframes in the same cities. So that makes it necessary to create batches based on cities in the one dataframe then running the fuzzy matcher between each batch and a subset of the other dataframe with like cities. I can't find another post that does this and I am stuck. Here is what I have so far. Thanks!
df = pd.DataFrame({'A':[1,1,2,2,2,2,3,3],'B':['Q','Q','R','R','R','P','L','L'],'origin':['file1','file2','file3','file4','file5','file6','file7','file8']})
cols = ['B']
df1 = df[df.duplicated(subset=cols,keep=False)].copy()
df1 = df1.sort_values(cols)
df1['group'] = 'g' + (df1.groupby(cols).ngroup() + 1).astype(str)
df1['duplicate_count'] = df1.groupby(cols)['origin'].transform('size')
df1_g1 = df1.loc[df1['group'] == 'g1']
print(df1_g1)
which will not factor in anything that isn't duplicated so if a value only appears once then it will be skipped as is the case with 'P' in column B. It also requires me to go in and hard-code the group in each time which is not ideal. I haven't been able to figure out a for loop or any other method to solve this. Thanks!
You can pass to locals
variables = locals()
for i,j in df1.groupby('group'):
variables["df1_{0}".format(i)] = j
df1_g1
Out[314]:
A B origin group duplicate_count
6 3 L file7 g1 2
7 3 L file8 g1 2

Calculate percentage of rows with a column above a certain value Pandas

I have a sample of a large dataset as below
I would like to get the percentage of the count of number of rows with a value above 30 which would give me an output as below
How would I go about achieving this with pandas. I have gotten to this last point of processing my data and a bit stuck with this
You can compare values for greater like 30 with aggregate mean:
df = (df.B > 30).groupby(df['A']).mean().mul(100).reset_index(name='C')
print (df)
A C
0 r 60.0
Or:
df = df.assign(C = df.B > 30).groupby('A')['C'].mean().mul(100).reset_index()

How to count number of rows in terms of users id in pandas

I have a dataframe, it consists many users and their respective actions
What I need pandas to do is to count the number of rows in terms of user_iD
lets say if user_iD = 1 is repeated 30 or times it should remain in the dataframe otherwise pandas should remove all the user_iD enteries which are less than 30.
This could solve your problem.
userid_counts = A.user_iD.value_counts()
mask = userid_counts >= 30
filtered_userids = mask[mask].index
A = A[A.user_iD.isin(filtered_userids)]

Taking second last observed row

I am new to pandas. I know how to use drop_duplicates and take the last observed row in a dataframe. Is there any way that I can use it to take only second last observed. Or any other way of doing it.
For example:
I would like to go from
df = pd.DataFrame(data={'A':[1,1,1,2,2,2],'B':[1,2,3,4,5,6]}) to
df1 = pd.DataFrame(data={'A':[1,2],'B':[2,5]})
The idea is that you'll group the data by the duplicate column , then check the length of group , if the length of group is greater than or equal 2 this mean that you can slice the second element of group , if the group has a length of one which mean that this value is not duplicated , then take index 0 which is the only element in the grouped data
df.groupby(df['A']).apply(lambda x : x.iloc[1] if len(x) >= 2 else x.iloc[0])
The first answer I think was on the right track, but possibly not quite right. I have extended your data to include 'A' groups with two observations, and an 'A' group with one observation, for the sake of completeness.
import pandas as pd
df = pd.DataFrame(data={'A':[1,1,1,2,2,2, 3, 3, 4],'B':[1,2,3,4,5,6, 7, 8, 9]})
def user_apply_func(x):
if len(x) == 2:
return x.iloc[0]
if len(x) > 2:
return x.iloc[-2]
return
df.groupby('A').apply(user_apply_func)
Out[7]:
A B
A
1 1 2
2 2 5
3 3 7
4 NaN NaN
For your reference the apply method automatically passes the data frame as the first argument.
Also, as you are always going to be reducing each group of data to a single observation you could also use the agg method (aggregate). apply is more flexible in terms of the length of the sequences that can be returned whereas agg must reduce the data to a single value.
df.groupby('A').agg(user_apply_func)