dataframe groupby or slice based on conditions - pandas

I have a dataframe like this:
df_test = pd.DataFrame({'ID1':['A','A','A','B','B','A','B','B','B','B','A','A','A','A'],
'ID2':[1,2,3,1,1,1,6,7,1,2,2,5,6,1]})
df_test
The result dataframe would be like this ('ID1' was group/slice by the value, for example, if A was repeated at least 2 times, these 2 rows will be treated as a group and calculate the mean of ID2. it's similar to 'B', but only if 'B' repeat at least 3 times):
df_result = pd.DataFrame({'ID1':['A1','B1','A2'],
'mean_ID2':[2,4,3.5]})
df_result

you can use run length encoding to figure out the rows to keep based on number of elements in consecutive runs. in the next step, group by consecutive runs again and take mean.
import pdrle
r = pdrle.encode(df_test.ID1)
r["chk"] = ((r.vals == "A") & (r.runs >=2)) | ((r.vals == "B") & (r.runs >= 3))
df2 = df_test[pdrle.decode(r.chk, r.runs)]
df2.groupby(pdrle.get_id(df2.ID1)).agg({"ID1": "first", "ID2": "mean"})
# ID1 ID2
# ID1
# 0 A 2.0
# 1 B 4.0
# 2 A 3.5

Related

Classifying pandas columns according to range limits

I have a dataframe with several numeric columns and their range goes either from 1 to 5 or 1 to 10
I want to create two lists of these columns names this way:
names_1to5 = list of all columns in df with numbers ranging from 1 to 5
names_1to10 = list of all columns in df with numbers from 1 to 10
Example:
IP track batch size type
1 2 3 5 A
9 1 2 8 B
10 5 5 10 C
from the dataframe above:
names_1to5 = ['track', 'batch']
names_1to10 = ['ip', 'size']
I want to use a function that gets a dataframe and perform the above transformation only on columns with numbers within those ranges.
I know that if the column 'max()' is 5 than it's 1to5 same idea when max() is 10
What I already did:
def test(df):
list_1to5 = []
list_1to10 = []
for col in df:
if df[col].max() == 5:
list_1to5.append(col)
else:
list_1to10.append(col)
return list_1to5, list_1to10
I tried the above but it's returning the following error msg:
'>=' not supported between instances of 'float' and 'str'
The type of the columns is 'object' maybe this is the reason. If this is the reason, how can I fix the function without the need to cast these columns to float as there are several, sometimes hundreds of these columns and if I run:
df['column'].max() I get 10 or 5
What's the best way to create this this function?
Use:
string = """alpha IP track batch size
A 1 2 3 5
B 9 1 2 8
C 10 5 5 10"""
temp = [x.split() for x in string.split('\n')]
cols = temp[0]
data = temp[1:]
def test(df):
list_1to5 = []
list_1to10 = []
for col in df.columns:
if df[col].dtype!='O':
if df[col].max() == 5:
list_1to5.append(col)
else:
list_1to10.append(col)
return list_1to5, list_1to10
df = pd.DataFrame(data, columns = cols, dtype=float)
Output:
(['track', 'batch'], ['IP', 'size'])

saving dataframe groupby rows to exactly two lines

I got a dataframe and I want to groupby the rows based on a specific column. Number of rows in each group will be at least 4 and at most 50. I want to save one column from the group into two lines. If the groupsize is even, let us say 2n, then n rows in one line and the remaining n in the second line. If it is odd, n+1 and n or n and n+1 will do.
For example,
import pandas as pd
from io import StringIO
data = """
id,name
1,A
1,B
1,C
1,D
2,E
2,F
2,ds
2,G
2, dsds
"""
df = pd.read_csv(StringIO(data))
I want to groupby id
df.groupby('id',sort=False)
and then get a dataframe like
id name
0 1 A B
1 1 C D
2 2 E F ds
3 2 G dsds
Probably not the most efficient solution, but it works:
import numpy as np
df = df.sort_values('id')
# next 3 lines: for each group find the separation
df['range_idx'] = range(0, df.shape[0])
df['mean_rank_group'] = df.groupby(['id'])['range_idx'].transform(np.mean)
df['separate_column'] = df['range_idx'] < df['mean_rank_group']
# groupby itself with the help of additional column
df.groupby(['id', 'separate_column'], as_index=False)['name'].agg(','.join).drop(
columns='separate_column')
This is a bit convoluted approach but it does the work;
def func(s: pd.Series):
mid = max(s.shape[0]//2 ,1)
l1 = ' '.join(list(s[:mid]))
l2 = ' '.join(list(s[mid:]))
return [l1, l2]
df_new = df.groupby('id').agg(func)
df_new["name1"]= df_new["name"].apply(lambda x: x[0])
df_new["name2"]= df_new["name"].apply(lambda x: x[1])
df = df_new.drop(labels="name", axis=1).stack().reset_index().drop(labels = ["level_1"], axis=1).rename(columns={0:"name"}).set_index("id")

matching consecutive pairs in pd.Series

I have a DataFrame which looks like this :-
ID | act
1 A
1 B
1 C
1 D
2 A
2 B
3 A
3 C
I am trying to get the IDs where an activity act1 is followed by another act2, for example, A is followed by B. In that case, I want to get [1,2] as the ids. How do I go about this in a vectorized manner?
Edit :- Expected output : For the sample df defined above, the output should be a list/Series of all the IDs where A is followed immediately by B
IDs
1
2
Here is a simple, vectorised way to do it!
df.loc[(df.act == 'A') & (df.act.shift(-1) == 'B') & (df.ID == df.ID.shift(-1)), 'ID']
Output:
0 1
4 2
Name: ID, dtype: int64
Another way of writing this, possibly clearer:
conditions = (df.act == 'A') & (df.act.shift(-1) == 'B') & (df.ID == df.ID.shift(-1))
df.loc[conditions, 'ID']
Numpy makes it easy to filter for one or many boolean conditions. The resulting vector is used to filter your dataframe.
Here is one approach: groupby, and don't sort, since we need to track B immediately following A, based on the current dataframe structure.
Next aggregate using str.cat
check if A,B is present
get the index
pass as a list
(df
.groupby('ID',sort=False)
.Act
.agg(lambda x: x.str.cat(sep=','))
.str.contains('A,B')
.loc[lambda x: x==1]
.index.tolist()
)
[1, 2]
Another approach is using the shift function and filtering:
df['x'] = df.Act.shift()
df.loc[lambda x: (x['Act']=='B') & (x['x']=='A')].ID

pandas dataframe compare first and last row from each group

How do I compare the value of the first row in col b and the last row in col b from grouping by col a, without using the groupby function? Because groupby function is very slow for a large dataset.
a = [1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3]
b = [1,0,0,0,0,0,7,8,0,0,0,0,0,4,1,0,0,0,0,0,1]
Return two lists: one has the group names from col a where the last value is larger than the first value, etc.
larger_or_equal = [1,3]
smaller = [2]
All numpy
a = np.array([1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3])
b = np.array([1,0,0,0,0,0,7,8,0,0,0,0,0,4,1,0,0,0,0,0,1])
w = np.where(a[1:] != a[:-1])[0] # find the edges
e = np.append(w, len(a) - 1) # define the end pos
s = np.append(0, w + 1) # define start pos
# slice end pos with boolean array. then slice groups with end postions.
# I could also have used start positions.
a[e[b[e] >= b[s]]]
a[e[b[e] < b[s]]]
[1 3]
[2]
Here is a solution without groupby. The idea is to shift column a to detect group changes:
df[df['a'].shift() != df['a']]
a b
0 1 1
7 2 8
14 3 1
df[df['a'].shift(-1) != df['a']]
a b
6 1 7
13 2 4
20 3 1
We will compare the column b in those two dataframes. We simply need to reset the index for the pandas comparison to work:
first = df[df['a'].shift() != df['a']].reset_index(drop=True)
last = df[df['a'].shift(-1) != df['a']].reset_index(drop=True)
first.loc[last['b'] >= first['b'], 'a'].values
array([1, 3])
Then do the same with < to get the other groups. Or do a set difference.
As I wrote in the comments, groupby(sort=False) might well be faster depending on your dataset.

Pandas dataframe row removal

I am trying to repair a csv file.
Some data rows need to be removed based on a couple conditions.
Say you have the following dataframe:
-A----B-----C
000---0-----0
000---1-----0
001---0-----1
011---1-----0
001---1-----1
If two or more rows have column A in common, i want to keep the row that has column B set to 1.
The resulting dataframe should look like this:
-A----B-----C
000---1-----0
011---1-----0
001---1-----1
I've experimented with merges and drop_duplicates but cannot seem to get the result I need. It is not certain that the row with column B = 1 will be after a row with B = 0. The take_last argument of drop_duplicates seemed attractive but I don't think it applies here.
Any advice will be greatly appreciated.Thank you.
Not straight forward, but this should work
DF = pd.DataFrame({'A' : [0,0,1,11,1], 'B' : [0,1,0,1,1], 'C' : [0,0,1,0,1]})
DF.ix[DF.groupby('A').apply(lambda df: df[df.B == 1].index[0] if len(df) > 1 else df.index[0])]
A B C
1 0 1 0
4 1 1 1
3 11 1 0
Notes:
groupby divides DF into groups of rows with unique A values i.e. groups with A = 0 (2 rows), A=1 (2 rows) and A=11 (1 row)
Apply then calls the function on each group and assimilates the results
In the function (lambda) I'm looking for the index of row with value B == 1 if there is more than one row in the group, else I use the index of the default row
The result of apply is a list of index values that represent rows with B==1 if more than one row in the group else the default row for given A
The index values are then used to access the corresponding rows by ix operator
Was able to weave my way around panda to get the result I want.
It's not pretty but it gets the job done
res = DataFrame(columns=('CARD_NO', 'STATUS'))
for i in grouped.groups:
if len(grouped.groups[i]) > 1:
card_no = i
print card_no
for a in grouped.groups[card_no]:
status = df.iloc[a]['STATUS']
print 'iloc:'+str(a) +'\t'+'status:'+str(status)
if status == 1:
print 'yes'
row = pd.DataFrame([dict(CARD_NO=card_no, STATUS=status), ])
res = res.append(row, ignore_index=True)
else:
print 'no'
else:
#only 1 record found
#could be a status of 0 or 1
#add to dataframe
print 'UNIQUE RECORD'
card_no = i
print card_no
status = df.iloc[grouped.groups[card_no][0]]['STATUS']
print grouped.groups[card_no][0]
#print status
print 'iloc:'+str(grouped.groups[card_no][0]) +'\t'+'status:'+str(status)
row = pd.DataFrame([dict(CARD_NO=card_no, STATUS=status), ])
res = res.append(row, ignore_index=True)
print res