pandas compare two column with different size - pandas

Suppose we have two columns in two dataframes , the columns are the same but the size is different. How do we compare two columns and have the indices of the same values in both? df1 and df2, age is common in two but the df1 has 1000 rows and the df2 has 200 rows -- I want to have the indices of rows that have the same age value?

You can use .loc for index labelling:
df1.age < df2.loc[df1.index].age
Example:
df1 = pd.DataFrame({'age':np.random.randint(1,10,10)})
df2 = pd.DataFrame({'age':np.random.randint(1,10,20)})
Output:
0 True
1 True
2 False
3 True
4 True
5 False
6 False
7 True
8 False
9 False
Name: age, dtype: bool
Get everything all in one dataframe:
df1.assign(age_2=df2.loc[df1.index],cond=df1.age < df2.loc[df1.index].age)
Output:
age age_2 cond
0 3 5 True
1 3 8 True
2 6 6 False
3 4 7 True
4 4 7 True
5 5 2 False
6 2 2 False
7 3 7 True
8 6 3 False
9 5 4 False

Related

Label the first element in each groupby

I have a data frame that looks like the following
df = pd.DataFrame({'group':[1,1,2,2,2],'time':[1,2,3,4,5],'C':[6,7,8,9,10]})
group time C
0 1 1 6
1 1 2 7
2 2 3 8
3 2 4 9
4 2 5 10
and I'm looking to label the first element (in terms of time) in each group as True, i.e.:
group time C first_in_group
0 1 1 6 True
1 1 2 7 False
2 2 3 8 True
3 2 4 9 False
4 2 5 10 False
I tried several combinations of groupby, first but did not manage to achieve what I wanted.
Is there an elegant way to do it in Pandas?
Use duplicated:
df['first_in_group'] = ~df.group.duplicated()
OUTPUT:
group time C first_in_group
0 1 1 6 True
1 1 2 7 False
2 2 3 8 True
3 2 4 9 False
4 2 5 10 False
NOTE: Do the sorting 1st (if required).
df = df.sort_values(['group', 'time'])

Dataframe count of columns matching value in another column in that row

How to find the count of columns with same value as a specified column in the dataframe with large number of rows.
For instance, below df has
df = pd.DataFrame(np.random.randint(0,10,size=(5, 4)), columns=list('ABCD'))
df.index.name = 'id'
A B C D
id
0 7 6 6 2
1 6 5 3 5
2 8 8 0 9
3 0 2 8 9
4 4 3 8 5
bc_cols = ['B', 'C']
df['max'] = df[bc_cols].max(axis=1)
A B C D BC_max
id
0 7 6 6 2 6
1 6 5 3 5 5
2 8 8 0 9 8
3 0 2 8 9 8
4 4 3 8 5 8
For each row, we want to get the number of columns with the value matching the max. I was able to get to by doing this.
df["freq"] = df[bc_cols].stack().groupby(by='id').apply(lambda g: g[g==g.max()].count())
A B C D BC_max BC_freq
id
0 7 6 6 2 6 2
1 6 5 3 5 5 1
2 8 8 0 9 8 1
3 0 2 8 9 8 1
4 4 3 8 5 8 1
But this is turning out to be very inefficient and slow. We need to do this on a fairly large dataframe with several hundred thousand rows so I am looking for an efficient way to do this. Any ideas?
Once you have BC_max why not re-use it:
def get_bc_freq(row):
if (row.B == row.BC_max) and (row.C == row.BC_max):
return 2
elif (row.B == row.BC_max) or (row.C == row.BC_max):
return 1
return 0
df['freq'] = df.apply(lambda row: get_bc_freq(row), axis=1)
Or the prettier one-liner:
df['freq'] = df.apply(lambda row: [row.B, row.C].count(row.BC_max), axis=1)
UPDATE - to make the columns you use more dynamic you could use list comprehension (not sure how much this helps with performance but...):
cols_to_use = ['B', 'C']
df['freq'] = df.apply(lambda row: [row[x] for x in cols_to_use].count(row.BC_max), axis=1)

Mask minimum values per group in a `pd.DataFrame`

Given a pd.DataFrame containing different time series in different groups, I want to create a mask over all rows that indicates per group at which timepoints the minima of value is reached in respect of type 0:
For example, given the pd.DataFrame:
>>> df
group type time value
0 A 0 0 4
1 A 0 1 5
2 A 1 0 6
3 A 1 1 7
4 B 0 0 11
5 B 0 1 10
6 B 1 0 9
7 B 1 1 8
In group A the minima for type 0 is reached at the timepoint 0. For group B the minima for type 0 is reached at the timepoint 1. Therefore, the resulting column should look like:
is_min
0 True
1 False
2 True
3 False
4 False
5 True
6 False
7 True
I have created a version that seems very cumbersome, first finding out the minima locations and then constructing the final column:
def get_minima(df):
type_mask = df.type == 0
min_value = df[type_mask].value.min()
value_mask = df.value == min_value
return df[type_mask & value_mask].time.max()
min_ts = df.groupby('group').apply(get_minima)
df['is_min'] = df.apply(lambda row: min_ts[row.group] == row.time, axis=1)
IIUC, you can try with groupby+apply and min
df['is_min']= df.groupby(['group','type'])['value']
.apply(lambda x: x==x.min())
Same as this with transform+min to get the minimal and eq to create the mask desired:
df['is_min']= df.groupby(['group','type'])['value']
.transform('min').eq(df['value'])
Output:
df
group type time value is_min
0 A 0 0 4 True
1 A 0 1 5 False
2 A 1 0 6 True
3 A 1 1 7 False
4 B 0 0 11 False
5 B 0 1 10 True
6 B 1 0 9 False
7 B 1 1 8 True
You can remove the rows with an excluding merge. sort the values, subset to only "type==0" and drop_duplicates to get the times per group you need to exclude. Then merge with an indicator to exclude.
m = (df.sort_values('value').query('type == 0').drop_duplicates('group')
.drop(columns=['type', 'value']))
# group time
#0 A 0
#5 B 1
df = (df.merge(m, how='outer', indicator=True).query('_merge == "left_only"')
.drop(columns='_merge'))
group type time value
2 A 0 1 5
3 A 1 1 7
4 B 0 0 11
5 B 1 0 9
If you separately need the mask and don't want to automatically query to subset the rows, map the indicator
df = df.merge(m, how='outer', indicator='is_min')
df['is_min'] = df['is_min'].map({'left_only': False, 'both': True})
group type time value is_min
0 A 0 0 4 True
1 A 1 0 6 True
2 A 0 1 5 False
3 A 1 1 7 False
4 B 0 0 11 False
5 B 1 0 9 False
6 B 0 1 10 True
7 B 1 1 8 True

Pandas: fetch rows that continuously have a similar value

I have a dataframe like so..
id time status
-- ---- ------
a 1 T
a 2 F
b 1 T
b 2 T
a 3 T
a 4 T
b 3 F
b 4 T
b 5 T
I would like to fetch the ids that continuously have the status 'T' for a certain threshold number of times (say 2 in this case).
Thus the fetched rows would be...
id time status
-- ---- ------
b 1 T
b 2 T
a 3 T
a 4 T
b 4 T
b 5 T
I can think of an iterative solution. What I am looking for is something more pandas/sql like. I think an order by id and then time followed by a group by first by id and then status should work, but I'd like to be sure.
Compare values by Series.eq for T and count consecutive values with Series.shift and Series.cumsum, count by Series.value_counts and Series.map to original - get counts per consecutive groups. Then compare by Series.ge and last filter by boolean indexing chain both mask by bitwise AND:
N = 2
m1 = df['status'].eq('T')
g = df['status'].ne(df['status'].shift()).cumsum()
m2 = g.map(g.value_counts()).ge(N)
df = df[m1 & m2]
print (df)
id time status
2 b 1 T
3 b 2 T
4 a 3 T
5 a 4 T
7 b 4 T
8 b 5 T
Details:
print (df.assign(m1=m1, g=g, counts=g.map(g.value_counts()), m2=m2))
id time status m1 g counts m2
0 a 1 T True 1 1 False
1 a 2 F False 2 1 False
2 b 1 T True 3 4 True
3 b 2 T True 3 4 True
4 a 3 T True 3 4 True
5 a 4 T True 3 4 True
6 b 3 F False 4 1 False
7 b 4 T True 5 2 True
8 b 5 T True 5 2 True

Delete duplicated according values in Pandas

I have a dataframe like that
id col1
2 T
2 T
4 R
4 T
6 G
6 G
I want deduplicate with this way :
If I have T and T for the same id I want to keep the 2 lines
If I have G or R and G or R for the same id I want to keep the 2 lines
If I have T and (G or R) for the same IT I just want to keep the line with T (delete one of the two lines)
I want this result :)
id col1
2 T
2 T
4 T
6 G
6 G
Thank you :)
Use boolean indexing for filtering:
m1 = df['col1'].eq('T')
m2 = m1.groupby(df['id']).transform('sum').ne(1)
df = df[m1 | m2 ]
print (df)
id col1
0 2 T
1 2 T
3 4 T
4 6 G
5 6 G
Explanation:
Compare col1 for T with eq (==):
m1 = df['col1'].eq('T')
print (m1)
0 True
1 True
2 False
3 True
4 False
5 False
Name: col1, dtype: bool
Count True values per groups by transform with sum:
print (m1.groupby(df['id']).transform('sum'))
0 2.0
1 2.0
2 1.0
3 1.0
4 0.0
5 0.0
Name: col1, dtype: float64
Compare for not equal by ne (!=):
m2 = m1.groupby(df['id']).transform('sum').ne(1)
print (m2)
0 True
1 True
2 False
3 False
4 True
5 True
Name: col1, dtype: bool
And chain together by | for bitwise OR:
print (m1 | m2)
0 True
1 True
2 False
3 True
4 True
5 True
Name: col1, dtype: bool