How to check if a row does not exist in another column? - pandas

import pandas as pd
import numpy as np
from numpy.random import randint
dict_1 = {'Col1':[1,1,1,1,2,4,5,6,7],'Col2':[3,3,3,3,2,4,5,6,7]}
df = pd.DataFrame(dict_1)
filt = df.apply(lambda x: x['Col2'] not in df['Col1'],axis = 1)
print(filt)
That's is what I tried the expected output is:
0 True
1 True
2 True
3 True
4 False
5 False
6 False
7 False
8 False
The given result is
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
It is only giving false no matter what I do, and I am not sure how to fix that.

IIUC, here's one way:
filt = ~df.Col2.isin(df.Col1.unique())
OUTPUT:
0 True
1 True
2 True
3 True
4 False
5 False
6 False
7 False
8 False

In general, using df.COLUMN notation has the drawback you mention in that it is not obvious how to reference them.
~df["Col2"].isin(df["Col1"].unique())
Remember that when using square brackets instead of .dot notation, single square brackets returns a Series, while double-square brackets return a DataFrame.
isinstance(df["Col2"], pandas.Series)
OUTPUT:
True
Versus
isinstance(df[["Col2"]], pandas.DataFrame)
OUTPUT:
True

Related

How to make count the amount between 2 conditions?

When the start column is true, start counting.
When the end column is true, stop counting.
Input:
import pandas as pd
df=pd.DataFrame()
df['start']=[False,True,False,False,False,True,False,False,False]
df['end']= [False,False,False,True,False,False,False,True,False]
Expected Output:
start end expected
0 False False 0
1 True False 1
2 False False 2
3 False True 0
4 False False 0
5 True False 1
6 False False 2
7 False True 0
8 False False 0
You can use cumsum to compute the groups, groupby.cummax to identify the values after a start (and later mask with where) and groupby.cumcount to increment a counter:
# make groups between start/end
group = (df['start']|df['end']).cumsum()
# identify values after a start and before an end
mask = df['start'].groupby(group).cummax()
# compute a cumcount and mask with the above "mask"
df['expected'] = df.groupby(group).cumcount().add(1).where(mask, 0)
Output:
start end expected
0 False False 0
1 True False 1
2 False False 2
3 False True 0
4 False False 0
5 True False 1
6 False False 2
7 False True 0
8 False False 0

How does a bool type in a dataframe and?

A bool type dataframe:
m = df_device_commission[['X']].gt(0).any(axis=1)
print(m)
0 True
1 True
2 True
3 False
4 True
dtype: bool
another bool type dataframe:
n = df_device_commission[['Y']].notna().any(axis=1)
print(n)
0 False
1 False
2 False
3 False
4 True
dtype: bool
If I want to m and n, How should I write the code?
You can use numpy.logical_and:
In [377]: import numpy as np
In [378]: np.logical_and(m,n)
Out[378]:
0 False
1 False
2 False
3 False
4 True
dtype: bool
You can use the & operator.
>>> m & n
0 False
1 False
2 False
3 False
4 True
dtype: bool

Pandas True False Matching

For this table:
I would like to generate the 'desired_output' column. One way to achieve this maybe:
All the True values from col_1 are transferred straight across to desired_output (red arrow)
In desired_output, place a True value above any existing True value (green arrow)
Code I have tried:
df['desired_output']=df.col_1.apply(lambda x: True if x.shift()==True else False)
Thankyou
You can chain by | for bitwise OR original with shifted values by Series.shift:
d = {"col1":[False,True,True,True,False,True,False,False,True,False,False,False]}
df = pd.DataFrame(d)
df['new'] = df.col1 | df.col1.shift(-1)
print (df)
col1 new
0 False True
1 True True
2 True True
3 True True
4 False True
5 True True
6 False False
7 False True
8 True True
9 False False
10 False False
11 False False
try this
df['desired_output'] = df['col_1']
df.loc[1:, 'desired_output'] = df.col_1[1:].values | df.col_1[:-1].values
print(df)
In case those are saved as string. all_caps (TRUE / FALSE)
Input:
col_1
0 True
1 True
2 False
3 True
4 True
5 False
6 Flase
7 True
8 False
Code:
df['desired']=df['col_1']
for i, e in enumerate(df['col_1']):
if e=='True':
df.at[i-1,'desired']=df.at[i,'col_1']
df = df[:(len(df)-1)]
df
Output:
col_1 desired
0 True True
1 True True
2 False True
3 True True
4 True True
5 False False
6 Flase True
7 True True
8 False False

How to get count for the non duplicates in column

My code to get the duplicates, how to negate the below meaning
df.duplicated(subset='col', keep='last').sum()
len(df['col'])-len(df['col'].drop_duplicates())
I think you need DataFrame.duplicated with keep=False for all duplicates, invert mask and sum for count Trues:
df = pd.DataFrame({'col':[1,2,2,3,3,3,4,5,5]})
print (df.duplicated(subset='col', keep=False))
0 False
1 True
2 True
3 True
4 True
5 True
6 False
7 True
8 True
dtype: bool
print (~df.duplicated(subset='col', keep=False))
0 True
1 False
2 False
3 False
4 False
5 False
6 True
7 False
8 False
dtype: bool
print ((~df.duplicated(subset='col', keep=False)).sum())
2
Another solution with Series.drop_duplicates and keep=False with length of Series:
print (df['col'].drop_duplicates(keep=False))
0 1
6 4
Name: col, dtype: int64
print (len(df['col'].drop_duplicates(keep=False)))
2

Find the min/max of rows with overlapping column values, create new column to represent the full range of both

I'm using Pandas DataFrames. I'm looking to identify all rows where both columns A and B == True, then represent in Column C the all points on other side of that intersection where only A or B is still true but not the other. For example:
A B C
0 False False False
1 True False True
2 True True True
3 True True True
4 False True True
5 False False False
6 True False False
7 True False False
I can find the direct overlaps quite easily:
df.loc[(df['A'] == True) & (df['B'] == True), 'C'] = True
... however this does not take into account the overlap need.
I considered creating column 'C' in this way, then grouping each column:
grp_a = df.loc[(df['A'] == True), 'A'].groupby(df['A'].astype('int').diff.ne(0).cumsum())
grp_b = df.loc[(df['A'] == True), 'A'].groupby(df['A'].astype('int').diff.ne(0).cumsum())
grp_c = df.loc[(df['A'] == True), 'A'].groupby(df['A'].astype('int').diff.ne(0).cumsum())
From there I thought to iterate over the indexes in grp_c.indices and test the indices in grp_a and grp_b against those, find the min/max index of A and B and update column C. This feels like an inefficient way of getting to the result I want though.
Ideas?
Try this:
#Input df just columns 'A' and 'B'
df = df[['A','B']]
df['C'] = df.assign(C=df.min(1)).groupby((df[['A','B']].max(1) == 0).cumsum())['C']\
.transform('max').mask(df.max(1)==0, False)
print(df)
Output:
A B C
0 False False False
1 True False True
2 True True True
3 True True True
4 False True True
5 False False False
6 True False False
7 True False False
Explanation:
First, create column 'C' with the assignment of minimum value, what this does is to ass True to C where both A and B are True. Next, using
df[['A','B']].max(1) == 0
0 True
1 False
2 False
3 False
4 False
5 True
6 False
7 False
dtype: bool
We can find all of the records were A and B are both False. Then we use cumsum to create a count of those False False records. Allowing us to create grouping of records with the False False recording having a count up until the next False False record which gets incremented.
(df[['A','B']].max(1) == 0).cumsum()
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
dtype: int32
Let's group the dataframe with the newly assigned column C by this grouping created with cumsum. Then take the maximum value of column C from that group. So, if the group has a True True record, assign True to all the records in that group. Lastly, use mask to turn the first False False record back to False.
df.assign(C=df.min(1)).groupby((df[['A','B']].max(1) == 0).cumsum())['C']\
.transform('max').mask(df.max(1)==0, False)
0 False
1 True
2 True
3 True
4 True
5 False
6 False
7 False
Name: C, dtype: bool
And, assign that series to df['C'] overwriting the temporarily assigned C in the statement.
df['C'] = df.assign(C=df.min(1)).groupby((df[['A','B']].max(1) == 0).cumsum())['C']\
.transform('max').mask(df.max(1)==0, False)