Determine if columns of a pandas dataframe uniquely identify the rows - pandas

I'm looking for a way to determine if a column or set of columns of a pandas dataframe uniquely identifies the rows of that dataframe. I've seen this called the isid function in Stata.
The best I can think of is to get the unique values of a subset of columns using a set comprehension, and asserting that there are as many values in the set as there are rows in the dataframe:
subset = df[["colA", "colC"...]]
unique_vals = {tuple(x) for x in subset.values}
assert(len(unique_vals) == len(df))
This isn't the most elegant answer in the world, so I'm wondering if there's a built-in function that does this, or perhaps a way to test if a subset of columns are a uniquely-valued index.

You could make an index and check its is_unique attribute:
import pandas as pd
df1 = pd.DataFrame([(1,2),(1,2)], columns=list('AB'))
df2 = pd.DataFrame([(1,2),(1,3)], columns=list('AB'))
print(df1.set_index(['A','B']).index.is_unique)
# False
print(df2.set_index(['A','B']).index.is_unique)
# True

Maybe groupby size
df.groupby(['x','y']).size()==1
Out[308]:
x y
1 a True
2 b True
3 c True
4 d False
dtype: bool

You can check
df[['x', 'y']].transform(tuple,1).duplicated(keep=False).any()
To see if there are any duplicated rows with the sets of value from columns x and y.
Example:
df = pd.DataFrame({'x':[1,2,3,4,4], 'y': ["a", "b", "c", "d","d"]})
x y
0 1 a
1 2 b
2 3 c
3 4 d
4 4 d
Then transform
0 (1, a)
1 (2, b)
2 (3, c)
3 (4, d)
4 (4, d)
dtype: object
then check which are duplicated()
0 False
1 False
2 False
3 True
4 True
dtype: bool
Notice that transforming into tuple might not be necessary
df.duplicated(keep=False)
0 False
1 False
2 False
3 True
4 True
dtype: bool

Related

Pandas groupby with isin for consecutive groups

I have a dataframe that looks like the following:
arr = pd.DataFrame([[0,0],[0,1],[0,4],[1,4],[1,5],[1,6],[2,5],[2,8],[2,6])
My desired output is booleans that represent whether the value in column 2 is in the next consecutive group or not. The groups are represented by the values in column 1. So for example, 4 shows up in group 0 and the next consecutive group, group 1:
output = pd.DataFrame([[False],[False],[True],[False],[True],[True],[Nan],[Nan],[Nan]])
The outputs for group 2 would be Nan because group 3 doesn't exist.
So far I have tried this:
output = arr.groupby([0])[1].isin(arr.groupby([0])[1].shift(periods=-1))
This doesn't work because I can't apply the isin() on a groupby series.
You could create a helper column with lists of shifted group items, then check against that with a function that returns True, False of NaN:
import pandas as pd
import numpy as np
arr = pd.DataFrame([[0,0],[0,1],[0,4],[1,4],[1,5],[1,6],[2,5],[2,8],[2,6]])
arr = pd.merge(arr, arr.groupby([0]).agg(list).shift(-1).reset_index(), on=[0], how='outer')
def check_columns(row):
try:
if row['1_x'] in row['1_y']:
return True
else:
return False
except:
return np.nan
arr.apply(check_columns, axis=1)
Result:
0 False
1 False
2 True
3 False
4 True
5 True
6 NaN
7 NaN
8 NaN

Sign check on pandas dataframe

I have a dataframe (df) like this:
A B C D
0 -0.01961 -0.01412 0.013277 0.013277
1 -0.021173 0.001205 0.01659 0.01659
2 -0.026254 0.009932 0.028451 0.012826
How could I efficiently check if there is ANY column where column values do not have the same sign?
Check with np.sign and nunique
np.sign(df).nunique()!=1
Out[151]:
A False
B True
C False
D False
dtype: bool

change value in row of dataframe on condition

I have a dataframe with certain values and want to exchange values in one row on a condition. If the value is greater than x I want it to change to zero. I tried with .loc but somehow I get a Keyerror everytime I try. Does .loc work to select rows instead of columns? I used it for columns before, but I cant get it to work for rows.
df = pd.DataFrame({'a': np.random.randn(4), 'b': np.random.randn(4), 'c': np.random.randn(4)})
print(df)
df.loc['Total'] = df.sum()
df.loc[(df['Total'] < x), ['Total']] = 0
I also tried using iloc, but get another error. I dont think its a complex problem, but im kind of stuck, so help would be much appreciated!
You can assign values with loc - first set rows for repalce values by string - here Total, because set row label Total and then compare values of this rows selected by loc - It return boolean mask:
np.random.seed(2019)
df = pd.DataFrame({'a': np.random.randn(4), 'b': np.random.randn(4), 'c': np.random.randn(4)})
print(df)
a b c
0 -0.217679 -0.361865 -0.235634
1 0.821455 0.685609 0.953490
2 1.481278 0.573761 -1.689625
3 1.331864 0.287728 -0.344943
df.loc['Total'] = df.sum()
x = 1
df.loc['Total', df.loc['Total'] < x] = 0
print (df)
a b c
0 -0.217679 -0.361865 -0.235634
1 0.821455 0.685609 0.953490
2 1.481278 0.573761 -1.689625
3 1.331864 0.287728 -0.344943
Total 3.416918 1.185233 0.000000
Detail:
print (df.loc['Total'] < x)
a False
b False
c True
Name: Total, dtype: bool

Pandas Row Select Where String Starts With Any Item In List

I want to select a subset of rows in a pandas dataframe, based on a particular string column, where the value starts with any number of values in a list.
A small version of this:
df = pd.DataFrame({'a': ['aa10', 'aa11', 'bb13', 'cc14']})
valids = ['aa', 'bb']
So I want just those rows where a starts with aa or bb in this case.
You need startswith
df.a.str.startswith(tuple(valids))
Out[191]:
0 True
1 True
2 True
3 False
Name: a, dtype: bool
After filter with original df
df[df.a.str.startswith(tuple(valids))]
Out[192]:
a
0 aa10
1 aa11
2 bb13

How to remove rows from a dataframe based on their column values existence in another df?

Given two dataframes A and B, which both have columns 'x', 'y' how can I efficiently remove all rows in A that their pairs of (x, y) appear in B.
I thought about implementing it using a row iterator on A and then per pair checking if it exists in B but I am guessing this is the least efficient way...
I tried using the .isin function as suggested in Filter dataframe rows if value in column is in a set list of values but couldn't make use of it for multiple columns.
Example dataframes:
A = pd.DataFrame([[1, 2], [1, 4], [3, 4], [2, 4]], columns=['x', 'y'])
B = pd.DataFrame([[1, 2], [3, 4]], columns=['x', 'y'])
C should contain [1,4] and [2,4] after the operation.
In pandas master (or in future 0.13) isin will also accept DataFrames, but the problem is that it just looks at the values in each column, and not at an exact row combination of the columns.
Taken from #AndyHayden comment here (https://github.com/pydata/pandas/issues/4421#issuecomment-23052472), a similar approach with set:
In [3]: mask = pd.Series(map(set(B.itertuples(index=False)).__contains__, A.itertuples(index=False)))
In [4]: A[~mask]
Out[4]:
x y
1 1 4
3 2 4
Or a more readable version:
set_B = set(B.itertuples(index=False))
mask = [x not in set_B for x in A.itertuples(index=False)]
The possible advantage of this compared to #Acorbe's answer is that this preserves the index of A and does not remove duplicate rows in A (but that depends on what you want of course).
As I said, 0.13 will have accept DataFrames to isin. However, I don't think this will solve this issue because also the index has to be the same:
In [27]: A.isin(B)
Out[27]:
x y
0 True True
1 False True
2 False False
3 False False
You can solve this by converting it to a dict, but now it does not look at the combinatio of both columns, but just for each column seperately:
In [28]: A.isin(B.to_dict(outtype='list'))
Out[28]:
x y
0 True True
1 True True
2 True True
3 False True
For those looking for a single-column solution:
new_df = df1[~df1["column_name"].isin(df2["column_name"])]
The ~ is a logical operator for NOT.
So this will create a new dataframe when the values of df1["column_name"] are not found in df2["column_name"]
One option would be to generate two sets, say A_set, B_set, whose elements are the rows of the DataFrames. Hence, the fast set difference operation A_set - B_set can be used.
A_set = set(map(tuple,A.values)) #we need to have an hashable object before generating a set
B_set = set(map(tuple,B.values))
C_set = A_set - B_set
C_set
{(1, 4), (2, 4)}
C = pd.DataFrame([c for c in C_set], columns=['x','y'])
x y
0 2 4
1 1 4
This procedure involves some preliminary conversion operations, though.