what is the simplest way to check for occurrence of character/substring in dataframe values? - pandas

consider a pandas dataframe that has values such as 'a - b'. I would like to check for the occurrence of '-' anywhere across all values of the dataframe without looping through individual columns. Clearly a check such as the following won't work:
if '-' in df.values
Any suggestions on how to check for this? Thanks.

I'd use stack() + .str.contains() in this case:
In [10]: df
Out[10]:
a b c
0 1 a - b w
1 2 c z
2 3 d 2 - 3
In [11]: df.stack().str.contains('-').any()
Out[11]: True
In [12]: df.stack().str.contains('-')
Out[12]:
0 a NaN
b True
c False
1 a NaN
b False
c False
2 a NaN
b False
c True
dtype: object

You can use replace to to swap a regex match with something else then check for equality
df.replace('.*-.*', True, regex=True).eq(True)

One way may be to try using flatten to values and list comprehension.
df = pd.DataFrame([['val1','a-b', 'val3'],['val4','3', 'val5']],columns=['col1','col2', 'col3'])
print(df)
Output:
col1 col2 col3
0 val1 a-b val3
1 val4 3 val5
Now, to search for -:
find_value = [val for val in df.values.flatten() if '-' in val]
print(find_value)
Output:
['a-b']

Using NumPy: np.core.defchararray.find(a,s) returns an array of indices where the substring s appears in a;
if it's not present, -1 is returned.
(np.core.defchararray.find(df.values.astype(str),'-') > -1).any()
returns True if '-' is present anywhere in df.

Related

Is there a way to use .loc on column names instead of the values inside the columns?

I am wondering if there is a way to use .loc to check to sort df with certain column names == something else on another df. I know you can usually use it to check if the value is == to something, but what about the actual column name itself?
ex.
df1 = [ 0, 1, 2, 3]
df2 .columns = [2,4,6]
Is there a way to only display df2 values where the column name is == df1 without hardcoding it and saying df2.loc[:, ==2]?
IIUC, you can use df2.columns.intersection to get columns only present in df1:
>>> df1
A B D F
0 0.431332 0.663717 0.922112 0.562524
1 0.467159 0.549023 0.139306 0.168273
>>> df2
A B C D E F
0 0.451493 0.916861 0.257252 0.600656 0.354882 0.109236
1 0.676851 0.585368 0.467432 0.594848 0.962177 0.714365
>>> df2[df2.columns.intersection(df1.columns)]
A B D F
0 0.451493 0.916861 0.600656 0.109236
1 0.676851 0.585368 0.594848 0.714365
One solution:
df3 = df2[[c for c in df2.columns if c in df1]]

Sign check on pandas dataframe

I have a dataframe (df) like this:
A B C D
0 -0.01961 -0.01412 0.013277 0.013277
1 -0.021173 0.001205 0.01659 0.01659
2 -0.026254 0.009932 0.028451 0.012826
How could I efficiently check if there is ANY column where column values do not have the same sign?
Check with np.sign and nunique
np.sign(df).nunique()!=1
Out[151]:
A False
B True
C False
D False
dtype: bool

filter on pandas array

I'm doing this kind of code to find if a value belongs to the array a inside a dataframe:
Solution 1
df = pd.DataFrame([{'a':[1,2,3], 'b':4},{'a':[5,6], 'b':7},])
df = df.explode('a')
df[df['a'] == 1]
will give the output:
a b
0 1 4
Problem
This can go worst if there are repetitions:
df = pd.DataFrame([{'a':[1,2,1,3], 'b':4},{'a':[5,6], 'b':7},])
df = df.explode('a')
df[df['a'] == 1]
will give the output:
a b
0 1 4
0 1 4
Solution 2
Another solution could go like:
df = pd.DataFrame([{'a':[1,2,1,3], 'b':4},{'a':[5,6], 'b':7},])
df = df[df['a'].map(lambda row: 1 in row)]
Problem
That Lambda can't go fast if the Dataframe is Big.
Question
As a first goal, I want all the lines where the value 1 belongs to a:
without using Python, since it is slow
with high performance
avoiding memory issues
...
So I'm trying to understand what may I do with the arrays inside Pandas. Is there some documentation on how to use this type efficiently?
IIUC, you are trying to do:
df[df['a'].eq(1).groupby(level=0).transform('any')
Output:
a b
0 1 4
0 2 4
0 3 4
Nothing is wrong. This is normal behavior of pandas.explode().
To check whether a value belongs to values in a you may use this:
if x in df.a.explode()
where x is what you test for.
I think you can convert arrays to scalars with DataFrame constructor and then test value with DataFrame.eq and DataFrame.any:
df = df[pd.DataFrame(df['a'].tolist()).eq(1).any(axis=1)]
print (df)
a b
0 [1, 2, 1, 3] 4
Details:
print (pd.DataFrame(df['a'].tolist()))
0 1 2 3
0 1 2 1.0 3.0
1 5 6 NaN NaN
print (pd.DataFrame(df['a'].tolist()).eq(1))
0 1 2 3
0 True False True False
1 False False False False
So I'm trying to understand what may I do with the arrays inside Pandas. Is there some documentation on how to use this type efficiently?
I think working with lists in pandas is not good idea.

Determine if columns of a pandas dataframe uniquely identify the rows

I'm looking for a way to determine if a column or set of columns of a pandas dataframe uniquely identifies the rows of that dataframe. I've seen this called the isid function in Stata.
The best I can think of is to get the unique values of a subset of columns using a set comprehension, and asserting that there are as many values in the set as there are rows in the dataframe:
subset = df[["colA", "colC"...]]
unique_vals = {tuple(x) for x in subset.values}
assert(len(unique_vals) == len(df))
This isn't the most elegant answer in the world, so I'm wondering if there's a built-in function that does this, or perhaps a way to test if a subset of columns are a uniquely-valued index.
You could make an index and check its is_unique attribute:
import pandas as pd
df1 = pd.DataFrame([(1,2),(1,2)], columns=list('AB'))
df2 = pd.DataFrame([(1,2),(1,3)], columns=list('AB'))
print(df1.set_index(['A','B']).index.is_unique)
# False
print(df2.set_index(['A','B']).index.is_unique)
# True
Maybe groupby size
df.groupby(['x','y']).size()==1
Out[308]:
x y
1 a True
2 b True
3 c True
4 d False
dtype: bool
You can check
df[['x', 'y']].transform(tuple,1).duplicated(keep=False).any()
To see if there are any duplicated rows with the sets of value from columns x and y.
Example:
df = pd.DataFrame({'x':[1,2,3,4,4], 'y': ["a", "b", "c", "d","d"]})
x y
0 1 a
1 2 b
2 3 c
3 4 d
4 4 d
Then transform
0 (1, a)
1 (2, b)
2 (3, c)
3 (4, d)
4 (4, d)
dtype: object
then check which are duplicated()
0 False
1 False
2 False
3 True
4 True
dtype: bool
Notice that transforming into tuple might not be necessary
df.duplicated(keep=False)
0 False
1 False
2 False
3 True
4 True
dtype: bool

pandas string contains lookup: NaN leads to Value Error

If you would like to filter those rows for which a string is in a column value, it is possible to use something like data.sample_id.str.contains('hph') (answered before: check if string in pandas dataframe column is in list, or Check if string is in a pandas dataframe).
However, my lookup column contains emtpy cells. Terefore, str.contains() yields NaN values and I get an value error upon indexing.
`ValueError: cannot index with vector containing NA / NaN values``
What works:
# get all runs
mask = [index for index, item in enumerate(data.sample_id.values) if 'zent' in str(item)]
Is there a more elegant and faster method (similar to str.contains()) than this one ?
You can set parameter na in str.contains to False:
print (df.a.str.contains('hph', na=False))
Using EdChum sample:
df = pd.DataFrame({'a':['hph', np.NaN, 'sadhphsad', 'hello']})
print (df)
a
0 hph
1 NaN
2 sadhphsad
3 hello
print (df.a.str.contains('hph', na=False))
0 True
1 False
2 True
3 False
Name: a, dtype: bool
IIUC you can filter those rows out also
data['sample'].dropna().str.contains('hph')
Example:
In [38]:
df =pd.DataFrame({'a':['hph', np.NaN, 'sadhphsad', 'hello']})
df
Out[38]:
a
0 hph
1 NaN
2 sadhphsad
3 hello
In [39]:
df['a'].dropna().str.contains('hph')
Out[39]:
0 True
2 True
3 False
Name: a, dtype: bool
So by calling dropna first you can then safely use str.contains on the Series as there will be no NaN values
Another way to handle the null values would be to use notnull:
In [43]:
(df['a'].notnull()) & (df['a'].str.contains('hph'))
Out[43]:
0 True
1 False
2 True
3 False
Name: a, dtype: bool
but I think passing na=False would be cleaner (#jezrael's answer)