I have a dataframe (df) like this:
A B C D
0 -0.01961 -0.01412 0.013277 0.013277
1 -0.021173 0.001205 0.01659 0.01659
2 -0.026254 0.009932 0.028451 0.012826
How could I efficiently check if there is ANY column where column values do not have the same sign?
Check with np.sign and nunique
np.sign(df).nunique()!=1
Out[151]:
A False
B True
C False
D False
dtype: bool
Related
I want to create a column that answers if string values from column 'A' are in either column 'B' or 'C'. These can be converted to float or int if that makes it easier.
Data:
A B C OUTPUT
A B C No/False
B B B Yes/True
A A C Yes/True
A C A Yes/True
You can do
df["output"] = df.apply(lambda x: True if x["a"] in (x["b"], x["c"]) else False)
Let us try sin
df[['B','C']].isin(df.A).any(1)
0 False
1 True
2 True
3 True
dtype: bool
I'm looking for a way to determine if a column or set of columns of a pandas dataframe uniquely identifies the rows of that dataframe. I've seen this called the isid function in Stata.
The best I can think of is to get the unique values of a subset of columns using a set comprehension, and asserting that there are as many values in the set as there are rows in the dataframe:
subset = df[["colA", "colC"...]]
unique_vals = {tuple(x) for x in subset.values}
assert(len(unique_vals) == len(df))
This isn't the most elegant answer in the world, so I'm wondering if there's a built-in function that does this, or perhaps a way to test if a subset of columns are a uniquely-valued index.
You could make an index and check its is_unique attribute:
import pandas as pd
df1 = pd.DataFrame([(1,2),(1,2)], columns=list('AB'))
df2 = pd.DataFrame([(1,2),(1,3)], columns=list('AB'))
print(df1.set_index(['A','B']).index.is_unique)
# False
print(df2.set_index(['A','B']).index.is_unique)
# True
Maybe groupby size
df.groupby(['x','y']).size()==1
Out[308]:
x y
1 a True
2 b True
3 c True
4 d False
dtype: bool
You can check
df[['x', 'y']].transform(tuple,1).duplicated(keep=False).any()
To see if there are any duplicated rows with the sets of value from columns x and y.
Example:
df = pd.DataFrame({'x':[1,2,3,4,4], 'y': ["a", "b", "c", "d","d"]})
x y
0 1 a
1 2 b
2 3 c
3 4 d
4 4 d
Then transform
0 (1, a)
1 (2, b)
2 (3, c)
3 (4, d)
4 (4, d)
dtype: object
then check which are duplicated()
0 False
1 False
2 False
3 True
4 True
dtype: bool
Notice that transforming into tuple might not be necessary
df.duplicated(keep=False)
0 False
1 False
2 False
3 True
4 True
dtype: bool
I want to select a subset of rows in a pandas dataframe, based on a particular string column, where the value starts with any number of values in a list.
A small version of this:
df = pd.DataFrame({'a': ['aa10', 'aa11', 'bb13', 'cc14']})
valids = ['aa', 'bb']
So I want just those rows where a starts with aa or bb in this case.
You need startswith
df.a.str.startswith(tuple(valids))
Out[191]:
0 True
1 True
2 True
3 False
Name: a, dtype: bool
After filter with original df
df[df.a.str.startswith(tuple(valids))]
Out[192]:
a
0 aa10
1 aa11
2 bb13
consider a pandas dataframe that has values such as 'a - b'. I would like to check for the occurrence of '-' anywhere across all values of the dataframe without looping through individual columns. Clearly a check such as the following won't work:
if '-' in df.values
Any suggestions on how to check for this? Thanks.
I'd use stack() + .str.contains() in this case:
In [10]: df
Out[10]:
a b c
0 1 a - b w
1 2 c z
2 3 d 2 - 3
In [11]: df.stack().str.contains('-').any()
Out[11]: True
In [12]: df.stack().str.contains('-')
Out[12]:
0 a NaN
b True
c False
1 a NaN
b False
c False
2 a NaN
b False
c True
dtype: object
You can use replace to to swap a regex match with something else then check for equality
df.replace('.*-.*', True, regex=True).eq(True)
One way may be to try using flatten to values and list comprehension.
df = pd.DataFrame([['val1','a-b', 'val3'],['val4','3', 'val5']],columns=['col1','col2', 'col3'])
print(df)
Output:
col1 col2 col3
0 val1 a-b val3
1 val4 3 val5
Now, to search for -:
find_value = [val for val in df.values.flatten() if '-' in val]
print(find_value)
Output:
['a-b']
Using NumPy: np.core.defchararray.find(a,s) returns an array of indices where the substring s appears in a;
if it's not present, -1 is returned.
(np.core.defchararray.find(df.values.astype(str),'-') > -1).any()
returns True if '-' is present anywhere in df.
I have a Pandas Dataframe called names as follows:
name status
A X
B Y
C Z
D X
I want to get the name column (e.g. names['name']), but only with names which do NOT have the status Y or Z.
So the result should be:
name status
A X
D X
how can I do this?
Use isin to generate the boolean mask and negate it using ~:
In [230]:
df[~df['status'].isin(['Y','Z'])]
Out[230]:
name status
0 A X
3 D X
Result of isin:
In [231]:
df['status'].isin(['Y','Z'])
Out[231]:
0 False
1 True
2 True
3 False
Name: status, dtype: bool
You can then just access the 'name' column like so:
In [232]:
df.loc[~df['status'].isin(['Y','Z']),'name']
Out[232]:
0 A
3 D
Name: name, dtype: object
This will also work:
df.loc[(df['status']!='Y') & (df['status']!='Z')]
Or, if you just want the data in the name column to be displayed:
df.loc[(df['status']!='Y') & (df['status']!='Z'), 'name']