How to check if a Pandas Dataframe column contains a value? [duplicate] - pandas

This question already has answers here:
finding values in pandas series - Python3
(2 answers)
Closed 1 year ago.
I'd like to check if a pandas.DataFrame column contains a specific value. For instance, this toy Dataframe has a "h" in column "two":
import pandas as pd
df = pd.DataFrame(
np.array(list("abcdefghi")).reshape((3, 3)),
columns=["one", "two", "three"]
)
df
one two three
0 a b c
1 d e f
2 g h i
But surprisingly,
"h" in df["two"]
evaluates to False.
My question is: What's the clearest way to find out if a DataFrame column (or pandas.Series in general) contains a specific value?

df["two"] is a pandas.Series which looks like this:
0 b
1 e
2 h
It turns out, the in operator checks the index, not the values. I.e.
2 in df["two"]
evaluates to True
So one has to explicitly check for the values like this:
"h" in df["two"].values
This evaluates to True.

Related

pandas filtering column names [duplicate]

This question already has answers here:
Find column whose name contains a specific string
(8 answers)
Closed 4 months ago.
I have this type of dataframe;
A = ["axa","axb","axc","axd","bxa","bxb","bxc","bxd","cxa".......]
My question is I have this type of data but there are more than 350 columns and for example i need only 'c' including column names in new dataframe. How can i do that?
new dataframe columns should look like this;
B = A[["axc","bxc","cxa","cxb","cxc","cxd","dxc","exc","fxc".......]]
Use for filter columns names with c by DataFrame.filter:
df2 = df.filter(like='c')
Or use list comprehension for filter columns names:
df2 = df[[x for x in df.columns if 'c' in x]]
You can do it easily using list comprehension:
new_df = df[[col for col in df.columns if 'c' in col]]

How to print the value of a row that returns false using .isin method in python [duplicate]

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
(11 answers)
Closed 4 months ago.
I am new to writing code and currently working on a project to compare two columns of an excel sheet using python and return the rows that does not match.
I tried using the .isin funtion and was able to identify output the values comparing the columns however i am not sure on how to print the actual row that returns the value "False"
For Example:
import pandas as pd
data = ["Darcy Hayward","Barbara Walters","Ruth Fraley","Minerva Ferguson","Tad Sharp","Lesley Fuller","Grayson Dolton","Fiona Ingram","Elise Dolton"]
df = pd.DataFrame(data, columns=['Names'])
df
data1 = ["Darcy Hayward","Barbara Walters","Ruth Fraley","Minerva Ferguson","Tad Sharp","Lesley Fuller","Grayson Dolton","Fiona Ingram"]
df1 = pd.DataFrame(data1, columns=['Names'])
df1
data_compare = df["Names"].isin(df1["Names"])
for data in data_compare:
if data==False:
print(data)
However, i want to know that 8 index returned False, something like the below format
Could you please advise how i can modify the code to get the output printed with the Index, Name that returned False?

count the number of strings in a 2-D pandas series [duplicate]

This question already has answers here:
How do I count the values from a pandas column which is a list of strings?
(5 answers)
Closed 11 months ago.
I am trying to count the number of characters in an uneven 2-D pandas series.
df = pd.DataFrame({ 'A' : [['a','b'],['a','c','f'],['a'], ['b','f']]}
I want to count the number of times each character is repeated.
any ideas?
You can use explode() and value_counts().
import pandas as pd
df = pd.DataFrame({ 'A' : [['a','b'],['a','c','f'],['a'], ['b','f']]})
df = df.explode("A")
print(df.value_counts())
Expected output:
A
a 3
b 2
f 2
c 1

Pandas dataframe select rows where a list-column contains a specific set of elements

This is a follow-up to the following post: Pandas dataframe select rows where a list-column contains any of a list of strings
I want to be able to select rows that contain the exact pair of strings from the selection list (where selection= ['cat', 'dog']).
starting df:
molecule species
0 a [dog]
1 b [horse, pig]
2 c [cat, dog]
3 d [cat, horse, pig]
4 e [chicken, pig]
df I want:
molecule species
2 c [cat, dog]
I tried the following and it returned only the columns labels.
df[pd.DataFrame(df.species.tolist()).isin(selection).all(1)]
One way to do it:
df['joined'] = df.species.str.join(sep=',')
selection = ['cat,dog']
filtered = df.loc[df.joined.isin(selection)]
This won't find cases with different sorting (i.e. 'dog,cat' or 'horse,cat,pig'), but if that is not an issue then it works fine.
This will find anything.
import pandas as pd
selection = ['cat', 'dog']
mols = pd.DataFrame({'molecule':['a','b','c','d','e'],'species':[['dog'],['horse','pig'],['cat','dog'],['cat','horse','pig'],['chicken','pig']]})
mols.loc[np.where(pd.Series([all(w in selection for w in mols.species.values[k]) for k in mols.index]).map({True:1,False:0}) == 1)[0]]
If you want to find any rows that have at least the elements in the list (and could have others as well), use:
mols.loc[np.where(pd.Series([all(w in mols.species.values[k] for w in selection) for k in mols.index]).map({True:1,False:0}) == 1)[0]]
This is an interesting application of matrices as selectors. Use the transposed mols to multiply the vector of zeroes and ones that points which rows in mols fit your criteria:
mols.to_numpy().T.dot(pd.Series([all(w in mols.species.values[k] for w in selection) for k in mols.index]).map({True:1,False:0}))
Another (more readable) solution would be to assign, to mols, a column where the condition is True, map it to 0 and 1 and query mols where that column is equal to 1.

Remove the duplicated entry from this list [duplicate]

This question already has answers here:
Drop all duplicate rows across multiple columns in Python Pandas
(8 answers)
Closed 3 years ago.
Based on the bellow example, how can i remove just the last "A" from the list ? By using duplicates (as i did) it deletes both entries. The end result should be: A, B, C but now i get B,C.
import pandas as pd
df = pd.DataFrame({'ID': ["A", "B", "C", "A"]})
df.drop_duplicates(keep=False,inplace=True)
print(df)
You just need to set parameter keep="first"