Pandas, isin, column of lists - pandas

Trying to make a Boolean flag that reads TRUE if one value or another is within a list. The below code is returning a FALSE for row 1 and I am not sure why, could someone help me understand why a FALSE is getting returned for the first row?
lists={'someList!':[[1,2,12,6,'ABC'],[1000,4,'z','a','bob']]}
dfLists = pd.DataFrame(lists)
dfLists['contains?']=dfLists['someList!'].isin([0,1])

Could someone help me understand why a FALSE is getting returned for the first row?
This isn't working because .isin(values) returns whether each element in the Series is contained in values.
You can use {0, 1} as a set and apply the truthiness of its intersection to each list:
>>> s = {0, 1}
>>> dfLists['someList!'].apply(lambda x: bool(s.intersection(x)))
0 True
1 False
This effectively does:
>>> s.intersection([1, 2, 12, 6, 'ABC'])
{1}
>>> s.intersection([1000, 4, 'z', 'a', 'bob'])
set()
The bool of the first result is True, because it is non-empty.

Using Dataframe constructor flatten you list column, then using isin
pd.DataFrame(dfLists['someList!'].tolist()).isin([1,2]).any(1)
Out[39]:
0 True
1 False
dtype: bool

You are passing the dataframe a list of lists. It's comparing integers to lists, so it's not finding a match.
Define your columns distinctly, like this, and isin should work.
lists={'someList!':[1,2,12,6,'ABC'], 'someList2':[1000,4,'z','a','bob']}

Related

Replacing Specific Values in a Pandas Column [duplicate]

I'm trying to replace the values in one column of a dataframe. The column ('female') only contains the values 'female' and 'male'.
I have tried the following:
w['female']['female']='1'
w['female']['male']='0'
But receive the exact same copy of the previous results.
I would ideally like to get some output which resembles the following loop element-wise.
if w['female'] =='female':
w['female'] = '1';
else:
w['female'] = '0';
I've looked through the gotchas documentation (http://pandas.pydata.org/pandas-docs/stable/gotchas.html) but cannot figure out why nothing happens.
Any help will be appreciated.
If I understand right, you want something like this:
w['female'] = w['female'].map({'female': 1, 'male': 0})
(Here I convert the values to numbers instead of strings containing numbers. You can convert them to "1" and "0", if you really want, but I'm not sure why you'd want that.)
The reason your code doesn't work is because using ['female'] on a column (the second 'female' in your w['female']['female']) doesn't mean "select rows where the value is 'female'". It means to select rows where the index is 'female', of which there may not be any in your DataFrame.
You can edit a subset of a dataframe by using loc:
df.loc[<row selection>, <column selection>]
In this case:
w.loc[w.female != 'female', 'female'] = 0
w.loc[w.female == 'female', 'female'] = 1
w.female.replace(to_replace=dict(female=1, male=0), inplace=True)
See pandas.DataFrame.replace() docs.
Slight variation:
w.female.replace(['male', 'female'], [1, 0], inplace=True)
This should also work:
w.female[w.female == 'female'] = 1
w.female[w.female == 'male'] = 0
This is very compact:
w['female'][w['female'] == 'female']=1
w['female'][w['female'] == 'male']=0
Another good one:
w['female'] = w['female'].replace(regex='female', value=1)
w['female'] = w['female'].replace(regex='male', value=0)
You can also use apply with .get i.e.
w['female'] = w['female'].apply({'male':0, 'female':1}.get):
w = pd.DataFrame({'female':['female','male','female']})
print(w)
Dataframe w:
female
0 female
1 male
2 female
Using apply to replace values from the dictionary:
w['female'] = w['female'].apply({'male':0, 'female':1}.get)
print(w)
Result:
female
0 1
1 0
2 1
Note: apply with dictionary should be used if all the possible values of the columns in the dataframe are defined in the dictionary else, it will have empty for those not defined in dictionary.
Using Series.map with Series.fillna
If your column contains more strings than only female and male, Series.map will fail in this case since it will return NaN for other values.
That's why we have to chain it with fillna:
Example why .map fails:
df = pd.DataFrame({'female':['male', 'female', 'female', 'male', 'other', 'other']})
female
0 male
1 female
2 female
3 male
4 other
5 other
df['female'].map({'female': '1', 'male': '0'})
0 0
1 1
2 1
3 0
4 NaN
5 NaN
Name: female, dtype: object
For the correct method, we chain map with fillna, so we fill the NaN with values from the original column:
df['female'].map({'female': '1', 'male': '0'}).fillna(df['female'])
0 0
1 1
2 1
3 0
4 other
5 other
Name: female, dtype: object
Alternatively there is the built-in function pd.get_dummies for these kinds of assignments:
w['female'] = pd.get_dummies(w['female'],drop_first = True)
This gives you a data frame with two columns, one for each value that occurs in w['female'], of which you drop the first (because you can infer it from the one that is left). The new column is automatically named as the string that you replaced.
This is especially useful if you have categorical variables with more than two possible values. This function creates as many dummy variables needed to distinguish between all cases. Be careful then that you don't assign the entire data frame to a single column, but instead, if w['female'] could be 'male', 'female' or 'neutral', do something like this:
w = pd.concat([w, pd.get_dummies(w['female'], drop_first = True)], axis = 1])
w.drop('female', axis = 1, inplace = True)
Then you are left with two new columns giving you the dummy coding of 'female' and you got rid of the column with the strings.
w.replace({'female':{'female':1, 'male':0}}, inplace = True)
The above code will replace 'female' with 1 and 'male' with 0, only in the column 'female'
There is also a function in pandas called factorize which you can use to automatically do this type of work. It converts labels to numbers: ['male', 'female', 'male'] -> [0, 1, 0]. See this answer for more information.
w.female = np.where(w.female=='female', 1, 0)
if someone is looking for a numpy solution. This is useful to replace values based on a condition. Both if and else conditions are inherent in np.where(). The solutions that use df.replace() may not be feasible if the column included many unique values in addition to 'male', all of which should be replaced with 0.
Another solution is to use df.where() and df.mask() in succession. This is because neither of them implements an else condition.
w.female.where(w.female=='female', 0, inplace=True) # replace where condition is False
w.female.mask(w.female=='female', 1, inplace=True) # replace where condition is True
dic = {'female':1, 'male':0}
w['female'] = w['female'].replace(dic)
.replace has as argument a dictionary in which you may change and do whatever you want or need.
I think that in answer should be pointed which type of object do you get in all methods suggested above: is it Series or DataFrame.
When you get column by w.female. or w[[2]] (where, suppose, 2 is number of your column) you'll get back DataFrame.
So in this case you can use DataFrame methods like .replace.
When you use .loc or iloc you get back Series, and Series don't have .replace method, so you should use methods like apply, map and so on.
To answer the question more generically so it applies to more use cases than just what the OP asked, consider this solution. I used jfs's solution solution to help me. Here, we create two functions that help feed each other and can be used whether you know the exact replacements or not.
import numpy as np
import pandas as pd
class Utility:
#staticmethod
def rename_values_in_column(column: pd.Series, name_changes: dict = None) -> pd.Series:
"""
Renames the distinct names in a column. If no dictionary is provided for the exact name changes, it will default
to <column_name>_count. Ex. female_1, female_2, etc.
:param column: The column in your dataframe you would like to alter.
:param name_changes: A dictionary of the old values to the new values you would like to change.
Ex. {1234: "User A"} This would change all occurrences of 1234 to the string "User A" and leave the other values as they were.
By default, this is an empty dictionary.
:return: The same column with the replaced values
"""
name_changes = name_changes if name_changes else {}
new_column = column.replace(to_replace=name_changes)
return new_column
#staticmethod
def create_unique_values_for_column(column: pd.Series, except_values: list = None) -> dict:
"""
Creates a dictionary where the key is the existing column item and the value is the new item to replace it.
The returned dictionary can then be passed the pandas rename function to rename all the distinct values in a
column.
Ex. column ["statement"]["I", "am", "old"] would return
{"I": "statement_1", "am": "statement_2", "old": "statement_3"}
If you would like a value to remain the same, enter the values you would like to stay in the except_values.
Ex. except_values = ["I", "am"]
column ["statement"]["I", "am", "old"] would return
{"old", "statement_3"}
:param column: A pandas Series for the column with the values to replace.
:param except_values: A list of values you do not want to have changed.
:return: A dictionary that maps the old values their respective new values.
"""
except_values = except_values if except_values else []
column_name = column.name
distinct_values = np.unique(column)
name_mappings = {}
count = 1
for value in distinct_values:
if value not in except_values:
name_mappings[value] = f"{column_name}_{count}"
count += 1
return name_mappings
For the OP's use case, it is simple enough to just use
w["female"] = Utility.rename_values_in_column(w["female"], name_changes = {"female": 0, "male":1}
However, it is not always so easy to know all of the different unique values within a data frame that you may want to rename. In my case, the string values for a column are hashed values so they hurt the readability. What I do instead is replace those hashed values with more readable strings thanks to the create_unique_values_for_column function.
df["user"] = Utility.rename_values_in_column(
df["user"],
Utility.create_unique_values_for_column(df["user"])
)
This will changed my user column values from ["1a2b3c", "a12b3c","1a2b3c"] to ["user_1", "user_2", "user_1]. Much easier to compare, right?
If you have only two classes you can use equality operator. For example:
df = pd.DataFrame({'col1':['a', 'a', 'a', 'b']})
df['col1'].eq('a').astype(int)
# (df['col1'] == 'a').astype(int)
Output:
0 1
1 1
2 1
3 0
Name: col1, dtype: int64

efficiently setting values on a subset of rows

I am wondering about the best way to change values in a subset of rows in a dataframe.
Let's say I want to double the values in column value in rows where selected is true.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'value': [1, 2, 3, 4], 'selected': [False, False, True, True]})
In [3]: df
Out[3]:
selected value
0 False 1
1 False 2
2 True 3
3 True 4
There are several ways to do this:
# 1. Subsetting with .loc on left and right hand side:
df.loc[df['selected'], 'value'] = df.loc[df['selected'], 'value'] * 2
# 2. Subsetting with .loc on left hand side:
df.loc[df['selected'], 'value'] = df['value'] * 2
# 3. Using where()
df['value'] = (df['value'] * 2).where(df['selected'], df['value'])
If I only subset on the left hand side (option 2), would Pandas actually make the calculation for all rows and then discard the result for all but the selected rows?
In terms of evaluation, is there any difference between using loc and where?
Your #2 option is the most standard and recommended way to do this. Your #1 option is fine also, but the extra code is unnecessary because ix/loc/iloc are designed to pass the boolean selection through and do the necessary alignment to make sure it applies only to your desired subset.
# 2. Subsetting with .loc on left hand side:
df.loc[df['selected'], 'value'] = df['value'] * 2
If you don't use ix/loc/iloc on the left hand side, problems can arise that we don't want to get into in a simple answer. Hence, using ix/loc/iloc is generally the safest and most recommened way to go. There is nothing wrong with your option #3, but it is the least readable of the three.
One faster and acceptable alternative you should know about is numpy's where() function:
df['value'] = np.where( df['selected'], df['value'] * 2, df['value'] )
The first argument is the selection or mask, the second is the value to assign if True, and third is the value to assign if false. It's especially useful if you want to also create or change the value if the selection is False.

Pandas text matching like SQL's LIKE?

Is there a way to do something similar to SQL's LIKE syntax on a pandas text DataFrame column, such that it returns a list of indices, or a list of booleans that can be used for indexing the dataframe? For example, I would like to be able to match all rows where the column starts with 'prefix_', similar to WHERE <col> LIKE prefix_% in SQL.
You can use the Series method str.startswith (which takes a regex):
In [11]: s = pd.Series(['aa', 'ab', 'ca', np.nan])
In [12]: s.str.startswith('a', na=False)
Out[12]:
0 True
1 True
2 False
3 False
dtype: bool
You can also do the same with str.contains (using a regex):
In [13]: s.str.contains('^a', na=False)
Out[13]:
0 True
1 True
2 False
3 False
dtype: bool
So you can do df[col].str.startswith...
See also the SQL comparison section of the docs.
Note: (as pointed out by OP) by default NaNs will propagate (and hence cause an indexing error if you want to use the result as a boolean mask), we use this flag to say that NaN should map to False.
In [14]: s.str.startswith('a') # can't use as boolean mask
Out[14]:
0 True
1 True
2 False
3 NaN
dtype: object
To find all the values from the series that starts with a pattern "s":
SQL - WHERE column_name LIKE 's%'
Python - column_name.str.startswith('s')
To find all the values from the series that ends with a pattern "s":
SQL - WHERE column_name LIKE '%s'
Python - column_name.str.endswith('s')
To find all the values from the series that contains pattern "s":
SQL - WHERE column_name LIKE '%s%'
Python - column_name.str.contains('s')
For more options, check : https://pandas.pydata.org/pandas-docs/stable/reference/series.html
you can use
s.str.contains('a', case = False)

How to remove rows from a dataframe based on their column values existence in another df?

Given two dataframes A and B, which both have columns 'x', 'y' how can I efficiently remove all rows in A that their pairs of (x, y) appear in B.
I thought about implementing it using a row iterator on A and then per pair checking if it exists in B but I am guessing this is the least efficient way...
I tried using the .isin function as suggested in Filter dataframe rows if value in column is in a set list of values but couldn't make use of it for multiple columns.
Example dataframes:
A = pd.DataFrame([[1, 2], [1, 4], [3, 4], [2, 4]], columns=['x', 'y'])
B = pd.DataFrame([[1, 2], [3, 4]], columns=['x', 'y'])
C should contain [1,4] and [2,4] after the operation.
In pandas master (or in future 0.13) isin will also accept DataFrames, but the problem is that it just looks at the values in each column, and not at an exact row combination of the columns.
Taken from #AndyHayden comment here (https://github.com/pydata/pandas/issues/4421#issuecomment-23052472), a similar approach with set:
In [3]: mask = pd.Series(map(set(B.itertuples(index=False)).__contains__, A.itertuples(index=False)))
In [4]: A[~mask]
Out[4]:
x y
1 1 4
3 2 4
Or a more readable version:
set_B = set(B.itertuples(index=False))
mask = [x not in set_B for x in A.itertuples(index=False)]
The possible advantage of this compared to #Acorbe's answer is that this preserves the index of A and does not remove duplicate rows in A (but that depends on what you want of course).
As I said, 0.13 will have accept DataFrames to isin. However, I don't think this will solve this issue because also the index has to be the same:
In [27]: A.isin(B)
Out[27]:
x y
0 True True
1 False True
2 False False
3 False False
You can solve this by converting it to a dict, but now it does not look at the combinatio of both columns, but just for each column seperately:
In [28]: A.isin(B.to_dict(outtype='list'))
Out[28]:
x y
0 True True
1 True True
2 True True
3 False True
For those looking for a single-column solution:
new_df = df1[~df1["column_name"].isin(df2["column_name"])]
The ~ is a logical operator for NOT.
So this will create a new dataframe when the values of df1["column_name"] are not found in df2["column_name"]
One option would be to generate two sets, say A_set, B_set, whose elements are the rows of the DataFrames. Hence, the fast set difference operation A_set - B_set can be used.
A_set = set(map(tuple,A.values)) #we need to have an hashable object before generating a set
B_set = set(map(tuple,B.values))
C_set = A_set - B_set
C_set
{(1, 4), (2, 4)}
C = pd.DataFrame([c for c in C_set], columns=['x','y'])
x y
0 2 4
1 1 4
This procedure involves some preliminary conversion operations, though.

Why can I not change the values in a subset of cells from a DataFrame column?

I am indexing a subset of cells from a DataFrame column and attempting to assign a boolean True to said subset:
df['mycolumn'][df['myothercolumn'] == val][idx: idx + 25] = True
However, when I slice df['mycolumn'][df['myothercolumn'] == val][idx: idx + 25], my the initial values are still found. In other words the changes were not applied!
I'm about to rip my hair out. What am I doing wrong?
Try this:
df.loc[df['myothercolumn']==val, some_column_name] = True
some_column_name should be the name of the column you want to add or change.