Pandas rewriting function without calling apply - pandas

I need to create a derived field in a pandas dataframe based on the value in another field of the same dataframe. Like so:
def newfield(row):
If row.col1 == ‘x’
return ‘value is x’
Elif row.col1 == ‘y’
Return ‘value is y’
Then i call it:
df.newfield = df.apply(lambda row:newfield(row),axis=1)
Is there a way to do it without’apply’? Also would like to make it less verbose. Np.where only allows 2 conditions, but i have more than 2.

Yes, you can use np.select:
df['newfield'] = np.select([df['col1'] == 'x', df['col1'=='y'],
['value is x', 'value is y'],
np.nan)

Related

Filling NaNs using apply lambda function to work with DASK dataframe

I am trying to figure out how to fill a column with a value if another column and filling the null values with the value of another column as so
df['NewCol'] = df.apply(lambda row: 'Three' if row['VersionThree'] == row['VersionThree']\
else ('Two' if row['VersionTwo'] == row['VersionTwo'] else \
('Test' if row['VS'] == row['VS'] else '')), axis=1)
So the function works as it should, but I am now trying to figure out how to get it to run when I read my dataset in as a Dask Data Frame
I tried to vectorize and see if I could use numpy where with it as so,
df['NewCol'] = np.where((df['VersionThree'] == df['VersionThree']), ['Three'],
np.where((df['VersionTwo']== df['VersionTwo']), ['Two'],
np.where((df['VS'] == df['VS']), ['Test'], np.nan)))
But it does not run/crashes. I would like the function to iterate through every row and does a check for those 3 columns if any of them exist then output it to NewCol, if null then check the next column in the if's, if all are null, then place a np.nan into that cell
I am trying to use a Dask DataFrame

Select single row with DataFrame.loc[] without index

Assuming I have a DataFrame:
import pandas as pd
df = pd.DataFrame([["house", "red"], ["car", "blue"]], columns=["item", "colour"])
What is the idiomatic way to return a single row, or raise an exception if exactly one row is not found, using DataFrame.loc:
match = df.loc[df["colour"] == "red"]]
# I know there is exactly one row in the resulting DataFrame
# TODO: How to make match to return a single row as pd.Series
This would be similar to SQLAlchemy's Query.one()
You can squeeze and assert that the output is a Series (or use a test if you don't want to raise an exception):
match = df.loc[df["colour"] == "red"].squeeze()
assert isinstance(match, pd.Series)
alternative to assert:
if isinstance(match, pd.Series):
# do something
else:
# do something else

Why pandas does not want to subset given columns in a list

I'm trying to remove certain values with that code, however pandas does not give me to, instead outputs
ValueError: Unable to coerce to Series, length must be 10: given 2
Here is my code:
import pandas as pd
df = pd.read_csv("/Volumes/SSD/IT/DataSets/Automobile_data.csv")
print(df.shape)
columns_df = ['index', 'company', 'body-style', 'wheel-base', 'length', 'engine-type',
'num-of-cylinders', 'horsepower', 'average-mileage', 'price']
prohibited_symbols = ['?','Nan''n.a']
df = df[df[columns_df] != prohibited_symbols]
print(df)
Try:
df = df[~df[columns_df].str.contains('|'.join(prohibited_symbols))]
The regex operator '|' helps remove records that contain any of your prohibited symbols.
Because what you are trying is not doing what you imagine it should.
df = df[df[columns_df] != prohibited_symbols]
Above line will always return False values for everything. You can't iterate over a list of prohibited symbols like that. != will do only a simple inequality check and none of your cells will be equal to the list of prohibited symbols probably. Also using that syntax will not delete those values from your cells.
You'll have to use a for loop and clean every column like this.
for column in columns_df:
df[column] = df[column].str.replace('|'.join(prohibited_symbols), '', regex=True)
You can as well specify the values you consider as null with the na_values argument when reading the data and then use dropna from pandas.
Example:
import pandas as pd
df = pd.read_csv("/Volumes/SSD/IT/DataSets/Automobile_data.csv", na_values=['?','Nan''n.a'])
df = df.dropna()

How to drop a pandas column based on number of values in it?

Turns out that when trying to drop a column with categorical data (0s and 1s) I cannot get the desired result. I have tried several procedures but they all yield the same result: the dataframe itself with all columns.
df1.drop([i for i in df1 if df1[i].nunique == 2], axis = 1, inplace = True)
That's one way I tried. Another one is as follows:
df1.drop(df.columns[df.apply(lambda col: col.nunique == 2)], axis = 1)
Can anyone help? Thanks
one approach could be to get all the columns which are boolean and drop then as below, this will work if the data type in column is correctly classified. choose to pass the datatype .dtypes as appropriate
bool_col = []
for cols in df:
if df[col].dtypes == "bool":
non_floats.append(col)
df = df.drop(columns=non_floats)
Your first attempt is perfect. You just need to add () to df1[i].nunique so that it becomes like this: df1.drop([i for i in df1 if df1[i].nunique() == 2], axis = 1, inplace = True)

Pandas null check with apply function

I am using pandas 'apply' function like this:
df['Geog'] = df.apply (lambda row: flagCntry(row,'country'),axis=1)
def flagCntry(row,colName):
if (row[colName] =='US' or row[colName] =='Canada'):
return 'North America'
elif (row[colName] ==null):# **DOES NOT work!!**
return 'Other'
How do I perform a null check within the function, because the syntax does not work
You might want to consider using pandas built in functions to perform your check.
df['Geog'] = np.nan
df.loc[df.country.isin(['US','Canada']),'Geog'] = 'North America'
df.loc[df.country.isnull(),'Geog'] = 'Other'
Otherwise you can also map a dictionnary:
my_dict = {np.nan:'Other','US':'North America','Canada':'North America'}
df['Geog'] = df.country.map(my_dict)
EDIT:
If you want to use the apply syntax, you can still use the dictionnary:
df['Geog'] = df.country.apply(lambda x : my_dict[x])
And if you want to use your custom function, one way to check if an element is null is to check whether it's different from itself:
def flagCntry(row,colName):
if row[colName] =='US' or row[colName] =='Canada':
return 'North America'
elif (row[colName] != row[colName]):
return 'Other'
df['Geog'] = df.apply(lambda row: flagCntry(row,'country'),axis=1)
And if you want to match None values but not np.nan you can use row[colName] == None instead of row[colName] != row[colName].
Change you (row[colName] ==null) to
np.isnan(row[colName])
uh... if I understand correctly, null is C/Java syntax. You might be looking for None.
In pandas more generally, this answer should work out for you.