Create a new column after if-else in dask - pandas

df[‘new_col’] = np.where(df[‘col1’] == df[‘col2’] , True, False), where col1 and col2 are both str data types, seems pretty straight forward. What is the more efficient method to create a column in dask after an if else statement? I tried the recommendation from this Create an if-else condition column in dask dataframe but it is taking forever. It has only processed about 30% after about an hour. I have 13mil rows and 70 columns

IIUC use if need set column to boolean:
df['new_col'] = df['col1'] == df['col2']
If need set to another values:
df['new_col'] = 'val for true'
ddf = df.assign(col1 = df.new_col.where(cond=df['col1'] == df['col2'], other='val for false'))

Related

Filling NaNs using apply lambda function to work with DASK dataframe

I am trying to figure out how to fill a column with a value if another column and filling the null values with the value of another column as so
df['NewCol'] = df.apply(lambda row: 'Three' if row['VersionThree'] == row['VersionThree']\
else ('Two' if row['VersionTwo'] == row['VersionTwo'] else \
('Test' if row['VS'] == row['VS'] else '')), axis=1)
So the function works as it should, but I am now trying to figure out how to get it to run when I read my dataset in as a Dask Data Frame
I tried to vectorize and see if I could use numpy where with it as so,
df['NewCol'] = np.where((df['VersionThree'] == df['VersionThree']), ['Three'],
np.where((df['VersionTwo']== df['VersionTwo']), ['Two'],
np.where((df['VS'] == df['VS']), ['Test'], np.nan)))
But it does not run/crashes. I would like the function to iterate through every row and does a check for those 3 columns if any of them exist then output it to NewCol, if null then check the next column in the if's, if all are null, then place a np.nan into that cell
I am trying to use a Dask DataFrame

replacing values in pandas df column with values from another of same row if condition is met

I've got a df that I'm trying to clean. In instances where the value in col1 is = 0 I want to replace the 0 with the corresponding value from col2. What's the easiest way to do this?
edit: I got it nevermind..
One way is to use:
df['col1'] = df['col1'].replace(0, np.nan).fillna(df['col2'])
Another way with mask:
df['col1'] = df['col1'].mask(df['col1'] == 0, other=df['col2'])
Or the opposite with where:
df['col1'] = df['col1'].where(df['col1'] != 0, other=df['col2'])
And with loc:
df.loc[df['col1'].eq(0), 'col1'] = df['col2']

How to create a true-for-all index on a pandas dataframe?

I am using pandas and have run into a few occasions where I have a programmatically generated list of conditionals, like so
conditionals = [
df['someColumn'] == 'someValue',
df['someOtherCol'] == 'someOtherValue',
df['someThirdCol'].isin(['foo','bar','baz']),
]
and I want to select rows where ALL of these conditions are true. I figure I'd do something like this.
bigConditional = IHaveNoIdeaOfWhatToPutHere
for conditional in conditionals:
bigConditional = bigConditional && conditional
filteredDf = df[bigConditional]
I know that I WANT to use the identity property, to where bigConditional is initialized to a series of true for every index in my dataframe, such that if any condition in my conditionals list evaluates to false that row won't be in the filtered dataframe, but initially every row is considered.
I don't know how to do that, or at least not the best most succinct way that shows it's intentional
Also, I've run into inverse scenarios where I only need on of the conditionals to match to include the row into the new dataframe, so I would need bigConditional to be set to false for every index in the dataframe.
what about sum the conditions and check if it is equal to the number of conditions
filteredDf = df.loc[sum(conditionals)==len(conditionals)]
or even more simple, with np.all
filteredDf = df.loc[np.all(conditionals, axis=0)]
otherwise, for your original question, you can create a series of True indexed like df and your for loop should work.
bigConditional = pd.Series(True, index=df.index)
Maybe you can use query and generate your conditions like this:
conditionals = [
"someColumn == 'someValue'",
"someOtherCol == 'someOtherValue'",
"someThirdCol.isin(['foo', 'bar', 'baz'])",
]
qs = ' & '.join(conditionals)
out = df.query(qs)
Or use eval to create boolean values instead of filter your dataframe:
mask = df.eval(qs)
Demo
Suppose this dataframe:
>>> df
someColumn someOtherCol someThirdCol
0 someValue someOtherValue foo
1 someValue someOtherValue baz
2 someValue anotherValue anotherValue
3 anotherValue anotherValue anotherValue
>>> df.query(qs)
someColumn someOtherCol someThirdCol
0 someValue someOtherValue foo
1 someValue someOtherValue baz
>>> df.eval(qs)
0 True
1 True
2 False
3 False
dtype: bool
You can even use f-strings or another template language to pass variables to your condition list.

Pyspark - Joins _ duplicate columns

I have 3 dataframes.
each of them have columns which look like below:
I am using below code to join them:
cond = [df1.col8_S1 == df2.col8_S1, df1.col8_S2 == df2.col8_S2]
df = df1.join(df2,cond,how ='inner').drop('df1.col8_S1','df1.col8_S2')
cond = [df.col8_S1 == df3.col8_S1, df.col8_S2 == df3.col8_S2]
df4 = df.join(df3,cond,how ='inner').drop('df3.col8_S1','df3.col8_S2')
I am writing the dataframe onto csv file; however since they have same columns from col1 to col7, the write fails due to duplicate columns. How to I drop the duplicate columns without specifying their names.
Just use the column names for join, instead of explicitly using equal op.
cond = ['col8_S1', 'col8_S2']
df = df1.join(df2, cond, how ='inner')
cond = ['col8_S1', 'col8_S2']
df4 = df.join(df3, cond, how ='inner')

Pandas dynamic column creation

I am attempting to dynamically create a new column based on the values of another column.
Say I have the following dataframe
A|B
11|1
22|0
33|1
44|1
55|0
I want to create a new column.
If the value of column B is 1, insert 'Y' else insert 'N'.
The resulting dataframe should looks like so:
A|B|C
11|1|Y
22|0|N
33|1|Y
44|1|Y
55|0|N
I could do this by iterating through the column values,
for i in dataframe['B'].values:
if i==1:
add Y to Series
else:
add N to Series
dataframe['C'] = Series
However I am afraid this will severely reduce performance especially since my dataset contains 500,000+ rows.
Any help will be greatly appreciated.
Thank you.
Avoid chained indexing by using loc. There are some subtleties with returning a view versus a copy in pandas that are related to numpy
df['C'] = 'N'
df.loc[df.B == 1, 'C'] = 'Y'
Try this:
df['C'] = 'N'
df['C'][df['B']==1] = 'Y'
should be faster.