Nested when in Pyspark - sql

I need to apply lots of when conditions which take input from a list by indexes. I wanted to ask if there's a way to write the code efficiently which produces the same results without affecting the runtime efficiency.
Below is the code I am using
df=df.withColumn('date_match_label', F.when(F.col(date_cols[0])==F.col(date_cols[3]), f"{date_cols[0]} matches with {date_cols[3]}")
.when(F.col(date_cols[0])==F.col(date_cols[3]), f"{date_cols[1]} matches with {date_cols[3]}")
.when(F.col(date_cols[0])==F.col(date_cols[3]), f"{date_cols[1]} matches with {date_cols[3]}")
.when(F.col(date_cols[1])==F.col(date_cols[4]), f"{date_cols[1]} matches with {date_cols[4]}")
.when(F.col(date_cols[1])==F.col(date_cols[4]), f"{date_cols[1]} matches with {date_cols[4]}")
.when(F.col(date_cols[1])==F.col(date_cols[4]), f"{date_cols[1]} matches with {date_cols[4]}")
.when(F.col(date_cols[2])==F.col(date_cols[5]), f"{date_cols[1]} matches with {date_cols[5]}")
.when(F.col(date_cols[2])==F.col(date_cols[5]), f"{date_cols[1]} matches with {date_cols[5]}")
.when(F.col(date_cols[2])==F.col(date_cols[5]), f"{date_cols[1]} matches with {date_cols[5]}")
.otherwise('No Match'))
Here date_cols contains six column names. I need to check the first three columns with the last three columns and return the comment if there's a match.
The problem with current approach is as the size of the list increases, I'll have to add more and more lines which makes my code prone to errors and looks ugly. I was wondering if there's a way to that where I only need to specify the list indices which need to be compared against the other list elements.

Considering you want to compare the 1st half of the list (with column names) to second half of the list, you can dynamically build the code expression so that there wouldn't be any need to write more error prone code each time the list expands.
You can build the code dynamically with the help of indices in the following way:
from itertools import product
from pyspark.sql.functions import when,col
n=len(cols)
req=list(range(0,n))
res = list(product(req[:n//2], req[n//2:]))
start = '''df.withColumn('date_match_label','''
whens =[]
for i,j in res:
whens.append(f'''when(col(cols[{i}])==col(cols[{j}]), f"cols[{i}] matches with cols[{j}]")''')
final_exp = start + '.'.join(whens) + '''.otherwise('No Match'))'''
This will generate the final expession as shown below considering there are 4 columns (comparing 1st half with 2nd half):
The above is a string expression. So, to execute it you can use eval function and get results as shown below:
df = eval(final_exp)
df.show(truncate=False)

Related

Pandas Using Series.str.slice Most Efficiently with row varying parameters

My derived column is a substring of another column but the new string must be extracted at varying positions. In the code below I have done this using a lambda. However, this is slow. Is it possible to achieve the correct result using str.slice or is there another fast method?
import pandas as pd
df = pd.DataFrame ( {'st_col1':['aa-b', 'aaa-b']} )
df['index_dash'] = df['st_col1'].str.find ('-')
# gives wrong answer at index 1
df['res_wrong'] = df['st_col1'].str.slice (3)
# what I want to do :
df['res_cant_do'] = df['st_col1'].str.slice ( df['index_dash'] )
# slow solution
# naively invoking the built in python string slicing ... aStr[ start: ]
# ... accessing two columns from every row in turn
df['slow_sol'] = df.apply (lambda x: x['st_col1'] [ 1+ x['index_dash']:], axis=1 )
So can this be sped up ideally using str.slice or via another method?
From what I understand you want to get the last value after the "-" in st_col1 and pass that to a single column for that just use split
df['slow_sol'] = df['st_col1'].str.split('-').str[-1]
No need to identify the index, and them slicing it again on the given index dash. This will surely be more efficient then what you are doing, and cut a lot of steps

Error when filtering pandas dataframe by column value

I am having a problem with filtering a pandas dataframe. I am trying to filter a dataframe based on column values being equal to a specific list but I am getting a length error.
I tried every possible way of filtering a dataframe but got nowhere. Any help would be appreciated, thanks in advance.
Here is my code :
for ind in df_hourly.index:
timeslot = df_hourly['date_parsed'][ind][0:4] # List value to filter
filtered_df = df.loc[df['timeslot'] == timeslot]
Error : ValueError: ('Lengths must match to compare', (5696,), (4,))
Above Image : df , Below Image : df_hourly
In the above image, the dataframe I want to filter is shown. Specifically, I want to filter according to the "timeslot" column.
And the below image shows the the dataframe which includes the value I want to filter by. I specifically want to filter by "date_parsed" column. In the first line of my code, I iterate through every row in this dataframe and assign the first 4 elements of the list value in df_hourly["date_parsed"] to a variable and later in the code, I try to filter the above dataframe by that variable.
When comparing columns using ==, pandas try to compare value by value - aka does the first item equals to first item, second item to the second and so on. This is why you receive this error - pandas expects to have two columns of the same shape.
If you want to compare if value is inside a list, you can use the .isin (documentation):
df.loc[df['timeslot'].isin(timeslot)]
Depends on what timeslot is exactly, you might to take timeslot.values or something like that (hard to understand exactly without giving an example for your dataframe)

How to assign Pandas.Series.str.extractall() result back to original dataset? (TypeError: incompatible index of inserted column with frame index)

Dataset brief overview
dete_resignations['cease_date'].head()
gives
dete_resignations['cease_date'].value_counts()
gives
of the code above
What I tried
I was trying to extract only the year value (e.g. 05/2012 -> 2012) from 'dete_resignations['cease_date']' using 'Pandas.Series.str.extractall()' and assign the result back to the original dataframe. However, since not all the rows contain that specific string values(e.g. 05/2012), an error occurred.
Here are the code I wrote.
pattern = r"(?P<month>[0-1][0-9])/?(?P<year>[0-2][0-9]{3})"
years = dete_resignations['cease_date'].str.extractall(pattern)
dete_resignations['cease_date_'] = years['year']
'TypeError: incompatible index of inserted column with frame index'
I thought the 'years' share the same index with 'dete_resignations['cease']'. Therefore, even though two dataset's index is not identical, I expected python automatically matches and assigns the values to the right rows. But it didn't
Can anyone help solve this issue?
Much appreciated if someone can enlighten me!
If you only want the years, then don't catch the month in pattern, and you can use extract instead of extractall:
# the $ indicates end of string
# \d is equivalent to [0-9]
# pattern extracts the last digit groups
pattern = '(?P<year>\d+)$'
years = dete_resignations['cease_date'].str.extract(pattern)
dete_resignations['cease_date_'] = years['year']

pandas concatenate part of index in row attending condition

I have a table listing in rows invoices and in rows kind of reparation and also power of the engine
The power of the engine contains always the symbol '/'
so filtering the index I have:
What I desire is a new row containing for every invoice a list of the different powers
For instance for 'inv123' the new row should contain ['400/HP','500kw/h']
So far I have the following code:
from itertools import compress
boolean_filter = DF.index.str.contains('/') & DF['inv123']
indexlist =list(DF.index)
mylist = list(compress(indexlist, boolean_filter))
# you can generate it in one liner
mylist = list(compress(DF.index,DF.index.str.contains('/') & DF['inv123']))
print(mylist)
Result
['400/HP', '500/kwh']
This is the value I have to add in row='concatenate" column='inv123'
I encounter a number of problems
a) I am not able to that in a pythonic way (no loops)
b) when adding an empty row with:
DF.append(pd.Series(name='concatenate'))
the dtype of the 0s,1s (integers) changes to float, which makes the code not reusable (not being boolean anymore)
Some idea how to approach the problem?
But still I would have to loop over every column
I came up with this solution
from itertools import compress
lc=[list(compress(DF.index,DF.index.str.contains('/') & DF.iloc[:,i])) for i in range(len(DF.columns))]
the first thing there is compress a the list of the index with the boolean of every column (DF.iloc[:,i])
As a result I obtain a list in wich every element is a list of the wanted values.
The solution is not at all elegant.
It took me a few hours.

Processing pandas data in declarative style

I have a pandas dataframe of vehicle co-ordinates (from multiple vehicles on multiple days). For each vehicle and for each day, I do two things: either apply an algorithm to it, or filter it out of the dataset completely if it doesn't satisfy certain criteria.
To acheive this I use df.groupby('vehicle_id', 'day') and then .apply(algorithm) or .filter(condition) where algorithm and condition are functions which take in a dataframe.
I would like the full processing of my dataset (which involves multiple .apply and .filter steps) to be written out in a declaritive style, as opposed to imperatively looping through the groups, with the goal of the whole thing to look something like:
df.group_by('vehicle_id', 'day').apply(algorithm1).filter(condition1).apply(algorithm2).filter(condition2)
Of course, the above code is incorrect since .apply() and .filter() return new dataframes, and this is exactly my problem. They return all the data back in a single dataframe, and I find that I have apply .groupby('vehicle_id', 'day') continuously.
Is there a nice way that I can write this out without having to group by the same columns over and over?
Since apply uses a for loop anyway (meaning there are no sophisticated optimizations in the background), I suggest using an actual for loop:
arr = []
for key, dfg in df.groupby(['vehicle_id', 'day']):
dfg = dfg.do_stuff1() # Perform all needed operations
dfg = do_stuff2(dfg) #
arr.append(dfg)
result = pd.concat(arr)
An alternative is to create a function which runs all of the applies and filters sequentially on a given dataframe, and then map a single groupby/apply to it:
def all_operations(dfg):
# Do stuff
return result_df
result = df.group_by(['vehicle_id', 'day']).apply(all_operations)
In both options you will have to deal with cases in which an empty dataframe is returned from the filters, if such cases exist.