pandas concatenate part of index in row attending condition - pandas

I have a table listing in rows invoices and in rows kind of reparation and also power of the engine
The power of the engine contains always the symbol '/'
so filtering the index I have:
What I desire is a new row containing for every invoice a list of the different powers
For instance for 'inv123' the new row should contain ['400/HP','500kw/h']
So far I have the following code:
from itertools import compress
boolean_filter = DF.index.str.contains('/') & DF['inv123']
indexlist =list(DF.index)
mylist = list(compress(indexlist, boolean_filter))
# you can generate it in one liner
mylist = list(compress(DF.index,DF.index.str.contains('/') & DF['inv123']))
print(mylist)
Result
['400/HP', '500/kwh']
This is the value I have to add in row='concatenate" column='inv123'
I encounter a number of problems
a) I am not able to that in a pythonic way (no loops)
b) when adding an empty row with:
DF.append(pd.Series(name='concatenate'))
the dtype of the 0s,1s (integers) changes to float, which makes the code not reusable (not being boolean anymore)
Some idea how to approach the problem?
But still I would have to loop over every column

I came up with this solution
from itertools import compress
lc=[list(compress(DF.index,DF.index.str.contains('/') & DF.iloc[:,i])) for i in range(len(DF.columns))]
the first thing there is compress a the list of the index with the boolean of every column (DF.iloc[:,i])
As a result I obtain a list in wich every element is a list of the wanted values.
The solution is not at all elegant.
It took me a few hours.

Related

Nested when in Pyspark

I need to apply lots of when conditions which take input from a list by indexes. I wanted to ask if there's a way to write the code efficiently which produces the same results without affecting the runtime efficiency.
Below is the code I am using
df=df.withColumn('date_match_label', F.when(F.col(date_cols[0])==F.col(date_cols[3]), f"{date_cols[0]} matches with {date_cols[3]}")
.when(F.col(date_cols[0])==F.col(date_cols[3]), f"{date_cols[1]} matches with {date_cols[3]}")
.when(F.col(date_cols[0])==F.col(date_cols[3]), f"{date_cols[1]} matches with {date_cols[3]}")
.when(F.col(date_cols[1])==F.col(date_cols[4]), f"{date_cols[1]} matches with {date_cols[4]}")
.when(F.col(date_cols[1])==F.col(date_cols[4]), f"{date_cols[1]} matches with {date_cols[4]}")
.when(F.col(date_cols[1])==F.col(date_cols[4]), f"{date_cols[1]} matches with {date_cols[4]}")
.when(F.col(date_cols[2])==F.col(date_cols[5]), f"{date_cols[1]} matches with {date_cols[5]}")
.when(F.col(date_cols[2])==F.col(date_cols[5]), f"{date_cols[1]} matches with {date_cols[5]}")
.when(F.col(date_cols[2])==F.col(date_cols[5]), f"{date_cols[1]} matches with {date_cols[5]}")
.otherwise('No Match'))
Here date_cols contains six column names. I need to check the first three columns with the last three columns and return the comment if there's a match.
The problem with current approach is as the size of the list increases, I'll have to add more and more lines which makes my code prone to errors and looks ugly. I was wondering if there's a way to that where I only need to specify the list indices which need to be compared against the other list elements.
Considering you want to compare the 1st half of the list (with column names) to second half of the list, you can dynamically build the code expression so that there wouldn't be any need to write more error prone code each time the list expands.
You can build the code dynamically with the help of indices in the following way:
from itertools import product
from pyspark.sql.functions import when,col
n=len(cols)
req=list(range(0,n))
res = list(product(req[:n//2], req[n//2:]))
start = '''df.withColumn('date_match_label','''
whens =[]
for i,j in res:
whens.append(f'''when(col(cols[{i}])==col(cols[{j}]), f"cols[{i}] matches with cols[{j}]")''')
final_exp = start + '.'.join(whens) + '''.otherwise('No Match'))'''
This will generate the final expession as shown below considering there are 4 columns (comparing 1st half with 2nd half):
The above is a string expression. So, to execute it you can use eval function and get results as shown below:
df = eval(final_exp)
df.show(truncate=False)

Pandas Using Series.str.slice Most Efficiently with row varying parameters

My derived column is a substring of another column but the new string must be extracted at varying positions. In the code below I have done this using a lambda. However, this is slow. Is it possible to achieve the correct result using str.slice or is there another fast method?
import pandas as pd
df = pd.DataFrame ( {'st_col1':['aa-b', 'aaa-b']} )
df['index_dash'] = df['st_col1'].str.find ('-')
# gives wrong answer at index 1
df['res_wrong'] = df['st_col1'].str.slice (3)
# what I want to do :
df['res_cant_do'] = df['st_col1'].str.slice ( df['index_dash'] )
# slow solution
# naively invoking the built in python string slicing ... aStr[ start: ]
# ... accessing two columns from every row in turn
df['slow_sol'] = df.apply (lambda x: x['st_col1'] [ 1+ x['index_dash']:], axis=1 )
So can this be sped up ideally using str.slice or via another method?
From what I understand you want to get the last value after the "-" in st_col1 and pass that to a single column for that just use split
df['slow_sol'] = df['st_col1'].str.split('-').str[-1]
No need to identify the index, and them slicing it again on the given index dash. This will surely be more efficient then what you are doing, and cut a lot of steps

Encoding feature array column from training set and applying to test set later

I have input columns that contain arrays of features. Feature is listed if present, absent if not. Order not guaranteed. eg:
features = pd.DataFrame({"cat_features":[['cuddly','short haired'],['short haired','bitey'],['short haired','orange','fat']]})
This works:
feature_table = pd.get_dummies(features['cat_features'].explode()).add_prefix("cat_features_").groupby(level=0).sum()
Problem:
It's not trivial to ensure the same output columns on my test set
when features are missing.
My real dataset has multiple such array
columns, but I can't explode them all at once because ValueError: columns must have matching element counts requiring looping over each array column.
One option, make a dtype and save it for later ("skinny" added as an example of something not in our input set):
from pandas.api.types import CategoricalDtype
cat_feature_type = CategoricalDtype([x.replace("cat_features_","") for x in feature_table.columns.to_list()]+ ["skinny"])
pd.get_dummies(features["cat_features"].explode().astype(cat_feature_type)).add_prefix("cat_features_").groupby(level=0).sum()
Is there a smarter way of doing this?

Pandas str slice in combination with Pandas str index

I have a Dataframe containing a single column with a list of file names. I want to find all rows in the Dataframe that their value has a prefix from a set of know prefixes.
I know I can run a simple for loop, but I want to run in a Dataframe to check speeds and run benchmarks - it's also a nice exercise.
What I had in mind is combining str.slice with str.index but I can't get it to work. This is what I have in mind:
import pandas as pd
file_prefixes = {...}
file_df = pd.Dataframe(list_of_file_names)
file_df.loc[file_df.file.str.slice(start=0, stop=upload_df.file.str.index('/')-1).isin(file_prefixes), :] # this doesn't work as the index returns a dataframe
My hope is that said code will return all rows that the value there starts with a file prefix from the list above.
In summary, I would like help with 2 things:
Combining slice and index
Thoughts about better ways to achieve this
Thanks
I will use startswith
file_df.loc[file_df.file.str.startswith(tuple(file_prefixes)), :]

Processing pandas data in declarative style

I have a pandas dataframe of vehicle co-ordinates (from multiple vehicles on multiple days). For each vehicle and for each day, I do two things: either apply an algorithm to it, or filter it out of the dataset completely if it doesn't satisfy certain criteria.
To acheive this I use df.groupby('vehicle_id', 'day') and then .apply(algorithm) or .filter(condition) where algorithm and condition are functions which take in a dataframe.
I would like the full processing of my dataset (which involves multiple .apply and .filter steps) to be written out in a declaritive style, as opposed to imperatively looping through the groups, with the goal of the whole thing to look something like:
df.group_by('vehicle_id', 'day').apply(algorithm1).filter(condition1).apply(algorithm2).filter(condition2)
Of course, the above code is incorrect since .apply() and .filter() return new dataframes, and this is exactly my problem. They return all the data back in a single dataframe, and I find that I have apply .groupby('vehicle_id', 'day') continuously.
Is there a nice way that I can write this out without having to group by the same columns over and over?
Since apply uses a for loop anyway (meaning there are no sophisticated optimizations in the background), I suggest using an actual for loop:
arr = []
for key, dfg in df.groupby(['vehicle_id', 'day']):
dfg = dfg.do_stuff1() # Perform all needed operations
dfg = do_stuff2(dfg) #
arr.append(dfg)
result = pd.concat(arr)
An alternative is to create a function which runs all of the applies and filters sequentially on a given dataframe, and then map a single groupby/apply to it:
def all_operations(dfg):
# Do stuff
return result_df
result = df.group_by(['vehicle_id', 'day']).apply(all_operations)
In both options you will have to deal with cases in which an empty dataframe is returned from the filters, if such cases exist.