Pandas - Apply function and generate more than one row with lambda function - pandas

This apply function works but I don't think its efficient;
xyz = data.apply(lambda row: pd.Series({"z":getNVC(row)[0],"y":getNVC(row)[1],"x":getNVC(row)[2]}),axis=1)
So I basically want to apply the NVC function once per row and return an np.array which has 3 elements. I then map these 3 elements to new columns x,y and z. However, I think at the moment I am calling the function 3 times?
Ideally I would like to just call it once, save in the output in a variable, say output and unpack the three elements into the columns. The allocation would probably be something like;
pd.Series({"z":output[0],"y":output[1],"x":output[2]})

Going purely by creating the dict (that is input to Series), while calling getNVC only once, the following may work:
pd.Series( dict(zip("zyx", getNVC(row))) )

Related

Pyspark: Filter DF based on columns, then run every subset DF through a function

I am new to Pyspark and am a bit confused on how to think of the problem.
I have a large dataframe and I would like to filter down every subset of that dataframe based on two columns and run it through the same algorithm.
Here is an example of how I run it (extremely inefficiently) now:
for letter in ['a', 'b', 'c']:
for number in [1, 2, 3]
filtered_DF_1, filtered_DF_2 = filter_func(DF_1, DF_2, letter, number)
process_function(filtered_DF_1, filtered_DF_2)
Basic filter function:
def filter_func(DF_1, DF_2, letter, number):
DF_1 = DF_1.filter(
(F.col("Letter") == letter) &
(F.col('Number') == number)
)
DF_2 = DF_2.filter(
(F.col("Letter") == letter) &
(F.col('Number') == number)
)
return DF_1, DF_2
Since this is Pyspark, I would like to parallelize it, since each iteration of the function can run independently.
Do I need to do some sort of mapping to get all my data subsets?
And then do I need to do anything to the process_function to make it available to all nodes as well to run and return an answer?
What is the best way to do this?
​
EDIT:
The process_function takes the filtered dataset and runs it through about 7 different functions that are already written in 300 lines of pyspark --> the end goal is to return a list of timestamps that are overbooked based on a bunch of complicated logic.
I think my plan is to build a dictionary of letter --> [number], then explode that list to get every permutation and create a dataset from that. Then map through that, and hopefully am able to create a udf for my process_function.
I don't think you need to worry a lot about parallelizing or the execution plan because the spark catalyst does it in the background for you. Also better to avoid UDF, you can do it mostly with inbulit function.
Are you doing a transformation function or an aggregate function inside you process_func?
Please provide any test data and suitable example of expected output. That would help in better answering..

How to efficiently append a dataframe column with a vector?

Working with Julia 1.1:
The following minimal code works and does what I want:
function test()
df = DataFrame(NbAlternative = Int[], NbMonteCarlo = Int[], Similarity = Float64[])
append!(df.NbAlternative, ones(Int, 5))
df
end
Appending a vector to one column of df. Note: in my whole code, I add a more complicated Vector{Int} than ones' return.
However, #code_warntype test() does return:
%8 = invoke DataFrames.getindex(%7::DataFrame, :NbAlternative::Symbol)::AbstractArray{T,1} where T
Which means I suppose, thisn't efficient. I can't manage to get what this #code_warntype error means. More generally, how can I understand errors returned by #code_warntype and fix them, this is a recurrent unclear issue for me.
EDIT: #BogumiłKamiński's answer
Then how one would do the following code ?
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
append!(df.NbAlternative, ones(Int, nb_simulations)*na)
append!(df.NbMonteCarlo, ones(Int, nb_simulations)*mt)
append!(df.Similarity, compare_smaa(na, nb_criteria, nb_simulations, mt))
end
end
compare_smaa returns a nb_simulations length vector.
You should never do such things as it will cause many functions from DataFrames.jl to stop working properly. Actually such code will soon throw an error, see https://github.com/JuliaData/DataFrames.jl/issues/1844 that is exactly trying to patch this hole in DataFrames.jl design.
What you should do is appending a data frame-like object to a DataFrame using append! function (this guarantees that the result has consistent column lengths) or using push! to add a single row to a DataFrame.
Now the reason you have type instability is that DataFrame can hold vector of any type (technically columns are held in a Vector{AbstractVector}) so it is not possible to determine in compile time what will be the type of vector under a given name.
EDIT
What you ask for is a typical scenario that DataFrames.jl supports well and I do it almost every day (as I do a lot of simulations). As I have indicated - you can use either push! or append!. Use push! to add a single run of a simulation (this is not your case, but I add it as it is also very common):
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
for i in 1:nb_simulations
# here you have to make sure that compare_smaa returns a scalar
# if it is passed 1 in nb_simulations
push!(df, (na, mt, compare_smaa(na, nb_criteria, 1, mt)))
end
end
end
And this is how you can use append!:
for na in arr_nb_alternative
#show na
for mt in arr_nb_montecarlo
println("...$mt")
# here you have to make sure that compare_smaa returns a vector
append!(df, (NbAlternative=ones(Int, nb_simulations)*na,
NbMonteCarlo=ones(Int, nb_simulations)*mt,
Similarity=compare_smaa(na, nb_criteria, nb_simulations, mt)))
end
end
Note that I append here a NamedTuple. As I have written earlier you can append a DataFrame or any data frame-like object this way. What "data frame-like object" means is a broad class of things - in general anything that you can pass to DataFrame constructor (so e.g. it can also be a Vector of NamedTuples).
Note that append! adds columns to a DataFrame using name matching so column names must be consistent between the target and appended object.
This is different in push! which also allows to push a row that does not specify column names (in my example above I show that a Tuple can be pushed).

how to print by default the first n row for a pandas dataframe?

when print a pandas dataframe, how to print the first n row by default?
I find myself frequently doing df.head(10) to view the column names and first a couple of rows.
I prefer when I type 'df', it prints the first n row by default, instead of printing the whole df, which in this case I cannot see the column names.
If I understand you correctly, you may set
pd.options.display.max_rows = 10
and whenever you do just df in your notebook, only 10 rows would be displayed.
You can always set back to the default value doing
pd.reset_option('display.max_rows')
Check pd.describe_option('display') for more information
Curry DataFrame.head using functools.partial.
from functools import partial
head10 = partial(pd.DataFrame.head, n=10)
Now you can either call the function passing your DataFrame as an argument,
head10(df)
Or, pass the function to df.pipe (which internally passes df as an argument to your function),
df.pipe(head10)
To get the first 10 rows by default.
The other option is to create a new class that extends DataFrame and add your own function (e.g., headXX) which internally calls df.head(n=10) and returns the result.
See the subclassing DataFrame section in the docs.

Processing pandas data in declarative style

I have a pandas dataframe of vehicle co-ordinates (from multiple vehicles on multiple days). For each vehicle and for each day, I do two things: either apply an algorithm to it, or filter it out of the dataset completely if it doesn't satisfy certain criteria.
To acheive this I use df.groupby('vehicle_id', 'day') and then .apply(algorithm) or .filter(condition) where algorithm and condition are functions which take in a dataframe.
I would like the full processing of my dataset (which involves multiple .apply and .filter steps) to be written out in a declaritive style, as opposed to imperatively looping through the groups, with the goal of the whole thing to look something like:
df.group_by('vehicle_id', 'day').apply(algorithm1).filter(condition1).apply(algorithm2).filter(condition2)
Of course, the above code is incorrect since .apply() and .filter() return new dataframes, and this is exactly my problem. They return all the data back in a single dataframe, and I find that I have apply .groupby('vehicle_id', 'day') continuously.
Is there a nice way that I can write this out without having to group by the same columns over and over?
Since apply uses a for loop anyway (meaning there are no sophisticated optimizations in the background), I suggest using an actual for loop:
arr = []
for key, dfg in df.groupby(['vehicle_id', 'day']):
dfg = dfg.do_stuff1() # Perform all needed operations
dfg = do_stuff2(dfg) #
arr.append(dfg)
result = pd.concat(arr)
An alternative is to create a function which runs all of the applies and filters sequentially on a given dataframe, and then map a single groupby/apply to it:
def all_operations(dfg):
# Do stuff
return result_df
result = df.group_by(['vehicle_id', 'day']).apply(all_operations)
In both options you will have to deal with cases in which an empty dataframe is returned from the filters, if such cases exist.

How can I use `apply` with a function that takes multiple inputs

I have a function that has multiple inputs, and would like to use SFrame.apply to create a new column. I can't find a way to pass two arguments into SFrame.apply.
Ideally, it would take the entry in the column as the first argument, and I would pass in a second argument. Intuitively something like...
def f(arg_1,arg_2):
return arg_1 + arg_2
sf['new_col'] = sf.apply(f,arg_2)
suppose the first argument of function f is one of the column.
Say argcolumn1 in sf, then
sf['new_col'] = sf['argcolumn1'].apply(lambda x:f(x,arg_2))
should work
Try this.
sf['new_col'] = sf.apply(lambda x : f(arg_1, arg_2))
The way i understand your question (and because none of the previous answers are marked as accepted), it seems to me that you are trying to apply a transformation using two different columns of a single SFrame, so:
As specified in the online documentation, the function you pass to the SFrame.apply method will be called for every row in the SFrame.
So you should rewrite your function to receive a single argument representing the current row, as follow:
def f(row):
return row['column_1'] + row['column_2']
sf['new_col'] = sf.apply(f)