Applying function to unique values for efficiency in pandas - pandas

This is a general question about how to apply a function efficiently in pandas. I often encounter situations where I need to apply a function to a pd.Series and it would be faster to apply the function only to unique values.
For example, suppose I have a very large dataset. One column is date, and I want to add a column that gives the last date of the quarter for date. I would do this:
mf['qtr'] = pd.Index(mf['date']) + pd.offsets.QuarterEnd(0)
But for large data sets, this can take a while. So to speed it up, I'll extract the unique values of date, apply the function to those, and then merge it back in to the original data:
dts = mf['date'].drop_duplicates()
eom = Series(pd.Index(dts) + pd.offsets.QuarterEnd(0), index=dts)
eom.name = 'qtr'
mf = pd.merge(mf, eom.reset_index())
This can be much faster than the one-liner above.
So here's my question: Is this really the right way to do things like this, or is there a better approach?
And, would it make sense and be feasible to add a feature to pandas that would take this unique/apply/merge approach automatically? (It wouldn't work for certain functions, such as those that rely on rolling data, so presumably the user would have to explicitly request this behavior.)

I'd personally just group on the date column and then just call your function for each group:
mf.groupby('date',as_index=False)['date'].apply(lambda x: x + pd.offsets.QuarterEnd(0))
I think should work
EDIT
OK the above doesn't work but the following does but I think this is a bit twisted:
mf.groupby('date', as_index=False)['date'].apply(lambda x: (pd.Index(x)+ QuarterEnd(0))[0])
we create a datetimeindex for each date, add the offset and then access the single element to return the value but personally I think this is not great.

Related

Custom distance function between every row of two dataframes

I have two dataframes, I want to calculate the "distance" between every row in one data frame compared to every row in another using a custom distance measure (for example euclidian for the first column, taxi cab for the second, etc). Is there a way to do this quickly using broadcasting?
you can simply create a custom function and use that in apply function:
for example:
def custom_func(a, b):
result = a + b for example:
return result
df = df.apply(custom_func)
if you want answer that needs to be tested in some dataset, please add the data. But Idea is you can create any custom functions and passed in your pandas apply function.
more about apply function and examples are: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
Also you may want to check applymap function in pandas too. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.applymap.html

no method matching size(::DataFrames.GroupedDataFrame)

It's the first time a post a question, so I will try to give some example but I might not be totally aware of the best way to do it.
I am using groupby() function to divide a DataFrame according to a pooled variable. My intent is to create from the SubDataframes a new one in which the rows splitted with groupby() become 2 separate columns. For instance a in DataFrame A I have :meanX and :Treatment, in dataframe B I want to have :meanX_Treatment1 and :meanX_Treatment2.
Now I found a way to use join() for this pourpose, but having many other variables to block I need to repeat the operation several time and I need to know how many SubDataFrames the initial call of groupby() created. The result is variable so I can't simply read it I need to store it in a variable, that's why I tried size(::DataFrames.GroupedDataFrame).
Is there a solution?
To get the number of groups in a GroupedDataFrame use the length method. For example:
using DataFrames
df = DataFrame(x=repeat(1:4,inner=2,outer=2),y='a':'p')
grouped = groupby(df,:x)
num_of_groups = length(grouped) # returns 4
# to do something with each group `for g in grouped ... end` is useful
As noted in comments, you might also consider using Query.jl (see documentation at http://www.david-anthoff.com/Query.jl/stable) for data processing along the question's lines.

Warning message while using a window function in SparkSQL dataframe

I am getting the below warning message when I use the window function in SparkSQL. Can anyone please let me know how to fix this issue.
Warning Message:
No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
My Code:
def calcPrevBrdrx(df: DataFrame): DataFrame = {
val w = Window.orderBy("existing_col1")
df.withColumn("new_col", lag("existing_col2", 1).over(w))
}
The warning is exactly what it says. In general, when you use a window function you would first partition by some column and only then order. So for example if you had logs for a user you might partition by the user and then order by time which would do the sorting separately for each user.
If you do not have a partition by then you are sorting on the entire data frame. This would basically mean you have a single partition. All data from all the dataframe would move to that single partition and be sorted.
This would be slow (you are shuffling everything and then sorting everything) and worse this means that all your data need to fit in a single partition which is not scalable.
You should probably take a look at your logic to make sure you really need to sort everything instead of partitioning by something before.
If your logic demands to use order by without partition clause, may be because you don't have anything else to partition on or it doesn't make sense for the window function used, you can add a dummy value like below -
.withColumn("id", explode(typedLit((1 to 100).toList)))
This will create an id field with value from 1 to 100 for each row in original dataframe and use that in partition by clause ( partition by id) , it will launch 100 tasks. Total number of rows it will create will be current rows*100 .Make sure you drop the id field and do distinct on result.

Why does pandas.apply() work differently for Series and DataFrame columns

apologies if this is a silly question, but I am not quite sure as to why this behavior is the case, and/or whether I am misunderstanding it. I was trying to create a function for the 'apply' method, and noticed that if you run apply on a series the series is passed as a np.array and if you pass the same series within a dataframe of 1 column, the series is passed as a series to the (u)func.
This affects the way a simpleton like me writes the function (i prefer iloc indexing to integer-based indexing on the array) so I was wondering whether this is on purpose, or historical accident?
Thanks,

How to use a look up table in MATLAB

I need to perform an exponential operation of two parameters (one set: t, and the other comes from the arrays) on a set of 2D arrays (a 3D Matrix if you want).
f(t,x) = exp(t-x)
And then I need to add the result of every value in the 3rd dimension. Because it takes too much time using bsxfun to perform the entire operation I was thinking of using a look up table.
I can create the table as a matrix LUT (2 dimensional due to the two parameters), then I can retrieve the values using LUT(par1,par2). But accessing on the 3rd dimension using a loop is expensive too.
My question is: is there a way to implement such mechanism (a look up table) to have a predefined values and then just using them accessing from the matrix elements (kind of indexing) without loops. Or, how can I create a look up table that MATLAB handles automatically to speed up the exponential operation?
EDIT:
I actually used similar methods to create the LUT. Now, my problem actually is how to access it in an efficient way.
Lets said I have a 2 dimensional array M. With those values that I want to apply the function f(t,M(i,j)) for fixed value t. I can use a loop to go through all the values (i,j) of M. But I want a faster way of doing it, because I have a set of M's, and then I need to apply this procedure to all the other values.
My function is a little bit complex than the example I gave:
pr = mean(exp(-bsxfun(#rdivide,bsxfun(#minus,color_vals,double(I)).^2,m)./2),3);
That is my actual function, as you can see is more complex than the example I presented. But the idea is the same. It does an average in the third dimension of the set of M's of the exponential of the difference of two arrays.
Hope that helps.
I agree that the question is not very clear, and that showing some code would help. I'll try anyway.
In order to have a LUT make sense at all, the set of values attained by t-x has to be limited, for example to integers.
Assuming that the exponent can be any integer from -1000 to 1000, you could create a LUT like this:
LUT = exp(-1000:1000);
Then you create your indices (assuming t is a 1D array, and x is a 2D array)
indexArray = bsxfun(#minus,reshape(t,[1,1,3]), x) + 1001; %# -1000 turns into 1
Finally, you create your result
output = LUT(indexArray);
%# sum along third dimension (i.e. sum over all `t`)
output = sum(output,3);
I am not sure I understand your question, but I think this is the answer.
x = 0:3
y = 0:2
z = 0:6
[X,Y,Z] = meshgrid(x,y,z)
LUT = (X+Y).^Z