I have two dataframes, I want to calculate the "distance" between every row in one data frame compared to every row in another using a custom distance measure (for example euclidian for the first column, taxi cab for the second, etc). Is there a way to do this quickly using broadcasting?
you can simply create a custom function and use that in apply function:
for example:
def custom_func(a, b):
result = a + b for example:
return result
df = df.apply(custom_func)
if you want answer that needs to be tested in some dataset, please add the data. But Idea is you can create any custom functions and passed in your pandas apply function.
more about apply function and examples are: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
Also you may want to check applymap function in pandas too. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.applymap.html
Related
df.groupby("home_team_name")["home_team_goal_count","away_team_goal_count"].sum()
I want to group examples in my data-frame based on the variable home_team_name I would like to perform different operations on different attributes. I want sum one of them and mean for one of them and the last occurence for another.
As of now I only know how to perform the same operation on all of them like in my code example.
You can do:
import numpy as np
df.groupby("home_team_name").agg({'home_team_goal_count': sum,
'away_team_goal_count': np.mean})
Refer to more examples in documentation
To get the last value, you could do:
df.groupby("home_team_name").agg({'home_team_goal_count': 'last',
'away_team_goal_count': 'last'})
After I've done a groupBy, I want to store a value so I can multiply it later.
The type of the groupBy result is 'object' so I can't operate on it.
I've tried creating a dataframe from the groupBy.
# create the df
total = df.groupby(['columnName_1'])['columnName_2'].sum().astype('int').map('{:,}'.format) # as I get a very large number, I clean its format for readability
total = total.add_suffix('_global').reset_index()
# as it is a sum, I get only one value, the one I want to multiply by another value later.
I'd like to know now if I can store this value without having to create a dataframe.
Thanks!
It's the first time a post a question, so I will try to give some example but I might not be totally aware of the best way to do it.
I am using groupby() function to divide a DataFrame according to a pooled variable. My intent is to create from the SubDataframes a new one in which the rows splitted with groupby() become 2 separate columns. For instance a in DataFrame A I have :meanX and :Treatment, in dataframe B I want to have :meanX_Treatment1 and :meanX_Treatment2.
Now I found a way to use join() for this pourpose, but having many other variables to block I need to repeat the operation several time and I need to know how many SubDataFrames the initial call of groupby() created. The result is variable so I can't simply read it I need to store it in a variable, that's why I tried size(::DataFrames.GroupedDataFrame).
Is there a solution?
To get the number of groups in a GroupedDataFrame use the length method. For example:
using DataFrames
df = DataFrame(x=repeat(1:4,inner=2,outer=2),y='a':'p')
grouped = groupby(df,:x)
num_of_groups = length(grouped) # returns 4
# to do something with each group `for g in grouped ... end` is useful
As noted in comments, you might also consider using Query.jl (see documentation at http://www.david-anthoff.com/Query.jl/stable) for data processing along the question's lines.
This is a general question about how to apply a function efficiently in pandas. I often encounter situations where I need to apply a function to a pd.Series and it would be faster to apply the function only to unique values.
For example, suppose I have a very large dataset. One column is date, and I want to add a column that gives the last date of the quarter for date. I would do this:
mf['qtr'] = pd.Index(mf['date']) + pd.offsets.QuarterEnd(0)
But for large data sets, this can take a while. So to speed it up, I'll extract the unique values of date, apply the function to those, and then merge it back in to the original data:
dts = mf['date'].drop_duplicates()
eom = Series(pd.Index(dts) + pd.offsets.QuarterEnd(0), index=dts)
eom.name = 'qtr'
mf = pd.merge(mf, eom.reset_index())
This can be much faster than the one-liner above.
So here's my question: Is this really the right way to do things like this, or is there a better approach?
And, would it make sense and be feasible to add a feature to pandas that would take this unique/apply/merge approach automatically? (It wouldn't work for certain functions, such as those that rely on rolling data, so presumably the user would have to explicitly request this behavior.)
I'd personally just group on the date column and then just call your function for each group:
mf.groupby('date',as_index=False)['date'].apply(lambda x: x + pd.offsets.QuarterEnd(0))
I think should work
EDIT
OK the above doesn't work but the following does but I think this is a bit twisted:
mf.groupby('date', as_index=False)['date'].apply(lambda x: (pd.Index(x)+ QuarterEnd(0))[0])
we create a datetimeindex for each date, add the offset and then access the single element to return the value but personally I think this is not great.
I need to perform an exponential operation of two parameters (one set: t, and the other comes from the arrays) on a set of 2D arrays (a 3D Matrix if you want).
f(t,x) = exp(t-x)
And then I need to add the result of every value in the 3rd dimension. Because it takes too much time using bsxfun to perform the entire operation I was thinking of using a look up table.
I can create the table as a matrix LUT (2 dimensional due to the two parameters), then I can retrieve the values using LUT(par1,par2). But accessing on the 3rd dimension using a loop is expensive too.
My question is: is there a way to implement such mechanism (a look up table) to have a predefined values and then just using them accessing from the matrix elements (kind of indexing) without loops. Or, how can I create a look up table that MATLAB handles automatically to speed up the exponential operation?
EDIT:
I actually used similar methods to create the LUT. Now, my problem actually is how to access it in an efficient way.
Lets said I have a 2 dimensional array M. With those values that I want to apply the function f(t,M(i,j)) for fixed value t. I can use a loop to go through all the values (i,j) of M. But I want a faster way of doing it, because I have a set of M's, and then I need to apply this procedure to all the other values.
My function is a little bit complex than the example I gave:
pr = mean(exp(-bsxfun(#rdivide,bsxfun(#minus,color_vals,double(I)).^2,m)./2),3);
That is my actual function, as you can see is more complex than the example I presented. But the idea is the same. It does an average in the third dimension of the set of M's of the exponential of the difference of two arrays.
Hope that helps.
I agree that the question is not very clear, and that showing some code would help. I'll try anyway.
In order to have a LUT make sense at all, the set of values attained by t-x has to be limited, for example to integers.
Assuming that the exponent can be any integer from -1000 to 1000, you could create a LUT like this:
LUT = exp(-1000:1000);
Then you create your indices (assuming t is a 1D array, and x is a 2D array)
indexArray = bsxfun(#minus,reshape(t,[1,1,3]), x) + 1001; %# -1000 turns into 1
Finally, you create your result
output = LUT(indexArray);
%# sum along third dimension (i.e. sum over all `t`)
output = sum(output,3);
I am not sure I understand your question, but I think this is the answer.
x = 0:3
y = 0:2
z = 0:6
[X,Y,Z] = meshgrid(x,y,z)
LUT = (X+Y).^Z