Is there a way to get max using pandas Dataframe.eval instead of Dataframe.max? - pandas

Is there a way to get the maximum per row using eval?
It would be very sweet to be able to write something like:
X = pd.DataFrame({'A':[1,2,3,4]})
X.eval("""B = 2* A
C = A +B
D = max(A, B)
E = 2 * D
""", inplace = True)
Instead of:
X = pd.DataFrame({'A':[1,2,3,4]})
X.eval("""B = 2* A
C = A +B
""", inplace = True)
X['D'] = X.loc[:,['A', 'B']].max(axis=1)
X.eval('E = 2 * D', inplace=True)
EDIT:
As suggested by #mephisto, something similar to this works beautifully:
def rowmax(A,B):
return pd.concat([A,B], axis=1).max(axis=1)
X = pd.DataFrame({'A':[0, 1,2,3,4]})
X.eval("""B = A % 2 +1
D = #rowmax(A, B)
""", inplace = True)
I am interested in knowing other alternatives.

You should be able to call a custom or predefined function with #. In your case you want to call df.max(), so try this X.eval('#df.max()').
Hope this helps

Related

Julia DifferentialEquations.jl all variable output

I have the following example:
using DifferentialEquations
function test1(du,u,p,t)
a,b,c = p
d=a^0.1*(t+1)
e=u[1]/a
f=u[2]/d
du[1] = a*u[1]
du[2] = d*u[2]
du[3] = b*u[2] - c*u[3]
end
p = (2,0.75,0.8)
u0 = [1.0;1.0;1.0]
tspan = (0.0,3.0)
prob = ODEProblem(test1,u0,tspan,p)
sol = solve(prob,saveat=0.3)
The sol objects contain state outputs but, I need efficiently other variables ("d","e","f") as well.
The closest I can get is:
function test2(du,u,p,t)
global i
global Out_values
global sampletimes
a,b,c = p
d=a^0.1*(t+1)
e=u[1]/a
f=u[2]/d
if t in sampletimes
Out_values[1,i] = d
Out_values[2,i] = e
Out_values[3,i] = f
i=i+1
end
du[1] = a*u[1]
du[2] = d*u[2]
du[3] = b*u[2] - c*u[3]
end
sampletimes = tspan[1]:0.3:tspan[2]
Out_values = Array{Float64}(undef, 3, 2*11)
i=1
prob = ODEProblem(test2,u0,tspan,p)
sol = solve(prob,saveat=0.3,tstops=sampletimes)
However, this solution is not ideal because:
it duplicates saveat and I get two sets of slightly different outputs (not sure why), and
it can't expand if I decide not to use saveat and I want to output all solutions, i.e. sol = solve(prob).
Any help is appreciated.

R: How to plot the last row of a dataframe?

This must be very easy, but I cannot get a plot of the last/any row of a dataframe.
A = data.frame(a = rnorm(50), b = rnorm(50), c = rnorm(50))
barplot(A[nrow(A),1:3])
I get the error message:
Error in barplot.default(A[nrow(A), 1:3]) :
'height' must be a vector or a matrix
A solution using ggplot would be very welcome!
imported ggplot2 library and the dataset you gave me. used the tail command to get only the last row. Then had to melt() the data to get it into the right format, then plotted in ggplot2
library(ggplot2)
library(reshap2)
A = data.frame(a = rnorm(50), b = rnorm(50), c = rnorm(50))
A_tail <- tail(A, 1)
tailmelt <- melt(A_tail)
ggplot(data = tailmelt, aes( x = factor(variable), y = value, fill = variable ) ) +
geom_bar( stat = 'identity' )

Create Dataframe name from 2 strings or variables pandas

i am extracting selected pages from a pdf file. and want to assign dataframe name based on the pages extracted:
file = "abc"
selected_pages = ['10','11'] #can be any combination eg ['6','14','20]
for i in selected_pages():
df{str(i)} = read_pdf(path + file + ".pdf",encoding = 'ISO-8859-1', stream = True,area = [100,10,740,950],pages= (i), index = False)
print (df{str(i)} )
The idea, ultimately, as in above example, is to have dataframes: df10, df11. I have tried "df" + str(i), "df" & str(i) & df{str(i)}. however all are giving error msg: SyntaxError: invalid syntax
Or any better way of doing it is most welcome. thanks
This is where a dictionary would be a much better option.
Also note the error you have at the start of the loop. selected_pages is a list, so you can't do selected_pages().
file = "abc"
selected_pages = ['10','11'] #can be any combination eg ['6','14','20]
df = {}
for i in selected_pages:
df[i] = read_pdf(path + file + ".pdf",encoding = 'ISO-8859-1', stream = True, area = [100,10,740,950], pages= (i), index = False)
i = int(i) - 1 # this will bring it to 10
dfB = df[str(i)]
#select row number to drop: 0:4
dfB.drop(dfB.index[0:4],axis =0, inplace = True)
dfB.columns = ['col1','col2','col3','col4','col5']

How to speed up simple linear algebra optimization probelm in Julia?

I implemented the LSDD changepoint detection method decribed in [1] in Julia, to see if I could make it faster than the existing python implementation [2], which is based on a grid search that looks for the optimal parameters.
I obtain the desired results but despite my best efforts, my grid search version of it takes about the same time to compute as the python one, which is still way too long for real applications.
I also tried using the Optimize package which only makes things worse (2 or 3 times slower).
Here is the grid search that I implemented :
using Random
using LinearAlgebra
function squared_distance(X::Array{Float64,1},C::Array{Float64,1})
sqd = zeros(length(X),length(C))
for i in 1:length(X)
for j in 1:length(C)
sqd[i,j] = X[i]^2 + C[j]^2 - 2*X[i]*C[j]
end
end
return sqd
end
function lsdd(x::Array{Float64,1},y::Array{Float64,1}; folds = 5, sigma_list = nothing , lambda_list = nothing)
lx,ly = length(x), length(y)
b = min(lx+ly,300)
C = shuffle(vcat(x,y))[1:b]
CC_dist2 = squared_distance(C,C)
xC_dist2, yC_dist2 = squared_distance(x,C), squared_distance(y,C)
Tx,Ty = length(x) - div(lx,folds), length(y) - div(ly,folds)
#Define the training and testing data sets
cv_split1, cv_split2 = floor.(collect(1:lx)*folds/lx), floor.(collect(1:ly)*folds/ly)
cv_index1, cv_index2 = shuffle(cv_split1), shuffle(cv_split2)
tr_idx1,tr_idx2 = [findall(x->x!=i,cv_index1) for i in 1:folds], [findall(x->x!=i,cv_index2) for i in 1:folds]
te_idx1,te_idx2 = [findall(x->x==i,cv_index1) for i in 1:folds], [findall(x->x==i,cv_index2) for i in 1:folds]
xTr_dist, yTr_dist = [xC_dist2[i,:] for i in tr_idx1], [yC_dist2[i,:] for i in tr_idx2]
xTe_dist, yTe_dist = [xC_dist2[i,:] for i in te_idx1], [yC_dist2[i,:] for i in te_idx2]
if sigma_list == nothing
sigma_list = [0.25, 0.5, 0.75, 1, 1.2, 1.5, 2, 2.5, 2.2, 3, 5]
end
if lambda_list == nothing
lambda_list = [1.00000000e-03, 3.16227766e-03, 1.00000000e-02, 3.16227766e-02,
1.00000000e-01, 3.16227766e-01, 1.00000000e+00, 3.16227766e+00,
1.00000000e+01]
end
#memory prealocation
score_cv = zeros(length(sigma_list),length(lambda_list))
H = zeros(b,b)
hx_tr, hy_tr = [zeros(b,1) for i in 1:folds], [zeros(b,1) for i in 1:folds]
hx_te, hy_te = [zeros(1,b) for i in 1:folds], [zeros(1,b) for i in 1:folds]
#h_tr,h_te = zeros(b,1), zeros(1,b)
theta = zeros(b)
for (sigma_idx,sigma) in enumerate(sigma_list)
#the expression of H is different for higher dimension
#H = sqrt((sigma^2)*pi)*exp.(-CC_dist2/(4*sigma^2))
set_H(H,CC_dist2,sigma,b)
#check if the sum is performed along the right dimension
set_htr(hx_tr,xTr_dist,sigma,Tx), set_htr(hy_tr,yTr_dist,sigma,Ty)
set_hte(hx_te,xTe_dist,sigma,lx-Tx), set_hte(hy_te,yTe_dist,sigma,ly-Ty)
for i in 1:folds
h_tr = hx_tr[i] - hy_tr[i]
h_te = hx_te[i] - hy_te[i]
#set_h(h_tr,hx_tr[i],hy_tr[i],b)
#set_h(h_te,hx_te[i],hy_te[i],b)
for (lambda_idx,lambda) in enumerate(lambda_list)
set_theta(theta,H,lambda,h_tr,b)
score_cv[sigma_idx,lambda_idx] += dot(theta,H*theta) - 2*dot(theta,h_te)
end
end
end
#retrieve the value of the optimal parameters
sigma_chosen = sigma_list[findmin(score_cv)[2][2]]
lambda_chosen = lambda_list[findmin(score_cv)[2][2]]
#calculating the new "optimal" solution
H = sqrt((sigma_chosen^2)*pi)*exp.(-CC_dist2/(4*sigma_chosen^2))
H_lambda = H + lambda_chosen*Matrix{Float64}(I, b, b)
h = (1/lx)*sum(exp.(-xC_dist2/(2*sigma_chosen^2)),dims = 1) - (1/ly)*sum(exp.(-yC_dist2/(2*sigma_chosen^2)),dims = 1)
theta_final = H_lambda\transpose(h)
f = transpose(theta_final).*sum(exp.(-vcat(xC_dist2,yC_dist2)/(2*sigma_chosen^2)),dims = 1)
L2 = 2*dot(theta_final,h) - dot(theta_final,H*theta_final)
return L2
end
function set_H(H::Array{Float64,2},dist::Array{Float64,2},sigma::Float64,b::Int16)
for i in 1:b
for j in 1:b
H[i,j] = sqrt((sigma^2)*pi)*exp(-dist[i,j]/(4*sigma^2))
end
end
end
function set_theta(theta::Array{Float64,1},H::Array{Float64,2},lambda::Float64,h::Array{Float64,2},b::Int64)
Hl = (H + lambda*Matrix{Float64}(I, b, b))
LAPACK.posv!('L', Hl, h)
theta = h
end
function set_htr(h::Array{Float64,1},dists::Array{Float64,2},sigma::Float64,T::Int16)
for (CVidx,dist) in enumerate(dists)
for (idx,value) in enumerate((1/T)*sum(exp.(-dist/(2*sigma^2)),dims = 1))
h[CVidx][idx] = value
end
end
end
function set_hte(h::Array{Float64,1},dists::Array{Float64,2},sigma::Array{Float64,1},T::Int16)
for (CVidx,dist) in enumerate(dists)
for (idx,value) in enumerate((1/T)*sum(exp.(-dist/(2*sigma^2)),dims = 1))
h[CVidx][idx] = value
end
end
end
function set_h(h,h1,h2,b)
for i in 1:b
h[i] = h1[i] - h2[i]
end
end
The set_H, set_h and set_theta functions are there because I read somewhere that modifying prealocated memory in place with a function was faster, but it did not make a great difference.
To test it, I use two random distribution as input data :
x,y = rand(500),1.5*rand(500)
lsdd(x,y) #returns a value around 0.3
Now here is the version of the code where I try to use Optimizer :
function Theta(sigma::Float64,lambda::Float64,x::Array{Float64,1},y::Array{Float64,1},folds::Int8)
lx,ly = length(x), length(y)
b = min(lx+ly,300)
C = shuffle(vcat(x,y))[1:b]
CC_dist2 = squared_distance(C,C)
xC_dist2, yC_dist2 = squared_distance(x,C), squared_distance(y,C)
#the subsets are not be mutually exclusive !
Tx,Ty = length(x) - div(lx,folds), length(y) - div(ly,folds)
shuffled_x, shuffled_y = [shuffle(1:lx) for i in 1:folds], [shuffle(1:ly) for i in 1:folds]
cv_index1, cv_index2 = floor.(collect(1:lx)*folds/lx)[shuffle(1:lx)], floor.(collect(1:ly)*folds/ly)[shuffle(1:ly)]
tr_idx1,tr_idx2 = [i[1:Tx] for i in shuffled_x], [i[1:Ty] for i in shuffled_y]
te_idx1,te_idx2 = [i[Tx:end] for i in shuffled_x], [i[Ty:end] for i in shuffled_y]
xTr_dist, yTr_dist = [xC_dist2[i,:] for i in tr_idx1], [yC_dist2[i,:] for i in tr_idx2]
xTe_dist, yTe_dist = [xC_dist2[i,:] for i in te_idx1], [yC_dist2[i,:] for i in te_idx2]
score_cv = 0
Id = Matrix{Float64}(I, b, b)
H = sqrt((sigma^2)*pi)*exp.(-CC_dist2/(4*sigma^2))
hx_tr, hy_tr = [transpose((1/Tx)*sum(exp.(-dist/(2*sigma^2)),dims = 1)) for dist in xTr_dist], [transpose((1/Ty)*sum(exp.(-dist/(2*sigma^2)),dims = 1)) for dist in yTr_dist]
hx_te, hy_te = [(lx-Tx)*sum(exp.(-dist/(2*sigma^2)),dims = 1) for dist in xTe_dist], [(ly-Ty)*sum(exp.(-dist/(2*sigma^2)),dims = 1) for dist in yTe_dist]
for i in 1:folds
h_tr, h_te = hx_tr[i] - hy_tr[i], hx_te[i] - hy_te[i]
#theta = (H + lambda * Id)\h_tr
theta = copy(h_tr)
Hl = (H + lambda*Matrix{Float64}(I, b, b))
LAPACK.posv!('L', Hl, theta)
score_cv += dot(theta,H*theta) - 2*dot(theta,h_te)
end
return score_cv,(CC_dist2,xC_dist2,yC_dist2)
end
function cost(params::Array{Float64,1},x::Array{Float64,1},y::Array{Float64,1},folds::Int8)
s,l = params[1],params[2]
return Theta(s,l,x,y,folds)[1]
end
"""
Performs the optinization
"""
function lsdd3(x::Array{Float64,1},y::Array{Float64,1}; folds = 4)
start = [1,0.1]
b = min(length(x)+length(y),300)
lx,ly = length(x),length(y)
#result = optimize(params -> cost(params,x,y,folds),fill(0.0,2),fill(50.0,2),start, Fminbox(LBFGS(linesearch=LineSearches.BackTracking())); autodiff = :forward)
result = optimize(params -> cost(params,x,y,folds),start, BFGS(),Optim.Options(f_calls_limit = 5, iterations = 5))
#bboptimize(rosenbrock2d; SearchRange = [(-5.0, 5.0), (-2.0, 2.0)])
#result = optimize(cost,[0,0],[Inf,Inf],start, Fminbox(AcceleratedGradientDescent()))
sigma_chosen,lambda_chosen = Optim.minimizer(result)
CC_dist2, xC_dist2, yC_dist2 = Theta(sigma_chosen,lambda_chosen,x,y,folds)[2]
H = sqrt((sigma_chosen^2)*pi)*exp.(-CC_dist2/(4*sigma_chosen^2))
h = (1/lx)*sum(exp.(-xC_dist2/(2*sigma_chosen^2)),dims = 1) - (1/ly)*sum(exp.(-yC_dist2/(2*sigma_chosen^2)),dims = 1)
theta_final = (H + lambda_chosen*Matrix{Float64}(I, b, b))\transpose(h)
f = transpose(theta_final).*sum(exp.(-vcat(xC_dist2,yC_dist2)/(2*sigma_chosen^2)),dims = 1)
L2 = 2*dot(theta_final,h) - dot(theta_final,H*theta_final)
return L2
end
No matter, which kind of option I use in the optimizer, I always end up with something too slow. Maybe the grid search is the best option, but I don't know how to make it faster... Does anyone have an idea how I could proceed further ?
[1] : http://www.mcduplessis.com/wp-content/uploads/2016/05/Journal-IEICE-2014-CLSDD-1.pdf
[2] : http://www.ms.k.u-tokyo.ac.jp/software.html

How to Plot a function of two variables in Julia with pyplot

I'm trying to plot a function of two variables with pyplot in Julia. The working starting-point is the following (found here at StackOverflow):
function f(z,t)
return z*t
end
z = linspace(0,5,11)
t = linspace(0,40,4)
for tval in t
plot(z, f(z, tval))
end
show()
This works right for me and is giving me exactly what I wanted:
a field of lines.
My own functions are as follows:
## needed functions ##
const gamma_0 = 6
const Ksch = 1.2
const Kver = 1.5
function Kvc(vc)
if vc <= 0
return 0
elseif vc < 20
return (100/vc)^0.1
elseif vc < 100
return 2.023/(vc^0.153)
elseif vc == 100
return 1
elseif vc > 100
return 1.380/(vc^0.07)
else
return 0
end
end
function Kgamma(gamma_t)
return 1-((gamma_t-gamma_0)/100)
end
function K(gamma_t, vc)
return Kvc(vc)*Kgamma(gamma_t)*Ksch*Kver
end
I've tried to plot them as follows:
i = linspace(0,45,10)
j = linspace(0,200,10)
for i_val in i
plot(i,K(i,j))
end
This gives me the following Error:
isless has no method matching isless(::Int64, ::Array{Float64,1})
while loading In[51], in expression starting on line 3
in Kvc at In[17]:2 in anonymous at no file:4
Obviously, my function cant deal with an array.
Next try:
i = linspace(0,200,11)
j = linspace(0,45,11)
for i_val in i
plot(i_val,map(K,i_val,j))
end
gives me a empty plot only with axes
Can anybody please give me a hint...
EDIT
A simpler example:
using PyPlot
function P(n,M)
return (M*n^3)/9550
end
M = linspace(1,5,5)
n = linspace(0,3000,3001)
for M_val in M
plot(n,P(n,M_val))
end
show()
Solution
OK, with your help I found this solution for the shortened example which works for me as intended:
function P(n,M)
result = Array(Float64, length(n))
for (idx, val) in enumerate(n)
result[idx] = (M*val^3)/9550
end
return result
end
n = linspace(0,3000,3001)
for M_val = 1:5
plot(n,P(n,M_val))
end
show()
This gives me what I wanted for this shortened example. The remainig question is: could it be done in a simpler more elegant way?
I'll try to apply it to the original example and post it when I'll succed.
I don't completely follow all the details of what you're trying to accomplish, but here are examples on how you can modify a couple of your functions so that they accept and return arrays:
function Kvc(vc)
result = Array(Float64, length(vc))
for (idx, val) in enumerate(vc)
if val <= 0
result[idx] = 0
elseif val < 20
result[idx] = (100/val)^0.1
elseif val < 100
result[idx] = 2.023/(val^0.153)
elseif val == 100
result[idx] = 1
elseif val > 100
result[idx] = 1.380/(val^0.07)
else
result[idx] = 0
end
end
return result
end
function Kgamma(gamma_t)
return ones(length(gamma_t))-((gamma_t - gamma_0)/100)
end
Also, for your loop, I think you probably want something like:
for i_val in i
plot(i_val,K(i_val,j))
end
rather than plot(i, K(i,j), as that would just print the same thing over and over.
< is defined for scalars. I think you need to broadcast it for arrays, i.e. use .<. Example:
julia> x = 2
2
julia> x < 3
true
julia> x < [3 4]
ERROR: MethodError: no method matching isless(::Int64, ::Array{Int64,2})
Closest candidates are:
isless(::Real, ::AbstractFloat)
isless(::Real, ::Real)
isless(::Integer, ::Char)
in <(::Int64, ::Array{Int64,2}) at .\operators.jl:54
in eval(::Module, ::Any) at .\boot.jl:234
in macro expansion at .\REPL.jl:92 [inlined]
in (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at .\event.jl:46
julia> x .< [3 4]
1x2 BitArray{2}:
true true