New to Python and Pandas, Looking for help aggregating observations - pandas

I am relatively new to using Python and Pandas, and was looking for help with this line of code:
`Football.ydstogo[Football.ydstogo>='11']&[Football.ydstogo<='19']= '10-plus`'
I am working with data from the NFL, and trying to build a model to predict when a team will pass, or when a team will run the ball. One of my variables (ydstogo) measures the distance for the team, with the ball, to get a first down. I am trying to group together the observations after 10 yards for ease of visualization.
When I tried running the code above, the error in my output is "can't assign to operator". I've used this code before to change gender observations to dummy variables, so I'm confused why it is not working here.

As I understand, you want to find elements with (string)
value between '11' and '19' and set a new string there.
So probably your should change your code to:
Football.ydstogo[(Football.ydstogo >= '11') & (Football.ydstogo <= '19')] = '10-plus'
Alternative:
Football.ydstogo[Football.ydstogo.between('11', '19')] = '10-plus'

Related

RStudio Error: Unused argument ( by = ...) when fitting gam model, and smoothing seperately for a factor

I am still a beginnner in R. For a project I am trying to fit a gam model on a simple dataset with a timeset and year. I am doing it in R and I keep getting an error message that claims an argument is unused, even though I specify it in the code.
It concerns a dataset which includes a categorical variable of "Year", with only two levels. 2020 and 2022. I want to investigate if there is a peak in the hourly rate of visitors ("H1") in a nature reserve. For each observation period the average time was taken, which is the predictor variable used here ("T"). I want to use a Gam model for this, and have the smoothing applied differently for the two years.
The following is the line of code that I tried to use
`gam1 <- gam(H1~Year+s(T,by=Year),data = d)`
When I try to run this code, I get the following error message
`Error in s(T, by = Year) : unused argument (by = Year)`
I also tried simply getting rid of the "by" argument
`gam1 <- gam(H1~Year+s(T,Year),data = d)`
This allows me to run the code, but when trying to summon the output using summary(gam1), I get
Error in [<-(tmp, snames, 2, value = round(nldf, 1)) : subscript out of bounds
Since I feel like both errors are probably related to the same thing that I'm doing wrong, I decided to combine the question.
Did you load the {mgcv} package or the {gam} package? The latter doesn't have factor by smooths and as such the first error message is what I would expect if you did library("gam") and then tried to fit the model you showed.
To fit the model you showed, you should restart R and try in a clean session:
library("mgcv")
# load you data
# fit model
gam1 <- gam(H1 ~ Year + s(T, by = Year), data = d)
It could well be that you have both {gam} and {mgcv} loaded, in which case whichever you loaded last will be earlier on the function search path. As both packages have functions gam() and s(), R might just be finding the wrong versions (masking), so you might also try
gam1 <- mgcv::gam(H1 ~ Year + mgcv::s(T, by = Year), data = d)
But you would be better off only loading {mgcv} if you wan factor by smooths.
#Gavin Simpson
I did have both loaded, and I tried just using mgcv as you suggested. However, then I get the following error.
Error in names(dat) <- object$term :
'names' attribute [1] must be the same length as the vector [0]
I am assuming this is simply because it's not actually trying to use the "gam" function, but rather it attempts to name something gam1. So I would assume I actually need the package of 'gam' before I could do this.
The second line of code also doesn't work. I get the following error
Error in model.frame.default(formula = H1 ~ Year + mgcv::s(T, by = Year), :
invalid type (list) for variable 'mgcv::s(T, by = Year)'
This happens no matter the order I download the two packages in. And if I don't download 'gam', I get the error as described above.

How to defining Non Linear Vector Constraints in Julia

I'm trying to minimize a function which takes a vector as input and is subjected to some non linear constraints. I'm very new to Julia. I’m trying to implement pseudospectral methods using Ipopt.My isssue is Optimizer which i'm using takes gradient of cost function and constraints. Functions like "ForwardDiff , ReverseDiff" are not helping in finding the gradient of my vector function.
I found that similar issue has been face by #acauligi. So far I haven't found any solution.
using LinearAlgebra, DiffEqOperators, ForwardDiff, ApproxFun, FFTW, ToeplitzMatrices
using ModelingToolkit,DifferentialEquations,NLPModels,ADNLPModels,NLPModelsIpopt
using DCISolver,JSOSolvers
# Number of collocation points
N=31 # This number can go up to 200
function Dmatrix(N::Integer)
h=2*pi/N;
ns=range(1,N-1,step=1);
col1(nps)=0.5*((-1)^nps)/sin(nps*h/2);
col=[0,col1.(ns)...];
row=[0,col[end:-1:2]...];
D=Toeplitz(col,row)
end
Dmat=Dmatrix(N);
function dzdt(x,y,t,a)
u=(1-(x^2)/4)-y^2;
dx=-4*y+x*u+a*x;
dy=x+y*u+a*y;
[dx,dy]
end
# initial guess
tfinal=1.1*pi;
tpoints=collect(range(1,N,step=1))*tfinal/N;
xguess=sin.((2*pi/tfinal)*tpoints)*2.0
yguess=-sin.((2*pi/tfinal)*tpoints)*0.5
function dxlist(xs,ys,tf,a)
nstates=2
ts=collect(range(1,N,step=1))*tf/N;
xytsZip=zip(xs,ys,ts);
dxD0=[dzdt(x,y,t,a) for (x,y,t) in xytsZip];
dxD=reduce(hcat, dxD0)';
xlyl=reshape([xs;ys],N,nstates);
dxF=(Dmat*xlyl)*(2.0*pi/tf);
err=dxD-dxF;
[vcat(err'...).-10^(-10);-vcat(err'...).+10^(-10)]
end
function cons(x)
tf=x[end-1];
a=x[end];
xs1=x[1:N];
ys1=x[N+1:2*N];
dxlist(xs1,ys1,tf,a)
end
a0=10^-3;
x0=vcat([xguess;yguess;[tfinal,a0]]);
obj(x)=0.0;
xlower1=push!(-3*ones(2*N),pi);
xlower=push!(xlower1,-10^-3)
xupper1=push!(3*ones(2*N),1.5*pi);
xupper=push!(xupper,10^-3)
consLower=-ones(4*N)*Inf;
consUpper=zeros(4*N)
# println("constraints vector = ",cons(x0))
model=ADNLPModel(obj,x0,xlower,xupper,cons,consLower,consUpper; backend =
ADNLPModels.ReverseDiffAD)
output=ipopt(model)
xstar=output.solution
fstar=output.objective
I got the solution for this same problem in 3 minutes in MatLab.(solution to this problem is . Time period of system is "pi" when a=0.).
I was hoping I could get the same result much faster in Julia. I have asked in Julia discourse so far I have got any suggestion. Any suggestion on how fix this issue highly appreciated. Thank you all.
I think there was two issues with your code. First,
xupper1=push!(3*ones(2*N),1.5*pi);
xupper=push!(xupper1,10^-3)
and then for some reason the product of the Toeplitz matrix by another matrix gives an error with the automatic differentiation. However, the following works:
function dxlist(xs,ys,tf,a)
nstates=2
ts=collect(range(1,N,step=1))*tf/N;
xytsZip=zip(xs,ys,ts);
dxD0=[dzdt(x,y,t,a) for (x,y,t) in xytsZip];
dxD=reduce(hcat, dxD0)';
xlyl=reshape([xs;ys],N,nstates);
dxF=vcat((Dmat*xlyl[:,1])*(2.0*pi/tf), (Dmat*xlyl[:,2])*(2.0*pi/tf));
err=vcat(dxD...) - dxF;
[err.-10^(-10);-err.+10^(-10)]
end
At the end, Ipopt returns the right results
model=ADNLPModel(obj,x0,xlower,xupper,cons,consLower,consUpper)
output=ipopt(model)
xstar=output.solution
fstar=output.objective
I also noticed that using Percival.jl is faster
using Percival
output=percival(model, max_time = 300.0)
xstar=output.solution
fstar=output.objective
Note that ADNLPModels.jl is receiving some attention and will improve significantly.

splitting columns with str.split() not changing the outcome

Will I have to use the str.split() for an exercise. I have a column called title and it looks like this:
and i need to split it into two columns Name and Season, the following code does not through an error but it doesn't seem to be doing anything as well when i'm testing it with df.head()
df[['Name', 'Season']] = df['title'].str.split(':',n=1, expand=True)
Any help as to why?
The code you have in your question is correct, and should be working. The issue could be coming from the execution order of your code though, if you're using Jupyter Notebook or some method that allows for unclear ordering of code execution.
I recommend starting a fresh kernel/terminal to clear all variables from the namespace, then executing those lines in order, e.g.:
# perform steps to load data in and clean
print(df.columns)
df[['Name', 'Season']] = df['title'].str.split(':',n=1, expand=True)
print(df.columns)
Alternatively you could add an assertion step in your code to ensure it's working as well:
df[['Name', 'Season']] = df['title'].str.split(':',n=1, expand=True)
assert {'Name', 'Season'}.issubset(set(df.columns)), "Columns were not added"

Case Statement Assistance

I have a case statement currently built but need to add an extra layer and I'm not sure of the most efficient and accurate way of doing it. I have these drug codes that are for SYRINGE usage, and another set of codes for PEN usage. I have those layers built in, easy enough. I am trying to add a third layer to determine if the member used both a pen and a syringe, so the member would have a code from both categories during my specified time period. Any ideas? Thanks in advance for the help!
, CASE WHEN NDC.GPI IN (
'2710400300D220',
'2710400300D233',
'2710400300D236',
'2710400400D220',
'2799100225D220',
'2799100235D220') THEN 'PEN'
WHEN NDC.GPI IN (
'27104004002022',
'27104010002005',
'27104020001805',
'27104070001820',
'2730001000E530') THEN 'SYRINGE'
ELSE 'OTHER' END AS DOSAGE_FORM

pseudo randomization in loop PsychoPy

I know other people have asked similar questions in past but I am still stuck on how to solve the problem and was hoping someone could offer some help. Using PsychoPy, I would like to present different images, specifically 16 emotional trials, 16 neutral trials and 16 face trials. I would like to pseudo randomize the loop such that there would not be more than 2 consecutive emotional trials. I created the experiment in Builder but compiled a script after reading through previous posts on pseudo randomization.
I have read the previous posts that suggest creating randomized excel files and using those, but considering how many trials I have, I think that would be too many and was hoping for some help with coding. I have tried to implement and tweak some of the code that has been posted for my experiment, but to no avail.
Does anyone have any advice for my situation?
Thank you,
Rae
Here's an approach that will always converge very quickly, given that you have 16 of each type and only reject runs of more than two emotion trials. #brittUWaterloo's suggestion to generate trials offline is very good--this what I do myself typically. (I like to have a small number of random orders, do them forward for some subjects and backwards for others, and prescreen them to make sure there are no weird or unintended juxtapositions.) But the algorithm below is certainly safe enough to do within an experiment if you prefer.
This first example assumes that you can represent a given trial using a string, such as 'e' for an emotion trial, 'n' neutral, 'f' face. This would work with 'emo', 'neut', 'face' as well, not just single letters, just change eee to emoemoemo in the code:
import random
trials = ['e'] * 16 + ['n'] * 16 + ['f'] * 16
while 'eee' in ''.join(trials):
random.shuffle(trials)
print trials
Here's a more general way of doing it, where the trial codes are not restricted to be strings (although they are strings here for illustration):
import random
def run_of_3(trials, obj):
# detect if there's a run of at least 3 objects 'obj'
for i in range(2, len(trials)):
if trials[i-2: i+1] == [obj] * 3:
return True
return False
tr = ['e'] * 16 + ['n'] * 16 + ['f'] * 16
while run_of_3(tr, 'e'):
random.shuffle(tr)
print tr
Edit: To create a PsychoPy-style conditions file from the trial list, just write the values into a file like this:
with open('emo_neu_face.csv', 'wb') as f:
f.write('stim\n') # this is a 'header' row
f.write('\n'.join(tr)) # these are the values
Then you can use that as a conditions file in a Builder loop in the regular way. You could also open this in Excel, and so on.
This is not quite right, but hopefully will give you some ideas. I think you could occassionally get caught in an infinite cycle in the elif statement if the last three items ended up the same, but you could add some sort of a counter there. In any case this shows a strategy you could adapt. Rather than put this in the experimental code, I would generate the trial sequence separately at the command line, and then save a successful output as a list in the experimental code to show to all participants, and know things wouldn't crash during an actual run.
import random as r
#making some dummy data
abc = ['f']*10 + ['e']*10 + ['d']*10
def f (l1,l2):
#just looking at the output to see how it works; can delete
print "l1 = " + str(l1)
print l2
if not l2:
#checks if second list is empty, if so, we are done
out = list(l1)
elif (l1[-1] == l1[-2] and l1[-1] == l2[0]):
#shuffling changes list in place, have to copy it to use it
r.shuffle(l2)
t = list(l2)
f (l1,t)
else:
print "i am here"
l1.append(l2.pop(0))
f(l1,l2)
return l1
You would then run it with something like newlist = f(abc[0:2],abc[2:-1])