SPSS: DO REPEAT with different numbers of matched variables - repeat

I have a dataset where each case has the following set of variables:
VarA1.1 to VarA25.185 (total of 4625 variables)
VarB.1 to VarB.185 (total of 185 variables)
For each case, VarA1.1, VarA2.1, VarA3.1, etc. are all linked to the same VarB.1.
I want to use a DO REPEAT function to search through each .1 instance using both VarA and VarB.
Example code:
DO REPEAT VarA = VarA1.1 to VarA25.185
/ VarB = VarB.1 to VarB.185.
if (VarA = X) AND ((VarB-Y)<0)
VarC = Z.
END REPEAT.
EXE.
However, it seems that because there are different numbers of variables in the repeat list of VarA and VarB, they don't pair up. I want to associate each VarA#(1-25).1 with VarB.1, each VarA#(1-25).2 with each VarB.2, etc. up to VarB.185 so that in the repeat function the correct pairing of variables is used.
Thanks!

Another way to do this is to use a LOOP on the outside and a DO REPEAT on the inside. So here is some example data, with just three A variables that go to 1 to 10.
SET SEED 10.
INPUT PROGRAM.
LOOP Id = 1 TO 100.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME Sim.
*Making random data.
VECTOR A1.(10).
VECTOR A2.(10).
VECTOR A3.(10).
VECTOR B.(10).
NUMERIC X Y.
DO REPEAT a = A1.1 TO Y.
COMPUTE a = RV.BERNOULLI(0.5).
END REPEAT.
EXECUTE.
So here is the part you want to pay attention to. Your DO REPEAT currently loops over the 25 variables. This switches it though, so the LOOP part goes over the 25 variables, but the DO REPEAT goes over each of your A vectors.
VECTOR A1 = A1.1 TO A1.10.
VECTOR A2 = A2.1 TO A2.10.
VECTOR A3 = A3.1 TO A3.10.
VECTOR B = B.1 TO B.10.
VECTOR C.(10).
LOOP #i = 1 TO 10.
DO REPEAT A = A1 A2 A3.
IF (A(#i) = X) AND (B(#i)-Y<0) C.(#i) = B(#i).
END REPEAT.
END LOOP.
EXECUTE.
Code golf it is probably not going to beat the macro approach, since you have to define all of those VECTOR statements. But I think it is a conceptually clear way to write the program.

It looks like what you are trying to do is loop over 25 variables but repeat this for 185 variables.
It would be more intutive to use SPSS Macros to achieve this. Stepping through the below will demonstrate the building blocks for solving your data problem.
DEFINE !MyMacroName ()
SET MPRINT ON.
/* Generate some example data to match desired data format*/.
set seed = 10.
input program.
loop #i = 1 to 50.
compute case = #i.
end case.
end loop.
end file.
end input program.
dataset name sim.
execute.
!do !i =1 !to 25
vector !concat('VarA',!i,'.(185, F1.0).').
do repeat v = !concat('VarA',!i,'.1') to !concat('VarA',!i,'.185').
compute v = TRUNC(RV.UNIFORM(1,6)).
end repeat.
!doend
vector VarB.(185, F1.0).
do repeat v = VarB.1 to VarB.185.
compute v = TRUNC(RV.UNIFORM(1,6)).
end repeat.
execute.
/* Solve actual problem */.
!do !i =1 !to 185
!do !j = 1 !to 25
if (!concat('VarA',!j,'.',!i) = !concat('VarB.',!i)) !concat('VarC', !j)=1.
!doend
!doend
SET MPRINT OFF.
!ENDDEFINE.
/* Run macro */.
!MyMacroName.

Related

Are calculations involving a large matrix using arrays in VBA faster than doing the same calculation manually in Excel?

I am trying to do calculations as part of regression model in Excel.
I need to calculate ((X^T)WX)^(-1)(X^T)WY. Where X, W, Y are matrices and ^T and ^-1 are denoting the matrix transpose and inverting operation.
Now when X, W, Y are of small dimensions I simply run my macro which calculates these values very very fast.
However sometimes I am dealing with the case when say, the dimensions of X, W, Y are 5000 X 5, 5000 X 1 and 5000 X 1 respectively, then the macro can take a lot longer to run.
I have two questions:
Would, instead of using my macro which generates the matrices on Excel sheets and then uses Excel formulas like MMULT and MINVERSE etc. to calculate the output, it be faster for larger dimension matrices if I used arrays in VBA to do all the calculations? (I am not too sure how arrays work in VBA so I don't actually know if it would do anything to excel, and hence if it would be any quicker/less computationally intensive.)
If the answer to the above question is no it would be no quicker. Then does anybody have an idea how to speed such calculations up? Or do I need to simply put up with it and wait.
Thanks for your time.
Considering that the algorithm of the code is the same, the speed ranking is the following:
Dll custom library with C#, C++, C, Java or anything similar
VBA
Excel
I have compared a VBA vs C++ function here, in the long term the result is really bad for VBA.
So, the following Fibonacci with recursion in C++:
int __stdcall FibWithRecursion(int & x)
{
int k = 0;
int p = 0;
if (x == 0)
return 0;
if (x == 1)
return 1;
k = x - 1;
p = x - 2;
return FibWithRecursion(k) + FibWithRecursion(p);
}
is exponentially better, when called in Excel, than the same complexity function in VBA:
Public Function FibWithRecursionVBA(ByRef x As Long) As Long
Dim k As Long: k = 0
Dim p As Long: p = 0
If (x = 0) Then FibWithRecursionVBA = 0: Exit Function
If (x = 1) Then FibWithRecursionVBA = 1: Exit Function
k = x - 1
p = x - 2
FibWithRecursionVBA = FibWithRecursionVBA(k) + FibWithRecursionVBA(p)
End Function
Better late than never:
I use matrices that are bigger, 3 or 4 dimensions, sized like 16k x 26 x 5.
I run through them to find data, apply one or two formulas or make combos with other matrices.
Number one, after starting the macro, open another application like notepad, you might have a nice speed increase ☺ !
Then, I guess you switched of screen updating etc, and turned of automatic calculation
As last: don't put the data in cells, not in arrays.
Just something like:
Dim Matrix1 as String ===>'put it in declarations if you want to use it in other macros as well. Remember you can not do "blabla=activecell.value2" etc anymore!!
In the "Sub()" code, use ReDim Matrix1(1 to a_value, 1 to 2nd_value, ... , 1 to last_value)
Matrix1(45,32,63)="what you want to put there"
After running, just drop the
Matrix1(1 to a_value, 1 to 2nd_value,1) at 1st sheet,
Matrix1(1 to a_value, 1 to 2nd_value,2) at 2nd sheet, etc
Switch on screen updating again, etc
In this way my calculation went from 45 minutes to just one, by avoiding the intermediary screen update
Success, I hope it is useful for somebody

Matlab: how do I run the optimization (fmincon) repeately?

I am trying to follow the tutorial of using the optimization tool box in MATLAB. Specifically, I have a function
f = exp(x(1))*(4*x(1)^2+2*x(2)^2+4*x(1)*x(2)+2*x(2)+1)+b
subject to the constraint:
(x(1))^2+x(2)-1=0,
-x(1)*x(2)-10<=0.
and I want to minimize this function for a range of b=[0,20]. (That is, I want to minimize this function for b=0, b=1,b=2 ... and so on).
Below is the steps taken from the MATLAB's tutorial webpage(http://www.mathworks.com/help/optim/ug/nonlinear-equality-and-inequality-constraints.html), how should I change the code so that, the optimization will run for 20 times, and save the optimal values for each b?
Step 1: Write a file objfun.m.
function f = objfun(x)
f = exp(x(1))*(4*x(1)^2+2*x(2)^2+4*x(1)*x(2)+2*x(2)+1)+b;
Step 2: Write a file confuneq.m for the nonlinear constraints.
function [c, ceq] = confuneq(x)
% Nonlinear inequality constraints
c = -x(1)*x(2) - 10;
% Nonlinear equality constraints
ceq = x(1)^2 + x(2) - 1;
Step 3: Invoke constrained optimization routine.
x0 = [-1,1]; % Make a starting guess at the solution
options = optimoptions(#fmincon,'Algorithm','sqp');
[x,fval] = fmincon(#objfun,x0,[],[],[],[],[],[],...
#confuneq,options);
After 21 function evaluations, the solution produced is
x, fval
x =
-0.7529 0.4332
fval =
1.5093
Update:
I tried your answer, but I am encountering problem with your step 2. Bascially, I just fill the my step 2 to your step 2 (below the comment "optimization just like before").
%initialize list of targets
b = 0:1:20;
%preallocate/initialize result vectors using zeros (increases speed)
opt_x = zeros(length(b));
opt_fval = zeros(length(b));
>> for idx = 1, length(b)
objfun = #(x)objfun_builder(x,b)
%optimization just like before
x0 = [-1,1]; % Make a starting guess at the solution
options = optimoptions(#fmincon,'Algorithm','sqp');
[x,fval] = fmincon(#objfun,x0,[],[],[],[],[],[],...
#confuneq,options);
%end the stuff I fill in
opt_x(idx) = x
opt_fval(idx) = fval
end
However, it gave me the output is:
Error: "objfun" was previously used as a variable, conflicting
with its use here as the name of a function or command.
See "How MATLAB Recognizes Command Syntax" in the MATLAB
documentation for details.
There are two things you need to change about your code:
Creation of the objective function.
Multiple optimizations using a loop.
1st Step
For more flexibility with regard to b, you need to set up another function that returns a handle to the desired objective function, e.g.
function h = objfun_builder(x, b)
h = #(x)(objfun(x));
function f = objfun(x)
f = exp(x(1))*(4*x(1)^2+2*x(2)^2+4*x(1)*x(2)+2*x(2)+1) + b;
end
end
A more elegant and shorter approach are anonymous functions, e.g.
objfun_builder = #(x,b)(exp(x(1))*(4*x(1)^2+2*x(2)^2+4*x(1)*x(2)+2*x(2)+1) + b);
After all, this works out to be the same as above. It might be less intuitive for a Matlab-beginner, though.
2nd Step
Instead of placing an .m-file objfun.m in your path, you will need to call
objfun = #(x)(objfun_builder(x,myB));
to create an objective function in your workspace. In order to loop over the interval b=[0,20], use the following loop
%initialize list of targets
b = 0:1:20;
%preallocate/initialize result vectors using zeros (increases speed)
opt_x = zeros(length(b))
opt_fval = zeros(length(b))
%start optimization of list of targets (`b`s)
for idx = 1, length(b)
objfun = #(x)objfun_builder(x,b)
%optimization just like before
opt_x(idx) = x
opt_fval(idx) = fval
end

Counting occurrences of values in spss

I have 50 variables, named w1 to w50, and each holds a value from 1 to 20. I want to create variables showing the number of occurrences of each of these values. This is what I'd like to do, but SPSS seems to have a problem with me using #n in the COUNT command.
COMPUTE #n = 1 .
DO REPEAT x = num1 to num20 .
COMPUTE x = 0 .
COUNT x = w1 to w50 (#n) .
COMPUTE #n = #n + 1 .
END REPEAT .
This is the error message I get:
Error # 4772 in column 24. Text: #n
On the COUNT command, the parenthesized value list is syntactically invalid.
Execution of this command stops.
You cannot supply a variable as the value list in the COUNT command. Fortunately the work around for your example is quite simple - you can use a stand in increment on the DO REPEAT:
DO REPEAT x = num1 to num20 /#i = 1 to 20.
COUNT x = w1 to w50 (#i).
END REPEAT.
Full example below.
**********************************************.
*creating fake data.
data list free / ID.
begin data
1
2
end data.
vector w(50,F2.0).
loop #i = 1 to 50.
compute w(#i) = TRUNC(RV.UNIFORM(1,21)).
end loop.
vector num(20,F2.0).
execute.
*making new vector.
DO REPEAT x = num1 to num20 /#i = 1 to 20.
COUNT x = w1 to w50 (#i).
END REPEAT.
EXECUTE.
**********************************************.

Looping and if statements in SPSS

I'm new to SPSS and I'm a bit stuck on a problem. I have about 200 variables and I want to loop through pairs of them looking for variables with correlation coefficients above 0.7. I know that I can use CORRELATIONS to get a matrix of coefficients but it would be huge and difficult to look through. Basically, in pseudocode, what I want to do is:
for (i = W1_1 to W1_200) {
for (j = i to W1_200) {
if CORRELATIONS(i,j)>0.7 {
print i, j, CORRELATIONS(i,j)
}
}
}
I can't for the life of me work out how to do any of this in SPSS. Help!
SPSS has a helper function on the CORRELATIONS command to export the correlation matrix. From there you can manipulate the data to give the correlation pairs that meet your criteria. So first, lets make some fake data to illustrate.
*Making fake data.
set seed 5.
input program.
loop i = 1 to 100.
end case.
end loop.
end file.
end input program.
dataset name test.
compute #base = RV.NORMAL(0,1).
vector X(20).
loop #i = 1 to 20.
compute X(#i) = #base*(#i/20) + RV.NORMAL(0,1).
end loop.
exe.
Now, we can run the CORRELATIONS command and export the table to a new dataset (which I named here Corrs).
DATASET DECLARE Corrs.
CORRELATIONS
/VARIABLES=X1 to X20
/MATRIX=OUT('Corrs').
Unfortunately SPSS returns the full matrix (plus other info on the sample size). We can only select the rows we are interested in (ones with "CORR" in the ROWTYPE_ column) and then use a DO REPEAT to set the upper or lower half of the matrix to system missing values.
DATASET ACTIVATE Corrs.
SELECT IF ROWTYPE_ = "CORR".
*Now only making lower half of matrix.
COMPUTE #iter = 0.
DO REPEAT X = X1 TO X20.
COMPUTE #iter = #iter + 1.
IF #iter > ($casenum-1) X = $SYSMIS.
END REPEAT.
I set them to system missing values because the next part I will reshape the data using VARSTOCASES. This by default drops missing values, so we won't end up having redundant correlation pairs.
VARSTOCASES
/MAKE Corr FROM X1 TO X20
/INDEX X2 (Corr)
/DROP ROWTYPE_.
RENAME VARIABLES (VARNAME_ = X1).
Now you have your correlation pairs list and can just select out the correlations that meet your criteria.
SELECT IF ABS(Corr) >= .5.
Making of the correlation pairs can be made into a MACRO function pretty easily to return the pair list. Below is that function, recreating the exact steps used here.
DEFINE !CorrPairs (!POSITIONAL !CMDEND)
DATASET DECLARE Corrs.
CORRELATIONS
/VARIABLES=!1
/MATRIX=OUT('Corrs').
DATASET ACTIVATE Corrs.
SELECT IF ROWTYPE_ = "CORR".
COMPUTE #iter = 0.
DO REPEAT X = !1.
COMPUTE #iter = #iter + 1.
IF #iter > ($casenum-1) X = $SYSMIS.
END REPEAT.
VARSTOCASES
/MAKE Corr FROM !1
/INDEX X2 (Corr)
/DROP ROWTYPE_.
RENAME VARIABLES (VARNAME_ = X1).
!ENDDEFINE.
The macro just takes a list of variables (in the active dataset) to grab the correlations, and returns a second dataset named Corrs with the correlation pairs and the variable names defined in the X1 and X2 columns. Then after the above macro is defined the above steps can be recreated simply by below.
!CorrPairs X1 to X20.
SELECT IF ABS(Corr) >= .5.
EXECUTE.
My suggestion is to use OMS to extract your correlation values from the output into a datafile. Use a macro to only run the correlations you need:
DATASET DECLARE Correlations.
OMS /SELECT TABLES /IF COMMANDS=['Correlations'] SUBTYPES=['Correlations']
/DESTINATION FORMAT=SAV NUMBERED=TableNumber_ OUTFILE='Correlations' VIEWER=YES.
define runCorrs ()
!do !i1=1 !to 200
!do !i2=!i1 !to 200
!if (!i2<>!i1) !then
corr !concat("W_",!i1) with !concat("W_",!i2).
!ifend
!doend !doend
!enddefine.
runCorrs.
OMSEND.
datas act Correlations.
select if var2="Pearson Correlation".
VARSTOCASES /make crlVal from W_2 to W_200/index=withvar(crlVal)
/drop TableNumber_ Command_ Subtype_ Label_ Var2.
now you have a nice list of all the correlations to work with:
select if crlVal>0.7.
exe.

How to choose a range for a loop based upon the answers of a previous loop?

I'm sorry the title is so confusingly worded, but it's hard to condense this problem down to a few words.
I'm trying to find the minimum value of a specific equation. At first I'm looping through the equation, which for our purposes here can be something like y = .245x^3-.67x^2+5x+12. I want to design a loop where the "steps" through the loop get smaller and smaller.
For example, the first time it loops through, it uses a step of 1. I will get about 30 values. What I need help on is how do I Use the three smallest values I receive from this first loop?
Here's an example of the values I might get from the first loop: (I should note this isn't supposed to be actual code at all. It's just a brief description of what's happening)
loop from x = 1 to 8 with step 1
results:
x = 1 -> y = 30
x = 2 -> y = 28
x = 3 -> y = 25
x = 4 -> y = 21
x = 5 -> y = 18
x = 6 -> y = 22
x = 7 -> y = 27
x = 8 -> y = 33
I want something that can detect the lowest three values and create a loop. From theses results, the values of x that get the smallest three results for y are x = 4, 5, and 6.
So my "guess" at this point would be x = 5. To get a better "guess" I'd like a loop that now does:
loop from x = 4 to x = 6 with step .5
I could keep this pattern going until I get an absurdly accurate guess for the minimum value of x.
Does anybody know of a way I can do this? I know the values I'm going to get are going to be able to be modeled by a parabola opening up, so this format will definitely work. I was thinking that the values could be put into a column. It wouldn't be hard to make something that returns the smallest value for y in that column, and the corresponding x-value.
If I'm being too vague, just let me know, and I can answer any questions you might have.
nice question. Here's at least a start for what I think you should do for this:
Sub findMin()
Dim lowest As Integer
Dim middle As Integer
Dim highest As Integer
lowest = 999
middle = 999
hightest = 999
Dim i As Integer
i = 1
Do While i < 9
If (retVal(i) < retVal(lowest)) Then
highest = middle
middle = lowest
lowest = i
Else
If (retVal(i) < retVal(middle)) Then
highest = middle
middle = i
Else
If (retVal(i) < retVal(highest)) Then
highest = i
End If
End If
End If
i = i + 1
Loop
End Sub
Function retVal(num As Integer) As Double
retVal = 0.245 * Math.Sqr(num) * num - 0.67 * Math.Sqr(num) + 5 * num + 12
End Function
What I've done here is set three Integers as your three Min values: lowest, middle, and highest. You loop through the values you're plugging into the formula (here, the retVal function) and comparing the return value of retVal (hence the name) to the values of retVal(lowest), retVal(middle), and retVal(highest), replacing them as necessary. I'm just beginning with VBA so what I've done likely isn't very elegant, but it does at least identify the Integers that result in the lowest values of the function. You may have to play around with the values of lowest, middle, and highest a bit to make it work. I know this isn't EXACTLY what you're looking for, but it's something along the lines of what I think you should do.
There is no trivial way to approach this unless the problem domain is narrowed.
The example polynomial given in fact has no minimum, which is readily determined by observing y'>0 (hence, y is always increasing WRT x).
Given the wide interpretation of
[an] equation, which for our purposes here can be something like y =
.245x^3-.67x^2+5x+12
many conditions need to be checked, even assuming the domain is limited to polynomials.
The polynomial order is significant, and the order determines what conditions are necessary to check for how many solutions are possible, or whether any solution is possible at all.
Without taking this complexity into account, an iterative approach could yield an incorrect solution due to underflow error, or an unfortunate choice of iteration steps or bounds.
I'm not trying to be hard here, I think your idea is neat. In practice it is more complicated than you think.