Minitab: Calculating sample variance in macro - minitab

So I have this set of samples in C1 in minitab, and I make 200 resamples from this data and store it in C2-C201. Now I want to calculate the sample variance from each of these columns and then save the result in a seperate column. I can get it to calculate the sample variance for each of the columns, but I am having trouble saving the results. This is my current macro:
GMACRO #Starts the global macro
Resample #Names the macro
DO K1=2:201
Sample 16 'Data' CK1;
Replace. #Resampling
Name c202 "Variance"
Statistics CK1; # Calculate S^2
Variance 'Variance'.
ENDDO
ENDMACRO #Ends the macro
This does the job, but it just overwrites the same cell over and over. The optimal thing would be for it to just save it like c202(K1) for each iteration, but I'm not sure how to implement this.

Hmmm. There are many ways I would change that macro.
The cause of your stated problem is that the Variance subcommand to Statistics stores the results in a whole column, overwriting the contents. If you really want to store the 200 separate subsamples in separate columns then this should do it:
GMACRO #Starts the global macro
Resample #Names the macro
Name c202 "Variance" # no need to name this column 200 times!
DO K1=2:201
Sample 16 'Data' CK1;
Replace. #Resampling
Statistics CK1; # Calculate S^2
Variance C203. # into row 1 of temporary column C203
Stack 'Variance' C203 'Variance' # append to end of Variance
ENDDO
erase K1 c203 # clean up
ENDMACRO #Ends the macro
If you want to store the subsamples but are happy to store them in just two columns then this is neater:
GMACRO #Starts the global macro
Resample #Names the macro
Name C2 'Sample number' c3 'Sample data'
set 'Sample number' # populate Sample number
(1:200)16
end
DO K1=1:200
Sample 16 'Data' c4;
Replace. #Resampling
Stack C4 'Sample data' 'Sample data' # append to Sample data
ENDDO
Name c4 "Variance"
Statistics 'Sample data';
By 'Sample number';
Variance 'Variance'.
ENDMACRO #Ends the macro
Of course, 200 x 16 samples with replacement is identical to 3200 samples with replacement so even neater - and much faster- would be:
GMACRO #Starts the global macro
Resample #Names the macro
Name C2 'Sample number' c3 'Sample data'
set 'Sample number' # populate Sample number
(1:200)16
end
Sample 3200 'Data' 'Sample data';
replace.
Name c4 "Variance"
Statistics 'Sample data';
By 'Sample number';
Variance 'Variance'.
ENDMACRO #Ends the macro

Related

GAMS- manipulating expression within a loop

I have a matrix, of dimension, i rows and j columns, a specific element of which is called x(i,j), where say i are plants, and j are markets. In standard GAMS notation:
Sets
i canning plants / seattle, san-diego /
j markets / new-york, chicago, topeka / ;
Now, I also wish to create a loop, over time- for 5 periods. Essentially, say I define
Set t time period
/period1
period2
period3
period4
period5
/ ;
Parameters
time(t)
/ period1 1,
period2 2,
period3 3,
period4 4,
period5 5
/ ;
Basically, I want to re-run this loop, which contains a bunch of other commands, but I wish to re-define this matrix from period 2 onwards, to look like this:
x("seattle",j)=x("seattle",j)+s("new-york",j)
x("new-york",j)=0'
Essentially, within the loop, I want the matrix x(i,j) to look different after period 2, wherein the column x("seattle",j) is replaced with the erstwhile x("seattle",j)+s("new-york",j) and the column x("new-york",j) is set to 0.
The loop would start like :
loop
(t,
...
Option reslim = 20000 ;
option nlp = conopt3 ;
solve example using NLP maximizing VARIABLE ;
) ;
I am not sure how to keep redefining this matrix within the loop, for each period>2.
Please note: After period 2, the matrix looks the same. The change only happens once (i.e., the matrix elements do not keep looping from the previous period, but just switch once , at the end of period 2, and then stay constant thereafter.
Any help on this is much appreciated!
You can use a $ condition to make this change in the loop for period2 only, like this:
x("seattle",j)$sameAs(t,'period2')=x("seattle",j)+s("new-york",j);

Output to a text file

I need to output lots of different datasets to different text files. The datasets share some common variables that need to be output but also have quite a lot of different ones. I have loaded these different ones into a macro variable separated by blanks so that I can macroize this.
So I created a macro which loops over the datasets and outputs each into a different text file.
For this purpose, I used a put statement inside a data step. The PUT statement looks like this:
PUT (all the common variables shared by all the datasets), (macro variable containing all the dataset-specific variables);
E.g.:
%MACRO OUTPUT();
%DO N=1 %TO &TABLES_COUNT;
DATA _NULL_;
SET &&TABLE&N;
FILE 'PATH/&&TABLE&N..txt';
PUT a b c d "&vars";
RUN;
%END;
%MEND OUTPUT;
Where &vars is the macro variable containing all the variables needed for outputting for a dataset in the current loop.
Which gets resolved, for example, to:
PUT a b c d special1 special2 special5 ... special329;
Now the problem is, the quoted string can only be 262 characters long. And some of my datasets I am trying to output have so many variables to be output that this macro variable which is a quoted string and holds all those variables will be much longer than that. Is there any other way how I can do this?
Do not include quotes around the list of variable names.
put a b c d &vars ;
There should not be any limit to the number of variables you can output, but if the length of the output line gets too long SAS will wrap to a new line. The default line length is currently 32,767 (but older versions of SAS use 256 as the default line length). You can actually set that much higher if you want. So you could use 1,000,000 for example. The upper limit probably depends on your operating system.
FILE "PATH/&&TABLE&N..txt" lrecl=1000000 ;
If you just want to make sure that the common variables appear at the front (that is you are not excluding any of the variables) then perhaps you don't need the list of variables for each table at all.
DATA _NULL_;
retain a b c d ;
SET &&TABLE&N;
FILE "&PATH/&&TABLE&N..txt" lrecl=1000000;
put (_all_) (+0) ;
RUN;
I would tackle this but having 1 put statement per variable. Use the # modifier so that you don't get a new line.
For example:
data test;
a=1;
b=2;
c=3;
output;
output;
run;
data _null_;
set test;
put a #;
put b #;
put c #;
put;
run;
Outputs this to the log:
800 data _null_;
801 set test;
802 put a #;
803 put b #;
804 put c #;
805 put;
806 run;
1 2 3
1 2 3
NOTE: There were 2 observations read from the data set WORK.TEST.
NOTE: DATA statement used (Total process time):
real time 0.07 seconds
cpu time 0.03 seconds
So modify your macro to loop through the two sets of values using this syntax.
Not sure why you're talking about quoted strings: you would not quote the &vars argument.
put a b c d &vars;
not
put a b c d "&vars";
There's a limit there, but it's much higher (64k).
That said, I would do this in a data driven fashion with CALL EXECUTE. This is pretty simple and does it all in one step, assuming you can easily determine which datasets to output from the dictionary tables in a WHERE statement. This has a limitation of 32kiB total, though if you're actually going to go over that you can work around it very easily (you can separate out various bits into multiple calls, and even structure the call so that if the callstr hits 32000 long you issue a call execute with it and then continue).
This avoids having to manage a bunch of large macro variables (your &VAR will really be &&VAR&N and will be many large macro variables).
data test;
length vars callstr $32767;
do _n_ = 1 by 1 until (last.memname);
set sashelp.vcolumn;
where memname in ('CLASS','CARS');
by libname memname;
vars = catx(' ',vars,name);
end;
callstr = catx(' ',
'data _null_;',
'set',cats(libname,'.',memname),';',
'file',cats('"c:\temp\',memname,'.txt"'),';',
'put',vars,';',
'run;');
call execute(callstr);
run;

Gnuplot: How to load and display single numeric value from data file

My data file has this content
# data file for use with gnuplot
# Report 001
# Data as of Tuesday 03-Sep-2013
total 1976
case1 522 278 146 65 26 7
case2 120 105 15 0 0 0
case3 660 288 202 106 63 1
I am making a histogram from the case... lines using the script below - and that works. My question is: how can I load the grand total value 1976 (next to the word 'total') from the data file and either (a) store it into a variable or (b) use it directly in the title of the plot?
This is my gnuplot script:
reset
set term png truecolor
set terminal pngcairo size 1024,768 enhanced font 'Segoe UI,10'
set output "output.png"
set style fill solid 1.00
set style histogram rowstacked
set style data histograms
set xlabel "Case"
set ylabel "Frequency"
set boxwidth 0.8
plot for [i=3:7] 'mydata.dat' every ::1 using i:xticlabels(1) with histogram \
notitle, '' every ::1 using 0:2:2 \
with labels \
title "My Title"
For the benefit of others trying to label histograms, in my data file, the column after the case label represents the total of the rest of the values on that row. Those total numbers are displayed at the top of each histogram bar. For example for case1, 522 is the total of (278 + 146 + 65 + 26 + 7).
I want to display the grand total somewhere on my chart, say as the second line of the title or in a label. I can get a variable into sprintf into the title, but I have not figured out syntax to load a "cell" value ("cell" meaning row column intersection) into a variable.
Alternatively, if someone can tell me how to use the sum function to total up 522+120+660 (read from the data file, not as constants!) and store that total in a variable, that would obviate the need to have the grand total in the data file, and that would also make me very happy.
Many thanks.
Lets start with extracting a single cell at (row,col). If it is a single values, you can use the stats command to extract the values. The row and col are specified with every and using, like in a plot command. In your case, to extract the total value, use:
# extract the 'total' cell
stats 'mydata.dat' every ::::0 using 2 nooutput
total = int(STATS_min)
To sum up all values in the second column, use:
stats 'mydata.dat' every ::1 using 2 nooutput
total2 = int(STATS_sum)
And finally, to sum up all values in columns 3:7 in all rows (i.e. the same like the previous command, but without using the saved totals) use:
# sum all values from columns 3:7 from all rows
stats 'mydata.dat' every ::1 using (sum[i=3:7] column(i)) nooutput
total3 = int(STATS_sum)
These commands require gnuplot 4.6 to work.
So, your plotting script could look like the following:
reset
set terminal pngcairo size 1024,768 enhanced
set output "output.png"
set style fill solid 1.00
set style histogram rowstacked
set style data histograms
set xlabel "Case"
set ylabel "Frequency"
set boxwidth 0.8
# extract the 'total' cell
stats 'mydata.dat' every ::::0 using 2 nooutput
total = int(STATS_min)
plot for [i=3:7] 'mydata.dat' every ::1 using i:xtic(1) notitle, \
'' every ::1 using 0:(s = sum [i=3:7] column(i), s):(sprintf('%d', s)) \
with labels offset 0,1 title sprintf('total %d', total)
which gives the following output:
For linux and similar.
If you don't know the row number where your data is located, but you know it is in the n-th column of a row where the value of the m-th column is x, you can define a function
get_data(m,x,n,filename)=system('awk "\$'.m.'==\"'.x.'\"{print \$'.n.'}" '.filename)
and then use it, for example, as
y = get_data(1,"case2",4,"datafile.txt")
using data provided by user424855
print y
should return 15
It's not clear to me where your "grand total" of 1976 comes from. If I calculate 522+120+660 I get 1302 not 1976.
Anyway, here is a solution which works even without stats and sum which were not available in gnuplot 4.4.0.
In the data you don't necessarily need the "grand total" or the sum of each row, because gnuplot can calculate this for you. This is done by (not) plotting the file as a matrix, and at the same time summing up the rows in the string variable S0 and the total sum in variable Total. There will be a warning warning: matrix contains missing or undefined values which you can ignore. The labels are added by plotting '+' ... with labels extracting the desired values from the S0 string.
Data: SO18583180.dat
So, the reduced input data looks like this:
# data file for use with gnuplot
# Report 001
# Data as of Tuesday 03-Sep-2013
case1 278 146 65 26 7
case2 105 15 0 0 0
case3 288 202 106 63 1
Script: (works for gnuplot>=4.4.0, March 2010 and gnuplot 5.x)
### histogram with sums and total sum
reset
FILE = "SO18583180.dat"
set style histogram rowstacked
set style data histograms
set style fill solid 0.8
set xlabel "Case"
set ylabel "Frequency"
set boxwidth 0.8
set key top left noautotitle
set grid y
set xrange [0:2]
set offsets 0.5,0.5,0,0
Total = 0
S0 = ''
addSums(v) = S0.sprintf(" %g",(M=$2,(N=$1+1)==1?S1=0:0,S1=S1+v))
plot for [i=2:6] FILE u i:xtic(1) notitle, \
'' matrix u (S0=addSums($3),Total=Total+$3,NaN) w p, \
'+' u 0:(real(S2=word(S0,int($0*N+N)))):(S2) every ::::M w labels offset 0,0.7 title sprintf("Total: %g",Total)
### end of script
Result: (created with gnuplot 4.4.0, Windows terminal)

Fortran read file into array - transposed dimensions

I'm trying to read a file into memory in a Fortran program. The file has N rows with two values in each row. This is what I currently do (it compiles and runs, but gives me incorrect output):
program readfromfile
implicit none
integer :: N, i, lines_in_file
real*8, allocatable :: cs(:,:)
N = lines_in_file('datafile.txt') ! a function I wrote, which works correctly
allocate(cs(N,2))
open(15, 'datafile.txt', status='old')
read(15,*) cs
do i=1,N
print *, cs(i,1), cs(i,2)
enddo
end
What I hoped to get was the data loaded into the variable cs, with lines as first index and columns as second, but when the above code runs, it first gives prints a line with two "left column" values, then a line with two "right column" values, then a line with the next two "left column values" and so on.
Here's a more visual description of the situation:
In my data file: Desired output: Actual output:
A1 B1 A1 B1 A1 A2
A2 B2 A2 B2 B1 B2
A3 B3 A3 B3 A3 A4
A4 B4 A4 B4 B3 B4
I've tried switching the indices when allocating cs, but with the same results (or segfault, depending on wether I also switch indices at the print statement). I've also tried reading the values row-by-row, but because of the irregular format of the data file (comma-delimited, not column-aligned) I couldn't get this working at all.
How do I read the data into memory the best way to achieve the results I want?
I do not see any comma in your data file. It should not make any difference with the list-directed input anyway. Just try to read it like you write it.
do i=1,N
read (*,*) cs(i,1), cs(i,2)
enddo
Otherwise if you read whole array in one command, it reads it in column-major order, i.e., cs(1,1), cs(2, 1), ....cs(N,1), cs(1, 2), cs(2,2), ... This is the order in which the array is stored in memory.

Using : operator to index numpy.ndarray of numpy.void (as output by numy.genfromtxt)

I generate data using numpy.genfromtxt like this:
ConvertToDate = lambda s:datetime.strptime(s,"%d/%m/%Y")
data= numpy.genfromtxt(open("PSECSkew.csv", "rb"),
delimiter=',',
dtype=[('CalibrationDate', datetime),('Expiry', datetime), ('B0', float), ('B1', float), ('B2', float), ('ATMAdjustment', float)],
converters={0: ConvertToDate, 1: ConvertToDate})
I now want to extract the last 4 columns (of each row but in a loop so lets just consider a single row) to separate variables. So I do this:
B0 = data[0][2]
B1 = data[0][3]
B2 = data[0][4]
ATM = data[0][5]
But if I can do this (like I could with a normal 2D ndarray for example) I would prefer it:
B0, B1, B2, ATM = data[0][2:]
But this gives me an 'invalid index' error. Is there a way to do this nicely or should I stick with the 4 line approach?
As output of np.genfromtxt, you have a structured array, that is, a 1D array where each row as different fields.
If you want to access some fields, just access them by names:
data["B0"], data["B1"], ...
You can also group them:
data[["B0", "B1]]
which gives you a 'new' structured array with only the fields you wanted (quotes around 'new' because the data is not copied, it's still the same as your initial array).
Should you want some specific 'rows', just do:
data[["B0","B1"]][0]
which outputs the first row. Slicing and fancy indexing work too.
So, for your example:
B0, B1, B2, ATM = data[["B0","B1","B2","ATMAdjustment"]][0]
If you want to access only those fields row after row, I would suggest to store the whole array of the fields you want first, then iterate:
filtered_data = data[["B0","B1","B2","ATMAdjustment"]]
for row in filtered_data:
(B0, B1, B2, ATM) = row
do_something
or even :
for (B0, B1, B2, ATM) in filtered_data:
do_something