Algorithm for calculating most stable, consecutive values from a database - sql

I have some questions and I'm in need of your input.
Say I have a database table filled with 2000-3000 rows and each row has a value and some identifiers. I am in need of withdrawing ~100 consecutive rows with the most stable values (lowest spread). It's okay with a few jumper values if you can exclude them.
How would you do this and what algorithm would you use?
I'm currently using SAS Enterprise Guide for my DB which runs on Oracle. I don't really know that much of the generic SAS language but I don't know what other language I could use for this? Some scripting language? I have limited programming knowledge but this task seems pretty easy, correct?
The algorithms I've been thinking of is:
Select 100 consecutive rows and calculate standard deviation. Increment select statement by 1 and calculate standard deviation again. Loop trough the whole table.
Export the rows with the lowest standard deviation
Same as 1, but calculate variance instead of standard deviation (basically the same thing). When the whole table has been looped, do it again but exclude 1 row which has the highest value from avg. Repeat process until 5 jumpers has been excluded and compare the results.
Pros and cons compared to method 1?
Questions:
Best & easiest method?
Prefered language? Possible in SAS?
Do you have any other method you would recommend?
Thanks in advance
/Niklas

The below code will do what you are asking. It is just using some sample data and only calcs it for 10 observations (rather than 100). I'll leave it to you to adapt as required.
Create some sample data. available to all sas installations:
data xx;
set sashelp.stocks;
where stock = 'IBM';
obs = _n_;
run;
Create row numbers and sort it descending. Makes it easier to calc stddev:
proc sort data=xx;
by descending obs;
run;
Use an array to keep the subsequent 10 obs for every row. Calculate the stddev for each row using the array (except for the last 10 rows. Remember we are working backwards through the data.
data calcs;
set xx;
array a[10] arr1-arr10;
retain arr1-arr10 .;
do tmp=10 to 2 by -1;
a[tmp] = a[tmp-1];
end;
a[1] = close;
if _n_ ge 10 then do;
std = std(of arr1-arr10);
end;
run;
Find which obs (ie. row) had the lowest stddev calc. Save it to a macro var.
proc sql noprint;
select obs into :start_row
from calcs
having std = min(std)
;
quit;
Select the 10 observations from the sample data that were involved in calcing the lowest stddev.
proc sql noprint;
create table final as
select *
from xx
where obs between &start_row and %eval(&start_row+10)
order by obs
;
quit;

An addition to Robert's solution but with part 2 included as well, creating a second array and then looping through and removing the top 5 values. You'll still the last parts of Roberts solution to extract the row with the minimum standard deviation and then the corresponding attached rows. You didn't specify how you wanted to deal with the variances that have the max removed so they are left in the dataset.
data want;
*set arrays for looping;
/*used to calculate the std*/
array p{0:9} _temporary_;
/*used to copy the array over to reduce variables*/
array ps(1:10) _temporary_;
/*used to store the var with 5 max values removed*/
array s{1:5} var1-var5;
set sample;
p{mod(_n_,10)} = open;
if _n_ ge 10 then std = std(of p{*});
*remove max values to calculate variance;
if _n_ ge 10 then do;
*copy array over to remove values;
do i=1 to 10;
ps(i)=p(i-1);
end;
do i=1 to 5;
index=whichn(max(of ps(*)), of ps(*));
ps(index)=.;
s(i)=var(of ps(*));
end;
end;
run;

Related

How to subtract second row from first, fourth row from third and so forth

I have SAS dataset as in the attached picture.. what I'm trying to accomplish is created new calculated field from Total column where I'm subtracting first row-second row, third row-fourth row and so on..
What i have tried so far is
DATA WANT2;
SET WANT;
BY APPT_TYPE;
IF FIRST.APPT_TYPE THEN SUPPLY-OPEN; ELSE 'ERROR';
RUN;
this throws an eror as statement is not valid..
not really sure how to go about this
My dataset
Here you go. The best I can do with the limited information you provided. Next time please provide sample data and your expected output.
data have;
input APPT_TYPE$ _NAME_$ Quantity;
datalines;
ASON Supply 10
ASON Open 8
ASSN Supply 9
ASSN Open 7
S30 Supply 11
S30 Open 8
;
proc sort data = have;
by APPT_TYPE descending _NAME_ ;
run;
data want;
set have;
by APPT_TYPE descending _NAME_;
lag_N_Order = lag1(Quantity);
N_Order = Quantity;
Difference = lag_N_Order - N_Order;
keep APPT_TYPE _NAME_ N_Order lag_N_Order Difference Type;
if last.APPT_TYPE & last._NAME_ & Difference>0;
run;

How to use prxmatch or alternative for below string in sas

I have multiple rows of strings
Eg 1.
Our commission is 25% for next order
Eg2.
20% is applied for previous order
I want remove and create new column with 25% and 20% and so on... From above string.
How can I do that in SAS? The new column should flow 25% i.e percentage.
ColA. ColB
Our commission is 25% for next order. 25%
20% is applied for previous order. 20%
.
.
.
So on....
A single percentage value (or none) can be retrieved from a string using a pattern with a grouping expression (<something>), prxmatch, and prxposn to retrieve the characters matching the grouped expression.
Example:
The percentage is presumed to be a whole number followed immediately by a percent sign.
Store the percentage as a fraction (presumed to be in 0 to 1 range) whose value is formatted for display as a percentage.
data have;
input;
line = _infile_;
datalines;
Eg 1.
Our commission is 25% for next order
Eg2.
20% is applied for previous order
run;
data want;
set have;
/* pattern for finding and capturing a whole number that is followed by a percent sign */
rx = prxparse('/(\d+)%/');
if prxmatch(rx,line) then do;
matched_digits = prxposn(rx, 1, line);
fraction = input(matched_digits, 12.) / 100;
end;
format fraction percent5.;
run;

SAS: most efficient method to output first non-missing across multiple columns

The data I have are millions of rows and rather sparse with anywhere between 3 and 10 variables needing processed. My end result needs to be one single row containing the first non-missing value for each column. Take the following test data:
** test data **;
data test;
length ID $5 AID 8 TYPE $5;
input ID $ AID TYPE $;
datalines;
A . .
. 123 .
C . XYZ
;
run;
The end result should look like such:
ID AID TYPE
A 123 XYZ
Using macro lists and loops I can brute force this result with multiple merge statements where the variable is non-missing and obs=1 but this is not efficient when the data are very large (below I'd loop over these variables rather than write multiple merge statements):
** works but takes too long on big data **;
data one_row;
merge
test(keep=ID where=(ID ne "") obs=1) /* character */
test(keep=AID where=(AID ne .) obs=1) /* numeric */
test(keep=TYPE where=(TYPE ne "") obs=1); /* character */
run;
The coalesce function seems very promising, but I believe I need it in combination with array and output to build this single-row result. The function also differs (coalesce and coalescec depending on variable type) whereas it does not matter using proc sql. I get an error using array since all variables in the array list are not the same type.
Exactly what is most efficient will largely depend on the characteristics of your data. In particular, whether the first nonmissing value for the last variable is usually relatively "early" in the dataset, or if you usually will have to trawl through the entire dataset to get to it.
I assume your dataset is not indexed (as that would simplify things greatly).
One option is the standard data step. This isn't necessarily fast, but it's probably not too much slower than most other options given you're going to have to read most/all of the rows no matter what you do. This has a nice advantage that it can stop when every row is complete.
data want;
if 0 then set test; *defines characteristics;
set test(rename=(id=_id aid=_aid type=_type)) end=eof;
id=coalescec(id,_id);
aid=coalesce(aid,_aid);
type=coalescec(type,_type);
if cmiss(of id aid type)=0 then do;
output;
stop;
end;
else if eof then output;
drop _:;
run;
You could populate all of that from macro variables from dictionary.columns, or even might use temporary arrays, though I think that gets too messy.
Another option is the self update, except it needs two changes. One, you need something to join on (as opposed to merge which can have no by variable). Two, it will give you the last nonmissing value, not the first, so you'd have to reverse-sort the dataset.
But assuming you added x to the first dataset, with any value (doesn't matter, but constant for every row), it is this simple:
data want;
update test(obs=0) test;
by x;
run;
So that has the huge advantage of simplicity of code, exchanged for some cost of time (reverse sorting and adding a new variable).
If your dataset is very sparse, a transpose might be a good compromise. Doesn't require knowing the variable names as you can process them with arrays.
data test_t;
set test;
array numvars _numeric_;
array charvars _character_;
do _i = 1 to dim(numvars);
if not missing(numvars[_i]) then do;
varname = vname(numvars[_i]);
numvalue= numvars[_i];
output;
end;
end;
do _i = 1 to dim(charvars);
if not missing(charvars[_i]) then do;
varname = vname(charvars[_i]);
charvalue= charvars[_i];
output;
end;
end;
keep numvalue charvalue varname;
run;
proc sort data=test_t;
by varname;
run;
data want;
set test_t;
by varname;
if first.varname;
run;
Then you proc transpose this to get the desired want (or maybe this works for you as is). It does lose the formats/etc. on the value, so take that into account, and your character value length probably needs to be set to something appropriately long - and then set back (you can use an if 0 then set to fix it).
A similar hash approach would work roughly the same way; it has the advantage that it would stop much sooner, and doesn't require resorting.
data test_h;
set test end=eof;
array numvars _numeric_;
array charvars _character_;
length varname $32 numvalue 8 charvalue $1024; *or longest charvalue length;
if _n_=1 then do;
declare hash h(ordered:'a');
h.defineKey('varname');
h.defineData('varname','numvalue','charvalue');
h.defineDone();
end;
do _i = 1 to dim(numvars);
if not missing(numvars[_i]) then do;
varname = vname(numvars[_i]);
rc = h.find();
if rc ne 0 then do;
numvalue= numvars[_i];
rc=h.add();
end;
end;
end;
do _i = 1 to dim(charvars);
if not missing(charvars[_i]) then do;
varname = vname(charvars[_i]);
rc = h.find();
if rc ne 0 then do;
charvalue= charvars[_i];
rc=h.add();
end;
end;
end;
if eof or h.num_items = dim(numvars) + dim(charvars) then do;
rc = h.output(dataset:'want');
end;
run;
There are lots of other solutions, just depending on your data which would be most efficient.

Which statistics is calculated faster in SAS, proc summary?

I need a theoretical answer.
Imagine that you have a table with 1.5 billion rows (the table is created as column-based with DB2-Blu).
You are using SAS and you will do some statistics by using Proc Summary like min/max/mean values, standard deviation value and percentile-10, percentile-90 through your peer-groups.
For instance, you have 30.000 peer-groups and you have 50.000 values in each peer group (Total 1.5 billions values).
The other case you have 3 million peer-groups and also you have 50 values in each peer-group. So you have total 1.5 billion values again.
Would it go faster if you have less peer groups but more values in each peer-group? Or would it go faster with more peer-groups but less less values in each peer-group.
I could test the first case (30.000 peer-groups and 50.000 values per peer group) and it took around 16 mins. But I can't test for the second case.
Can you write an approximate prognose for run-time in case when I have 3 million peer-groups and also 50 values in each peer-group?
One more dimension for the question. Would it be faster to do those statistics if I use Proc SQL instead?
Example code is below:
proc summary data = table_blu missing chartype;
class var1 var2; /* Var1 and var2 are toghether peer-group */
var values;
output out = stattable(rename = (_type_ = type) drop = _freq_)
n=n min=min max=max mean=mean std=std q1=q1 q3=q3 p10=p10 p90=p90 p95=p95
;
run;
So there are a number of things to think about here.
The first point and quite possibly the largest in terms of performance is getting the data from DB2 into SAS. (I'm assuming this is not an in database instance of SAS -- correct me if it is). That's a big table and moving it across the wire takes time. Because of that, if you can calculate all these statistics inside DB2 with an SQL statement, that will probably be your fastest option.
So assuming you've downloaded the table to the SAS server:
A table sorted by the CLASS variables will be MUCH faster to process than an unsorted table. If SAS knows the table is sorted, it doesn't have to scan the table for records to go into a group, it can do block reads instead of random IO.
If the table is not sorted, then the larger the number of groups, then more table scans that have to occur.
The point is, the speed of getting data from the HD to the CPU will be paramount in an unsorted process.
From there, you get into a memory and cpu issue. PROC SUMMARY is multithreaded and SAS will read N groups at a time. If group size can fit into the memory allocated for that thread, you won't have an issue. If the group size is too large, then SAS will have to page.
I scaled down the problem to a 15M row example:
%let grps=3000;
%let pergrp=5000;
UNSORTED:
NOTE: There were 15000000 observations read from the data set
WORK.TEST.
NOTE: The data set WORK.SUMMARY has 3001 observations and 9
variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 20.88 seconds
cpu time 31.71 seconds
SORTED:
NOTE: There were 15000000 observations read from the data set
WORK.TEST.
NOTE: The data set WORK.SUMMARY has 3001 observations and 9
variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 5.44 seconds
cpu time 11.26 seconds
=============================
%let grps=300000;
%let pergrp=50;
UNSORTED:
NOTE: There were 15000000 observations read from the data set
WORK.TEST.
NOTE: The data set WORK.SUMMARY has 300001 observations and 9
variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 19.26 seconds
cpu time 41.35 seconds
SORTED:
NOTE: There were 15000000 observations read from the data set
WORK.TEST.
NOTE: The data set WORK.SUMMARY has 300001 observations and 9
variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 5.43 seconds
cpu time 10.09 seconds
I ran these a few times and the run times were similar. Sorted times are about equal and way faster.
The more groups / less per group was faster unsorted, but look at the total CPU usage, it is higher. My laptop has an extremely fast SSD so IO was probably not the limiting factor -- the HD was able to keep up with the multi-core CPU's demands. On a system with a slower HD, the total run times could be different.
In the end, it depends too much on how the data is structured and the specifics of your server and DB.
Not a theoretical answer but still relevant IMO...
To speed up your proc summary on large tables add the / groupinternal option to your class statement. Of course, assuming you don't want the variables formatted prior to being grouped.
e.g:
class age / groupinternal;
This tells SAS that it doesn't need to apply a format to the value prior to calculating what class to group the value into. Every value will have a format applied to it even if you have not specified one explicitly. This doesn't make a large difference on small tables, but on large tables it can.
From this simple test, it reduces the time from 60 seconds on my machine to 40 seconds (YMMV):
data test;
set sashelp.class;
do i = 1 to 10000000;
output;
end;
run;
proc summary data=test noprint nway missing;
class age / groupinternal;
var height;
output out=smry mean=;
run;

Output to a text file

I need to output lots of different datasets to different text files. The datasets share some common variables that need to be output but also have quite a lot of different ones. I have loaded these different ones into a macro variable separated by blanks so that I can macroize this.
So I created a macro which loops over the datasets and outputs each into a different text file.
For this purpose, I used a put statement inside a data step. The PUT statement looks like this:
PUT (all the common variables shared by all the datasets), (macro variable containing all the dataset-specific variables);
E.g.:
%MACRO OUTPUT();
%DO N=1 %TO &TABLES_COUNT;
DATA _NULL_;
SET &&TABLE&N;
FILE 'PATH/&&TABLE&N..txt';
PUT a b c d "&vars";
RUN;
%END;
%MEND OUTPUT;
Where &vars is the macro variable containing all the variables needed for outputting for a dataset in the current loop.
Which gets resolved, for example, to:
PUT a b c d special1 special2 special5 ... special329;
Now the problem is, the quoted string can only be 262 characters long. And some of my datasets I am trying to output have so many variables to be output that this macro variable which is a quoted string and holds all those variables will be much longer than that. Is there any other way how I can do this?
Do not include quotes around the list of variable names.
put a b c d &vars ;
There should not be any limit to the number of variables you can output, but if the length of the output line gets too long SAS will wrap to a new line. The default line length is currently 32,767 (but older versions of SAS use 256 as the default line length). You can actually set that much higher if you want. So you could use 1,000,000 for example. The upper limit probably depends on your operating system.
FILE "PATH/&&TABLE&N..txt" lrecl=1000000 ;
If you just want to make sure that the common variables appear at the front (that is you are not excluding any of the variables) then perhaps you don't need the list of variables for each table at all.
DATA _NULL_;
retain a b c d ;
SET &&TABLE&N;
FILE "&PATH/&&TABLE&N..txt" lrecl=1000000;
put (_all_) (+0) ;
RUN;
I would tackle this but having 1 put statement per variable. Use the # modifier so that you don't get a new line.
For example:
data test;
a=1;
b=2;
c=3;
output;
output;
run;
data _null_;
set test;
put a #;
put b #;
put c #;
put;
run;
Outputs this to the log:
800 data _null_;
801 set test;
802 put a #;
803 put b #;
804 put c #;
805 put;
806 run;
1 2 3
1 2 3
NOTE: There were 2 observations read from the data set WORK.TEST.
NOTE: DATA statement used (Total process time):
real time 0.07 seconds
cpu time 0.03 seconds
So modify your macro to loop through the two sets of values using this syntax.
Not sure why you're talking about quoted strings: you would not quote the &vars argument.
put a b c d &vars;
not
put a b c d "&vars";
There's a limit there, but it's much higher (64k).
That said, I would do this in a data driven fashion with CALL EXECUTE. This is pretty simple and does it all in one step, assuming you can easily determine which datasets to output from the dictionary tables in a WHERE statement. This has a limitation of 32kiB total, though if you're actually going to go over that you can work around it very easily (you can separate out various bits into multiple calls, and even structure the call so that if the callstr hits 32000 long you issue a call execute with it and then continue).
This avoids having to manage a bunch of large macro variables (your &VAR will really be &&VAR&N and will be many large macro variables).
data test;
length vars callstr $32767;
do _n_ = 1 by 1 until (last.memname);
set sashelp.vcolumn;
where memname in ('CLASS','CARS');
by libname memname;
vars = catx(' ',vars,name);
end;
callstr = catx(' ',
'data _null_;',
'set',cats(libname,'.',memname),';',
'file',cats('"c:\temp\',memname,'.txt"'),';',
'put',vars,';',
'run;');
call execute(callstr);
run;