How do I get the row number of an observation in proc sql, similar to _N_ for a datastep in proc sql?
For example
proc sql outobs=5;
select case mod(<something>, 2)
when 0 then "EVEN"
else "ODD"
end
from maps.africa
end;
Want:
Row
----------
1 odd
2 even
3 odd
.
.
.
Monotonic() does exist and in some cases can be helpful, but it is not identical to a row number, and can be dangerous to use, particularly given SQL is a heavily optimized language that will happily split your query into multiple threads - in which case monotonic() would fail to accomplish what you want. It in particular can behave differently on different datasets, on different SAS installations, or even simply on different days.
The safe way to do this is to create a view with _n_ copied into a permanent variable.
data africa_v/view=africa_v;
set maps.africa;
rownum=_n_;
run;
proc sql;
select case mod(rownum, 2)
when 0 then "EVEN"
else "ODD"
end
from africa_v;
quit;
This adds nearly no overhead - a few milliseconds - and achieves the same result, but with the safety to be confident you have the right ordering. The two queries (this and shipt's) run in nearly identical times on my machine, well within the margin of error (2.95s vs 2.98s for all records).
Use the monotonic() function. Whilst in the past I have read that this is an undocumented function (it is true that it does not appear on the sas website, there is at least one sas 'proceedings' document which makes use of it heavily
For example:
proc sql outobs=5;
select case mod(monotonic(), 2)
when 0 then "EVEN"
else "ODD"
end
from maps.africa;
quit;
will achieve your aim.
Related
I have this code in SAS, I'm trying to write SQL equivalent. I have no experience in SAS.
data Fulls Fulls_Dupes;
set Fulls;
by name, coeff, week;
if rid = 0 and ^last.week then output Fulls_Dupes;
else output Fulls;
run;
I tried the following, but didn't produce the same output:
Select * from Fulls where rid = 0 groupby name,coeff,week
is my sql query correct ?
SQL does not have a concept of observation order. So there is no direct equivalent of the LAST. concept. If you have some variable that is monotonically increasing within the groups defined by distinct values of name, coeff, and week then you could select the observation that has the maximum value of that variable to find the observation that is the LAST.
So for example if you also had a variable named DAY that uniquely identified and ordered the observations in the same way as they exist in the FULLES dataset now then you could use the test DAY=MAX(DAY) to find the last observation. In PROC SQL you can use that test directly because SAS will automatically remerge the aggregate value back onto all of the detailed observations. In other SQL implementations you might need to add an extra query to get the max.
create table new_FULLES as
select * from FULLES
group by name, coeff, week
having day=max(day) or rid ne 0
;
SQL also does not have any concept of writing two datasets at once. But for this example since the two generated datasets are distinct and include all of the original observations you could generate the second from the first using EXCEPT.
So if you could build the new FULLS you could get FULLS_DUPES from the new FULLS and the old FULLS.
create table FULLS_DUPES as
select * from FULLES
except
select * from new_FULLES
;
Assume that we have two tables, named Tb1 and Tb2 and we are going to replace data from one to another. Tb1 is the main source of data and Tb2 is the Destination. This replacement operation has 3 parts.
In the first part we are going to validate all rows in Tb1 and check if they are correct. For example National security code must exactly have 10 digits or a real customer must have a valid birth date so according to these validation rules, 28 different validation methods and error codes have been considered. During the validation every spoiled row's description and status will be updated to a new state.
Part 2 fixes the rows' problems and the third one replace them to the Tb2.
For instance this row says that it has 4 different error.
-- Tb1.desc=6,8,14,16
-- Tb1.sts=0
A correct row of data
-- Tb1.desc=Null i
-- Tb1.sts=1
I have been working on the first part recently and have come up with a solution which works fine but it is too slow. Unfortunately It takes exactly 31 minutes to validate 100,000 rows. In a real situation we are going to validate more than 2 million records so it is totally useless despite all it's functionality.
Let's take look at my package :
procedure Val_primary IS
begin
Open X_CUSTOMER;
Loop
fetch X_CUSTOMER bulk collect into CUSTOMER_RECORD;
EXIT WHEN X_CUSTOMER%notfound;
For i in CUSTOMER_RECORD.first..CUSTOMER_RECORD.last loop
Val_CTYP(CUSTOMER_RECORD(i).XCUSTYP);
Val_BRNCH(CUSTOMER_RECORD(i).XBRNCH);
--Rest of the validations ...
UptDate_Val(CUSTOMER_RECORD(i).Xrownum);
end loop;
CUSTOMER_RECORD.delete;
End loop;
Close X_CUSTOMER;
end Val_primary;
Inside a validation procedure :
procedure Val_CTYP(customer_type IN number)IS
Begin
IF(customer_type<1 or customer_type>3)then
RW_FINAL_STATUS:=0;
FINAL_ERR_DSC:=Concat(FINAL_ERR_DSC,ERR_INVALID_CTYP);
End If;
End Val_CTYP;
Inside the update procedure :
procedure UptDate_Val(rownumb IN number) IS
begin
update tb1 set tb1.xstst=RW_FINAL_STATUS,tb1.xdesc=FINAL_ERR_DSC where xc1customer.xrownum=rownumb;
RW_FINAL_STATUS:=1;
FINAL_ERR_DSC:=null;
end UptDate_Val;
Is there any way to reduce execution time ?
It must be done less than 20 minutes for more than 2 million records.
Maybe each validation check could be a case expression within an inline view, and you could concatenate them etc in the enclosing query, giving you a single SQL statement that could drive an update. Something along the lines of:
select xxx, yyy, zzz -- whatever columns you need from xc1customer
, errors -- concatenation of all error codes that apply
, case when errors is not null then 0 else 1 end as status
from ( select xxx, yyy, zzz
, trim(ltrim(val_ctyp||' ') || ltrim(val_abc||' ') || ltrim(val_xyz||' ') || etc...) as errors
from ( select c.xxx, c.yyy, c.zzz
, case when customer_type < 1 or customer_type > 3 then err_invalid_ctyp end as val_ctyp
, case ... end as val_abc
, case ... end as val_xyz
from xc1customer c
)
);
Sticking with the procedural approach, the slow part seems to be the single-row update. There is no advantage to bulk-collecting all 20 million rows into session memory only to apply 20 million individual updates. The quick fix would be to add a limit clause to the bulk collect (and move the exit to the bottom of the loop where it should be), have your validation procedures set a value in the array instead of updating the table, and batch the updates into one forall per loop iteration.
You can be a bit freer with passing records and arrays in and out of procedures rather than having everything a global variable, as passing by reference means there is no performance overhead.
There are two potential lines of attack.
Specific implementation. Collections are read into session memory. This is usually quite small compared to global memory allocation. Reading 100000 longish rows into session memory is a bad idea and can cause performance issues. So breaking up the process into smaller chunks (say 1000 rows) will most likely improve throughput.
General implementation. What is the point of the tripartite process? Updating Table1 with some error flags is an expensive activity. A more efficient approach would be to apply the fixes to the data in the collection and apply that to Table2. You can write a log record if you need to track what changes are made.
Applying these suggestion you'd end up with a single procedure which looks a bit like this:
procedure one_and_only is
begin
open x_customer;
<< tab_loop >>
loop
fetch x_customer bulk collect into customer_record
limit 1000;
exit when customer_record.count() = 0;
<< rec_loop >>
for i in customer_record.first..customer_record.last loop
val_and_fix_ctyp(customer_record(i).xcustyp);
val_and_fix_brnch(customer_record(i).xbrnch);
--rest of the validations ...
end loop rec_loop;
-- apply the cleaned data to target table
forall j in 1..customer_record.count()
insert into table_2
values customer_record(j);
end loop tab_loop;
close x_customer;
end one_and_only;
Note that this approach requires the customer_record collection to match the projection of the target table. Also, don't use %notfound to test for end of the cursor unless you can guarantee the total number of read records is an exact multiple of the LIMIT number.
I am trying to help a colleague with a SAS Script she is working with . I am a programmer so I understand the logic, however I don't now the syntax in SAS. Basically this is what she is trying to do.
We have:
Array of Procedure Dates (proc_date[i])
Array of Procedures (proc[i]).
Each record in our data can have up to 20 Procedures and 20 dates.
i=20
Each procedure has an associated code, lets just say there are 100 different codes where codes 1 to 10 is ProcedureA, 11 to 20 is ProcedureB etc etc.
We need to loop through each Procedure and assign it the correct ProcedureCategory if it falls into 1 of the 100 codes (ie: it enters one of the If Statements). When this is true we then need to loop through each other corresponding Procedure Date in that row, if they are different dates when we add the 'Weighted Values' together, else we would just take the greater of the 2 values.
I hope that helps and I hope this makes sense. I could write this in another language (ie: C/C++/C#/VB) however I'm at a loss with SAS as I'm just not that familiar with the syntax and the logic doesn't seem to be that other OO languages.
Thanks in advance for any assistance.
Kind Regards.
You don't want to do 100 if statements, one way or the other.
The answer to the core of your question is that you need the do loop outside of the if statement.
data want;
set have;
array proc[20];
array proc_date[20];
do _i = 1 to dim(proc); *this would be 20;
if proc[_i] = 53 then ... ;
else if proc[_i] = 54 then ...;
end;
run;
Now, what you're trying to do with the proc_date sounds like you need to do something with proc_date[i] in that same loop. The loop is just an iterator - the only thing changing is _i which is used as an array index. It's welcome to be the array index for either array (or to do any other thing). This is where this differs from common OOP practice, since this isn't an array class; you're not using the individual object to iterate it. This is a functional language style (c would do it the same way, even).
However, the if/else bits there would be unwieldy and long. In SAS you have a lot of ways of dealing with that. You might have another array of 100 values proc can take, and then inside that do loop have another do loop iterating over that array (do _j = 1 to 100;) - or the other way around (iterate through the 100, inside that iterate through the 20) if that makes more sense (if you want to at one time have all of the values).
data want;
set have;
array proc[20];
array proc_date[20];
array proc_val[100]; *needs to be populated in the `have` dataset, or in another;
do _i = 1 to dim(proc_val);
do _i = 1 to dim(proc);
if proc[_j] = proc_val[_i] then ...; *this statement executes 100*20 times;
end;
end;
run;
You also could have a user-defined format, which is really just a one-to-one mapping of values (start value -> label value). Map your 100 values to the 10 procedures they correspond to, or whatever. Then all 100 if statements become
proc_value[_i] = put(proc[_i],PROCFMT.);
and then proc_value[_i] stores the procedure (or whatever) which you then can evaluate more simply hopefully.
You also may want to look into hash tables; both for a similar concept as the format above, but also for doing the storage. Hash tables are a common idea in programming that perhaps you've already come across, and the way SAS implements them is actually OOP-like. If you're trying to do some sort of summarization based on the procedure values, you could easily do that in a hash table and probably much more efficiently than you could in IF statements.
Here are some statements mentioned.
*codes 1 to 10 is ProcedureA, 11 to 20 is ProcedureB ;
proc format;
value codes 1-10 = 'A'
11-20 = 'B';
Procedure(i) = put(code,codes.);
another way to recode ranges is with the between syntax
if 1 <= value <= 10 then variable = <new-value>;
hth
Basically, is the code below efficient (if I cannot use # variables in MonetDB), or will this call the subqueries more than once each?
CREATE VIEW sys.share26cuts_2007 (peorglopnr,share26cuts_2007) AS (
SELECT peorglopnr, CASE WHEN share26_2007 < (SELECT QUANTILE(share26_2007,0.25) FROM sys.share26_2007) THEN 1
WHEN share26_2007 < (SELECT QUANTILE(share26_2007,0.5) FROM sys.share26_2007) THEN 2
WHEN share26_2007 < (SELECT QUANTILE(share26_2007,0.75) FROM sys.share26_2007) THEN 3
ELSE 4 END AS share26cuts_2007
FROM sys.share26_2007
);
I would rather not use a user-defined function either, though this came up in other questions.
As e.g. GoatCO commented on the question, this is probably better avoided. The SET command that MonetDB support can be used with SELECT as in the code below. The remaining question is why all quantiles are zero where my data is surely not (I also got division by zero errors before using NULLIF). I show more of the code now.
CREATE VIEW sys.over26_2007 (personlopnr,peorglopnr,loneink,below26_loneink) AS (
SELECT personlopnr,peorglopnr,loneink, CASE WHEN fodelsear < 1981 THEN 0 ELSE loneink END AS below26_loneink
FROM sys.ds_chocker_lev_lisaindivid_2007
);
SELECT COUNT(*) FROM over26_2007;
CREATE VIEW sys.share26_2007 (peorglopnr,share26_2007) AS (
SELECT peorglopnr, SUM(below26_loneink)/NULLIF(SUM(loneink),0)
FROM sys.over26_2007
GROUP BY peorglopnr
);
SELECT COUNT(*) FROM share26_2007;
DECLARE firstq double;
SET firstq = (SELECT QUANTILE(share26_2007,0.25) FROM sys.share26_2007);
SELECT firstq;
DECLARE secondq double;
SET secondq = (SELECT QUANTILE(share26_2007,0.5) FROM sys.share26_2007);
SELECT secondq;
DECLARE thirdq double;
SET thirdq = (SELECT QUANTILE(share26_2007,0.275) FROM sys.share26_2007);
SELECT thirdq;
CREATE VIEW sys.share26cuts_2007 (peorglopnr,share26cuts_2007) AS (
SELECT peorglopnr, CASE WHEN share26_2007 < firstq THEN 1
WHEN share26_2007 < secondq THEN 2
WHEN share26_2007 < thirdq THEN 3
ELSE 4 END AS share26cuts_2007
FROM sys.share26_2007
);
SELECT COUNT(*) FROM share26cuts_2007;
About inspecting plans, MonetDB supports:
PLAN to see the logical plan
EXPLAIN to see the physical plan in terms of MAL instructions
TRACE same as EXPLAIN, but actually execute the MAL plan and return statistics for all instructions.
About your question on repeating the subqueries, in principle nothing will be repeated and you will not need to take care of it explicitly.
That's because the default optimization pipeline includes the commonTerms optimizer. Your SQL will be translated to a sequence of MAL instructions, with duplicate calls. MAL is designed to be simple: many short instruction calls, a bit like assembly, which operate on columns, not rows (hence don't apply the same reasoning you would use for SQL Server when you think of execution efficiency). This makes it easier to run some optimizations on it. The commonTerms optimizer will detect the duplicate calls and reuse all results that it can. This is done per-column. So you should really be able to run your query and be happy.
However, I said in principle. Not all cases will be detected (though most will), plus some limitations have been introduced on purpose. For example, the search-space for detecting duplicates is a window (too small for my taste - I have removed it altogether in my installations) over the whole MAL plan: if the duplicate instruction is too far down the plan it won't be detected. This was done for efficiency. In your case, that single query isn't that big, but if it is part of a longer chain of views, then all these views will compile into a large MAL plan - which might make commonTerms less effective - it really depends on the actual queries.
I have 2 SAS data sets, Lab and Rslt. They both have all of the same variables, but Rslt is supposed to have what is essentially a subset of Lab. For what I'm trying to do, there are 4 important variables: visit, accsnnum, battrnam, and lbtestcd. All are character variables. I want to compare the two files Lab and Rslt to find out where they vary -- specifically, I need to know the count of lbtestcd per unique accsnnum.
But I must control for a few factors. First, I only need to compare observations that have "Lipid Panel" or "Chemistry (6)" in the battrnam variable. The Rslt file only contains these observations, so we don't need to worry about that one. So I subsetted Lab using this code:
data work.lab;
set livingston.ndb_lab_1;
where battrnam contains "Lipid Panel" or battrnam = "Chemisty (6)";
run;
This worked fine. Now, I need to control for the variable visit. I need to get rid of all observations in both Lab and Rslt that have visits that contain "Day 1" or "Screening". I accomplished this using the following code:
data work.lab;
set work.lab;
if visit = "Day 1" or visit = "Screening" then delete;
else visit = visit;
run;
data work.rslt;
set work.rslt;
if visit = "Day 1" or visit = "Screening" then delete;
else visit = visit;
run;
Now this is where I get stuck. I need to create a way to compare the count of lbtestcd by accsnnum between the two separate files Lab and Rslt, and I need a way for it to flag the accsnum where there is a difference between Lab and Rslt for the count of lbtestcd. For example, if Lab has an accsnum A1 that has 5 unique lbtestcd values, and Rslt has the accsnum A1 with 7 unique lbtestcd value, I need that one to be brought to my attention.
I can do a proc freq for each file, but these are large data sets and I don't want to have to compare by hand. Perhaps exporting the count of lbtestcd by accsnum to a variable in a new 3rd dataset for each of the 2 files Lab and Rslt, then creating a variable that is the difference of these two? So that if difference != 0 then I can get a report of those asscnum? Advice in SQL will work too, as I can run that through SAS.
Edit
I've used some SQL to get the count of lbtestcd by accsnum for each data set using the code below, though I still need to figure out how to export these values to a data set to compare.
proc sql;
select accsnnum, count(lbtestcd)
from work.lab1
group by accsnnum;
quit;
proc sql;
select accsnnum, count(lbtestcd)
from work.rslt1
group by accsnnum;
quit;
Thanks for any and all help you can give. This one is really stumping me!
I would do a PROC FREQ on each dataset (or proc whatever-you-like-that-does-counts) and then use PROC COMPARE. For example:
proc freq data=rslt1;
tables accsnnum*ibtestcd/out=rsltcounts;
run;
proc freq data=lab1;
tables accsnnum*ibtestcd/out=labcounts;
run;
proc compare base=lab1 compare=rslt1 out=compares /* options */;
by accsnnum;
run;
PROC COMPARE has a lot of options; in this case the most helpful would probably be:
outnoequal - only outputs rows for each row that are not identical in the two datasets
outbase and outcomp - outputs a row for each of BASE and COMPARE datasets (if OUTNOEQUAL, then only when they differ)
outdif - outputs 'difference' rows, ie, one minus the other; this may or may not be helpful for you
The documentation lists all of the options. You may also need to look at the METHOD options if your data might have numeric precision issues.