I'm trying to organize a dataset in a specific way with a list of variables that changes. The issue I'm having is that I don't always know the actual number of variables I'm going to have in my dataset. I've done this previously with either a PROC SQL statement or a RETAIN statement after the data statement where the list of variables was static.
My data looks like this:
APPNUM DATE REASON1 REASON2 REASON3 REASON4 NAME1 NAME2 NAME3 NAME4
123 1/1/2017 X Y Z A Jon Mary Tom Suzie
I want it to look like this:
APPNUM DATE REASON1 NAME1 REASON2 NAME2 etc
123 1/1/2017 X Jon Y Mary etc
This would be easy with sql or a retain statement. However, I am using loops, etc to pull these variables together, and the number of variables presented is dependent upon my input data. Some days there may be 20 instances of REASON/NAME and others there may be 1 of each.
I tried the below code to pull a list of variable names, then order the APPNUM, DATE, then finally order by the LAST digit of the variable name. I.E. 1,1,2,2,3,3 - but I was unsuccessful. The list was being stored properly - no errors, but when resolving the value of &VARLIST. they are not ordered as expected. Has anyone ever tried and accomplished this?
PROC SQL;
SELECT NAME INTO :VARLIST SEPARATED BY ','
FROM DICTIONARY.COLUMNS
WHERE LIBNAME = 'WORK'
AND MEMNAME = 'SFINAL'
ORDER BY NAME, SUBSTR(NAME,LENGTH(NAME)-1);
QUIT;
The above code would order something like this:
APPNUM, DATE, NAME1...2...3..., REASON1...2...3...
and not:
APPNUM, DATE, NAME1, REASON1, NAME2, REASON2....
Two problems.
First, your order on the ORDER BY is backwards.
Second, your SUBSTR() call is not correct. You have an arbitrary length number at the end. You don't know how many characters that will be. You best bet is to read that number string, convert to a number, and then order by that.
data test;
array name[20];
array reason[20];
format appnum best. date date9.;
run;
proc sql noprint;
SELECT NAME INTO :VARLIST SEPARATED BY ','
FROM DICTIONARY.COLUMNS
WHERE LIBNAME = 'WORK'
AND MEMNAME = 'TEST'
and (upcase(NAME) like 'NAME%' or upcase(NAME) like 'REASON%')
ORDER BY input(compress(compress(name,'name'),'reason'),best.), NAME ;
quit;
%put &varlist;
proc sql noprint;
create table test2 as
select APPNUM, DATE, &varlist
from test;
quit;
Related
In the data below I would like proc sql to select the minimum date for subject 123 as the missing date.
data visit;
input subject $1-3 dtc $4-24 ;
cards;
123 2014-01-15T00:00
123
123 2014-01-17T00:00:00
124 2014-01-15T00:00:00
124 2014-01-15T00:00:00
124 2014-01-17T00:00:00
;
run;
proc sql;
create table want. as
select distinct subject, min(dtc) as mindt format = date9.
from have
where subject ne ''
group by subject;
quit;
MIN() will discard missing values from the aggregate computation. Thus, you need to test separately if there are any missing values.
Example:
Use a CASE expression to calculate the MIN you want.
data have;
input subject $1-3 dtc $5-27 ;
cards;
123 2014-01-15T00:00
123 .
123 2014-01-17T00:00:00
124 2014-01-15T00:00:00
124 2014-01-15T00:00:00
124 2014-01-17T00:00:00
;
proc sql ;
create table want as
select
subject
, case when nmiss(dtc) then '' else min(dtc) end as mindtc
, input (calculated mindtc, ? yymmdd10.) as mindt format=date9.
from have
where subject ne ''
group by subject
;
quit;
Here is an alternative solution in SAS:
First, create an index or sort your data by subject and dtc.
proc sort data=have out=have_sorted;
by subject dtc;
quit;
Then you can apply a data step with by grouping and use the first.[column] to get the minimum for each subject including missing values:
data minima;
set have_sorted;
by subject dtc;
if first.subject;
run;
I'm new to SAS and I'm having problems with proc sql. I have a vertical dataset with:
ID Code Time
001 1 0:00:00.00
001 1 0:10:00.00
001 2 0:20:00.00
... ... ...
001 9 23:50:00.00
And I'm interested in having a table that summarizes how many N code 1, N code 2 and so on are between 0:00:00.00 and 23:50:00.00 for each ID. So, the output looks something like this:
ID Code N
001 1 28
001 2 17
001 3 5
...
Right now, I have something like this:
proc sql;
select Code,ID
from have
where Time between 0:00:00.000 and 23:50:00.000;quit;
If someone has an easier way and it's not with proc sql that's alright too. Thank you very much!
To filter the data used in an analysis use a WHERE statement (or WHERE clause of an SQL statement). Make sure to use values that match the type of values in your variable.
where time between '00:00:00't and '23:00:00't
To count the number of observations you could use PROC SQL and the COUNT() aggregate function with the GROUP BY clause.
proc sql;
select Code,ID,count(*) as N
from have
where Time between '00:00:00't and '23:00:00't
group by code, id
;
quit;
Or just use regular SAS code to do the counting instead.
proc summary data=have nway;
where Time between '00:00:00't and '23:00:00't;
class code id;
output out=want(rename=(_freq_=N));
run;
If your TIME variable is actually character then trying to restrict the range will be hard if some of your strings have only one digit for the HOUR number. So convert it to a time value (number of seconds since midnight) to do the range testing.
where input(Time,time12.) between '00:00:00't and '23:00:00't;
I am trying to stack 3 columns into one, but however, I would like to keep a filter column to be able to distinct the variables, I have tried with Coalesce and Union all, but I don't get to understand how to do it, given that I do not have an ID column.
Here the tables:
You can use a data step approach.
I'm not typing out your data so here's a fully worked example that's similar to yours but not exactly the same.
Use VNAME() to get the variable name.
Use an array to get the values.
DATA wide;
input famid faminc96 faminc97 faminc98 ;
CARDS;
1 40000 40500 41000
2 45000 45400 45800
3 75000 76000 77000
;
RUN;
DATA long1a;
SET wide;
*declare an array with the list of variables to transpose;
ARRAY afaminc(96:98) faminc96 - faminc98 ;
DO year = 96 to 98 ;
faminc = afaminc(year);
variable_name = vname(afaminc(year));
OUTPUT;
END;
DROP faminc96 - faminc98 ;
RUN;
Wide to long using data step
https://stats.idre.ucla.edu/sas/modules/reshaping-data-wide-to-long-using-a-data-step/
Arrays:
https://stats.idre.ucla.edu/sas/seminars/sas-arrays/
I have a table with an account number and several attributes.
acct | attr1 | attr2 | attr3...
The issue is that there are duplicate account numbers in the list with different attributes. To make matters worse, when there are two account number entries, those entries may have entirely different attributes.
I have a sorting scheme to use to somewhat solve the issue, but after I sort the table, I only need the first occurrence of each account number. I am attempting to do this in sas using Proc SQL.
Any ideas?
I don't think it's possible to do this with PROC SQL, however in DATA STEP logic it is possible.
After the data is sorted, use first. (pronounced first-dot) logic to pick the first occurrence:
First sort the data, using your desired scheme.
proc sort data=have out=intermediate_table;
by acct <other variables>;
run;
Then just use first.acct:
data want;
set intermediate_table;
by acct <other variables>;
if first.acct then output;
run;
proc sort is easiest way to do this. You can use undocumented monotonic() function to do this in Proc sql as shown below
data have;
input acct attr1 $ attr2 $ attr3 $;
datalines;
100 a b c
100 b d e
100 c e f
101 a b c
102 h i j
102 h k l
;
proc sql;
create table want(drop =rn) as
select * from
(select b.*,monotonic() as rn
from have b)
group by acct
having rn =min(rn);
or by using n in a datastep(creating view is a good option as suggested #richard in comments sections)followed by group by as shown below.
data have_view/view=have_view;;
set have;
rn=_n_;
run;
proc sql;
create table want as
select acct, attr1 , attr2 , attr3
from have_view b
group by acct
having rn =min(rn);
I am having trouble with the syntax when trying to reference a macro variable.
I have a subset of ID numbers and a dataset with a quantitative variable xxx associated by IDnum:
data IDnumlist;
input IDnum;
cards;
123
456
789
;
run;
data info;
input IDnum xxx;
cards;
123 2
123 5
456 3
789 1
789 4
555 9
;
run;
I want to summarize the data in the info dataset, but not for IDnum=555, since that is not in my subset. So my data set would look like this:
IDnum xxx_count xxx_sum
123 2 7
456 1 3
789 2 5
Here is my attempt so far:
proc sql noprint;
select count(*)
into :NObs
from IDnumlist;
select IDnum
into :IDnum1-:IDnum%left(&NObs)
from IDnumlist;
quit;
proc sql;
create table want as
select IDnum,
count(xxx) as xxx_count,
sum(xxx) as xxx_sum
from info
where IDnum in (&IDnum1-IDnum%left(&NObs))
group by 1;
run;
What am I doing wrong?
Why are you using macro variables for this? This is what a join is for, or a subquery, or who knows how many other better ways to do this.
proc sql;
create table want as
select info.idnum, count(xxx) as xxx_count, sum(xxx) as xxx_sum
from info inner join idnumlist
on info.idnum=idnumlist.idnum
group by info.idnum;
quit;
The specific problem in your above code is that you can't use 'macro variable lists' like you can data step lists. You could in theory list them individually, but better would be to do the select into differently.
proc sql noprint;
select IDnum
into :IDnumlist separated by ','
from IDnumlist;
quit;
Then all of the values are in &idnumlist. and can be used directly with the in operator:
where idnum in (&idnumlist.)