Turning textual answers into dichotomous variables - sql

I've done research using the google forms and now I need to prepare that data for the further analysis. The point is I don't really know how to go about that.
I have variables (questionnaire questions), each of this question have four answers. In my data those answers are just strings, so let's say:
Variable 1 (Here is a question)
Value = Answer (C. The answer)
Now I need to split every one of those variable into four different ones and the representation of that data should look like that:
Variable 1_1 where Value = 0
Variable 1_2 where Value = 0
Variable 1_3 where Value = 1 -> because as you seen above answer C was chosen.
Variable 1_4 where Value = 0
So here is the recoding part. It's not string anymore but 0 or 1.
Well I hope that this make sense. And thank you in advance.

If you want to do this one variable at a time, you can use:
IF (variable1="a") variable1_1=1 .
IF (variable1="b") variable1_2=1 .
IF (variable1="c") variable1_3=1 .
IF (variable1="d") variable1_4=1 .
RECODE variable1_1 TO variable1_4 (SYSMIS=0) .
EXE .
If all of your variables have the same response structure and you want to loop through all of them at once, you can use VECTOR to do that.
VECTOR variable = variable1 TO variable100 /* existing variables */ .
VECTOR response1_var = response1_var (100,F1) /* create new vars, response1_var1 TO response1_var100 */ .
VECTOR response2_var = response2_var (100,F1) /* create new vars, response2_var1 TO response2_var100 */ .
VECTOR response3_var = response3_var (100,F1) .
VECTOR response4_var = response4_var (100,F1) .
LOOP #i = 1 TO 100 .
IF (variable(#i)="a") response1_var(#i)=1 .
IF (variable(#i)="b") response2_var(#i)=1 .
IF (variable(#i)="c") response3_var(#i)=1 .
IF (variable(#i)="d") response4_var(#i)=1 .
END LOOP .
RECODE response1_var1 TO response4_var100 (SYSMIS=0) .
EXE .
Keep in mind that looping through this way would order your new variables by "response series" as oppose to questionnaire order. If you wanted to either re-order or rename your new variables that could be done separately.

There are many ways to do this, so here is one.
first to make some fake data to demonstrate on:
data list free/var1 to var4 (4a1).
begin data
"a" "b" "a" "c" "c" "b" "a" "c" "d" "d" "a" "b"
end data.
Now a separate recode command for each possible answer - each command taking care of all the relevant variables which have those possible answers:
recode var1 to var4 ("a"=1)(else=0) into varA1 to varA4.
recode var1 to var4 ("b"=1)(else=0) into varB1 to varB4.
recode var1 to var4 ("c"=1)(else=0) into varC1 to varC4.
recode var1 to var4 ("d"=1)(else=0) into varD1 to varD4.

Related

Output to a text file

I need to output lots of different datasets to different text files. The datasets share some common variables that need to be output but also have quite a lot of different ones. I have loaded these different ones into a macro variable separated by blanks so that I can macroize this.
So I created a macro which loops over the datasets and outputs each into a different text file.
For this purpose, I used a put statement inside a data step. The PUT statement looks like this:
PUT (all the common variables shared by all the datasets), (macro variable containing all the dataset-specific variables);
E.g.:
%MACRO OUTPUT();
%DO N=1 %TO &TABLES_COUNT;
DATA _NULL_;
SET &&TABLE&N;
FILE 'PATH/&&TABLE&N..txt';
PUT a b c d "&vars";
RUN;
%END;
%MEND OUTPUT;
Where &vars is the macro variable containing all the variables needed for outputting for a dataset in the current loop.
Which gets resolved, for example, to:
PUT a b c d special1 special2 special5 ... special329;
Now the problem is, the quoted string can only be 262 characters long. And some of my datasets I am trying to output have so many variables to be output that this macro variable which is a quoted string and holds all those variables will be much longer than that. Is there any other way how I can do this?
Do not include quotes around the list of variable names.
put a b c d &vars ;
There should not be any limit to the number of variables you can output, but if the length of the output line gets too long SAS will wrap to a new line. The default line length is currently 32,767 (but older versions of SAS use 256 as the default line length). You can actually set that much higher if you want. So you could use 1,000,000 for example. The upper limit probably depends on your operating system.
FILE "PATH/&&TABLE&N..txt" lrecl=1000000 ;
If you just want to make sure that the common variables appear at the front (that is you are not excluding any of the variables) then perhaps you don't need the list of variables for each table at all.
DATA _NULL_;
retain a b c d ;
SET &&TABLE&N;
FILE "&PATH/&&TABLE&N..txt" lrecl=1000000;
put (_all_) (+0) ;
RUN;
I would tackle this but having 1 put statement per variable. Use the # modifier so that you don't get a new line.
For example:
data test;
a=1;
b=2;
c=3;
output;
output;
run;
data _null_;
set test;
put a #;
put b #;
put c #;
put;
run;
Outputs this to the log:
800 data _null_;
801 set test;
802 put a #;
803 put b #;
804 put c #;
805 put;
806 run;
1 2 3
1 2 3
NOTE: There were 2 observations read from the data set WORK.TEST.
NOTE: DATA statement used (Total process time):
real time 0.07 seconds
cpu time 0.03 seconds
So modify your macro to loop through the two sets of values using this syntax.
Not sure why you're talking about quoted strings: you would not quote the &vars argument.
put a b c d &vars;
not
put a b c d "&vars";
There's a limit there, but it's much higher (64k).
That said, I would do this in a data driven fashion with CALL EXECUTE. This is pretty simple and does it all in one step, assuming you can easily determine which datasets to output from the dictionary tables in a WHERE statement. This has a limitation of 32kiB total, though if you're actually going to go over that you can work around it very easily (you can separate out various bits into multiple calls, and even structure the call so that if the callstr hits 32000 long you issue a call execute with it and then continue).
This avoids having to manage a bunch of large macro variables (your &VAR will really be &&VAR&N and will be many large macro variables).
data test;
length vars callstr $32767;
do _n_ = 1 by 1 until (last.memname);
set sashelp.vcolumn;
where memname in ('CLASS','CARS');
by libname memname;
vars = catx(' ',vars,name);
end;
callstr = catx(' ',
'data _null_;',
'set',cats(libname,'.',memname),';',
'file',cats('"c:\temp\',memname,'.txt"'),';',
'put',vars,';',
'run;');
call execute(callstr);
run;

Remove variables by character pattern in variable name (SAS)

I'd like to drop all variables with a certain character segment in the name. Example below:
var1 var2 var3 o_var1 o_var2 o_var3
1 1 1 3 2 5
7 3 4 . -1 5
I'd like to only keep those without the "o_" in front. I could sort positionally and keep the first x number of variables, but with 100s of variables with this pattern, I wanted to seek an alternative.
Just use the colon wildcard operator.
data want;
set have (drop=o_:); /* drops all variables beginning with o_ */
run;

Using CONTAINS with variables sql

Ok so I am trying to reference one variable with another in SQL.
X= a,b,c,d (x is a string variable with a list of things in it)
Y= b ( Y is a string variable that may or may not have a vaue that appears in X)
I tried this:
Case when Y in (X) then 1 else 0 end as aa
But it doesnt work since it looks for exact matches between X and Y
also tried this:
where contains(X,#Y)
but i cant create Y globally since it is a variable that changes in each row of the table.( x also changes)
A solution in SAS would also be useful.
Thanks
Maybe like will help
select
*
from
t
where
X like ('%'+Y+'%')
or
select
case when (X like ('%'+Y+'%')) then 1 else 0 end
from
t
SQLFiddle example
In SAS I would use the INDEX function, either in a data step or proc sql. This returns the position within the string in which it finds the character(s), or zero if there is no match. Therefore a test if the value returned is greater than zero will result in a binary 1:0 output. You need to use the compress function with the variable containing the search characters as SAS pads the value with blanks.
Data step solution :
aa=index(x,compress(y))>0;
Proc Sql solution :
index(x,compress(y))>0 as aa

Creating and modifying a global statement in SAS

I would like to do something very simple, but it doesn't work
This is a simple example but I intend to use it for some more complex stuff
the output I want is :
obs. dummy newcount
1 3 1
2 5 2
3 2 3
but the output I get is :
obs. dummy newcount
1 3 1
2 5 1
3 2 1
here is my code
data test;
input dummy;
cards;
3
5
2
;
run;
%let count=1;
data test2;
set test;
newcount = &count.;
%let count = &count. + 1;
run;
The variable count doesn't get incremented. How do I do this?
Thanks for your help !
You're mixing macro variables and datastep variables in a way you cannot. Macro variables used in the data step in most cases have to have their values already defined prior to the data step when used like this; what happens is the data step compiler immediately resolves &count to the number 1, and uses that number 1 in its compilation, not the macro variable's newer values.
Further, the %let is not a data step command but a macro statement - it is also only executed once, not one time per data step pass.
You could use
data test2;
set test;
newcount = symget("count");
call symput("count",newcount+1);
put _all_;
run;
and it would work (call symput is how you define a macro variable in a data step, symget is how you retrieve the value of a macro variable that isn't finalized before the data step begins). It is probably not a good idea, however - you shouldn't generally store data values in macro variables and interact repeatedly with them inside a data step. If you post more details about why you're trying to do this (ie, what your actual goal is) I'm sure several of us could offer some suggestions for how to approach the problem.

Increment an integer

Sometimes ABAP drives me crazy with really simple tasks such as incrementing an integer within a loop...
Here's my try:
METHOD test.
DATA lv_id TYPE integer.
lv_id = 1.
LOOP AT x ASSIGNING <y>.
lv_id = lv_id+1.
ENDLOOP.
ENDMETHOD.
This results in the error message Field type "I" does not permit subfield access.
You already answered the question yourself, but to make things a bit clearer:
variable + 1
is an arithmetic expression - add 1 to the value of the variable.
variable+1
is an offset operation on a character variable. For example, if variable contains ABC, variable+1 is BC.
This can be especially confusing when dealing with NUMCs. For example, with variable = '4711', variable + 1 is evaluated to 4712, whereas variable+1 is '711' (a character sequence).
The error you saw occurred because it's not possible to perform the index operation on a non-character-like variable.
You mean like:
ADD 1 to lv_id.
By the way, when you loop over an internal table, SY-TABIX has the loop counter.
Uh, I got it.
It's the f****** spaces...
lv_id = lv_id + 1
works...
Simple
DATA : gv_inc type I .
place this statement in loop
gv_inc = gv_inc + 1 .
from SAP NetWeaver Version 7.54 you can also use:
lv_id += 1.
Instead of
lv_id = lv_id + 1.
Happy coding!
If you are going to increment every loop cycle than you can directly get the table size.
describe table x lines data(lv_id). "Out side of the loop.