Counting rows in multiple variables using SAS - variables

I have a question in creating a count variable using SAS.
Q R
----
1 a
1 a
1 b
1 b
1 b
2 a
3 a
3 c
4 c
4 c
4 c
I need to create a variable S that counts the rows that has same combination of Q and R. The following will be the output.
Q R S
-------------------
1 a 1
1 a 2
1 b 1
1 b 2
1 b 3*
2 a 1
3 a 1
3 c 1
4 b 1
4 b 2
4 b 3
I tried using following program:
data two;
set one;
S + 1;
by Q R;
if first.Q and first.R then S = 1;
run;
But, this did not run correctly. For example, * will come out as 1 instead of 3. I would appreciate any tips on how to make this counting variable work correctly.

Very close, your if statement is should be first.R (or change the and to OR but that isn't efficient). I usually prefer to have the increment after the set to 1.
data two;
set one;
by Q R;
*Retain S; *implicitly retained by using the +1 notation;
if first.R then S = 1;
else S+1;
run;

Reese's example is certainly sufficient in this case, but given this was a simple question with a mostly uninteresting answer, I'll instead present a very small variant simply from a programming style standpoint.
data two;
set one;
by Q R;
if first.R then s=0;
s+1;
run;
This will likely function exactly the same as Reese's code and the original question's code once the first.Q is removed. However, it has two slight differences.
First off, I like to group if first. variable-resetting code (that isn't otherwise dependent on location) in one place, as early as possible in the code (after by statements, where, "early" subsetting if, array, format, length, and retain). This is useful in an organizational standpoint because this code (usually) is something that roughly parallels what SAS does in between data step iterations, but for BY groups - so it's nice to have it at the start of the data step.
Second, I like initializing to zero. That avoids the need for the else conditional, which makes the code clearer; you're not doing anything different on the first row per BY group other than the re-initialization, so it makes sense to not have the increment be conditional.
These are both helpful in a more complex version of this data step; obviously in this particular step it doesn't matter much, but in a larger data step it can be helpful to organize your code more effectively. Imagine this data step:
data two;
set one;
by Q R;
retain Z;
format Z $12.;
array CS[15];
if first.R then do; *re-init block;
S=0;
Z=' ';
end;
S+1; *increment counter;
do _t = 1 to dim(CS);
Z = cats(Z,ifc(CS[_t]>0,put(CS[_t],8.),''));
end;
keep Q R S Z;
run;
That data step is nicely organized, and has a logical flow, even though almost every step could be moved anywhere else in the data step. Grouping the initializations together makes it more readable, particularly if you always group things that way.

Related

SQL dealing every bit without run query repeatedly

I have a column using bits to record status of every mission. The index of bits represents the number of mission while 1/0 indicates if this mission is successful and all bits are logically isolated although they are put together.
For instance: 1010 is stored in decimal means a user finished the 2nd and 4th mission successfully and the table looks like:
uid status
a 1100
b 1111
c 1001
d 0100
e 0011
Now I need to calculate: for every mission, how many users passed this mission. E.g.: for mission1: it's 0+1+1+0+1 = 5 while for mission2, it's 0+1+0+0+1 = 2.
I can use a formula FLOOR(status%POWER(10,n)/POWER(10,n-1)) to get the bit of every mission of every user, but actually this means I need to run my query by n times and now the status is 64-bit long...
Is there any elegant way to do this in one query? Any help is appreciated....
The obvious approach is to normalise your data:
uid mission status
a 1 0
a 2 0
a 3 1
a 4 1
b 1 1
b 2 1
b 3 1
b 4 1
c 1 1
c 2 0
c 3 0
c 4 1
d 1 0
d 2 0
d 3 1
d 4 0
e 1 1
e 2 1
e 3 0
e 4 0
Alternatively, you can store a bitwise integer (or just do what you're currently doing) and process the data in your application code (e.g. a bit of PHP)...
uid status
a 12
b 15
c 9
d 4
e 3
<?php
$input = 15; // value comes from a query
$missions = array(1,2,3,4); // not really necessary in this particular instance
for( $i=0; $i<4; $i++ ) {
$intbit = pow(2,$i);
if( $input & $intbit ) {
echo $missions[$i] . ' ';
}
}
?>
Outputs '1 2 3 4'
Just convert the value to a string, remove the '0's, and calculate the length. Assuming that the value really is a decimal:
select length(replace(cast(status as char), '0', '')) as num_missions as num_missions
from t;
Here is a db<>fiddle using MySQL. Note that the conversion to a string might look a little different in Hive, but the idea is the same.
If it is stored as an integer, you can use the the bin() function to convert an integer to a string. This is supported in both Hive and MySQL (the original tags on the question).
Bit fiddling in databases is usually a bad idea and suggests a poor data model. Your data should have one row per user and mission. Attempts at optimizing by stuffing things into bits may work sometimes in some programming languages, but rarely in SQL.

Working of Merge in SAS (with IN=)

I have two dataset data1 and data2
data data1;
input sn id $;
datalines;
1 a
2 a
3 a
;
run;
data data2;
input id $ sales x $;
datalines;
a 10 x
a 20 y
a 30 z
a 40 q
;
run;
I am merging them from below code:
data join;
merge data1(in=a) data2(in=b);
by id;
if a and b;
run;
Result: (I was expecting an Inner Join result which is not the case)
1 a 10 x
2 a 20 y
2 a 30 z
2 a 40 w
Result from proc sql inner join.
proc sql;
select data1.id,sn,sales,x from data2 inner join data1 on data1.hh_id;
quit;
Result: (As expected from an inner join)
a 1 10 x
a 1 20 y
a 1 30 z
a 1 40 w
a 2 10 x
a 2 20 y
a 2 30 z
a 2 40 w
b 3 10 x
b 3 20 y
b 3 30 z
b 3 40 w
I want to know the concept and STEP BY STEP working of merge statement in SAS with In= and proving the above result.
PS: I have read this, and it says
An obvious use for these variables is to control what kind of 'merge'
will occur, using if statements. For example, if
ThisRecordIsFromYourData and ThisRecordIsFromOtherData; will make SAS
only include rows that match on the by variables from both input data
sets (like an inner join).
which I guess, (like an Inner Join) is not always the case.
Basically, this is a result of the difference in how the SAS data step and SQL process their respective join/merges.
SQL creates a separate record for each possible combination of keys. This is a Cartesian Product (at the key level).
SAS data step, however, process merges very differently. MERGE is really nothing more than a special case of SET. It still processes rows iteratively, one at a time - it never goes back, and never has more than one row from any dataset in the PDV at once. Thus, it cannot create a Cartesian product in its normal process - that would require random access, which the SAS datastep doesn't do normally.
What it does:
For each unique BY value
Take the next record from the left side dataset, if one exists with that BY value
Take the next record from the right side dataset, if one exists with that BY value
Output a row
Continue until both datasets are exhausted for that BY value
With BY values that yield unique records per value on either side (or both), it is effectively identical to SQL. However, with BY values that yield duplicates on BOTH sides, you get what you have there: a side-by-side merge, and if one runs out before the other, the values from the last row of the shorter dataset (for that by value) are more-or-less copied down. (They're actually RETAINED, so if you overwrite them with changes, they will not reset on new records from the longer dataset).
So, if left has 3 records and right has 4 records for key value a, like in your example, then you get data from the following records (assuming you don't alter the data after):
left right
1 1
2 2
3 3
3 4

SAS Checking Whether A Third Variable is Between the Values of two other variables

I have been dealing with this issue that I thought was trivial, but for some reason nothing I have tried has worked so far.
I have a dataset
obs A B C
1 2 6 7
2 3 1 5
3 8 5 9
. . . .
For each observation, I want to compare the values in column A to the values in column B and assign a value 1 to a variable called within. My goal to only select observations where their A value is within their B and C values. I have tried everything, but nothing seem to be working.
Thank you.
Here's how to do it in a data step. Let me know if that works for you.
data new;
set old;
if B < A < C then D = 1;
else delete;
run;

Creating and modifying a global statement in SAS

I would like to do something very simple, but it doesn't work
This is a simple example but I intend to use it for some more complex stuff
the output I want is :
obs. dummy newcount
1 3 1
2 5 2
3 2 3
but the output I get is :
obs. dummy newcount
1 3 1
2 5 1
3 2 1
here is my code
data test;
input dummy;
cards;
3
5
2
;
run;
%let count=1;
data test2;
set test;
newcount = &count.;
%let count = &count. + 1;
run;
The variable count doesn't get incremented. How do I do this?
Thanks for your help !
You're mixing macro variables and datastep variables in a way you cannot. Macro variables used in the data step in most cases have to have their values already defined prior to the data step when used like this; what happens is the data step compiler immediately resolves &count to the number 1, and uses that number 1 in its compilation, not the macro variable's newer values.
Further, the %let is not a data step command but a macro statement - it is also only executed once, not one time per data step pass.
You could use
data test2;
set test;
newcount = symget("count");
call symput("count",newcount+1);
put _all_;
run;
and it would work (call symput is how you define a macro variable in a data step, symget is how you retrieve the value of a macro variable that isn't finalized before the data step begins). It is probably not a good idea, however - you shouldn't generally store data values in macro variables and interact repeatedly with them inside a data step. If you post more details about why you're trying to do this (ie, what your actual goal is) I'm sure several of us could offer some suggestions for how to approach the problem.

Consecutive item comparison with smallest memory footprint

In J (using J503, not J6 or 7), normally when I want to see if the elements of an array are smaller than their predecessor, I use this:
smaller =: }:<:}.
Which results in n-1 items:
smaller 1 2 3 4 5
1 1 1 1
smaller 1 2 4 3
1 1 0
Internally, }: and }. create two arrays (one omits the last item, the other omits the first), finally allowing the <: comparison. The total memory usage will be 2(n-1) for the 2 temp arrays.
memuse =: 7!:2
i =: ,(10000#1)?10000
memuse 'smaller i'
148480
Another approach which should intuitively work better takes more memory:
smaller2 =: 13 : '2<:/\y.'
memuse 'smaller i'
214400
(J6 handles this much better. But I'm stuck with J5).
What would be a leaner alternative for the same operation?
Edit:
In J504, using rotate (1|.) seems to be the best choice so far:
smallerS =: <:1&|.
l =: ?. 10000$1000
;memuse &.> 'smaller l';'smaller2 l';'smallerS l'
148480 214400 82880
I would expect
2<:/\y
to be much better, given that 2 f/\y is special code since J 4.06. In the off chance that <: is the reason for the poor performance, try:
0>:2-/\ y