how to Sort negative and positive data in SAS - sql

I have bellow data in variable NUM
-3 1 0 1 3 2 -2 5 -5 -6 4 6 -4
i want data NUM in bellow sorting order
0 -1 1 -2 2 -3 3 -4 4 -5 5 -6 6
How can we sort negative and positive values together? please help
data have;
input NUM ##;
cards;
-3 1 0 1 3 2 -2 5 -5 -6 4 6 -4
;
run;

Sort by abs(num), num if you want the negative values to appear before the positive within the same absolute value as in the requested data.
data have;
input NUM ##;
cards;
-3 1 0 -1 3 2 -2 5 -5 -6 4 6 -4
;
run;
proc sql;
create table want as
select * from have
order by abs(num), num
;
quit;

Make a new variable with the absolute value and include it in the sort.
data want;
set have;
absolute=abs(num);
run;
proc sort data=want;
by absolute num;
run;

Related

How to create 2 datalines in sas with different length

I want to create a table like that:
a 1 2 3
b 1 2 3 4
a has 3 values, b has 4.
How can I do it in SAS?
When I enter it like that it deletes the 4 at the end.
data my_data;
input a b;
datalines;
1 1
2 2
3 3
4
I am very new to SAS thanks for your advice.
If you want to use LIST MODE input, like in your example, then each variable needs to have a "word" on the line. Use a period to indicate the missing values.
data my_data;
input a b;
datalines;
1 1
2 2
3 3
. 4
;
Otherwise switch to COLUMN MODE input.
data my_data;
input a 1-2 b 3-4 ;
datalines;
1 1
2 2
3 3
4
;
Or FORMATTED MODE
data my_data;
input a 2. b 2.;
datalines;
1 1
2 2
3 3
4
;
Note that you can use the period to indicate a missing value even when the variable is character. This is because the normal character informat will convert that single period into a blank value.
data my_data;
input a $ b;
datalines;
1 1
2 2
3 3
. 4
;

SAS Lookup on Per Variable Basis

I have two tables in SAS, Table A and Table B. Suppose I want to write a little SAS code to obtain the table "Desired Output." How would I do this?
Table A:
Observation Var1 Var2
1 0 0
2 1 2
3 2 1
4 0 0
Table B:
Var Level Lookup
Var1 0 0.1
Var1 1 0.3
Var1 2 0.5
Var2 0 0.7
Var2 1 0.8
Var2 2 0.9
Desired output:
Observation Var1 Var2 Var1_new Var2_new
1 0 0 0.1 0.7
2 1 2 0.3 0.9
3 2 1 0.5 0.8
4 0 2 0.1 0.9
From my understanding, this may involve SQL in SAS, but I'm not sure. I have no idea how to do this. Pseudo-code might look like this, but I don't know how to actually make it work:
data DATA_OUT.DESIRED_OUTPUT;
set DATA_IN.TABLE_A;
set PP.TABLE_B key=(Var Level);
Var1_new = TABLE_B["Var1" Var1][Lookup];
Var2_new = TABLE_B["Var2" Var2][Lookup];
run;
How would you achieve the desired output in SAS?
Here is a method using a hash object to store your table B.
data A ;
input var1 var2;
cards;
0 0
1 2
2 1
0 0
;
data B;
input Var :$32. Level Lookup;
cards;
Var1 0 0.1
Var1 1 0.3
Var1 2 0.5
Var2 0 0.7
Var2 1 0.8
Var2 2 0.9
;
data want;
if _n_=1 then do;
if 0 then set b;
dcl hash h(dataset: 'b');
h.definekey('var','level');
h.definedata('lookup');
h.definedone();
end;
set a;
h.find(key:'Var1',key:var1);
lookup1=lookup;
h.find(key:'Var2',key:var2);
lookup2=lookup;
drop var level lookup;
run;
There's about a dozen ways to do this, but the best way for what you have there is probably to make a format from the second dataset.
Formats are just relationships between one value and another value, which is exactly what you have here! You use the CNTLIN option on PROC FORMAT to create the relationship from a dataset (your dataset B) and then apply it using PUT. (Then use INPUT to change back to a number - formats only create character values. You can't use INFORMAT here because those only take character values as input. Number to number always takes an extra step.)
You could also use a hash table lookup, or just a pair of data step merges, or keyed set statements... a lot of options, as well as SQL joins. But format here will be the fastest and the easiest to code IMO.
data a;
input Observation Var1 Var2;
datalines;
1 0 0
2 1 2
3 2 1
4 0 0
;;;;
run;
data b;
input Var $ Level Lookup;
datalines;
Var1 0 0.1
Var1 1 0.3
Var1 2 0.5
Var2 0 0.7
Var2 1 0.8
Var2 2 0.9
;;;;
run;
*Here we make a new dataset that has the required names for a format cntlin dataset;
data for_fmt;
set b;
rename var=fmtname
level=start
lookup=label
;
var = cats(var,'F'); *format names cannot end with numbers, so add an F at the end;
run;
proc format cntlin=for_fmt; *read in the format;
quit;
*now use the formats;
data want;
set a;
var1_new = input(put(var1,var1f.),best12.);
var2_new = input(put(var2,var2f.),best12.);
run;

Create new column using for loop Pandas

I have the following data set:
d = {'person':[1,1,1,1,1,1],'id':['-8','-5','-4','-3','-3','-2'],'obs':
['A','B','C','D','E','F']}
df_start = pd.DataFrame(data=d)
Need to create an output dataset like:
d_end = {'id':[-8,-8,-5,-8,-5,-4,-5,-4,-3,-3,-5,-4,-3,-3],
'obs':['A','A','B','A','B','C','B','C','D','E','B','C','D','E'],
'id_group':[-8,-5,-5,-4,-4,-4,-3,-3,-3,-3,-2,-2,-2,-2]}
df_end = pd.DataFrame(data=d_end)
I am trying to group the rows using a new column called id_group that is created by comparing id values across rows. A single id will belong to its own id group. An id will belong to another id_group if (id+4) is greater than or equal to an id on another row.
Have not been able to get very far using a for loop trying to do this, very open to suggestions
To do it without looping, perform one large merge and then subset afterwards:
df_start['id'] = df_start['id'].astype(int)
df_end = df_start.assign(dummy=1).merge(df_start[['id']].assign(dummy=1),
on='dummy',
suffixes=['','_gp']).drop(columns='dummy')
df_end = df_end[df_end.id.between(df_end.id_gp - 4, df_end.id_gp)]
Output:
person id obs id_gp
0 1 -8 A -8
1 1 -8 A -5
2 1 -8 A -4
7 1 -5 B -5
8 1 -5 B -4
9 1 -5 B -3
10 1 -5 B -3
11 1 -5 B -2
14 1 -4 C -4
15 1 -4 C -3
16 1 -4 C -3
17 1 -4 C -2
21 1 -3 D -3
22 1 -3 D -3
23 1 -3 D -2
27 1 -3 E -3
28 1 -3 E -3
29 1 -3 E -2
35 1 -2 F -2
With looping, create subsets for all values within 4 of the id. Concatenate them all together.
df_start['id'] = df_start['id'].astype(int)
l = []
for id_gp in df_start.id.unique():
l.append(df_start[df_start.id.between(id_gp-4, id_gp)].assign(id_gp = id_gp))
df_end = pd.concat(l, ignore_index=True)

identifying the rows with maximum continuous values

I have two columns in a table. the second column has 1 or zero depending on a predefined condition. Can someone help me with a logic to identify the maximum continuous occurrence of 1s. For example, in the below table the maximum continuous occurrence is between rows 7 and 18. Just the logic to identify this would be enough.
Thanks
Create the intervals.
data intervals ;
set have ;
by B NOTSORTED ;
if first.b then start=A ;
retain start ;
if last.b then do;
end = A ;
duration = end - start + 1 ;
output;
end;
drop A ;
run;
Then find the interval with the maximum duration. Perhaps you want the first occurrence of the maximum duration?
proc sort data=intervals out=want ;
by descending duration start;
run;
data want ;
set want (obs=1);
where B=1;
run;
something like this
data have;
input A B;
datalines;
1 0
2 0
3 1
4 1
5 1
6 0
7 0
8 0
9 1
10 0
11 1
12 1
13 1
14 1
15 1
16 1
17 0
18 0
19 0
20 1
21 0
;
proc sort data=have;
by A;
run;
data want;
set have;
if B=1 then count + 1;
if B = 0 then count = 0;
run;
proc sql;
select max(count) as max_value from want;

SAS hierarchical structure sum

I have a dataset with a hierarchical codelist variable.
The logics of hierarchy is determined by the LEVEL variable and the prefix structure of the CODE character variable.
There are 6 (code length from 1 to 6) "aggregate" levels and the terminal level (code length of 10 characters).
I need to update the nodes variable (count of terminal nodes - the aggregate levels do not count in the "higher" aggregates, only the terminal nodes) - so the sum of counts in one level, for example every level 5's total count is the same as every level 6's.
And I need to calculate (sum up) the weight to "higher" level nodes.
NOTE: I offset the output table's NODES and WEIGHT variable so you can see better what I am talking about (just add up the numbers in each offset and you get the same value).
EDIT1: there can be multiple observations with the same code. A unique observations is a combination of 3 variables code + var1 + var2.
Input table:
ID level code var1 var2 nodes weight myIndex
1 1 1 . . 999 999 999
2 2 11 . . 999 999 999
3 3 111 . . 999 999 999
4 4 1111 . . 999 999 999
5 5 11111 . . 999 999 999
6 6 111111 . . 999 999 999
7 10 1111119999 01 1 1 0.1 105,5
8 10 1111119999 01 2 1 0.1 109,1
9 6 111112 . . 999 999 999
10 10 1111120000 01 1 1 0.5 95,0
11 5 11119 . . 999 999 999
12 6 111190 . . 999 999 999
13 10 1111901000 01 1 1 0.1 80,7
14 10 1111901000 02 1 1 0.2 105,5
Desired output table:
ID level code var1 var2 nodes weight myIndex
1 1 1 . . 5 1.0 98,1
2 2 11 . . 5 1.0 98,1
3 3 111 . . 5 1.0 98,1
4 4 1111 . . 5 1.0 98,1
5 5 11111 . . 3 0.7 98,5
6 6 111111 . . 2 0.2 107,3
7 10 1111119999 01 1 1 0.1 105,5
8 10 1111119999 01 2 1 0.1 109,1
9 6 111112 . . 1 0.5 95,0
10 10 1111120000 01 1 1 0.5 95,0
11 5 11119 . . 2 0.3 97,2
12 6 111190 . . 2 0.3 97,2
13 10 1111901000 01 1 1 0.1 80,7
14 10 1111901000 02 1 1 0.2 105,5
And here's the code I came up with. It works just like I wanted, but man, it is really slow. I need something way faster, because this is a part of a webservice which has to run "instantly" on request.
Any suggestions on speeding up the code, or any other solutions are welcome.
%macro doit;
data temporary;
set have;
run;
%do i=6 %to 2 %by -1;
%if &i = 6 %then %let x = 10;
%else %let x = (&i+1);
proc sql noprint;
select count(code)
into :cc trimmed
from have
where level = &i;
select code
into :id1 - :id&cc
from have
where level = &i;
quit;
%do j=1 %to &cc.;
%let idd = &&id&j;
proc sql;
update have t1
set nodes = (
select sum(nodes)
from temporary t2
where t2.level = &x and t2.code like ("&idd" || "%")),
set weight = (
select sum(weight)
from temporary t2
where t2.level = &x and t2.code like ("&idd" || "%"))
where (t1.level = &i and t1.code like "&idd");
quit;
%end;
%end;
%mend doit;
Current code based on #Quentin's solution:
data have;
input ID level code : $10. nodes weight myIndex;
cards;
1 1 1 . . .
2 2 11 . . .
3 3 111 . . .
4 4 1111 . . .
5 5 11111 . . .
6 6 111111 . . .
7 10 1111110000 1 0.1 105.5
8 10 1111119999 1 0.1 109.1
9 6 111112 . . .
10 10 1111129999 1 0.5 95.0
11 5 11119 . . .
12 6 111190 . . .
13 10 1111900000 1 0.1 80.7
14 10 1111901000 1 0.2 105.5
;
data want (drop=_:);
*hash table of terminal nodes;
if (_n_ = 1) then do;
if (0) then set have (rename=(code=_code weight=_weight));
declare hash h(dataset:'have(where=(level=10) rename=(code=_code weight=_weight myIndex=_myIndex))');
declare hiter iter('h');
h.definekey('ID');
h.definedata('_code','_weight','_myIndex');
h.definedone();
end;
set have;
*for each non-terminal node, iterate through;
*hash table of all terminal nodes, looking for children;
if level ne 10 then do;
call missing(weight, nodes, myIndex);
do _n_ = iter.first() by 0 while (_n_ = 0);
if trim(code) =: _code then do;
weight=sum(weight,_weight);
nodes=sum(nodes,1);
myIndex=sum(myIndex,_myIndex*_weight);
end;
_n_ = iter.next();
end;
myIndex=round(myIndex/weight,.1);
end;
output;
run;
Here's an alternative hash approach.
Rather than using a hash object to do a cartesian join, this adds the nodes & weight from each level 10 node to each of the 6 applicable parent nodes as it goes along. This may be marginally faster than Quentin's approach as there are no redundant hash lookups.
It takes a bit longer than Quentin's approach when constructing the hash object, and uses a bit more memory, as each terminal node is added 6 times with different keys and existing entries often have to be updated, but afterwards each parent node only has to look up its own individual stats, rather than looping through all the terminal nodes, which is a substantial saving.
Weighted stats are possible as well, but you have to update both loops, not just the second one.
data want;
if 0 then set have;
dcl hash h();
h.definekey('code');
h.definedata('nodes','weight','myIndex');
h.definedone();
length t_code $10;
do until(eof);
set have(where = (level = 10)) end = eof;
t_nodes = nodes;
t_weight = weight;
t_myindex = weight * myIndex;
do _n_ = 1 to 6;
t_code = substr(code,1,_n_);
if h.find(key:t_code) ne 0 then h.add(key:t_code,data:t_nodes,data:t_weight,data:t_myIndex);
else do;
nodes + t_nodes;
weight + t_weight;
myIndex + t_myIndex;
h.replace(key:t_code,data:nodes,data:weight,data:MyIndex);
end;
end;
end;
do until(eof2);
set have end = eof2;
if level ne 10 then do;
h.find();
myIndex = round(MyIndex / Weight,0.1);
end;
output;
end;
drop t_:;
run;
Below is a brute-force hash approach to doing a similar Cartesian product as in the SQL. Load a hash table of the terminal nodes. Then read through the dataset of nodes, and for each node that is not a terminal node, iterate through the hash table, identifying all of the child terminal nodes.
I think the approach #joop is describing may be more efficient, as this approach doesn't take advantage of the tree structure. So there is a lot of re-calculating. With 5000 records and 3000 terminal nodes, this would do 2000*3000 comparisons. But might not be that slow since the hash table is in memory, so you're not going to have excessive I/O ....
data want (drop=_:);
*hash table of terminal nodes;
if (_n_ = 1) then do;
if (0) then set have (rename=(code=_code weight=_weight));
declare hash h(dataset:'have(where=(level=10) rename=(code=_code weight=_weight))');
declare hiter iter('h');
h.definekey('ID');
h.definedata('_code','_weight');
h.definedone();
end;
set have;
*for each non-terminal node, iterate through;
*hash table of all terminal nodes, looking for children;
if level ne 10 then do;
call missing(weight, nodes);
do _n_ = iter.first() by 0 while (_n_ = 0);
if trim(code) =: _code then do;
weight=sum(weight,_weight);
nodes=sum(nodes,1);
end;
_n_ = iter.next();
end;
end;
output;
run;
It seems pretty simple. Just join back with itself and count/sum.
proc sql ;
create table want as
select a.id, a.level, a.code , a.var1, a.var2
, count(b.id) as nodes
, sum(b.weight) as weight
from have a
left join have b
on a.code eqt b.code
and b.level=10
group by 1,2,3,4,5
order by 1
;
quit;
If you don't want to use the EQT operator then you can use the SUBSTR() function instead.
on a.code = substr(b.code,1,a.level)
and b.level=10
Since you're using SAS, how about using proc summary to do the heavy lifting here? No cartesian joins required!
One advantage of this option over the some of the others is that it's a bit easier to generalise if you want to calculate lots of more complex statistics for multiple variables.
data have;
input ID level code : $10. nodes weight myIndex;
format myIndex 5.1;
cards;
1 1 1 . . .
2 2 11 . . .
3 3 111 . . .
4 4 1111 . . .
5 5 11111 . . .
6 6 111111 . . .
7 10 1111110000 1 0.1 105.5
8 10 1111119999 1 0.1 109.1
9 6 111112 . . .
10 10 1111129999 1 0.5 95.0
11 5 11119 . . .
12 6 111190 . . .
13 10 1111900000 1 0.1 80.7
14 10 1111901000 1 0.2 105.5
;
run;
data v_have /view = v_have;
set have(where = (level = 10));
array lvl[6] $6;
do i = 1 to 6;
lvl[i]=substr(code,1,i);
end;
drop i;
run;
proc summary data = v_have;
class lvl1-lvl6;
var nodes weight;
var myIndex /weight = weight;
ways 1;
output out = summary(drop = _:) sum(nodes weight)= mean(myIndex)=;
run;
data v_summary /view = v_summary;
set summary;
length code $10;
code = cats(of lvl:);
drop lvl:;
run;
data have;
modify have v_summary;
by code;
replace;
run;
In theory a hash of hashes might also be an appropriate data structure, but that would be extremely complicated for a relatively small benefit. I might have a go anyway just as a learning exercise...
One approach (I think) would be to make the Cartesian product, and find all of the terminal nodes that are a "match" to each of the nodes, then sum the weights.
Something like:
data have;
input ID level code : $10. nodes weight ;
cards;
1 1 1 . .
2 2 11 . .
3 3 111 . .
4 4 1111 . .
5 5 11111 . .
6 6 111111 . .
7 10 1111110000 1 0.1
8 10 1111119999 1 0.1
9 6 111112 . .
10 10 1111129999 1 0.5
11 5 11119 . .
12 6 111190 . .
13 10 1111900000 1 0.1
14 10 1111901000 1 0.2
;
proc sql;
select min(id) as id
, min(level) as level
, a.code
, count(b.weight) as nodes /*count of terminal nodes*/
, sum(b.weight) as weight /*sum of weights of terminal nodes*/
from
have as a
,(select code , weight
from have
where level=10 /*selects terminal nodes*/
) as b
where a.code eqt b.code /*EQT is equivalent to =: */
group by a.code
;
quit;
I'm not sure that is correct, but it gives the desired results for the sample data.
This is the SQL needed to estimate the parent record for every record. It only uses string functions (position and length) so it should be adaptable to any dialect of SQL, maybe even SAS. (the CTE might need to be rewritten to subqueries or a view) The idea is to:
add a parent_id field to the dataset
find the record with the longest substring of code
and use its id as the value for our parent_id
(after that) update the records from the sum(nodes),sum(weight) of their direct children (the ones with child.parent_id = this.id )
BTW: I could have used the LEVEL instead of the LENGTH(code) ; the data is a bit redundant in this aspect.
WITH sub AS (
SELECT id, length(code) AS len
, code
FROM tree)
UPDATE tree t
SET parent_id = s.id
FROM sub s
WHERE length(t.code) > s.len AND POSITION (s.code IN t.code) = 1
AND NOT EXISTS (
SELECT *
FROM sub nx
WHERE nx.len > s.len AND POSITION (nx.code IN t.code ) = 1
AND nx.len < length(t.code) AND POSITION (nx.code IN t.code ) = 1
)
;
SELECT * FROM tree
ORDER BY parent_id DESC NULLS LAST
, id
;
After finding the parents, the whole table should be updated (repeatedly) from itself
like:
-- PREPARE omg( integer) AS
UPDATE tree t
SET nodes = s.nodes , weight = s.weight
FROM ( SELECT parent_id , SUM(nodes) AS nodes , SUM(weight) AS weight
FROM tree GROUP BY parent_id) s
WHERE s.parent_id = t.id
;
In SAS, this could probably be done by sorting on {0-parent_id, id} and do some retain+summation magic. (my SAS is a bit rusty in this area)
UPDATE: if only the leaf nodes have non-NULL (non-missing) values for {nodes, weight}, the aggregation can be done in one sweep for the entire tree, without first computing the parent_ids:
UPDATE tree t
SET nodes = s.nodes , weight = s.weight
FROM ( SELECT p.id , SUM(c.nodes) AS nodes , SUM(c.weight) AS weight
FROM tree p
JOIN tree c ON c.lev > p.lev AND POSITION (p.code IN c.code ) = 1
GROUP BY p.id
) s
WHERE s.id = t.id
;
An index on {lev,code} will probably speed up things. (assuming an index on id)