How to create 2 datalines in sas with different length - input

I want to create a table like that:
a 1 2 3
b 1 2 3 4
a has 3 values, b has 4.
How can I do it in SAS?
When I enter it like that it deletes the 4 at the end.
data my_data;
input a b;
datalines;
1 1
2 2
3 3
4
I am very new to SAS thanks for your advice.

If you want to use LIST MODE input, like in your example, then each variable needs to have a "word" on the line. Use a period to indicate the missing values.
data my_data;
input a b;
datalines;
1 1
2 2
3 3
. 4
;
Otherwise switch to COLUMN MODE input.
data my_data;
input a 1-2 b 3-4 ;
datalines;
1 1
2 2
3 3
4
;
Or FORMATTED MODE
data my_data;
input a 2. b 2.;
datalines;
1 1
2 2
3 3
4
;
Note that you can use the period to indicate a missing value even when the variable is character. This is because the normal character informat will convert that single period into a blank value.
data my_data;
input a $ b;
datalines;
1 1
2 2
3 3
. 4
;

Related

Choose first occurrence and then choose the next one down

In SAS(Data Step) or Proc SQL, I want to choose the first occurrence of TransB based on DaysBetweenTrans first and then flag, if TransB has already been chosen then I want the next available one although I also want TransA to be unique as well i.e. TransA is a unique row and TransB is unique too.
For example, the original table looks like this:
TransA
TransB
DaysBetweenTrans
Flag
A
1
1
1
A
2
1
1
B
1
3
1
B
2
2
1
B
3
3
1
C
1
1
1
C
3
4
1
but I want only:
TransA
TransB
DaysBetweenTrans
Flag
A
2
1
1
B
1
3
1
C
3
4
1
I tried using sorting TransA and dedupkey and then sort TranB and dedupkey but no luck. The other way I thought of was to do first.TransA and output. Join back on the original table and remove any TransA and repeat, but there has to be a better way.
You might want to look into SAS procedures for optimization as a straight forward approach of taking the best next match for the current case might not find the best solution.
Here is an approach that uses a HASH to keep track of which targets have already been assigned.
It is not totally clear to me what your preference for ordering are but here is one method. It sounds like you want to find the best match for TRANSB=1. Then for TRANSB=2, etc.
data have;
input TransA $ TransB $ DaysBetweenTrans Flag;
cards;
A 1 1 0
A 2 1 1
B 1 3 1
B 2 2 1
B 3 3 1
C 1 1 1
C 3 4 1
;
proc sort data=have;
by transB daysbetweentrans descending flag transA;
run;
data _null_;
if _n_=1 then do;
declare hash h(ordered:'Y');
rc=h.definekey('transA');
rc=h.definedata('transA','transB','daysbetweentrans','flag');
rc=h.definedone();
end;
set have end=eof;
by transB;
if first.transB then found=0;
retain found;
if not found then if not h.add() then found=1;
if eof then do;
rc=h.output(dataset:'want');
end;
run;
Results:
Days
Trans Trans Between
Obs A B Trans Flag
1 A 2 1 1
2 B 3 3 1
3 C 1 1 1

SAS sum observations not in a group, by multiple groups

This post follow this one: SAS sum observations not in a group, by group
Where my minimal example was a bit too minimal sadly,I wasn't able to use it on my data.
Here is a complete case example, what I have is :
data have;
input group1 group2 group3 $ value;
datalines;
1 A X 2
1 A X 4
1 A Y 1
1 A Y 3
1 B Z 2
1 B Z 1
1 C Y 1
1 C Y 6
1 C Z 7
2 A Z 3
2 A Z 9
2 A Y 2
2 B X 8
2 B X 5
2 B X 5
2 B Z 7
2 C Y 2
2 C X 1
;
run;
For each group, I want a new variable "sum" with the sum of all values in the column for the same sub groups (group1 and group2), exept for the group (group3) the observation is in.
data want;
input group1 group2 group3 $ value $ sum;
datalines;
1 A X 2 8
1 A X 4 6
1 A Y 1 9
1 A Y 3 7
1 B Z 2 1
1 B Z 1 2
1 C Y 1 13
1 C Y 6 8
1 C Z 7 7
2 A Z 3 11
2 A Z 9 5
2 A Y 2 12
2 B X 8 17
2 B X 5 20
2 B X 5 20
2 B Z 7 18
2 C Y 2 1
2 C X 1 2
;
run;
My goal is to use either datasteps or proc sql (doing it on around 30 millions observations and proc means and such in SAS seems slower than those on previous similar computations).
My issue with solutions provided in the linked post is that is uses the total value of the column and I don't know how to change this by using the total in the sub group.
Any idea please?
A SQL solution will join all data to an aggregating select:
proc sql;
create table want as
select have.group1, have.group2, have.group3, have.value
, aggregate.sum - value as sum
from
have
join
(select group1, group2, sum(value) as sum
from have
group by group1, group2
) aggregate
on
aggregate.group1 = have.group1
& aggregate.group2 = have.group2
;
SQL can be slower than hash solution, but SQL code is understood by more people than those that understand SAS DATA Step involving hashes ( which can be faster the SQL. )
data want2;
if 0 then set have; * prep pdv;
declare hash sums (suminc:'value');
sums.defineKey('group1', 'group2');
sums.defineDone();
do while (not hash_loaded);
set have end=hash_loaded;
sums.ref(); * adds value to internal sum of hash data record;
end;
do while (not last_have);
set have end=last_have;
sums.sum(sum:sum); * retrieve group sum.;
sum = sum - value; * subtract from group sum;
output;
end;
stop;
run;
SAS documentation touches on SUMINC and has some examples
The question does not address this concept:
For each row compute the tier 2 sum that excludes the tier 3 this row is in
A hash based solution would require tracking each two level and three level sums:
data want2;
if 0 then set have; * prep pdv;
declare hash T2 (suminc:'value'); * hash for two (T)iers;
T2.defineKey('group1', 'group2'); * one hash record per combination of group1, group2;
T2.defineDone();
declare hash T3 (suminc:'value'); * hash for three (T)iers;
T3.defineKey('group1', 'group2', 'group3'); * one hash record per combination of group1, group2, group3;
T3.defineDone();
do while (not hash_loaded);
set have end=hash_loaded;
T2.ref(); * adds value to internal sum of hash data record;
T3.ref();
end;
T2_cardinality = T2.num_items;
T3_cardinality = T3.num_items;
put 'NOTE: |T2| = ' T2_cardinality;
put 'NOTE: |T3| = ' T3_cardinality;
do while (not last_have);
set have end=last_have;
T2.sum(sum:t2_sum);
T3.sum(sum:t3_sum);
sum = t2_sum - t3_sum;
output;
end;
stop;
drop t2_: t3:;
run;

How to define input variable when datalines have spaces for a variable

I want to have two different strings in the same dataset.
I tried to separate valeus with "" but it didnt work. Imagine I dont want to write "" but only strings inside. I searched a lot but did not find anything related to.
Could you guys please help me to get my goal?
data ecl.dim_produtos;
input id_produt id_departament id_order id_business id_portfolio initials $4. long_name $40. short_name $30.;
datalines;
1 1 10201 4 1 PZC "Puzzle Crédito" "Puzzle Crédito"
2 1 10202 4 1 PZR "Puzzle Reestruturados" "Reestruturados"
3 2 10207 30 1 DBO "Banca Online" "Banca Online"
4 3 10210 60 1 CLB "Colaboradores" "Colaboradores"
5 1 10203 4 1 PZF "Puzzle Formação" "Code Academy"
6 4 10205 5 1 HIP "Hipoteca Inversa" "Hip. Inversa"
7 5 10206 25 1 EMP "DEMP" "DEMP"
8 6 10208 45 1 NCO "NewCo" "NewCo"
9 6 10211 70 1 LDRC "Lendrock" "Lendrock"
10 4 10209 50 1 OTI "Otima Provision" "Otima"
11 6 10001 1 1 LDC "Lendico" "Lendico"
12 6 10007 1 1 MIBL "Market Invoice BL - EUR" "Market Invoice BL"
13 6 10003 1 1 CRS "CreditShelf" "CreditShelf"
14 6 10005 1 1 FUN "Funding Circle" "Funding Circle"
15 6 10002 1 1 RAI "Raize" "Raize"
16 4 10204 5 1 FLX "Flex" "Flex"
17 6 10101 2 1 AUX "Auxmoney" "Auxmoney"
18 6 10009 2 1 UPG "Upgrade - EUR" "Upgrade"
19 6 10104 2 1 PRO "Prodigy Finance" "Prodigy"
20 6 10102 2 1 FEL "Fellow Finance" "Fellow"
21 6 10008 1 1 ASZ "Assetz - EUR" "Assetz"
22 6 10010 2 1 LDB "Lendable - EUR" "Lendable"
23 6 10004 1 1 LIN "Linked Finance" "Linked"
24 6 10103 2 1 LDR "Lendrock" "Lendrock"
25 6 10105 3 1 EDX "Edebex" "Edebex"
26 6 10006 1 1 CAM "Camomille - FC" "Camomille"
27 6 10106 3 1 MIN "Market Invoice - EUR" "Market Invoice"
90 0 99991 102 2 DIV "Dívida Pública - EUR" "Dívida Pública"
91 6 99992 103 2 CRP "Obrigações Corporate - EUR" "Obrigações Corporate"
92 0 99990 101 3 SDA "Disp. Aplicações OIC - EUR" "Disp. Aplicações OIC"
9999 0 999999 999 99 TOT "Total Patrimonial - EUR" "Total Patrimonial"
;
run;
The most reliable approach would be to:
define the variables of the INPUT statement using a length or attrib statement.
use INFILE options to specify how the data lines are parsed by INPUT
take the $ out of the INPUT statement
Example (leave data lines as-is):
length
id_produt id_departament id_order id_business id_portfolio 8
initials $4
long_name $40
short_name $30
;
infile cards dsd dlm=" ";
For the case of wanting data lines with double quotes, you will have to modify the data lines to separate the values with two or more spaces and use the & argument for the variables in a list-style INPUT statement.
You could also separate the values in the data lines with a tab character and use DLM='09'x. You might have some trouble seeing and entering tabs using the SAS editor.
First make sure to use the : modifier if you want to include informat specifications in the INPUT statement to avoid switching between list and formatted input modes.
If you can insure that you have have at least two spaces between the values (and that the values themselves do NOT have adjacent spaces inside them) you can use the & modifier.
data test;
input id_produt id_departament id_order id_business id_portfolio
initials &:$4. long_name &:$40. short_name &:$30.
;
datalines;
1 1 10201 4 1 PZC Puzzle Crédito Puzzle Crédito
2 1 10202 4 1 PZR Puzzle Reestruturados Reestruturados
;
Or keep the quotes and make sure there is exactly one space between each value (and don't indent the datalines!) and add the DSD option.
data test;
infile datalines dsd dlm=' ' truncover ;
input id_produt id_departament id_order id_business id_portfolio
initials :$4. long_name :$40. short_name :$30.
;
datalines;
1 1 10201 4 1 PZC "Puzzle Crédito" "Puzzle Crédito"
2 1 10202 4 1 PZR "Puzzle Reestruturados" "Reestruturados"
;
Or use a different delimiter, with or without the DSD option.
data test;
infile datalines dsd dlm='|' truncover ;
input id_produt id_departament id_order id_business id_portfolio
initials :$4. long_name :$40. short_name :$30.
;
datalines;
1|1|10201|4|1|PZC|Puzzle Crédito|Puzzle Crédito
2|1|10202|4|1|PZR|Puzzle Reestruturados|Reestruturados
;

How to do a last observation carrying forward using SAS PROC SQL

I have the data below. I want to write a sas proc sql code to get the last non-missing values for each patient(ptno).
data sda;
input ptno visit weight;
format ptno z3. ;
cards;
1 1 122
1 2 123
1 3 .
1 4 .
2 1 156
2 2 .
2 3 70
2 4 .
3 1 60
3 2 .
3 3 112
3 4 .
;
run;
proc sql noprint;
create table new as
select ptno,visit,weight,
case
when weight = . then weight
else .
end as _weight_1
from sda
group by ptno,visit
order by ptno,visit;
quit;
The sql code above does not work well.
The desire output data like this:
ptno visit weight
1 1 122
1 2 123
1 3 123
1 4 123
2 1 156
2 2 .
2 3 70
2 4 70
3 1 60
3 2 .
3 3 112
3 4 112
Since you do have effectively a row number (visit), you can do this - though it's much slower than the data step.
Here it is, broken out into a separate column for demonstration purposes - of course in your case you will want to coalesce this into one column.
Basically, you need a subquery that determines the maximum visit number less than the current one that does have a legitimate weight count, and then join that to the table to get the weight.
proc sql;
select ptno, visit, weight,
(
select weight
from sda A,
(select ptno, max(visit) as visit
from sda D
where D.ptno=S.ptno
and D.visit<S.visit
and D.weight is not null
group by ptno
) V
where A.visit=V.visit and A.ptno=V.ptno
)
from sda S
;
quit;
Although you don't describe it that way you do not carry forward VISIT 1 right?
I don't know why you would want to do this using SQL. In SAS a data step is much better suited to the task. I like using the "update trick". If you're interested in how this works I will leave it to you to study the UPDATE statement.
data locf;
update sda(obs=0 keep=ptno) sda;
by ptno;
output;
if visit eq 1 then call missing(weight);
run;

how to add a column with specific string depending on 4th column with awk

I have a file in which the 4th column has numbers.
If 4th column is greater than 2 I want to add 5th column corresponding as gain; otherwise, the 5th column will have the string loss.
Input
1 762097 6706109 6
1 7202143 7792617 3
1 8922949 9815420 1
1 10502346 11074110 3
1 11188922 12267136 1
1 12566829 13910626 3
Desired output:
1 762097 6706109 6 gain
1 7202143 7792617 3 gain
1 8922949 9815420 1 loss
1 10502346 11074110 3 gain
1 11188922 12267136 1 loss
1 12566829 13910626 4 gain
How should I do this with awk?
Use awk like this:
$ awk '{print $0, ($4>2?"gain":"lose")}' file
1 762097 6706109 6 gain
1 7202143 7792617 3 gain
1 8922949 9815420 1 lose
1 10502346 11074110 3 gain
1 11188922 12267136 1 lose
1 12566829 13910626 3 gain
As you see, it is printing the full line ($0) followed by a string. This string is determined by the value of $4 using a ternary operator.