different variables with the same name in a sas dataset - input

I have a text file where two different columns are having the same name. As shown in the following figure.
Let's say for SystBP, I need to change the first SystBP to SystBP_B and the second SystBP to SystBP_E.
Could someone kindly offer me some help on this?

When programming in SAS Base You should sometimes not expect SAS to read column names from a text file and interpret them as variable names.
You have to instruct SAS what the first data row is, where the values are written and how they should be interpreted (text, number, date, ...) You do that with an infile and an input statement in a data step.
As you write the code yourself, you have complete control.
data READ_FROM_TXT;
infile "C:\myFolder\myFile.txt" firstobs=3 truncover;
* firstobs=3 makes SAS skip the first 2 observations;
* truncover avoids jumping to the next line when the last variable is missing or too short ;
input
#01 ID 2.
#05 Week 4.
#11 SystBP_B 6.
#19 DiastBP_B 6.
...
#41 SystBP_E 6.
#49 DiastBP_E 6.
...
;
* #11 SystBP_B 6. instructs SAS to interpret positions 11 to 16 as a number
* and assign the value to variable SystBP_B;
run;
As you inserted the data as an image, not as text, using markup, I had to guess the positions, so you will have to correct them.

I would make timing into observations.
data test;
infile cards4 firstobs=2;
input id :$8. week #;
do time = 'STR','END';
input SystBP DiastBP Pulse Stress #;
output;
end;
cards;
ID Week SystBP DiastBP Pulse Stress SystBP DiastBP Pulse Stress
1 1 134 44 66 5.8 134 44 66 5.8
;;;;
run;

The INFILE option FIRSTOBS= will let you INPUT the data starting in row 3.
Data file: C:\temp\bp-survey.txt
Start End
ID Week SystBP DiastBP Pulse Stress SystBP DiastBP Pulse Stress
1 1 134 44 66 5.8 134 44 66 5.8
...
Program
filename survey 'c:\temp\bp-survey.txt';
data want;
infile survey firstobs=3;
input
ID Week
SystBP_start DiastBP_start Pulse_start Stress_start
SystBP_end DiastBP_end Pulse_end Stress_end
;
run;
ods html ;
proc print data=want;
run;

Related

SAS delete and group by

Simplified version of the dataset I have is:
DATA HAVE;
INPUT ID match1 $ match2 $ not_relevant;
DATALINES;
1 "ABC" "ABC" 4
1 "XYZ" "XYZ" 29
2 "QQQ" "AAA" 5
2 "ABC" "ABC" 9
3 "EFG" "EFG" 7
3 "DEF" "DEF" 12
3 "LMK" LMK" 16
3 "LMK" . 29
;RUN;
I am looking to compare match1 and match2, and if anywhere in the ID column match1 does not equal match2, I would like to remove all of the rows with that ID. So for this example dataset I want to remove all of ID 2 (rows 3 and 4) since row 3 does not have a match between match1 and match2. All I can figure out how to do so far is to delete the rows where they dont match, which isnt terribly helpful for this application. I assume it would be easier to make it a new data set with some wheres but I am unsure how to begin there. Any ideas / advice?
EDIT:
Apologies, I dumbed down my dataset too much and forgot about an important exception. Note in my new dataset (I only added one row to the end). I do NOT want to delete group 3, since match2 is blank. I only want to delete a group where match2 is not blank and match1 does not equal match2.
Thanks
There's a few ways to do this. One would be to just construct a dataset of IDs that have non-matching rows, then do a merge or a SQL join and remove anything that matched this list.
However, my preferred option (partly because of speed, but also it's more straightforward once you understand how it works) is the DoW loop.
data want;
id_nonmatch = 0;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
if match1 ne match2 then id_nonmatch = 1; *set the flag to 1 if we find a nonmatch;
end;
do _n_ = 1 by 1 until (last.id);
set have;
by id;
if id_nonmatch = 0 then output;
end;
run;
There are two set statements on the data step, each of which runs through the same dataset separately. If it doesn't make sense, throw a put _all_; inside each of the do loops - that will show you what it's doing. The first loop goes over all of the rows for one ID, checks if any violate the constraint, and if none do, the flag variable (id_nonmatch) stays 0. If one does, it becomes a 1 (and stays that way). Then, when it hits an ID boundary, it stops pulling records from the first set statement, and goes onto the second - re-pulling those same rows. Now, it outputs only when the flag is a zero.
This is very efficient because of buffering - unless your id groups are very large, the data step may be able to use buffers to keep the same rows in memory and not have to reread them from disk. (This will depend on your disk and buffers - and seems to help much less on flash than on physical disks [since there is not the additional benefit of the disk head not having to move] - so your mileage may vary here.)
Just to show this difference, here is a log showing that there isn't much additional time needed for the second read - when the record is reasonably sized. This benefit is less when the record is very small - I imagine there is more overhead involved. Note that the second read adds only 1/7 of the time of the first read to the total processing time!
69 data have;
70 call streaminit(7);
71 length strvar $1000;
72 do id = 1 to 100000;
73 do iter = 1 to 50;
74 x = rand('Uniform');
75 output;
76 end;
77 end;
78 run;
NOTE: Variable strvar is uninitialized.
NOTE: The data set WORK.HAVE has 5000000 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 5.20 seconds
cpu time 5.20 seconds
79
80
81 data _null_;
82 do _n_ = 1 by 1 until (last.id);
83 set have;
84 by id;
85 end;
86 run;
NOTE: There were 5000000 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
real time 2.37 seconds
cpu time 2.37 seconds
87
88
89 data _null_;
90 do _n_ = 1 by 1 until (last.id);
91 set have;
92 by id;
93 end;
94 do _n_ = 1 by 1 until (last.id);
95 set have;
96 by id;
97 end;
98 run;
NOTE: There were 5000000 observations read from the data set WORK.HAVE.
NOTE: There were 5000000 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
real time 2.74 seconds
cpu time 2.73 seconds
It is easy to do this with an SQL query with a GROUP BY and HAVING clause.
proc sql;
create table want as
select *
from have
group by id
having max( (match1 ne match2) and not missing(match2))
;
quit;
SAS evaluates boolean expressions as 1/0 for TRUE/FALSE so the MAX() of a series of TRUE/FALSE values will be TRUE if ANY of them are TRUE.

SAS - Importing variable length Binary records without delimiters

I have a binary data set with no delimiters and no fixed length records. I know each record contains 22 bytes of data then an unknown number of 23 byte blocks, up to 50 blocks. The problem is that it's only reading 1 line of 32767 bytes for a total of 728 obs. I'm expecting 2.7MM output obs. How can I make this read the input file to the end? I've already tried adding an "OBS=" option and "lrecl=" option to the infile line. Adding the "end=" option had no effect on the result.
DATA INFILE.MYDATA (drop= i);
INFILE "&Path./UGLYDATA" end=eof;
INPUT
MY_KEY s370fPD9.
...
OCCURS s370fPD2.
#
;
ARRAY MyData{50} MyData1-MyData50;
...
ARRAY Filler{50} $ Filler1-Filler50;
DO I = 1 TO min(50,OCCURS);
INPUT
MyData{I} s370fPD4.
...
Filler{I} $ebcdic10.
##
;
End;
RUN;
Relevant Log:
NOTE: 1 record was read from the infile "UGLYDATA".
The minimum record length was 32767.
The maximum record length was 32767.
One or more lines were truncated.
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: The data set INFILE.MYDATA has 728 observations and 356 variables.
NOTE: Compressing data set INFILE.MYDATA decreased size by 47.06 percent.
Compressed is 9 pages; un-compressed would require 17 pages.
NOTE: DATA statement used (Total process time):
real time 2.69 seconds
user cpu time 0.02 seconds
system cpu time 0.11 seconds
memory 1890.40k
OS Memory 10408.00k
Timestamp 12/07/2021 05:17:34 PM
Step Count 1 Switch Count 0
Page Faults 3
Page Reclaims 1028
Page Swaps 0
Voluntary Context Switches 272
Involuntary Context Switches 1226
Block Input Operations 309648
Block Output Operations 2312
Sounds like the file does not consists of lines of text. So try using RECFM=N on your INFILE statement so that SAS will not be looking for LINEFEED character (or CARRIAGE RETURN and LINEFEED combination) to mark the end of the lines.
INFILE "&Path./UGLYDATA" recfm=n ;
If you are unsure what the file contains just run a simple data step to look at the first few hundred bytes and then figure it out. If any of the bytes in a "line" are not printable characters the LIST command will include the hexcodes for the bytes under the lines when it writes to the SAS log.
data _null_;
INFILE "&Path./UGLYDATA" recfm-=f lrecl=100 obs=10 ;
input;
list;
run;
Per #Tom, indeed RECFM=N.
Example:
Create and read back a binary file.
filename foo '%temp%/foo.bin' recfm=n;
data _null_;
file foo;
call streaminit(2021);
filler = repeat('*', 10);
do recnum = 1001 to 1010;
put recnum s370fPD9. #;
put filler $char11. #;
occurs = rand('integer',1,26);
put occurs s370fPD2. #;
do z = 0 to occurs-1;
record = repeat(byte(rank('A')+z), 22);
put record $ebcdic23.;
end;
putlog 'NOTE: ' recnum= occurs=;
end;
stop;
run;
data want;
infile foo;
* read master;
input recnum s370fPD9. filler $char11. occurs s370fPD2.;
* read details;
do index = 1 to occurs;
input content $ebcdic23.;
output;
end;
run;
dm 'vt want';

SAS - Conditional input statement

I would like to use conditional if...then...else to read in the following data set, to read in using one input statement if source =1 and to read in using another input statement if source = 2. Not sure where my error is. This is what I have so far and the associated error. Not sure if the pointers are needed.
DATA results2;
infile datalines missover;
input #10 source 1. #;
if source = 1 then input #1 id #4 name $ #12 score;
else if source = 2 then input #1 id #4 score #12 name $;
DATALINES;
11 john 1 77
11 88 2 james
22 bobby 1 55
22 89 2 opey
;;;;
RUN;
It is correctly reading in the id but the source is not correctly matched to the id and having an issue with the name and score.
Thanks for helping!

How to import specific lines from dat file

I've got a .dat file with numbers that I need imported into a SAS dataset. However, there's plenty of information that I do not need, and I only want specific lines of data (e.g. every 6th line starting from line 1000, until I have 100 observations). I also require a unique identifier based on what is displayed on the first line.
So for example, the .dat file contains this:
DATANOTREQUIRED
DATANOTREQUIRED
DATANOTREQUIRED
UPDATE AAA_1111111_Q_BBBBBB_0_1_#
123.4,
123.5,
124.0,
124.1
DATANOTREQUIRED
DATANOTREQUIRED
DATANOTREQUIRED
UPDATE AAA_1111111__Q_BBBBBB_0_2_#
125.1,
126.0,
127.1,
130.0
What I want the eventual SAS dataset to look like is this
Identifier | Value
X.1. | 124.1
X.2. | 130.0
I'm using the infile in SAS and using input to point to line 1000 but I'm stuck and cannot get the SAS dataset I want. (Updated code based on contributors below)
data work.test;
infile '\\filepath\mydatasource.dat' dsd firstobs=1042 truncover;
input #8 ID :$40.
#4 Value1 :8.;
run;
but what I'm seeing now is that the header lines are appearing fine, but the first observation has a . and instead the first data value is appearing for the 2nd header line.
ID | Value1
UPDATE AAA_1111111_Q_BBBBBB_0_1_# | .
UPDATE AAA_1111111__Q_BBBBBB_0_2_# | 124.1
Here's an example assuming that you have the same number of rows between each header row:
data want;
if _n_ > 2 then stop; /*Stop after we've output 2 rows */
infile cards firstobs=6; /*Skip the first 5 lines in the file*/
input #1 #8 ID :$32.
#5 myvar :8.;
cards;
UPDATE AAA_1111111_Q_BBBBBB_0_1_#
123.4,
123.5,
124.0,
124.1
UPDATE AAA_1111111__Q_BBBBBB_0_2_#
125.1,
126.0,
127.1,
130.0
UPDATE AAA_1111111_Q_BBBBBB_0_3_#
123.4,
123.5,
124.0,
124.1
UPDATE AAA_1111111__Q_BBBBBB_0_4_#
125.1,
126.0,
127.1,
130.0
;
run;
Use the FIRSTOBS= option to skip the beginning of the file.
If there are always 5 records per block you could just read them individually.
data want;
infile rawdata dsd firstobs=1000 truncover;
input id :$40. (4*value) (/) ;
run;
Or you could do something like this that should allow for a variable number of values per id and just keep the last one.
data want;
infile rawdata dsd firstobs=1000 end=eof;
input # ;
length id $32 value 8 ;
retain id value;
if _infile_ =: 'UPDATE' then do;
if _n_ > 1 then output;
id = scan(_infile_,-1,' ');
end;
else input value;
if eof and _n_ > 1 then output;
run;

SAS - # symbol in the INPUT statement

I have the following program, but don't understand what # symbol in the end of the INPUT lines does:
data colors;
input #1 Var1 $ #8 Var2 $ #;
input #1 Var3 $ #8 Var4 $ #;
datalines;
RED ORANGE YELLOW GREEN
BLUE INDIGO PURPLE VIOLET
CYAN WHOTE FICSIA BLACK
GRAY BROWN PINK MAGENTA
run;
proc print data=colors;
run;
Output produced without # in the end of the INPUT line is different from the ouput with #.
Can you please clarify what does # in the end of the 2nd and 3rd INPUT lines do?
# at the end of an input statement means, do not advance the line pointer after semicolon. ## means, do not advance the line pointer after the run statement either.
Normally an input statement has an implicit advance the line pointer one after semicolon. So:
data want;
input a b;
datalines;
1 2 3 4
5 6 7 8
run;
proc print data=want;
run;
Will return
1 2
5 6
If you want to read 3 4 into another line, then, you might do something like:
data want;
input a b #;
output;
input a b;
datalines;
1 2 3 4
5 6 7 8
run;
proc print data=want;
run;
Which gives
1 2
3 4
5 6
7 8
Similarly you could simply write
data want;
input a b ##;
datalines;
1 2 3 4
5 6 7 8
run;
proc print data=want;
run;
To get the same result - ## would hold the line pointer even across the run statement. (It still would advance once it hit the end of the line.)
In Summary: I think you probably don't want the trailing # in this case. The Input statements do not seem fitting for the data you are reading. With the trailing #, you are reading the same data into var1 and var3, and the same data into var2 and var4, because it is reading the same line twice. Either way, you are not reading in what the data appears to be. You would be better off with:
input Var1 $ Var2 $ #;
input Var3 $ Var4 $;
Or, more simply:
input Var1 $ Var2 $ Var3 $ Var4 $;
Official details from the SAS support site, annotated:
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000146292.htm
Using Line-Hold Specifiers
Line-hold specifiers keep the pointer on
the current input record when
a data record is read by more than one INPUT statement (trailing #)
Use a single trailing # to allow the next INPUT statement to read from the same record.
Normally, each INPUT statement in a DATA step reads a new data record
into the input buffer. When you use a trailing #, the following
occurs:
The pointer position does not change.
No new record is read into the input buffer.
The next INPUT statement for the same iteration of the DATA step continues to read the same record rather than a new one.
SAS releases a record held by a trailing # when
a null INPUT statement executes:
input;
an INPUT statement without a trailing # executes
the next iteration of the DATA step begins.