SAS - Importing variable length Binary records without delimiters - file-io

I have a binary data set with no delimiters and no fixed length records. I know each record contains 22 bytes of data then an unknown number of 23 byte blocks, up to 50 blocks. The problem is that it's only reading 1 line of 32767 bytes for a total of 728 obs. I'm expecting 2.7MM output obs. How can I make this read the input file to the end? I've already tried adding an "OBS=" option and "lrecl=" option to the infile line. Adding the "end=" option had no effect on the result.
DATA INFILE.MYDATA (drop= i);
INFILE "&Path./UGLYDATA" end=eof;
INPUT
MY_KEY s370fPD9.
...
OCCURS s370fPD2.
#
;
ARRAY MyData{50} MyData1-MyData50;
...
ARRAY Filler{50} $ Filler1-Filler50;
DO I = 1 TO min(50,OCCURS);
INPUT
MyData{I} s370fPD4.
...
Filler{I} $ebcdic10.
##
;
End;
RUN;
Relevant Log:
NOTE: 1 record was read from the infile "UGLYDATA".
The minimum record length was 32767.
The maximum record length was 32767.
One or more lines were truncated.
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: The data set INFILE.MYDATA has 728 observations and 356 variables.
NOTE: Compressing data set INFILE.MYDATA decreased size by 47.06 percent.
Compressed is 9 pages; un-compressed would require 17 pages.
NOTE: DATA statement used (Total process time):
real time 2.69 seconds
user cpu time 0.02 seconds
system cpu time 0.11 seconds
memory 1890.40k
OS Memory 10408.00k
Timestamp 12/07/2021 05:17:34 PM
Step Count 1 Switch Count 0
Page Faults 3
Page Reclaims 1028
Page Swaps 0
Voluntary Context Switches 272
Involuntary Context Switches 1226
Block Input Operations 309648
Block Output Operations 2312

Sounds like the file does not consists of lines of text. So try using RECFM=N on your INFILE statement so that SAS will not be looking for LINEFEED character (or CARRIAGE RETURN and LINEFEED combination) to mark the end of the lines.
INFILE "&Path./UGLYDATA" recfm=n ;
If you are unsure what the file contains just run a simple data step to look at the first few hundred bytes and then figure it out. If any of the bytes in a "line" are not printable characters the LIST command will include the hexcodes for the bytes under the lines when it writes to the SAS log.
data _null_;
INFILE "&Path./UGLYDATA" recfm-=f lrecl=100 obs=10 ;
input;
list;
run;

Per #Tom, indeed RECFM=N.
Example:
Create and read back a binary file.
filename foo '%temp%/foo.bin' recfm=n;
data _null_;
file foo;
call streaminit(2021);
filler = repeat('*', 10);
do recnum = 1001 to 1010;
put recnum s370fPD9. #;
put filler $char11. #;
occurs = rand('integer',1,26);
put occurs s370fPD2. #;
do z = 0 to occurs-1;
record = repeat(byte(rank('A')+z), 22);
put record $ebcdic23.;
end;
putlog 'NOTE: ' recnum= occurs=;
end;
stop;
run;
data want;
infile foo;
* read master;
input recnum s370fPD9. filler $char11. occurs s370fPD2.;
* read details;
do index = 1 to occurs;
input content $ebcdic23.;
output;
end;
run;
dm 'vt want';

Related

Convert character variable to numeric variable in SAS

I'm trying to convert a character variable to a numeric variable, but unfortunately i'm really struggeling. Help would be appreciated!
I keep getting the following error: 'Invalid argument to function INPUT at line 3259 column 17'
Syntax:
Data want;
Set have;
Dosis_num = input(Dosis, best12.);
run;
I have also tried multiplying the variable by 1. This doesnt work either.
The variable looks like this:
Dosis
155
201
2.1
0.8
123.80
12.0
3333.4
00.6
Want:
Dosis_num
155.0
201.0
2.1
0.8
123.8
12.0
333.4
0.6
Thanks alot!
The code will work with the data you show. So either the values in the character variable are not what you think or you are not using the right variable name for the variable.
The code is trying to only use the first 12 bytes of the character variable. Normally you don't need to restrict the number of characters you ask the INPUT() function to use. In fact the INPUT() function does not care if the width of the informat used is larger than the length of the string being read. So just use 32. as the informat since 32 is the maximum width that the normal numeric informat can read. Note that BEST is the name of a FORMAT, if you use it as the name of informat it is just an alias for the normal numeric informat.
If the variable has a length longer than 12 then perhaps there are leading spaces in the variable (note the ODS output displays do not properly display leading spaces) then use the LEFT() function to remove them.
Dosis_num = input(left(Dosis), 32.);
The typical thing to do here is to find out what's actually in the character variable. There is likely something in there that is causing the issue.
Try this:
data have;
input #1 Dosis $8.;
datalines;
155
201
2.1
0.8
123.80
12.0
3333.4
00.6
;;;;
run;
data check;
set have;
put dosis hex32.;
run;
What I get is this:
83 data check;
84 set have;
85 put dosis hex32.;
86 run;
3135352020202020
3230312020202020
322E312020202020
302E382020202020
3132332E38302020
31322E3020202020
333333332E342020
30302E3620202020
NOTE: There were 8 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.CHECK has 8 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
All those 2020202020 are spaces, which should be there (all strings are space-padded to full length). Period/Decimal Point is 2E, Digits are 3x where x is the digit (because the ASCII for 0 is 30, not because of any other reason). So for example for the last one, 00.6, 30 means zero, 30 means zero, 2E means period, and 36 means 6.
Check to make sure that you don't have any other characters other than digits (3x) and period (2e) and space (20).
The other thing to verify is that your system is set to use . as the decimal separator and not , as many European systems are - otherwise this requires the commaw. informat. You can actually just try the commaw. informat (comma12. is sufficient if 12 is plenty - and don't include anything after the period) as anything that 12. can read in also can be read in by commaw..

different variables with the same name in a sas dataset

I have a text file where two different columns are having the same name. As shown in the following figure.
Let's say for SystBP, I need to change the first SystBP to SystBP_B and the second SystBP to SystBP_E.
Could someone kindly offer me some help on this?
When programming in SAS Base You should sometimes not expect SAS to read column names from a text file and interpret them as variable names.
You have to instruct SAS what the first data row is, where the values are written and how they should be interpreted (text, number, date, ...) You do that with an infile and an input statement in a data step.
As you write the code yourself, you have complete control.
data READ_FROM_TXT;
infile "C:\myFolder\myFile.txt" firstobs=3 truncover;
* firstobs=3 makes SAS skip the first 2 observations;
* truncover avoids jumping to the next line when the last variable is missing or too short ;
input
#01 ID 2.
#05 Week 4.
#11 SystBP_B 6.
#19 DiastBP_B 6.
...
#41 SystBP_E 6.
#49 DiastBP_E 6.
...
;
* #11 SystBP_B 6. instructs SAS to interpret positions 11 to 16 as a number
* and assign the value to variable SystBP_B;
run;
As you inserted the data as an image, not as text, using markup, I had to guess the positions, so you will have to correct them.
I would make timing into observations.
data test;
infile cards4 firstobs=2;
input id :$8. week #;
do time = 'STR','END';
input SystBP DiastBP Pulse Stress #;
output;
end;
cards;
ID Week SystBP DiastBP Pulse Stress SystBP DiastBP Pulse Stress
1 1 134 44 66 5.8 134 44 66 5.8
;;;;
run;
The INFILE option FIRSTOBS= will let you INPUT the data starting in row 3.
Data file: C:\temp\bp-survey.txt
Start End
ID Week SystBP DiastBP Pulse Stress SystBP DiastBP Pulse Stress
1 1 134 44 66 5.8 134 44 66 5.8
...
Program
filename survey 'c:\temp\bp-survey.txt';
data want;
infile survey firstobs=3;
input
ID Week
SystBP_start DiastBP_start Pulse_start Stress_start
SystBP_end DiastBP_end Pulse_end Stress_end
;
run;
ods html ;
proc print data=want;
run;

Check the number of row in SAS dataset

I am giving the below command to check the number of rows in SAS data set but it's outputting the 60 records of dataset however the dataset have 247 records.
Is there is any other way to do in unix command?
UNIX command:
awk 'END {print NR}' /home/user/check.sas7bdat
You need to write a SAS program to output the number of observations for you. The structure of the sas7bdat file is complicated.
data _null_;
file stdout;
set "&sysparm" nobs=nobs;
put "NOBS:" nobs;
stop;
run;
I named this "test.sas"
This reads in the data set specified in a passed system parameter and outputs to STDOUT the number of observations.
I created a test data set in my home directory like:
libname d "~/";
data d.test;
do i=1 to 1000;
output;
end;
run;
From the command line run
<path to sas>/sas test.sas -sysparm ~/test.sas7bdat
I get NOBS:1000 back.
What about just doing it in a SAS datastep? You can fetch the number of rows with the NOBS statement.
/* Test dataset */
data have;
a = 1;output;
a = 2;output;
a = 3;output;
run;
data _null_;
set have NOBS = size;
call symput("size",strip(size));
run;
%put NOTE: Number of records: &size.;
NOTE: Number of records: 3

Output to a text file

I need to output lots of different datasets to different text files. The datasets share some common variables that need to be output but also have quite a lot of different ones. I have loaded these different ones into a macro variable separated by blanks so that I can macroize this.
So I created a macro which loops over the datasets and outputs each into a different text file.
For this purpose, I used a put statement inside a data step. The PUT statement looks like this:
PUT (all the common variables shared by all the datasets), (macro variable containing all the dataset-specific variables);
E.g.:
%MACRO OUTPUT();
%DO N=1 %TO &TABLES_COUNT;
DATA _NULL_;
SET &&TABLE&N;
FILE 'PATH/&&TABLE&N..txt';
PUT a b c d "&vars";
RUN;
%END;
%MEND OUTPUT;
Where &vars is the macro variable containing all the variables needed for outputting for a dataset in the current loop.
Which gets resolved, for example, to:
PUT a b c d special1 special2 special5 ... special329;
Now the problem is, the quoted string can only be 262 characters long. And some of my datasets I am trying to output have so many variables to be output that this macro variable which is a quoted string and holds all those variables will be much longer than that. Is there any other way how I can do this?
Do not include quotes around the list of variable names.
put a b c d &vars ;
There should not be any limit to the number of variables you can output, but if the length of the output line gets too long SAS will wrap to a new line. The default line length is currently 32,767 (but older versions of SAS use 256 as the default line length). You can actually set that much higher if you want. So you could use 1,000,000 for example. The upper limit probably depends on your operating system.
FILE "PATH/&&TABLE&N..txt" lrecl=1000000 ;
If you just want to make sure that the common variables appear at the front (that is you are not excluding any of the variables) then perhaps you don't need the list of variables for each table at all.
DATA _NULL_;
retain a b c d ;
SET &&TABLE&N;
FILE "&PATH/&&TABLE&N..txt" lrecl=1000000;
put (_all_) (+0) ;
RUN;
I would tackle this but having 1 put statement per variable. Use the # modifier so that you don't get a new line.
For example:
data test;
a=1;
b=2;
c=3;
output;
output;
run;
data _null_;
set test;
put a #;
put b #;
put c #;
put;
run;
Outputs this to the log:
800 data _null_;
801 set test;
802 put a #;
803 put b #;
804 put c #;
805 put;
806 run;
1 2 3
1 2 3
NOTE: There were 2 observations read from the data set WORK.TEST.
NOTE: DATA statement used (Total process time):
real time 0.07 seconds
cpu time 0.03 seconds
So modify your macro to loop through the two sets of values using this syntax.
Not sure why you're talking about quoted strings: you would not quote the &vars argument.
put a b c d &vars;
not
put a b c d "&vars";
There's a limit there, but it's much higher (64k).
That said, I would do this in a data driven fashion with CALL EXECUTE. This is pretty simple and does it all in one step, assuming you can easily determine which datasets to output from the dictionary tables in a WHERE statement. This has a limitation of 32kiB total, though if you're actually going to go over that you can work around it very easily (you can separate out various bits into multiple calls, and even structure the call so that if the callstr hits 32000 long you issue a call execute with it and then continue).
This avoids having to manage a bunch of large macro variables (your &VAR will really be &&VAR&N and will be many large macro variables).
data test;
length vars callstr $32767;
do _n_ = 1 by 1 until (last.memname);
set sashelp.vcolumn;
where memname in ('CLASS','CARS');
by libname memname;
vars = catx(' ',vars,name);
end;
callstr = catx(' ',
'data _null_;',
'set',cats(libname,'.',memname),';',
'file',cats('"c:\temp\',memname,'.txt"'),';',
'put',vars,';',
'run;');
call execute(callstr);
run;

SAS - # symbol in the INPUT statement

I have the following program, but don't understand what # symbol in the end of the INPUT lines does:
data colors;
input #1 Var1 $ #8 Var2 $ #;
input #1 Var3 $ #8 Var4 $ #;
datalines;
RED ORANGE YELLOW GREEN
BLUE INDIGO PURPLE VIOLET
CYAN WHOTE FICSIA BLACK
GRAY BROWN PINK MAGENTA
run;
proc print data=colors;
run;
Output produced without # in the end of the INPUT line is different from the ouput with #.
Can you please clarify what does # in the end of the 2nd and 3rd INPUT lines do?
# at the end of an input statement means, do not advance the line pointer after semicolon. ## means, do not advance the line pointer after the run statement either.
Normally an input statement has an implicit advance the line pointer one after semicolon. So:
data want;
input a b;
datalines;
1 2 3 4
5 6 7 8
run;
proc print data=want;
run;
Will return
1 2
5 6
If you want to read 3 4 into another line, then, you might do something like:
data want;
input a b #;
output;
input a b;
datalines;
1 2 3 4
5 6 7 8
run;
proc print data=want;
run;
Which gives
1 2
3 4
5 6
7 8
Similarly you could simply write
data want;
input a b ##;
datalines;
1 2 3 4
5 6 7 8
run;
proc print data=want;
run;
To get the same result - ## would hold the line pointer even across the run statement. (It still would advance once it hit the end of the line.)
In Summary: I think you probably don't want the trailing # in this case. The Input statements do not seem fitting for the data you are reading. With the trailing #, you are reading the same data into var1 and var3, and the same data into var2 and var4, because it is reading the same line twice. Either way, you are not reading in what the data appears to be. You would be better off with:
input Var1 $ Var2 $ #;
input Var3 $ Var4 $;
Or, more simply:
input Var1 $ Var2 $ Var3 $ Var4 $;
Official details from the SAS support site, annotated:
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000146292.htm
Using Line-Hold Specifiers
Line-hold specifiers keep the pointer on
the current input record when
a data record is read by more than one INPUT statement (trailing #)
Use a single trailing # to allow the next INPUT statement to read from the same record.
Normally, each INPUT statement in a DATA step reads a new data record
into the input buffer. When you use a trailing #, the following
occurs:
The pointer position does not change.
No new record is read into the input buffer.
The next INPUT statement for the same iteration of the DATA step continues to read the same record rather than a new one.
SAS releases a record held by a trailing # when
a null INPUT statement executes:
input;
an INPUT statement without a trailing # executes
the next iteration of the DATA step begins.