How to import specific lines from dat file - file-io

I've got a .dat file with numbers that I need imported into a SAS dataset. However, there's plenty of information that I do not need, and I only want specific lines of data (e.g. every 6th line starting from line 1000, until I have 100 observations). I also require a unique identifier based on what is displayed on the first line.
So for example, the .dat file contains this:
DATANOTREQUIRED
DATANOTREQUIRED
DATANOTREQUIRED
UPDATE AAA_1111111_Q_BBBBBB_0_1_#
123.4,
123.5,
124.0,
124.1
DATANOTREQUIRED
DATANOTREQUIRED
DATANOTREQUIRED
UPDATE AAA_1111111__Q_BBBBBB_0_2_#
125.1,
126.0,
127.1,
130.0
What I want the eventual SAS dataset to look like is this
Identifier | Value
X.1. | 124.1
X.2. | 130.0
I'm using the infile in SAS and using input to point to line 1000 but I'm stuck and cannot get the SAS dataset I want. (Updated code based on contributors below)
data work.test;
infile '\\filepath\mydatasource.dat' dsd firstobs=1042 truncover;
input #8 ID :$40.
#4 Value1 :8.;
run;
but what I'm seeing now is that the header lines are appearing fine, but the first observation has a . and instead the first data value is appearing for the 2nd header line.
ID | Value1
UPDATE AAA_1111111_Q_BBBBBB_0_1_# | .
UPDATE AAA_1111111__Q_BBBBBB_0_2_# | 124.1

Here's an example assuming that you have the same number of rows between each header row:
data want;
if _n_ > 2 then stop; /*Stop after we've output 2 rows */
infile cards firstobs=6; /*Skip the first 5 lines in the file*/
input #1 #8 ID :$32.
#5 myvar :8.;
cards;
UPDATE AAA_1111111_Q_BBBBBB_0_1_#
123.4,
123.5,
124.0,
124.1
UPDATE AAA_1111111__Q_BBBBBB_0_2_#
125.1,
126.0,
127.1,
130.0
UPDATE AAA_1111111_Q_BBBBBB_0_3_#
123.4,
123.5,
124.0,
124.1
UPDATE AAA_1111111__Q_BBBBBB_0_4_#
125.1,
126.0,
127.1,
130.0
;
run;

Use the FIRSTOBS= option to skip the beginning of the file.
If there are always 5 records per block you could just read them individually.
data want;
infile rawdata dsd firstobs=1000 truncover;
input id :$40. (4*value) (/) ;
run;
Or you could do something like this that should allow for a variable number of values per id and just keep the last one.
data want;
infile rawdata dsd firstobs=1000 end=eof;
input # ;
length id $32 value 8 ;
retain id value;
if _infile_ =: 'UPDATE' then do;
if _n_ > 1 then output;
id = scan(_infile_,-1,' ');
end;
else input value;
if eof and _n_ > 1 then output;
run;

Related

different variables with the same name in a sas dataset

I have a text file where two different columns are having the same name. As shown in the following figure.
Let's say for SystBP, I need to change the first SystBP to SystBP_B and the second SystBP to SystBP_E.
Could someone kindly offer me some help on this?
When programming in SAS Base You should sometimes not expect SAS to read column names from a text file and interpret them as variable names.
You have to instruct SAS what the first data row is, where the values are written and how they should be interpreted (text, number, date, ...) You do that with an infile and an input statement in a data step.
As you write the code yourself, you have complete control.
data READ_FROM_TXT;
infile "C:\myFolder\myFile.txt" firstobs=3 truncover;
* firstobs=3 makes SAS skip the first 2 observations;
* truncover avoids jumping to the next line when the last variable is missing or too short ;
input
#01 ID 2.
#05 Week 4.
#11 SystBP_B 6.
#19 DiastBP_B 6.
...
#41 SystBP_E 6.
#49 DiastBP_E 6.
...
;
* #11 SystBP_B 6. instructs SAS to interpret positions 11 to 16 as a number
* and assign the value to variable SystBP_B;
run;
As you inserted the data as an image, not as text, using markup, I had to guess the positions, so you will have to correct them.
I would make timing into observations.
data test;
infile cards4 firstobs=2;
input id :$8. week #;
do time = 'STR','END';
input SystBP DiastBP Pulse Stress #;
output;
end;
cards;
ID Week SystBP DiastBP Pulse Stress SystBP DiastBP Pulse Stress
1 1 134 44 66 5.8 134 44 66 5.8
;;;;
run;
The INFILE option FIRSTOBS= will let you INPUT the data starting in row 3.
Data file: C:\temp\bp-survey.txt
Start End
ID Week SystBP DiastBP Pulse Stress SystBP DiastBP Pulse Stress
1 1 134 44 66 5.8 134 44 66 5.8
...
Program
filename survey 'c:\temp\bp-survey.txt';
data want;
infile survey firstobs=3;
input
ID Week
SystBP_start DiastBP_start Pulse_start Stress_start
SystBP_end DiastBP_end Pulse_end Stress_end
;
run;
ods html ;
proc print data=want;
run;

How to subtract second row from first, fourth row from third and so forth

I have SAS dataset as in the attached picture.. what I'm trying to accomplish is created new calculated field from Total column where I'm subtracting first row-second row, third row-fourth row and so on..
What i have tried so far is
DATA WANT2;
SET WANT;
BY APPT_TYPE;
IF FIRST.APPT_TYPE THEN SUPPLY-OPEN; ELSE 'ERROR';
RUN;
this throws an eror as statement is not valid..
not really sure how to go about this
My dataset
Here you go. The best I can do with the limited information you provided. Next time please provide sample data and your expected output.
data have;
input APPT_TYPE$ _NAME_$ Quantity;
datalines;
ASON Supply 10
ASON Open 8
ASSN Supply 9
ASSN Open 7
S30 Supply 11
S30 Open 8
;
proc sort data = have;
by APPT_TYPE descending _NAME_ ;
run;
data want;
set have;
by APPT_TYPE descending _NAME_;
lag_N_Order = lag1(Quantity);
N_Order = Quantity;
Difference = lag_N_Order - N_Order;
keep APPT_TYPE _NAME_ N_Order lag_N_Order Difference Type;
if last.APPT_TYPE & last._NAME_ & Difference>0;
run;

SAS - creating macros dynamically

I would like to dynamically create macros to query a transactional data set. I have a table that has a set of parameters (parameter_data) and transaction data (txs). For each row in my parameter data I want to create a macro that can be called to query the data.
Parameter data:
data parameter_data;
input macro_name $ parameter_name $ parameter_value $;
datalines;
A Person_ID 1
B TX_ID 2
;
Transactional Data:
data txns;
input Person_ID $ TX_ID $ TX_Amount $;
datalines;
John Sales 1123
Mary Acctng 34
John Sales 23
Mary Sales 2134
;
Here I try to create a macro that should create macros dynamically according to the parameter data. The 'inner macros' are the macros that are created from the parameter data.
%macro outerMacro;
/*loop through each row in the parameter table to get the detail of the macro we want to create*/
%DO ROW = 1 %To 2;
data _NULL_;
set parameter_data;
if _N_ = ROW then do;
call symputx('parameter_name',parameter_name);
call symputx('parameter_value',parameter_value);
end;
run;
/*define inner macro parameters*/
%let macroName = myMacro; /*set the name of the macro we want to create*/
%let innerMacroStart = macro &macroName.; /*set the macro name to start the macro definition*/
%let innerMacroEnd = mend &macroName;
%&&innerMacroStart.; /*start the inner macro*/
/*body of the macro*/
data output;
set txns;
&&parameter_name = &&parameter_value;
/*so here effectively for the first row in the parameter table we are filtering where person_id = John*/
run;
%&&innerMacroEnd.; /*end the inner macro*/
%mend outerMacro;
%&&outerMacroName.;
It seems that SAS is unable to parse the lines %innerMacroStart. Any help is much appreciated.
Thanks!
If the goal is just to subset data then it might be better to generate macro variables instead of actual macros. Try something like this instead.
data _null_;
set parameter_data ;
call symputx(macro_name,catx(' ','where also'
,parameter_name,'=',quote(trim(parameter_value)),';'));
run;
Then just use the generated where statement(s) when you need them by expanding the macro variable. Like this:
data output ;
set txns;
&a
run;
If you really want to generate a macro definition then you probably want to just use a data step to write the code to a file and then %include the file to compile the macros. That will be much easier to debug than macro logic.
Let's fix your parameter file to better match your test data. Person_ID and TX_ID are character variables in your transaction dataset. You will probably need to add logic or change the parameter file to allow it to handle testing of both numeric and character variables. For now I just made it generate code that assumes that PARAMETER_NAME refers to a character variable so that PARAMETER_VALUE will need to have quotes added to make it a string literal.
data parameter_data;
input macro_name :$32. parameter_name :$32. parameter_value $:200.;
datalines;
A Person_ID John
B TX_ID Acctng
;
data txns;
input Person_ID $ TX_ID $ TX_Amount $;
datalines;
John Sales 1123
Mary Acctng 34
John Sales 23
Mary Sales 2134
;
Now let's run a data step to generate the code for all of your macros. I added logic to use AND if there were multiple "parameters" defined for each macro.
filename code temp;
data _null_;
set parameter_data ;
by macro_name ;
file code ;
if first.macro_name then put
'%macro ' macro_name ';'
/ 'data output;'
/ ' set txns;'
/ ' where ' #
;
else put ' and ' # ;
put parameter_name '=' parameter_value :$quote. # ;
if last.macro_name then put
';'
/ 'run;'
/ '%mend ' macro_name ';'
;
run;
Now just use %include to compile the macros.
%include code / source2 ;
NOTE: %INCLUDE (level 1) file CODE is file C:\...\#LN00048.
432 +%macro A ;
433 +data output;
434 + set txns;
435 + where Person_ID ="John" ;
436 +run;
437 +%mend A ;
438 +%macro B ;
439 +data output;
440 + set txns;
441 + where TX_ID ="Acctng" ;
442 +run;
443 +%mend B ;
NOTE: %INCLUDE (level 1) ending.
Now you can use your macros.
445 options mprint;
446 %a ;
MPRINT(A): data output;
MPRINT(A): set txns;
MPRINT(A): where Person_ID ="John" ;
MPRINT(A): run;
NOTE: There were 2 observations read from the data set WORK.TXNS.
WHERE Person_ID='John';
NOTE: The data set WORK.OUTPUT has 2 observations and 3 variables.
447 %b ;
MPRINT(B): data output;
MPRINT(B): set txns;
MPRINT(B): where TX_ID ="Acctng" ;
MPRINT(B): run;
NOTE: There were 1 observations read from the data set WORK.TXNS.
WHERE TX_ID='Acctng';
NOTE: The data set WORK.OUTPUT has 1 observations and 3 variables.
I have placed a comment before each block of code, but essentially it is:
Parameter set up.
Macro generation.
%include.
Call any desired macro.
I have assumed no more than 999 parameter observations - this is controlled by seq.
You can examine file "inner_macro.sas" to see the macro definitions.
NB. When you try it, make sure to use your own path in place of <your-path> (occurs twice):
/* set up parameters */
data parameters;
infile datalines dlm=',';
input var : $8.
operator : $8.
value : $8.
;
datalines;
name,eq,"John"
age,gt,12
weight,eq,0
;
/* read parameters and generate a macro definition for each obs, written to a file */
data _null_;
file '<your-path>/inner_macro.sas';
set parameters;
seq = put(_n_,z3.);
put '%macro inner_' seq ';';
put ' where ' var operator value ';';
put '%mend inner_' seq ';';
put;
run;
/* %include (submits code in file) all of the macro definitions */
%include '<your-path>/inner_macro.sas';
options mprint;
/* invoke the macro with the required data sets */
data class1;
set sashelp.class;
%inner_001;
run;
data class2;
set sashelp.class;
%inner_002;
run;
data class3;
set sashelp.class;
%inner_003;
run;

SAS - # symbol in the INPUT statement

I have the following program, but don't understand what # symbol in the end of the INPUT lines does:
data colors;
input #1 Var1 $ #8 Var2 $ #;
input #1 Var3 $ #8 Var4 $ #;
datalines;
RED ORANGE YELLOW GREEN
BLUE INDIGO PURPLE VIOLET
CYAN WHOTE FICSIA BLACK
GRAY BROWN PINK MAGENTA
run;
proc print data=colors;
run;
Output produced without # in the end of the INPUT line is different from the ouput with #.
Can you please clarify what does # in the end of the 2nd and 3rd INPUT lines do?
# at the end of an input statement means, do not advance the line pointer after semicolon. ## means, do not advance the line pointer after the run statement either.
Normally an input statement has an implicit advance the line pointer one after semicolon. So:
data want;
input a b;
datalines;
1 2 3 4
5 6 7 8
run;
proc print data=want;
run;
Will return
1 2
5 6
If you want to read 3 4 into another line, then, you might do something like:
data want;
input a b #;
output;
input a b;
datalines;
1 2 3 4
5 6 7 8
run;
proc print data=want;
run;
Which gives
1 2
3 4
5 6
7 8
Similarly you could simply write
data want;
input a b ##;
datalines;
1 2 3 4
5 6 7 8
run;
proc print data=want;
run;
To get the same result - ## would hold the line pointer even across the run statement. (It still would advance once it hit the end of the line.)
In Summary: I think you probably don't want the trailing # in this case. The Input statements do not seem fitting for the data you are reading. With the trailing #, you are reading the same data into var1 and var3, and the same data into var2 and var4, because it is reading the same line twice. Either way, you are not reading in what the data appears to be. You would be better off with:
input Var1 $ Var2 $ #;
input Var3 $ Var4 $;
Or, more simply:
input Var1 $ Var2 $ Var3 $ Var4 $;
Official details from the SAS support site, annotated:
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000146292.htm
Using Line-Hold Specifiers
Line-hold specifiers keep the pointer on
the current input record when
a data record is read by more than one INPUT statement (trailing #)
Use a single trailing # to allow the next INPUT statement to read from the same record.
Normally, each INPUT statement in a DATA step reads a new data record
into the input buffer. When you use a trailing #, the following
occurs:
The pointer position does not change.
No new record is read into the input buffer.
The next INPUT statement for the same iteration of the DATA step continues to read the same record rather than a new one.
SAS releases a record held by a trailing # when
a null INPUT statement executes:
input;
an INPUT statement without a trailing # executes
the next iteration of the DATA step begins.

Is there some way to tell SAS that for any obs ####1, ####2, or ####3 (where # = 1-9), I want them formatted #### Spring, #### Fall, and #### Winter?

So I have a 1000 observations for one variable that look like this:
19962
19943
19972
19951
19951
19912
The first four digits vary a bit, but the last digit is always 1, 2, or 3. Is there a way to only format the last digit, while not having to type out each iteration of the first four digits in a value statement?
That is, I want to avoid doing this:
proc format;
value varfmt
19911 = '1991 Spring'
19912 = '1991 Fall'
19913 = '1991 Winter'
19921 = '
19922 = '
[…]
19991 = '1999 Spring'
19992 = '1999 Fall'
19993 = '
;
run;
Instead, is there some way to tell SAS that for any ####1, ####2, or ####3, I want #### Spring, #### Fall, and #### Winter (which would be three lines under the value statement)?
Thanks in advance for any help.
As you are applying the format on the last digit only, so using the all the digits in the proc format is not required. Just extract the last digit and apply the format on it and concatenate it with other first four digits.
Creating the sample dataset
data test;
infile datalines;
input year;
datalines;
19962
19943
19972
19951
19951
19912
;
run;
Creating the formats
proc format;
value $varfmt
1 = 'Spring'
2 = 'Fall'
3 = 'Winter'
;
run;
Here, doing the following things
Extracting the last digit
Applying the format on it, created above
Extracting the first four digits of the number
Concatenating the output of 2 and 3
data final;
set test;
year_new = cat(substr(compress(year),1,4)," ",put(substr(compress(year),5,1),$varfmt.));
run;
You also have the option of creating a format from a dataset, if you do want a format for the whole value. You will have to create all possible rows, but it's not particularly hard.
data forfmt;
fmtname='SEASONF';
length start $5 label $8;
do startyr = 1990 to 2015;
start=cats(startyr,'1');
label=catx(' ',startyr,'Spring');
output;
start=cats(startyr,'2');
label=catx(' ',startyr,'Fall');
output;
start=cats(startyr,'3');
label=catx(' ',startyr,'Winter');
output;
end;
run;
proc format cntlin=forfmt;
quit;