Automating readins in SAS to avoid truncation and properly classify numeric variables - file-io

I've run into issues with proc import and large files, so I've been trying to develop a way to automate the readin process myself. Basically, I start with a file, read in all variables as character variables with a gratuitous length, run through the data set to determine the max length the variable actually takes on, and then alters the readin to cut down the lengths. Then, it tries to determine which variables should be numeric/datetime, and then converts them. For simplicity, I'm just posting the datetime part.
I have the following dataset:
data test;
do i=1 to 10;
j="01JAN2015:3:48:00";
k="23SEP1999:3:23:00";
l="22FEB1992:2:22:12";
m="Hello";
output;
end;
drop i;
run;
I want to run through it and determine that I should convert each variable. What I do is count the number of times the input function is successful, then decide on a threshold (in this case, 90%) that it is successful. I'm assuming none of the observations are missing, but in the general case I consider that too. My code looks something like this:
proc contents data=test noprint out=test_c; run;
data test_numobs;
set test_c nobs=temp;
call symput('nobs',strip(temp));
run;
data test2;
set test nobs=lastobs;
array vars (*) $ _ALL_;
length statement $1000;
array tempnum(&nobs.) tempnum1-tempnum&nobs.;
do i=1 to dim(vars);
if input(vars(i),anydtdtm.) ne . then tempnum(i)+1;
end;
statement="";
if _N_=lastobs then do i=1 to dim(vars);
if tempnum(i)/lastobs >=.9 then
statement=strip(statement)||" "||strip(vname(vars(i)))||'1=input('||strip(vname(vars(i)))||",anydtdtm.); format "||
strip(vname(vars(i)))||"1 datetime22.; drop "||strip(vname(vars(i)))||"; rename "||strip(vname(vars(i)))||"1="||strip(vname(vars(i)))||"; ";
ds="test2";
end;
if _N_=lastobs then output;
run;
I only output the last row, which contains the line I want,
j1=input(j,anydtdtm.); format j1 datetime22.; drop j; rename j1=j; k1=input(k,anydtdtm.); format k1 datetime22.; drop k; rename k1=k; l1=input(l,anydtdtm.); format l1 datetime22.; drop l; rename l1=l;
And then send that into a macro to reformat the dataset.
This is a pretty roundabout program. I didn't include a lot of steps but I use the same idea in how to determine the proper variable lengths via generating length and input statements. My question is, does anyone have any better solutions for this type of problem?
Thanks!

Related

Combine two strings for file path in SAS

I have two strings that I want to combine to get the file path to be used in a PROC IMPORT statement in SAS
%let TypeName = XYZ;
%let InputDirectory = \\Nam1\Nam2\Nam3\Dataset\;
%let FileType = Filing.csv;
%let Filename = &TypeName&FileType;
%put &Filename;
%let CompInputDirect = &InputDirectory&Filename;
PROC IMPORT DATAFILE= %sysfunc(&CompInputDirect)
OUT= outdata
DBMS=csv
REPLACE;
GETNAMES=YES;
RUN;
I get an error message saying that
ERROR: Function name missing in %SYSFUNC or %QSYSFUNC macro function reference.
How do I put a macro variable containing the full file path in the Proc Import statement? Thanks in advance.
I reckon you meant to use QUOTE function.
%sysfunc(quote(&CompInputDirect))
Or you can supply your own quotes.
"&CompInputDirect"
Macro symbol resolution &<name> is more formally &<name>. The . is often left off when the resolution occurs where other characters or tokens break up the submit stream.
You want to be careful if you have abstracted a dot (.) filename extension. You will need double dots in order to resolve filename and dot separate the extension. A good habit when dealing with filename parts is to use the formal resolution syntax.
Example:
%let folder = \\Nam1\Nam2\Nam3\Dataset\;
%let file = XYZ;
%let ext = csv;
proc import datafile = "&folder.&file..&ext." ...
^^

SAS, variables order in data import

I've searched my problem in a lot of topics, but no solutions yet.
My SAS code import data from a .txt file, the problem is that the order of variables changes from a version to another (so I have to changes it back to fit my code otherwise it crushes). Here's the code importing data:
data Donnees1 ;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile "&source\Donnees\&data1" delimiter='09'x MISSOVER DSD
lrecl=32767 firstobs=2 ;
informat Numero $100. ;
informat NU_CLI $100. ;
informat Date $100.;
informat Code $10. ;
informat RESEAU $100.
informat TOP_SAN $10. ;
informat TOP_PRV $10. ;
format Numero $100. ;
format NU_CLI $100. ;
format Date $100.;
format Code $10. ;
format RESEAU $100.
format TOP_SAN $10. ;
format TOP_PRV $10. ;
input
Numero
NU_CLI
Date
Code
RESEAU
TOP_SAN
TOP_PRV;
if _ERROR_ then call symput('_EFIERR_',1); /* set ERROR detection macro variable */
run;`
I am looking for an option so that, if the variables changes order in the source file, it doesn't make my code crush.
I've seen solution to reorder variables with retain, but it's for changing order of variables already imported, not during the import step.
The code works perfectly with no issues, only if the data source changes in term of variables order.
Thank you for your help.
IF the variables are named in your text file you could use PROC IMPORT's GETNAMES option to get SAS to automatically name your variables. This doesn't provide you with as much granular control as datastep infile but should work as long as your input file isn't too irregular.
You should also change the order of variables in input list in data-step.
If the variable names and attributes do not change then you can dynamically generate the INPUT statement by reading the variable names from the header row of the file. Read the header line and generate a macro variable.
data _null_;
infile "&source\Donnees\&data1" obs=1;
input;
call symputx('varnames',translate(_infile_,' ','09'x));
run;
Then read the data lines into a dataset and use the variable list in the INPUT statement. You actually don't want to use the ugly code that PROC IMPORT creates. Do NOT attach $xx FORMATS and INFORMATS to character variables as they add no value and can cause trouble down the line if they get out of sync with the actual length of the variable.
data Donnees1;
infile "&source\Donnees\&data1" dlm='09'x TRUNCOVER DSD lrecl=32767 firstobs=2 ;
length
Numero $100
NU_CLI $100
Date $100
Code $10
RESEAU $100
TOP_SAN $10
TOP_PRV $10
;
input &varnames ;
run;
I have found a solution but haven't tested it yet : Creating a temporary Work table where I import all the variables (the order doesn't matter) through a proc import. Then I create a data step where I keep only the variables that interest me, and this time in the correct order. I'll tell you if it works fine.
Also Tom's solution seems pretty good, I'll give it a shot.
Thank you for your help.

SQL Server : Derive Multiple Rows from a text string contained in a database

I have a database that contains logging information. When a user uploads multiple files they show up as a text string in a record. I need to update another table with the names of the files that were uploaded.
In the below example, File1.txt and File2.txt are the file names:
PK Description
----------------------------------
1 Path: [Path]:\folder\sub Upload Method: Html5 Browser: IE 10.0 IP: 1.1.1.1 Files: Name: file1.txt Size: 313 KB Status: Completed Name: file2.txt Size: 444 KB Status: Completed Total Size: 758 KB Elapsed Time: 2 seconds Transfer Rate: 286 KB/second
I need to obtain and insert the file name in a new table ignoring the superfluous information so that it would appear like so:
PK Filename
-----------------------------------
1 file1.txt
2 file2.txt
Because different paths may be uploaded to, there is not a set number of characters that will be present before the first file. And although my example shows 2 files there could be more so I need to continue parsing file names from the text be there 1 or 10 or 50 of them. The file names are also not uniform but all of them are preceded by name.
My recommended broad approach
This is a pretty typical use-case for a user-defined table-valued function.
You essentially want to create a function that takes each value of your log Description as the main input parameter - probably also taking additional parameters to govern what the start and end of each interesting substring should be. (In your case, interesting substrings start after Name: and end just before Size:.)
The function extracts each interesting value and adds it to an accumulator table variable, which is then returned as the result of the function.
You can use such a function neatly over presumably-many rows of logging information, using cross apply or outer apply operators (explained around half-way down this page), something like so:
select L.Description
,R.Filename
from dbo.uploadlogs as L
cross apply dbo.my_tv_function(L.Description,'%Name: %','% Size:%') as R;
This assumes the my_tv_function returns a column called Filename containing the split out filenames. (That's up to how you write the function.)
You could hard-code the patterns you want to search for into the function, but then it'd be less useful/transferrable to different styles of logging information.
For every Description, this will produce n rows in the result set corresponding to n files uploaded in that Description log.
Having got that it should be easy to add a new unique key column using row_number().
How to create such a user-defined function?
In a general sense, you're going to want to leverage two standard SQL functions:
Patindex: finds out where a particular pattern in a bigger string first starts.
Substring: slices a, well, a substring from a bigger string.
Combining these functions (or patindex's closely related charindex) is a very common way to get hold of a consistent bit of a string, when you don't know where exactly it'll start (or how long it'll go on for).
But this only gets me the first occurrence of the text I want!
This is where to bring in a while loop. Looping in SQL is both often-maligned, and often-misused. However, it's a useful language construct and situations like this, within functions, are exactly where looping is both appropriate and effective. To ensure the loop ends, you need to make the long string (the log Description) shorter on each time around, by cutting off the bit you've already found a filename in, and leaving everything beyond it.
There are other possible approaches without a while loop: in a general sense, this problem of "doing the same thing multiple times along a big string" can be solved recursively or iteratively, and a while loop is the iterative approach. In SQL, I prefer this approach.
Putting it all together
I'm not sure if you wanted a complete code solution or just guidance. If you want to figure the actual code out yourself, stop reading about now... :)
Here's a SQL function definition that will do what I described above:
create function dbo.fn_SplitSearch (
#searchString nvarchar(max)
,#startPattern nvarchar(255)
,#endPattern nvarchar(255)
)
returns #fileList table (Filename nvarchar(255) not null)
begin
/***
This table-valued function will return all instances of text
starting with #startPattern, and going up to the last character before
#endPattern starts. This might include leading/trailing spaces depending
on what you define as the patterns.
***/
declare #foundValue nvarchar(255) =''
declare #startLoc int =0
declare #endLoc int =0
while patindex(#startPattern,#searchString)<>0
begin
set #startLoc = patindex(#startPattern,#searchString)
set #endLoc = patindex(#endPattern,#searchString)
set #foundValue = substring(#searchString,#startLoc,#endLoc-#startLoc)
insert into #fileList values (#foundValue)
-- Next time round, only look in the remainder of the search string beyond the end of the first endPattern
set #searchString = substring(#searchString,#endLoc+len(#endPattern),len(#searchString))
end
return
end;
This will actually output results like this:
Filename
---------
Name: file1.txt
Name: file2.txt
including the startPattern text in the output. For me this is a little more generic and it should be easy to trim off the Name: bit outside the function if you want. You could alternatively modify the function to only return the file1.txt part.
I would add some regex clr assembly to my database and then use regex match to extract file names.

When using infile in SAS for a fixed-width file, how do you stop input when you encounter a blank line?

Imagine you have a particular fixed-width file with lines of data you are interested in, a few blank lines, and then a bunch of data and descriptions that you are not interested in. How do you read in that file but stop at the blank line?
For example, if you download and unzip the following document:
http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_Factors_TXT.zip
And attempt to read in the data in SAS like so
data FF;
infile 'C:/Data/F-F_Research_Data_Factors.txt' firstobs=5 stopover;
input date Mkt_RF SMB HML RF;
run;
It reads in "extra" lines near the bottom that are not monthly data but are instead annual data. Is there a way to stop at the blank line?
For a simple file like the example just use a conditional STOP statement. Also note that you can read those YYYYMM values as actual date values instead of treating them as just numbers.
data FF;
infile 'C:/Data/F-F_Research_Data_Factors.txt' firstobs=5 truncover;
input date Mkt_RF SMB HML RF;
informat date yymmn6.;
format date yymmn6.;
if date=. then stop;
run;
The following code is untested, but should do what you are looking to achieve.
DATA FF;
INFILE 'C:/F-F_RESEARCH_DATA_FACTORS.TXT' FIRSTOBS=5 TERMSTR = CRLF;
/*READ IN ONLY VARIABLE DATE AND EVALUATE CONTENTS.*/
INPUT DATE #;
/*IF THERE IS A BLANK LINE THEN STOP READING IN THE FILE*/
IF DATE = . THEN STOP;
/*IF THE VALUE IS NOT MISSING THEN READ IN THE REMAINING COLUMNS*/
ELSE INPUT MKT_RF SMB HML RF;
RUN;
I'd suggest that you test each row before you attempt to parse the row using something like the following.
data FF;
infile 'C:/Data/F-F_Research_Data_Factors.txt' firstobs=5 stopover;
input #;
if _infile_='' then stop;
input #1 date Mkt_RF SMB HML RF;
run;
The input #; statement reads in the entire line but doesn't release the line due to the trailing #. The _infile_ variable is automatically loaded with the entire line by the input statement. We then test the line for being blank. The original input statement then needs #1 to reset the line read pointer to the first column so it can function normally.

How to write structures?

How can I show the value inside a structure? see below the example:
DATA: BEGIN OF line,
col1 TYPE i,
col2 TYPE i,
END OF line.
DATA: itab LIKE TABLE OF line,
jtab LIKE TABLE OF line.
DO 3 TIMES.
line-col1 = sy-index.
line-col2 = sy-index ** 2.
APPEND line TO itab.
ENDDO.
MOVE itab TO jtab.
line-col1 = 10. line-col2 = 20.
APPEND line TO itab.
IF itab GT jtab.
WRITE / 'ITAB GT JTAB'.
ENDIF.
Write: itab, jtab.
because i want to know why itab is greater than jtab?.
If you want to see the contents of a field purely for debugging purposes you can also just put a break point in your code and look at the contents in debugger.
Just don't leave the break point in productive code!
break-point.
"or use break yourusername <= this use is safer
EDIT:
You can also just use a session break-point, which does not require you to change the code (and will only be applicable to your user for the duration of the session):
In the system where you are running the program:
Open the Program
Select the line that you would like the program to stop on
Click the session Break-point button
The break-point icon will appear next to the line (you can also just click in the place where the icon appeared, to set/delete the break-point).
I assume that this is just a quick example and you don't want to use (parts of) this in a productive environment - so I ignore the other potential issues there are in your code.
Down to your question, you need to loop over your itab to access its values. You can then access a value like so:
DATA: ls_current_line LIKE line.
" ...
LOOP AT itab INTO ls_current_line.
WRITE / ls_current_line-col1.
ENDLOOP.
You could use function module REUSE_ALV_GRID_DISPLAY.
For example:
CALL FUNCTION 'REUSE_ALV_GRID_DISPLAY'
TABLES
t_outtab = itab.
ITAB is greater than JTAB because it contains more lines; ITAB has 4 lines while JTAB has 3 lines.
When it comes to internal tables, the GT operator first takes a look at the number of lines in the tables. More details on the comparison operators (for internal tables) can be found at http://help.sap.com/saphelp_nw04/helpdata/en/fc/eb3841358411d1829f0000e829fbfe/content.htm. [I see that your example is also taken from this help page.]