Convert SAS read-in file into SQL

Convert SAS read-in file into SQL - sql

I am trying to read a .DAT file into SQL. The agency data provider supplied read-in code in SAS here (https://www.health.ny.gov/statistics/sparcs/docs/ip_v2.sas). I would like to read this data into a secure SQL database and was wondering if anyone could help me translate this SAS code into SQL? Here's the start:
OPTIONS NOCENTER NODATE FORMDLIM=' ' compress=yes pagesize=50;
%let yr=11;
/**** READ IN FILE ******* No Check for HexDec ****/
data IUM;
infile eium truncover lrecl=2500 PAD ignoredoseof /*obs=10000*/ ;
INPUT
#0016 ordr $char3.
#0001 RECDTL $char2500.
;
Further down it is specifying the position and length of the the columns, but not the data type. Any SAS users out there feeling smart and generous?

What is there to translate? The #xxxx is saying what column to start in, then you have the variable name and the informat to use to read it. SAS only has two data types, fixed length character strings and floating point numbers. Any informat that starts with a $ will generate a character variable. Others will generate a number. Some obvious informats will generate numbers that can be interpreted as dates, times or datetime (timestamp) values. Such as date9., time8. or datetime20.. But also other informats for other ways of representing dates in text format, like YYMMDD. , MMDDYY. or DDMMYY..
SAS will define the order of the variables in the dataset based on when you first reference them in the code. So ORDR will be defined in the database before RECDTL even though the latter appears first in the text file (column 1 versus column 16).

Related

SAS VARTYPE Function: run it (or equivalent) against all Variables

I am a DB administrator with 0 SAS experience and I work in government and have been tasked with ingesting SAS output from another team. The other team has limited SAS experience apparently and cannot answer the question "what is the data type of each SAS variable". We have dozens of tables and thousands of variables to import. Is there a way to run the SAS function "VarType" against all columns?
I've not found what I needed on SAS docs, SO search, etc.
I am expecting code that I can hand to the other team which they will run to produce the following (with only hard-coding the "dataset" ; no hard-coded table names/variable names):
TableName
VariableName
DataType
DataLength and/or Other attributes as needed
MyTable 1
Column1
char
25
MyTable 1
Col2
numeric
scale 10 precision 2
MyTable 2
Col1
(small? big? 32? ) int
bytes? or something that tells me max range
...
MyTable102
Column100
date
yyyy-mm-dd
Update: here's what I used based on the accepted answer. You would change:
library=SASHELP to library=YourLibrary to change the dataset being scraped
out=yourDataset.sasSchemaDump replace yourDataset with the destination dataset where a new table named sasSchemaDump will be created/populated. Rename sasSchemaDump to your desired table name.
proc datasets library=SASHELP  memtype=data;
contents data=_ALL_ (read=green) out=yourDataset.sasSchemaDump;
title 'SAS Schema Dump';
run;

There is a dedicated SAS procedure for this: PROC CONTENTS
proc contents data=sashelp.cars out=want; run;
It will create a SAS table want with all the information needed.
FYI: TYPE 1 is numeric, TYPE 2 is character.
If all tables are in the same library you could do the following to cycle through all the tables within the library
proc contents data=sashelp._all_ out=want; run;

Run PROC CONTENTS on the dataset and you will have the information you need.
SAS has only two data TYPE. Fixed length character strings and floating point numbers. The LENGTH is the number of bytes that are stored in the dataset. So for character variables the length determines how many characters it can store (assuming you are using a single byte encoding). Floating point numbers require 8 bytes to store, but you can store it with fewer in the dataset if you don't mind the loss of precision that means. For example if you know the values are integers you might choose to store only 4 of the bytes.
You can sometimes tell more information about a variable if the creator attach a permanent FORMAT to control how the variable is displayed. For example SAS stores DATE values as the number of days since 1960. So to make those number meaningful to humans you need to attach a format such as DATE9. or YYMMDD10. so that the numbers print as strings that a human would see as a date. Similarly there are display formats for displaying time of day value (number of seconds since midnight) or datetime values (number of seconds since 1960). Also if they attached a format that does not display decimal places that might mean the values are intended to be integers.
And if they attached a LABEL to the variable that might explain more about the variable than you can learn from the name alone.
They could also attach user defined formats to a variable. Those could be simple code/decode lookups, but they could also be more complex. A common complex one is used for collapsing a range (or multiple values and/or ranges) to a single decode. The definition of a user defined format is stored in a separate file, called a catalog, in particular a format catalog. You can use PROC FORMAT with the FMTLIB or CNTLOUT= option to see the definition of the user defined formats.

SAS importing numeric column to scientific notation

I'm importing a sas7bdat file in sas studio using proc import and one of the variables in the dataset is changing to scientific notation, e.g, 1234567891011121 is showing up as 1.2345678E15
I'm fairly new to SAS and not sure what function would help retain this particular column in its original 16 digit format instead of scientific notation. This column is of numeric data type and its length is being displayed as 8. I have been through other similar posts, but could not find a solution to work with.

SAS stores all numbers as 64bit binary floating point, so using a length of 8 bytes to store the value is the right thing. You cannot use more bytes because it only takes 8 bytes to store all 64 buts. And if you used fewer bytes you would lose precision and could not store all 16 digits.
SAS uses FORMATs to control how values are printed as text. You can use the FORMAT statement to attach a format to a variable.
It looks like you are either using the BEST12. format with that variable, or you are letting SAS use its default way of displaying numbers, which in most cases will be to use the BEST12. format.
If you want the numbers to print with 16 decimal digits then just attach the 16. format to the variable instead.
Or you could use the COMMA21. format instead and the numbers will print with thousand separators so it will be easier for humans to read them.
Example code for attaching a format to variable in a data step.
data want;
set have;
format mynumber 16.;
run;

SSIS convert exponent number to real (DT_R4)

I have a flat CSV file and some fields contain a value like "1.8e-5, 8.139717345049093e-39" (exponent or scientific numbers). I need to store this value in a SQL real data type field (not float). But the maximum exponent supported by real is e-38.
But I need a mechanism to convert this string field to a real number through SSIS. Basically the e-39 or smaller values should be replaced as 0. and the rest should be stored properly.
I tried setting the data type to DT_R4 in flat file connection field mapping and that didn't help. I tried casting it to DT_R4 through a derived column and that didn't help too. When I check through Data Viewer still the value has the unsupported exponent value and it fails when I insert it to the SQL table.

SAS renaming variables during input

self-taught SAS user here.
I often work with datasets that I have little control over and are shared among several different users.
I generally have been reading in files as CSVs using an infile statement + defining the variables with blocks of informat, format, and input statements. During this process, can I go ahead and rename variables--provided that everything is renamed in the correct order--or do they have to match the original dataset and be renamed in a later data step?
For example, the variable name in the dataset is '100% Fully Paid Out.' I know SAS variables can't start with numbers and I'd also like to simplify variable names in general, so could I do something like the following:
infile statement...
informat Paid $3.;
format Paid $3.;
input Paid $;
run;
Or maybe I'm going about this very inefficiently. I've tried doing simple proc imports without this whole informat/format/input business, but I've found that trying to redefine variable types afterwards causes more of a headache for me (all datasets I work with have combinations of text, dollars, percentages, general numbers, dates...). In any case, other tips highly appreciated--thanks!
EDIT
Maybe the question I should ask is this: is there any way of keeping the format of the csv for dollars and percentages (through proc import, which seems to convert these to characters)? I know I can manually change the formats from dollars/percentages to "general" in Excel prior to importing the file, but I'd prefer avoiding additional manual steps and also because I actually do want to keep these as dollars and percentages. Or am I just better off doing the informat/format/input to specify data types for the csv, so that variables are read in exactly how I want them to be read in?
Note: I've been unable to proc import xls or xlsx files, either because I'm on a 64-bit computer and/or I'm missing required drivers (or both). I was never able to do this even on a 32-bit computer either.

CSV files do not contain any metadata about the variable types, as your note about trying to import them into Excel demonstrates. You can use PROC IMPORT to have SAS make an educated guess as to how to read them, but the answer could vary from file to file based on the particular data values that happen to appear.
If you have data in XLS or XLSX files you should be able to read them directly into SAS using a libname with the XLS or XLSX engine. That does not use Excel and so does not have any conflicts between 32 and 64 installation. In fact you don't even need Excel installed. SAS will do a better job of determining the variable types from Excel files than from CSV files, but since Excel is a free-form spreadsheet you still might not have consistent variable types for the same variable across multiple files. With an Excel spreadsheet you might not even have the same data type consistently in a single column of a single sheet.
You are better off writing your own data step to read the file. That way you can enforce consistency.
What I typically do when given a CSV file is copy the names from the first row and use it to create a LENGTH statement. This will both define the variables and set the order of the variables. You could at this point give the variables new names.
length paid $3 date amount 8 ;
Then for variables that require an INFORMAT to be read properly I add an INFORMAT statement. Normally this is only needed for date/time variables, but it might also be needed if numeric values include commas or percent signs. The DOLLAR. informat is useful if your CSV file has numbers formatted with $ and/or thousands separators.
informat date mmddyy. amount dollar. ;
Then for variables that require a FORMAT to be displayed properly I add a FORMAT statement. Normally this is only needed for date/time variables. It is only required for character variables if you want to attach $CHAR. format in order to preserve leading spaces.
format date yymmdd10. ;
Then the INPUT statement is really easy since you can use a positional variable list. Note that there is no need to include informats or $ in the INPUT statement since the types are already defined by the LENGTH statement.
input paid -- amount ;

MySQL : how to load data with fixed-row format into user variables

I'm trying to load a file where are all the lines use the same rules. (assume HEADER is a single line)
HEADER1
HEADER2
.......
But unluckily when I try to use the LOAD DATA INFILE statement I get this error: Error Code: 1409
Can't load value from file with fixed size rows to variable.
This is the code I wrote:
USE test;
DROP TABLE IF EXISTS EXAMPLE_H;
CREATE TABLE EXAMPLE_H(
ID CHAR(20),
SP CHAR(3),
IVA CHAR(11) PRIMARY KEY,
NLP CHAR(6),
DLP DATE,
DUVI DATE,
DELP CHAR(30),
FILLER CHAR(39),
VTLP CHAR(3),
FILL CHAR(49)
);
LOAD DATA INFILE 'BTILSP.TXT'
INTO TABLE test.EXAMPLE_H
FIELDS TERMINATED BY ''
LINES TERMINATED BY '\n'
(ID, SP, IVA, NLP, #var_date_one, #var_date_two, DELP, FILLER, VTLP, FILL)
SET DLP = str_to_date(#var_date_one, '%Y%m%d',
DUVI = str_to_date(#var_date_two, '%Y%m%d');
I had this idea reading the bottom of this page (comment by Ramam Pullella), and I found the same explained on some websites, but I can't understand why I'm getting this error.
If I don't use the #var_date_one and #var_date_two variables, and so the STR_TO_DATE function, the date isn't rendered as MySql needs - the date in the file is something like "20100701" - then that field would contain all zeros or a different date than what I'm expecting. If I change DLP and DUVI to be represented by CHAR(8), then it works, but I won't use the SQL DATE comparisons and similar tools.
Can you help me please? :)
Thank you very much.
EDIT:
It seems the problem is given by the LINE TERMINATED BY '', since this kind of line is a "fixed row (undelimited)". Maybe it can't be assigned to variable for an unknown reason, but it's this way it works.
The documentation says:
User variables cannot be used when
loading data with fixed-row format
because user variables do not have a
display width.
Any suggestion?
RE-EDIT:
I've read the comment by Ryan Neve at bottom of that page. He gives a trick to read fixed-row into variables:
LOAD DATA LOCAL INFILE '<file name>' INTO TABLE <table>
(#var1)
SET Date=str_to_date(SUBSTR(#var1,3,10),'%m/%d/%Y'),
Time=SUBSTR(#var1,14,8),
WindVelocity=SUBSTR(#var1,26,5),
WindDirection=SUBSTR(#var1,33,3),
WindCompass=SUBSTR(#var1,38,3),
WindNorth=SUBSTR(#var1,43,6),
WindEast=SUBSTR(#var1,51,6),
WindSamples=SUBSTR(#var1,61,4);
Do you think it's a good way to do it? :)

I'm no expert, but it seems to me that if the fields are terminated by an empty string, then they have to be fixed size instead; there has to be some way to determine the boundaries between fields, and if there is no terminator, then they pretty much have to be fixed size.
I observe that the MySQL 5.5 manual says:
User variables cannot be used when loading data with fixed-row format because user variables do not have a display width.
It also (rather earlier on the page) says:
If the FIELDS TERMINATED BY and FIELDS ENCLOSED BY values are both empty (''), a fixed-row (nondelimited) format is used. With fixed-row format, no delimiters are used between fields (but you can still have a line terminator). Instead, column values are read and written using a field width wide enough to hold all values in the field. For TINYINT, SMALLINT, MEDIUMINT, INT, and BIGINT, the field widths are 4, 6, 8, 11, and 20, respectively, no matter what the declared display width is.
Since your statement has no 'FIELDS ENCLOSED BY' and empty 'FIELDS ENCLOSED BY', that is why you have a fixed format. And hence you cannot do as you want.
Sometimes, it is easier to massage the data outside the DBMS - fixing the data representation might be one such operation. I do have program that I call DBLDFMT that I've not used for a few years now, but it can do a variety of operations, such as convert decimal numbers with implicit decimal points (a mainframe trick; the price field might be 0023199, representing the value £231.99). It can deal with date manipulations too (not necessarily using a particularly user-friendly notation, but it is able to deal with the problems I faced getting data from mainframes into a Unix DBMS - not MySQL; it didn't exist when I wrote this code. Contact me if that might be of any interest - see my profile.

In case someone else comes across this.
If you just run
LOAD DATA LOCAL INFILE '<file name>' INTO TABLE <table>
(#var1)
SET ...
without specifying FIELDS TERMINATED BY, and your file contains commas MySQL will split on those by default.
In such case you can just tell MySQL that your field delimiter is something silly. eg:
FIELDS TERMINATED BY '############'
This way the whole line gets put in the first "column" ie your user variable. You can then use it exactly as shown in your code at the top.
It's worth noting MySQL treats delimiter a string, so you can even have
FIELDS TERMINATED BY 'this_string_thoes_not_appear_in_my_file
if you want

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas