SAS VARTYPE Function: run it (or equivalent) against all Variables - variables

I am a DB administrator with 0 SAS experience and I work in government and have been tasked with ingesting SAS output from another team. The other team has limited SAS experience apparently and cannot answer the question "what is the data type of each SAS variable". We have dozens of tables and thousands of variables to import. Is there a way to run the SAS function "VarType" against all columns?
I've not found what I needed on SAS docs, SO search, etc.
I am expecting code that I can hand to the other team which they will run to produce the following (with only hard-coding the "dataset" ; no hard-coded table names/variable names):
TableName
VariableName
DataType
DataLength and/or Other attributes as needed
MyTable 1
Column1
char
25
MyTable 1
Col2
numeric
scale 10 precision 2
MyTable 2
Col1
(small? big? 32? ) int
bytes? or something that tells me max range
...
MyTable102
Column100
date
yyyy-mm-dd
Update: here's what I used based on the accepted answer. You would change:
library=SASHELP to library=YourLibrary to change the dataset being scraped
out=yourDataset.sasSchemaDump replace yourDataset with the destination dataset where a new table named sasSchemaDump will be created/populated. Rename sasSchemaDump to your desired table name.
proc datasets library=SASHELP  memtype=data;
contents data=_ALL_ (read=green) out=yourDataset.sasSchemaDump;
title 'SAS Schema Dump';
run;

There is a dedicated SAS procedure for this: PROC CONTENTS
proc contents data=sashelp.cars out=want; run;
It will create a SAS table want with all the information needed.
FYI: TYPE 1 is numeric, TYPE 2 is character.
If all tables are in the same library you could do the following to cycle through all the tables within the library
proc contents data=sashelp._all_ out=want; run;

Run PROC CONTENTS on the dataset and you will have the information you need.
SAS has only two data TYPE. Fixed length character strings and floating point numbers. The LENGTH is the number of bytes that are stored in the dataset. So for character variables the length determines how many characters it can store (assuming you are using a single byte encoding). Floating point numbers require 8 bytes to store, but you can store it with fewer in the dataset if you don't mind the loss of precision that means. For example if you know the values are integers you might choose to store only 4 of the bytes.
You can sometimes tell more information about a variable if the creator attach a permanent FORMAT to control how the variable is displayed. For example SAS stores DATE values as the number of days since 1960. So to make those number meaningful to humans you need to attach a format such as DATE9. or YYMMDD10. so that the numbers print as strings that a human would see as a date. Similarly there are display formats for displaying time of day value (number of seconds since midnight) or datetime values (number of seconds since 1960). Also if they attached a format that does not display decimal places that might mean the values are intended to be integers.
And if they attached a LABEL to the variable that might explain more about the variable than you can learn from the name alone.
They could also attach user defined formats to a variable. Those could be simple code/decode lookups, but they could also be more complex. A common complex one is used for collapsing a range (or multiple values and/or ranges) to a single decode. The definition of a user defined format is stored in a separate file, called a catalog, in particular a format catalog. You can use PROC FORMAT with the FMTLIB or CNTLOUT= option to see the definition of the user defined formats.

Related

SAS importing numeric column to scientific notation

I'm importing a sas7bdat file in sas studio using proc import and one of the variables in the dataset is changing to scientific notation, e.g, 1234567891011121 is showing up as 1.2345678E15
I'm fairly new to SAS and not sure what function would help retain this particular column in its original 16 digit format instead of scientific notation. This column is of numeric data type and its length is being displayed as 8. I have been through other similar posts, but could not find a solution to work with.
SAS stores all numbers as 64bit binary floating point, so using a length of 8 bytes to store the value is the right thing. You cannot use more bytes because it only takes 8 bytes to store all 64 buts. And if you used fewer bytes you would lose precision and could not store all 16 digits.
SAS uses FORMATs to control how values are printed as text. You can use the FORMAT statement to attach a format to a variable.
It looks like you are either using the BEST12. format with that variable, or you are letting SAS use its default way of displaying numbers, which in most cases will be to use the BEST12. format.
If you want the numbers to print with 16 decimal digits then just attach the 16. format to the variable instead.
Or you could use the COMMA21. format instead and the numbers will print with thousand separators so it will be easier for humans to read them.
Example code for attaching a format to variable in a data step.
data want;
set have;
format mynumber 16.;
run;

Trying to generate a unique ID that isn't an integer nor exactly other values in plaintext

I'm helping a colleague who has been asked to generate a key ID for two different groups of data coming in. I've completed this step but it's not very user friendly so I'm looking for suggestions on how to make it more readable. Each group has its own ID that appears to be a hexadecimal value. The concatenation of them appears to be a unique key in its own right.
In this case, the Household table and the Account table are being brought together and she has been asked to generate at Household-Account value (a household can have many accounts, an account can span households).
Our data is stored on SQL server but we do most of our manipulations using SAS, hence, PROC SQL below.
My initial thought was that the most obvious key is to run the two key fields together and use a delimiter. You'll see that in the top portion of my code. However this makes a very long field so I was asked to shorten it. My second thought, and their initial ask, was to just do an integer field. You can see that with the Monotonic but they felt that since it has warnings about it around the internet they don't trust it. My third thought was to run the existing, concatenated field through some kind of one-way function but when I do that (see MD5 below) I get something that looks like wingdings took over.
/* creating a table of just the "key" columns */
PROC SQL;
CREATE TABLE work.ConcatonatedKey AS
SELECT DISTINCT
CATX("G", HouseholdKey,FinancialKey) as Concatonated
FROM work.OriginalData
;
QUIT;
/* Populate HHFinancialKey */
/* Monotonic documentation */
/* http://support.sas.com/techsup/notes/v8/15/138.html */
PROC SQL;
CREATE TABLE work.ContrivedKeys AS
SELECT
Monotonic() AS HHFinID
, Concatonated
, MD5(Concatonated) As foo
FROM work.ConcatonatedKey
;
QUIT;
So, the real question here is, if you had something that could uniquely ID a row but wanted to make it more user friendly, using SAS, how would you go about it. ?
The SAS UUIDGEN function can return either human readable character string or a denser binary string. Per docs:
The UUIDGEN function returns a UUID (a unique value) for each cell. The default result is 36 characters long and it looks like:
5ab6fa40-426b-4375-bb22-2d0291f43319.
A binary result is 16 bytes long.
Example:
select
...
uuidgen() as myGroupId length=36
...
MD5 is probably the simplest solution. The MD5 function returns a 16 byte string as a result, but to make it human readable you can just format it using the $hex32. format. It's also very fast and widely supported.
data _null_;
x = put(md5("some_string_here"),$hex32.);
put x;
run;
Result:
BB28824D60AE6706F812CC940CAAAF1B
Just be careful that md5() is sensitive to case differences, and leading/trailing spaces. So you may want/need to upppercase everything and trim spaces prior to running it through the function to get consistent results across different platforms.
The risk of collisions is close to zero:
How many random elements before MD5 produces collisions?
Should also note that, knowing the two unhashed keys used to create the hash, you can recreate the hash from the keys, something that isn't possible with the uuidgen solution selected as the answer. Depending on your requirements this may or may not be a requirement.

Convert SAS read-in file into SQL

I am trying to read a .DAT file into SQL. The agency data provider supplied read-in code in SAS here (https://www.health.ny.gov/statistics/sparcs/docs/ip_v2.sas). I would like to read this data into a secure SQL database and was wondering if anyone could help me translate this SAS code into SQL? Here's the start:
OPTIONS NOCENTER NODATE FORMDLIM=' ' compress=yes pagesize=50;
%let yr=11;
/**** READ IN FILE ******* No Check for HexDec ****/
data IUM;
infile eium truncover lrecl=2500 PAD ignoredoseof /*obs=10000*/ ;
INPUT
#0016 ordr $char3.
#0001 RECDTL $char2500.
;
Further down it is specifying the position and length of the the columns, but not the data type. Any SAS users out there feeling smart and generous?
What is there to translate? The #xxxx is saying what column to start in, then you have the variable name and the informat to use to read it. SAS only has two data types, fixed length character strings and floating point numbers. Any informat that starts with a $ will generate a character variable. Others will generate a number. Some obvious informats will generate numbers that can be interpreted as dates, times or datetime (timestamp) values. Such as date9., time8. or datetime20.. But also other informats for other ways of representing dates in text format, like YYMMDD. , MMDDYY. or DDMMYY..
SAS will define the order of the variables in the dataset based on when you first reference them in the code. So ORDR will be defined in the database before RECDTL even though the latter appears first in the text file (column 1 versus column 16).

Float type storing values in format "2.46237846387469E+15"

I have a table ProductAmount with columns
Id [BIGINT]
Amount [FLOAT]
now when I pass value from my form to table it gets stored in format 2.46237846387469E+15 whereas actual value was 2462378463874687. Any ideas why this value is being converted and how to stop this?
It is not being converted. That is what the floating point representation is. What you are seeing is the scientific/exponential format.
I am guessing that you don't want to store the data that way. You can alter the column to use a fixed format representation:
alter table ProductAmount alter amount decimal(20, 0);
This assumes that you do not want any decimal places. You can read more about decimal formats in the documentation.
I would strongly discourage you from using float unless:
You have a real floating point number (say an expected value from a statistical calculation).
You have a wide range of values (say, 0.00000001 to 1,000,000,000,000,000).
You only need a fixed number of digits of precision over a wide range of magnitudes.
Floating point numbers are generally not needed for general-purpose and business applications.
The value gets stored in a binary format, because this is what you specified by requesting FLOAT as the data type for the column.
The value that you store in the field is represented exactly, because 64-bit FLOAT uses 52 bits to represent the mantissa*. Even though you see 2.46237846387469E+15 when selecting the value back, it's only the presentation that is slightly off: the actual value stored in the database matches the data that you inserted.
But i want to store 2462378463874687 as a value in my db
You are already doing it. This is the exact value stored in the field. You just cannot see it, because querying tool of SQL Management Studio formats it using scientific notation. When you do any computations on the value, or read it back into a double field in your program, you will get back 2462378463874687.
If you would like to see the exact number in your select query in SQL Management Studio, use CONVERT:
CONVERT (VARCHAR(50), float_field, 128) -- See note below
Note 1: 128 is a deprecated format. It will work with SQL Server-2008, which is one of the tags of your question, but in versions of SQL Server 2016 and above you need to use 3 instead.
Note 2: Since the name of the column is Amount, good chances are that you are looking for a different data type. Look into decimal data types, which provide a much better fit for representing monetary amounts.
* 2462378463874687 is right on the border for exact representation, because it uses all 52 bits of mantissa.

SAS renaming variables during input

self-taught SAS user here.
I often work with datasets that I have little control over and are shared among several different users.
I generally have been reading in files as CSVs using an infile statement + defining the variables with blocks of informat, format, and input statements. During this process, can I go ahead and rename variables--provided that everything is renamed in the correct order--or do they have to match the original dataset and be renamed in a later data step?
For example, the variable name in the dataset is '100% Fully Paid Out.' I know SAS variables can't start with numbers and I'd also like to simplify variable names in general, so could I do something like the following:
infile statement...
informat Paid $3.;
format Paid $3.;
input Paid $;
run;
Or maybe I'm going about this very inefficiently. I've tried doing simple proc imports without this whole informat/format/input business, but I've found that trying to redefine variable types afterwards causes more of a headache for me (all datasets I work with have combinations of text, dollars, percentages, general numbers, dates...). In any case, other tips highly appreciated--thanks!
EDIT
Maybe the question I should ask is this: is there any way of keeping the format of the csv for dollars and percentages (through proc import, which seems to convert these to characters)? I know I can manually change the formats from dollars/percentages to "general" in Excel prior to importing the file, but I'd prefer avoiding additional manual steps and also because I actually do want to keep these as dollars and percentages. Or am I just better off doing the informat/format/input to specify data types for the csv, so that variables are read in exactly how I want them to be read in?
Note: I've been unable to proc import xls or xlsx files, either because I'm on a 64-bit computer and/or I'm missing required drivers (or both). I was never able to do this even on a 32-bit computer either.
CSV files do not contain any metadata about the variable types, as your note about trying to import them into Excel demonstrates. You can use PROC IMPORT to have SAS make an educated guess as to how to read them, but the answer could vary from file to file based on the particular data values that happen to appear.
If you have data in XLS or XLSX files you should be able to read them directly into SAS using a libname with the XLS or XLSX engine. That does not use Excel and so does not have any conflicts between 32 and 64 installation. In fact you don't even need Excel installed. SAS will do a better job of determining the variable types from Excel files than from CSV files, but since Excel is a free-form spreadsheet you still might not have consistent variable types for the same variable across multiple files. With an Excel spreadsheet you might not even have the same data type consistently in a single column of a single sheet.
You are better off writing your own data step to read the file. That way you can enforce consistency.
What I typically do when given a CSV file is copy the names from the first row and use it to create a LENGTH statement. This will both define the variables and set the order of the variables. You could at this point give the variables new names.
length paid $3 date amount 8 ;
Then for variables that require an INFORMAT to be read properly I add an INFORMAT statement. Normally this is only needed for date/time variables, but it might also be needed if numeric values include commas or percent signs. The DOLLAR. informat is useful if your CSV file has numbers formatted with $ and/or thousands separators.
informat date mmddyy. amount dollar. ;
Then for variables that require a FORMAT to be displayed properly I add a FORMAT statement. Normally this is only needed for date/time variables. It is only required for character variables if you want to attach $CHAR. format in order to preserve leading spaces.
format date yymmdd10. ;
Then the INPUT statement is really easy since you can use a positional variable list. Note that there is no need to include informats or $ in the INPUT statement since the types are already defined by the LENGTH statement.
input paid -- amount ;