SAS renaming variables during input - file-io

self-taught SAS user here.
I often work with datasets that I have little control over and are shared among several different users.
I generally have been reading in files as CSVs using an infile statement + defining the variables with blocks of informat, format, and input statements. During this process, can I go ahead and rename variables--provided that everything is renamed in the correct order--or do they have to match the original dataset and be renamed in a later data step?
For example, the variable name in the dataset is '100% Fully Paid Out.' I know SAS variables can't start with numbers and I'd also like to simplify variable names in general, so could I do something like the following:
infile statement...
informat Paid $3.;
format Paid $3.;
input Paid $;
run;
Or maybe I'm going about this very inefficiently. I've tried doing simple proc imports without this whole informat/format/input business, but I've found that trying to redefine variable types afterwards causes more of a headache for me (all datasets I work with have combinations of text, dollars, percentages, general numbers, dates...). In any case, other tips highly appreciated--thanks!
EDIT
Maybe the question I should ask is this: is there any way of keeping the format of the csv for dollars and percentages (through proc import, which seems to convert these to characters)? I know I can manually change the formats from dollars/percentages to "general" in Excel prior to importing the file, but I'd prefer avoiding additional manual steps and also because I actually do want to keep these as dollars and percentages. Or am I just better off doing the informat/format/input to specify data types for the csv, so that variables are read in exactly how I want them to be read in?
Note: I've been unable to proc import xls or xlsx files, either because I'm on a 64-bit computer and/or I'm missing required drivers (or both). I was never able to do this even on a 32-bit computer either.

CSV files do not contain any metadata about the variable types, as your note about trying to import them into Excel demonstrates. You can use PROC IMPORT to have SAS make an educated guess as to how to read them, but the answer could vary from file to file based on the particular data values that happen to appear.
If you have data in XLS or XLSX files you should be able to read them directly into SAS using a libname with the XLS or XLSX engine. That does not use Excel and so does not have any conflicts between 32 and 64 installation. In fact you don't even need Excel installed. SAS will do a better job of determining the variable types from Excel files than from CSV files, but since Excel is a free-form spreadsheet you still might not have consistent variable types for the same variable across multiple files. With an Excel spreadsheet you might not even have the same data type consistently in a single column of a single sheet.
You are better off writing your own data step to read the file. That way you can enforce consistency.
What I typically do when given a CSV file is copy the names from the first row and use it to create a LENGTH statement. This will both define the variables and set the order of the variables. You could at this point give the variables new names.
length paid $3 date amount 8 ;
Then for variables that require an INFORMAT to be read properly I add an INFORMAT statement. Normally this is only needed for date/time variables, but it might also be needed if numeric values include commas or percent signs. The DOLLAR. informat is useful if your CSV file has numbers formatted with $ and/or thousands separators.
informat date mmddyy. amount dollar. ;
Then for variables that require a FORMAT to be displayed properly I add a FORMAT statement. Normally this is only needed for date/time variables. It is only required for character variables if you want to attach $CHAR. format in order to preserve leading spaces.
format date yymmdd10. ;
Then the INPUT statement is really easy since you can use a positional variable list. Note that there is no need to include informats or $ in the INPUT statement since the types are already defined by the LENGTH statement.
input paid -- amount ;

Related

SAS VARTYPE Function: run it (or equivalent) against all Variables

I am a DB administrator with 0 SAS experience and I work in government and have been tasked with ingesting SAS output from another team. The other team has limited SAS experience apparently and cannot answer the question "what is the data type of each SAS variable". We have dozens of tables and thousands of variables to import. Is there a way to run the SAS function "VarType" against all columns?
I've not found what I needed on SAS docs, SO search, etc.
I am expecting code that I can hand to the other team which they will run to produce the following (with only hard-coding the "dataset" ; no hard-coded table names/variable names):
TableName
VariableName
DataType
DataLength and/or Other attributes as needed
MyTable 1
Column1
char
25
MyTable 1
Col2
numeric
scale 10 precision 2
MyTable 2
Col1
(small? big? 32? ) int
bytes? or something that tells me max range
...
MyTable102
Column100
date
yyyy-mm-dd
Update: here's what I used based on the accepted answer. You would change:
library=SASHELP to library=YourLibrary to change the dataset being scraped
out=yourDataset.sasSchemaDump replace yourDataset with the destination dataset where a new table named sasSchemaDump will be created/populated. Rename sasSchemaDump to your desired table name.
proc datasets library=SASHELP  memtype=data;
contents data=_ALL_ (read=green) out=yourDataset.sasSchemaDump;
title 'SAS Schema Dump';
run;
There is a dedicated SAS procedure for this: PROC CONTENTS
proc contents data=sashelp.cars out=want; run;
It will create a SAS table want with all the information needed.
FYI: TYPE 1 is numeric, TYPE 2 is character.
If all tables are in the same library you could do the following to cycle through all the tables within the library
proc contents data=sashelp._all_ out=want; run;
Run PROC CONTENTS on the dataset and you will have the information you need.
SAS has only two data TYPE. Fixed length character strings and floating point numbers. The LENGTH is the number of bytes that are stored in the dataset. So for character variables the length determines how many characters it can store (assuming you are using a single byte encoding). Floating point numbers require 8 bytes to store, but you can store it with fewer in the dataset if you don't mind the loss of precision that means. For example if you know the values are integers you might choose to store only 4 of the bytes.
You can sometimes tell more information about a variable if the creator attach a permanent FORMAT to control how the variable is displayed. For example SAS stores DATE values as the number of days since 1960. So to make those number meaningful to humans you need to attach a format such as DATE9. or YYMMDD10. so that the numbers print as strings that a human would see as a date. Similarly there are display formats for displaying time of day value (number of seconds since midnight) or datetime values (number of seconds since 1960). Also if they attached a format that does not display decimal places that might mean the values are intended to be integers.
And if they attached a LABEL to the variable that might explain more about the variable than you can learn from the name alone.
They could also attach user defined formats to a variable. Those could be simple code/decode lookups, but they could also be more complex. A common complex one is used for collapsing a range (or multiple values and/or ranges) to a single decode. The definition of a user defined format is stored in a separate file, called a catalog, in particular a format catalog. You can use PROC FORMAT with the FMTLIB or CNTLOUT= option to see the definition of the user defined formats.

Proper formatting of Excel sheets to avoid errors in SQL querying?

What do you avoid when creating and filling out a Excel spreadsheet of data for a SQL database (certain formats, characters, character length issues?)
2.Does it matter how dates are formatted?
VARCHAR or INTEGER errors you've seen?
Finally, what SQL or Python queries did you use to address errors you found that you might have shared for questions 1-3?
The easiest way would be, if you can import Database-EDI (e.g. Oracle SQL Developer) a TXT- or CSV-Excel-Export into our Database.
→ Depending on the database, different requirements must be observed.
The main focus is on the correct formatting with regard to the country settings (Excel & database):
Excel-Format-Date YYYY-M-DD HH24:MM / Databe-Timestamp YYYY-MM-DD HH24:MM:SS.FFFF
→ That would not work
In addition, make sure that Excel does not cut any numbers:
Excel-Format-Long-Number 89632150000 (orignal 896321512345 )
→ Excel automatically shortens the number in the standard settings.
The length of a text must not exceed the specified maximum length in the assigned column of the type (VARCHAR).
I think these would be the main points to look out for.

Quickly Convert Text To Numbers or Dates Excel VBA

Is there any way to QUICKLY convert numbers/dates stored as text (without knowing exactly which cells are affected) to their correct type using VBA.
I get data in an ugly text-deliminated format, and I wrote a macro that basically does text-to-columns on it, but is more robust (regular text-to-columns will not work on my data, and I also don't want to waste time going through the wizard every time...). But, since I have to use arrays to process the data efficiently, everything gets stored as a String (and is thus transferred to the worksheet as text).
I don't want to have to cycle through every cell, as this takes a LONG time (these are huge data files - I need to use arrays to process them). Is there a simple command I can apply to the entire range to do this?
Thanks!
This has to do with the data type of the columns modify the column from general to the correct data type and the placement of text data should get automatically converted... here's an example where I pasted the text 012345 into different columns having different data types. Note how the displayed value is different for the different types but the value is retained (except on number and general which truncate a leading 0.
However if you don't know what field is of what type... you're really out of luck.
There is a way is there. Just multiply 1 with the data in the column have text to converted as number, whether it is text or not it will convert to numbers only.
Read the following the link for more.
http://chandoo.org/wp/2014/09/02/convert-numbers-stored-as-text-tip/

Preserve leading zeros when importing Excel into SQL

My office uses excel to prepare our data before importing it into a SQL database. However, we have been expreiencing the following error.
When the data is imported from one computer it loses all of the leading zeros. However, when it is imported from a different computer it imports perfectly.
An example of the leading zeros are that our item numbers are required to be formatted as "001, 002, 003,... 010, 011, 012,... 100, 101, 102, ect".
1) The excel file is stored on a server so there is no difference in the file.
2) If the users swap workstations the result stays with the computer, and doesn't switch with the user.
3) The data is formatted as text. It has been formatted as text both from the Data Tab and from Format Cells.
Is there a setting within excel that is specific to the computer and not the spreadsheet which will affect exporting the data? Or is there a non-excel specific setting which will cause this?
Its best to avoid the 'TEXT' format option. Confusingly, it does not force the contents of a cell to be a text data type, and it wreaks havoc when a formula references a 'TEXT' format.
To add to the previous answer (with all of the caveats about if this is a good idea), you can use the TEXT worksheet function
=TEXT(A1,"000")
to guarantee an actual text string with leading zeros if needed.
Depending on number of leading zeroes that you require, you can select your data/column in Excel, go into Excel >> Format >> Custom >> type in however many zeroes you require into the Type field (i.e. 000000000 for a 9-digit number with leading zeroes), and it will automatically preface with the correct number of leading zeroes to make the numerical string the correct length (i.e. 4000 = 00004000).
Note, this only works with numerical data, not text, but depending on the scenario it may be more useful to retain your data in numerical format - the example you gave listed numerical data only, and often retaining the numerical format is a benefit for analysis.
Not sure what the benefit of padding data before inserting it into the database would be...(takes more space, slower searching, etc.). Sounds like you're formatting it for output (?), which might be more efficiently done elsewhere.
But anyway -- here are some ideas for your SELECT (sql) statement:
RIGHT(1000 + [excel field], 3)
or another one would be
REPLICATE('0', 3 - LEN([excel field])) + [excel field]
Something you can do to the Excel field itself (before import) is prefix it with a ' (apostrophe). Notice if you type 0007 into Excel, it will change it to 7, but if you type '0007, it will keep the leading zeros.

Testing a CSV - how far should I go?

I'm generating a CSV which contains several rows and columns.
However, when I'm testing said CSV I feel like I am simply repeating the code that builds the file in the test as I'm checking each and every field is correct.
Question is, is this more sensible than it seems to me, or is there a better way?
A far simpler test is to just import the CSV into a spreadsheet or database and verify the data output is aligned to the proper fields. No extra columns or extra rows, data selected from the imported recordset is a perfect INTERSECT with the recordset from which the CSV was generated, etc.
More importantly, I recommend making sure your test data includes common CSV fail scenarios such as:
Field contains a comma (or whatever your separator character)
Field contains multiple commas (You might think it's the same thing, but I've seen one fail where the other succeeded)
Field contains the new-row character(s)
Field contains characters not in the code page of the CSV file
...to make sure your code is handling them properly.