I'm looking to upload a ".txt" file into SAS so that I can search the contents for specific characters and words to analyse. The text file in question is poorly formatted so would ideally to have one column with each word being a new observation like:
TEXT
1 Hello
2 World
Currently I'm downloading the file into SAS but there's lots of spaces and it has multiple words per observation.
data mylib.textimport;
infile "../TEXTTEST.txt" dlm="' ', ',', '.'";
input __text__ $char300. ;
run;
Could anyone help me with how to put every new word into a new column?
Thanks in advance. :)
If you want read the file "word by word" then just tell SAS what characters you consider to be delimiters and use FLOWOVER option to read the words. So if you wanted to treat spaces, commas, periods, quotes, tabs, linefeeds and carriage returns as word delimiters your program could look like this.
data want;
dlm=' ,."''' || '090A0D'x;
infile "../TEXTTEST.txt" dlm=dlm flowover;
length word $300 ;
input word ## ;
run;
I would TRANSLATE the characters you want out into spaces and then loop over the remaining, outputting each word.
Here's some test data
data have;
format line $200.;
input ;
line = _infile_;
datalines;
This is some, test.text
How,about,this cheesey.cheese?
;
run;
Here is a DATA Step to loop through and output what you are looking for:
data want(keep=word);
format word $200.;
set have;
line = translate(line," ",",."); /*convert , and . to space*/
n = countw(line);
/*Loop through the words and output*/
do i=1 to n;
word = scan(line,i);
output;
end;
run;
TRANSLATE converts the characters in the 3rd argument into the characters in the second. This of the string as an array. It does this replacement for each value in the array.
As this example shows, you probably want to think about other punctuation.
Related
I have a sas dataset with the variable called response, which has the following records:
and so on.
These are all the same records, I need to remove the last character wherever special and return the records as
When I use a compress function, it removes the asterisk in between and returns:
TrailerOffer which is not what I want.
Can somebody please help me code this? I need to remove the last characters if these are special.
You can use regular expression character classes to specify the 'special' trailing characters. In this example replacement pattern, any characters that are not letters or numbers will be removed. \s* is needed before $ because SAS character variables will have trailing spaces when their values are passed to the regex engine.
Learn more about regular expression class groupings in the SAS documentation
data have;
length response $20.; input response; datalines;
Trailer*Offer
Trailer*Offer*
Trailer*Offer???
Trailer*Offer?...
Offer#1
Offer#1?
Offer#1*?
;
data want;
set have;
response = prxchange ('s/[^a-z0-9]+\s*$//i', 1, response);
run;
Using PRXCHANGE
prx=prxchange("s/^W*(.*?)\W*$/$1/",-1, response);
will remove trailing special characters
data have;
length response $20.;
response="Trailer*Offer";output;
response="Trailer*Offer*";output;
response="Trailer*Offer???";output;
response="Trailer*Offer?...";output;
run;
data _null_;
set have;
prx=prxchange("s/^W*(.*?)\W*$/$1/",-1, response);
put prx;
run;
77
78 data _null_;
79 set have;
80 prx=prxchange("s/^W*(.*?)\W*$/$1/",-1, response);
81 put prx;
82 run;
Trailer*Offer
Trailer*Offer
Trailer*Offer
Trailer*Offer
I have a set of data of gym membership starting with an ID, then 119 in-time columns and 119 out-time columns. The in-time and out-time columns are in the syntax of ##:##:## and I am trying to input the variables in the simplest way. Rather than writing [ID in1 $ in2 $ inX $ out1 $ out2 $ outX $], is there a way to easily input hundreds of columns in a simple line of code?
Just use variable lists. Let's assume your data file is comma delimited.
data want ;
infile 'myfile.csv' dsd truncover ;
input id (in1-in119 out1-out119) (:time8.) ;
format in1-in119 out1-out119 time8.;
run;
"proc import" can be an alternative solution.
It defines data type automatically.
The statement looks like the following:
proc import
datafile = myfile.csv
out = work.destination_table
dbms = csv replace
;
run;
Is there a way to override the default behavior of character length being set by the first value encountered and instead set all character data for a session to have the same fixed length?
Much of the data I work with daily is of a similar format/structure, such as a .csv or .txt. I find that using an infile statement with list input works well for importing this kind of data.
For instance, suppose I have a text file myData.txt.
myData.txt
string1 string2 num1 string3 num2
hello there 12 this 33
is some 45 sample 2
data for 8 you 12
I would then use code like this to bring it in.
%let dataDirectory = C:\path\to\file;
%let dataFile = myData.txt;
filename myFile "&dataDirectory.\&dataFile.";
data in_data;
infile myFile dsd dlm = '09'x firstobs = 2;
length
string1 $ 50.
string2 $ 50.
num1 8
string3 $ 50.
num2 8
;
input
string1 $
string2 $
num1
string3 $
num2
;
run;
filename myFile clear;
I find that it is important to have the length statement so that none of my data is truncated. Since the data sets are not particularly large, it makes sense to set all the character lengths to some fixed amount which will guarantee no truncation occurs. I find that the default numeric length is sufficient.
The problem with this approach is that any time a variable name needs to be changed etc, I need to make an alteration in both the length and input statements. This gets to be a nuisance, especially when there are 150 variables, and I'm hoping it is unnecessary.
List input seems appropriate to my needs. I could use column input, but then I'd have to fiddle around with defining column widths. I can't think of a way to make that a simple process when handling 150 columns. Being able to globally define all character lengths, as with the default 8 for numeric, would solve my problem. Is this possible? Or, maybe you have a better method for bringing in such data as myData.txt?
You could use a macro variable to store your default length. Then you can change it in one place.
You can use a variable list in your INPUT statement so that you don't need to worry about typing variable names more than once.
%let dataDirectory = C:\path\to\file;
%let dataFile = myData.txt;
%let defLength = $80 ;
data in_data;
infile "&dataDirectory/&dataFile" dsd dlm='09'x firstobs=2 truncover ;
length
string1 &defLength
string2 &defLength
num1 8
string3 &defLength
num2 8
;
input (_all_) (:) ;
run;
You can specify how many rows SAS should use to determine field attributes with the "guessingrows" option using proc import. That way proc import will take care of any number of new variables you may have.
proc import out=importeddata
datafile= "/examplepath/file.txt"
dbms=dlm replace;
delimiter='09'X;
getnames=YES;
guessingrows=5000;
run;
If you keep your length statement in the proper order you can use a SAS variable list for the INPUT statement. You don't need the $sign in the input statement. If you have INFORMATS for some variable use an INFORMAT statement to associate.
data in_data;
infile myFile dsd dlm = '09'x firstobs = 2;
length
string1 $ 50.
string2 $ 50.
num1 8
string3 $ 50.
num2 8
;
input (string1--num2)(:);
run;
I would like to create a new variable in SAS which takes the value 1 if an observation in the variable "TEXT" contains 8 numbers. The problem is, that TEXT is a character variable. Is it possible to make some kind of a format search in SAS?
I assume by '8 numbers' you actually mean 8 digits. For 8 separate numbers, that would be different.
So something like the code below might help.
The modifier 'kd' meaning KEEP DIGITS in COMPRESS function does the magic here:
data indata;
length TEXT $20;
input TEXT;
datalines;
a
123
12345678
A12345678
;
run;
data outdata;
set indata;
length TEXT_DIGITS $20 _8_DIGIT_INDICATOR 3;
TEXT_DIGITS = compress(TEXT, , 'kd');
if length(TEXT_DIGITS)=8 then _8_DIGIT_INDICATOR = 1;
run;
Adjust the logic as you need - e.g. if no other character in input value is allowed or something else.
Also functions like ANYDIGIT, NOTDIGIT might be useful.
Hi I am building a dataset, but the data I am merging is in different formats.
From the Excel sheet i import its in numeric 8, and the other 2 datasets im merging to are character 20, so I want to change the numeric 8 to char 20.
How can I change the variable acctnum, to char 20? (I also want to keep this as its name, as I presume a new variable will be created)
data WORK.T82APR;
set WORK.T82APR;
rename F1 = acctnum f2 = tariff;
run;
proc contents data=T82APR;
run;
While this thread is already dead, I thought I'd way in and answer why the 14 digits conversion became in E notation.
Typically, or rather, unless otherwise specified, numeric formats in SAS use BEST12 format. As such, when a numeric value is longer than 12 characters (including any commas and periods), BEST12 chooses E notation as the best way to format the value.
The input function, in that case receives the formatted value put(acctnum, BEST12.). There would've been 2 ways around it.
Either use
input(put(acctnum, 14.), $20.);
Or, change the format of the variable using the format statement (directly in a data step or with proc datasets like) - this has the added benefit that if you open the table in SAS, you will see the 14 digits and not the scientific formatted value.
proc datasets library=work nolist;
modify dsname;
format acctnum 14.;
run;
Vincent
Try this:
data WORK.T82APR ;
set WORK.T82APR;
acctnum = put(F1, $20.);
rename f2 = tariff;
run;
Ok, I didn't pay attention to your own rename statement, so I adjusted my answer to reflect that now.