SAS raw data read--line pointers with inconsistent lines and embedded headers - file-io

The raw data I want to read not only runs across several different lines but also doesn't have the same number of lines per record. To further complicate things it also has headers that appear in the middle of the file after every page that ruin everything.
Time: 1:47pm Item Master Report For 06/06/2013 Report: GMRIMMSB
Item Type: Nonstock
Item Asset Inven Dsp ---Order--- ---Primary---- Substute Contract Hazd Count
Stat Class Class Unt Unit Conv Loc Vendor Manufacturer Nbr Item Nbr Number Flag Cycle
------------------------------------------------------------------------------------------------------------------------------------
ITEM 20049 TEST PNEUMONIA S LATEX ZL22 (30859001)
A 0173 6 PK PK 1 NSL 2431 R30859001
Vendor 1: 2431 FISHER SCIENTIFIC COMPANY 2: 2658 REMEL
3: 536 ABBOTT LABS - DIAGNOSTIC DIVISION 4: 1404 MUREX DIAGNOSTICS INC.
ITEM 20051 ANTIGEN BACTER. WELLCOGEN ZL26 B1901-51
A 0173 6 PK PK 1 NSL 2431 30859602
Vendor 1: 2431 FISHER SCIENTIFIC COMPANY 2: 3804 CARDINAL HEALTH-ALLEGIANCE
3: 2658 REMEL 4: 536 ABBOTT LABS - DIAGNOSTIC DIVISION
5: 1404 MUREX DIAGNOSTICS INC.
ITEM 20053 FILM DUPLICATING 10X12
I 0173 14 BX BX 1 NSX 1335 112010
Vendor 1: 1335 AGFA CORPORATION
ITEM 20055 FILM HTU 10 X 12
I 0173 14 BX BX 1 NSX 1335 094010
Vendor 1: 1335 AGFA CORPORATION
ITEM 20056 FILM HTU 8 X 10
I 0173 14 BX BX 1 NSX 1335 094008
Vendor 1: 1335 AGFA CORPORATION
ITEM 20057 SOL AXSYM FLUIDIES CHECK (09A3401)
A 0173 119 BX BX 1 NSL 536
Vendor 1: 536 ABBOTT LABS - DIAGNOSTIC DIVISION
ITEM 20058 FILM DUPLICATING 8 X 10
I 0173 14 BX BX 1 NSX 1335 112008
Vendor 1: 1335 AGFA CORPORATION
ITEM 20059 FILM HTU 14 X 17
I 0173 14 BX BX 1 NSX 1335 094014
Vendor 1: 1335 AGFA CORPORATION
Item Asset Inven Dsp ---Order--- ---Primary---- Substute Contract Hazd Count
Stat Class Class Unt Unit Conv Loc Vendor Manufacturer Nbr Item Nbr Number Flag Cycle
------------------------------------------------------------------------------------------------------------------------------------
ITEM 20060 FILM HTU 30 X 35
I 0173 14 BX BX 1 NSX 1335 094030
Vendor 1: 1335 AGFA CORPORATION
ITEM 20061 FILM HTU 14 X 14
I 0173 14 BX BX 1 NSX 1335 094001
Vendor 1: 1335 AGFA CORPORATION
Here is the code I have working with (in SAS studio)....
libname mylib '/folders/myfolders/';
data myfile;
length itm $ 4 itemnum 5 itemdesc $ 40 inac $ 2 assetcl $ 4 invcl 3 dspunit $ 2
ordunit $ 2 convr 4 loc $ 4 vndnum 4 manufnum $ 20 vendinfo $ 80;
infile '/folders/myfolders/ItemstrSM.txt' missover;
input #1 itm $ itemnum itemdesc $ &
#2 inac $ assetcl $ invcl dspunit $ ordunit $ convr loc $ vndnum manufnum
#3 vendinfo & $ ;
run;
proc print data=myfile noobs;
run;

If you do not need to routinely handle lots of this file, you can simply use a text editor to solve the problem.
Replace the page head to blank.
Replace '\nITEM' to '###ITEM'
Replace '\n' to space
Replace '###' to '\n'
Now you have one entry per line in the text file.

One workaround you can use to deal with the repeated header rows is to use something along the lines of the following before your main input statement:
input #;
if _infile_ = "Header row text" then delete;
This loads the whole line into _infile_ without populating any variables and holds it for processing by your main input statement (provided that it isn't a header row).

Assuming the same formating you've shown us there continues, and that there are no more than 6 vendors for each item:
data test(drop=itm);
length itm $4 itemnum 8 itemdesc $132
inac $1 assetcl invcl 8 dspunit ordunit $2 convr 8 loc $3 vndnum 8 manufnum $132;
infile "c:\users\c41928\documents\egreadin.txt" missover lrecl=32767 truncover;
input itm $ #;
if itm="ITEM" then
do;
input itemnum itemdesc $ /
inac $ assetcl $ invcl dspunit $ ordunit $ convr loc $ vndnum manufnum /
vendinfo1 $132.;
if index(vendinfo1,"2:") then
input itm $ #;
if itm="3:" then
do;
input #1 vendinfo2 $132.;
input itm $ #;
if itm = "5:" then
do;
input #1 vendinfo3 $132.;
output;
end;
else output;
end;
else output;
end;
else delete;
run;
If there is a unknown, and possibly infinite, number of vendors, then an itterative do loop would be more suitable. Otherwise, if there is a known number of vendors higher than 6, more do loops could be added.
Anyway, it's not very pretty but it works.

Related

Stata : Change name of variables with values of another Variables

I have a dataset of variables looking like this:
Screenshot of the Dataset.
I would like, if it is possible, to label the other variables with the name of the country they are related to. For example, ggdy1 is the gross debt/GDP ratio for country 1, here Austria, while ggdy2 is the Gross Debt/GDP ratio for country 2, Belgium.
To avoid the back and forth from the dataset to the results or command windows, is there a way to label the different variables (ggdy, pby,...) automatically with the name of the suitable country?
I have 28 countries in my dataset and work on Stata 15.
I have to say I think this is the wrong question. Your data structure is analogous to this
* Example generated by -dataex-. For more info, type help dataex
clear
input float year str7 country1 float(y1 x1) str7 country2 float(y2 x2)
1990 "Austria" 12 16 "Belgium" 20 24
1991 "Austria" 14 18 "Belgium" 22 26
end
which is both logical and perverse for most Stata purposes. A simple reshape gets you to a structure that is much more useful for most analyses.
. reshape long country y x , i(year) j(which)
(note: j = 1 2)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 2 -> 4
Number of variables 7 -> 5
j variable (2 values) -> which
xij variables:
country1 country2 -> country
y1 y2 -> y
x1 x2 -> x
-----------------------------------------------------------------------------
. l
+----------------------------------+
| year which country y x |
|----------------------------------|
1. | 1990 1 Austria 12 16 |
2. | 1990 2 Belgium 20 24 |
3. | 1991 1 Austria 14 18 |
4. | 1991 2 Belgium 22 26 |
+----------------------------------+
which does no harm, but is not essential.
P.S. What you ask for is programmable too, something like
foreach v of var ggdy* {
local suffix = substr("`v'", 5, .)
local where = country`suffix'[1]
label var `v' "ggdy `where'"
label var pby`suffix' "pby `where'"
label var cby`suffix' "cby `where'"
label var fby`suffix' "fby `where'"
}

How to define input variable when datalines have spaces for a variable

I want to have two different strings in the same dataset.
I tried to separate valeus with "" but it didnt work. Imagine I dont want to write "" but only strings inside. I searched a lot but did not find anything related to.
Could you guys please help me to get my goal?
data ecl.dim_produtos;
input id_produt id_departament id_order id_business id_portfolio initials $4. long_name $40. short_name $30.;
datalines;
1 1 10201 4 1 PZC "Puzzle Crédito" "Puzzle Crédito"
2 1 10202 4 1 PZR "Puzzle Reestruturados" "Reestruturados"
3 2 10207 30 1 DBO "Banca Online" "Banca Online"
4 3 10210 60 1 CLB "Colaboradores" "Colaboradores"
5 1 10203 4 1 PZF "Puzzle Formação" "Code Academy"
6 4 10205 5 1 HIP "Hipoteca Inversa" "Hip. Inversa"
7 5 10206 25 1 EMP "DEMP" "DEMP"
8 6 10208 45 1 NCO "NewCo" "NewCo"
9 6 10211 70 1 LDRC "Lendrock" "Lendrock"
10 4 10209 50 1 OTI "Otima Provision" "Otima"
11 6 10001 1 1 LDC "Lendico" "Lendico"
12 6 10007 1 1 MIBL "Market Invoice BL - EUR" "Market Invoice BL"
13 6 10003 1 1 CRS "CreditShelf" "CreditShelf"
14 6 10005 1 1 FUN "Funding Circle" "Funding Circle"
15 6 10002 1 1 RAI "Raize" "Raize"
16 4 10204 5 1 FLX "Flex" "Flex"
17 6 10101 2 1 AUX "Auxmoney" "Auxmoney"
18 6 10009 2 1 UPG "Upgrade - EUR" "Upgrade"
19 6 10104 2 1 PRO "Prodigy Finance" "Prodigy"
20 6 10102 2 1 FEL "Fellow Finance" "Fellow"
21 6 10008 1 1 ASZ "Assetz - EUR" "Assetz"
22 6 10010 2 1 LDB "Lendable - EUR" "Lendable"
23 6 10004 1 1 LIN "Linked Finance" "Linked"
24 6 10103 2 1 LDR "Lendrock" "Lendrock"
25 6 10105 3 1 EDX "Edebex" "Edebex"
26 6 10006 1 1 CAM "Camomille - FC" "Camomille"
27 6 10106 3 1 MIN "Market Invoice - EUR" "Market Invoice"
90 0 99991 102 2 DIV "Dívida Pública - EUR" "Dívida Pública"
91 6 99992 103 2 CRP "Obrigações Corporate - EUR" "Obrigações Corporate"
92 0 99990 101 3 SDA "Disp. Aplicações OIC - EUR" "Disp. Aplicações OIC"
9999 0 999999 999 99 TOT "Total Patrimonial - EUR" "Total Patrimonial"
;
run;
The most reliable approach would be to:
define the variables of the INPUT statement using a length or attrib statement.
use INFILE options to specify how the data lines are parsed by INPUT
take the $ out of the INPUT statement
Example (leave data lines as-is):
length
id_produt id_departament id_order id_business id_portfolio 8
initials $4
long_name $40
short_name $30
;
infile cards dsd dlm=" ";
For the case of wanting data lines with double quotes, you will have to modify the data lines to separate the values with two or more spaces and use the & argument for the variables in a list-style INPUT statement.
You could also separate the values in the data lines with a tab character and use DLM='09'x. You might have some trouble seeing and entering tabs using the SAS editor.
First make sure to use the : modifier if you want to include informat specifications in the INPUT statement to avoid switching between list and formatted input modes.
If you can insure that you have have at least two spaces between the values (and that the values themselves do NOT have adjacent spaces inside them) you can use the & modifier.
data test;
input id_produt id_departament id_order id_business id_portfolio
initials &:$4. long_name &:$40. short_name &:$30.
;
datalines;
1 1 10201 4 1 PZC Puzzle Crédito Puzzle Crédito
2 1 10202 4 1 PZR Puzzle Reestruturados Reestruturados
;
Or keep the quotes and make sure there is exactly one space between each value (and don't indent the datalines!) and add the DSD option.
data test;
infile datalines dsd dlm=' ' truncover ;
input id_produt id_departament id_order id_business id_portfolio
initials :$4. long_name :$40. short_name :$30.
;
datalines;
1 1 10201 4 1 PZC "Puzzle Crédito" "Puzzle Crédito"
2 1 10202 4 1 PZR "Puzzle Reestruturados" "Reestruturados"
;
Or use a different delimiter, with or without the DSD option.
data test;
infile datalines dsd dlm='|' truncover ;
input id_produt id_departament id_order id_business id_portfolio
initials :$4. long_name :$40. short_name :$30.
;
datalines;
1|1|10201|4|1|PZC|Puzzle Crédito|Puzzle Crédito
2|1|10202|4|1|PZR|Puzzle Reestruturados|Reestruturados
;

mathematical operations in a text file usinf awk

I have a text file which looks like this small example:
in this file the first line of each group is ID and belong each ID, there are some lines in which the 1st column is 3-letters character and 2nd one is a number. 2 columns are tab separated.
ID1
AAA 17
TTA 3
ATA 6
ATC 12
AAG 9
ACA 13
ATG 21
ACC 13
ACG 5
AAT 12
AGA 11
ATT 22
AGC 11
TAA 3
ACT 8
TAC 12
ID2
AAA 10
AAC 7
AAG 4
ACA 3
ACC 1
ATG 6
ACG 1
below also I have a list of 3-letter characters. I want to get a ratio of TTA and other 3-letters characters belong each ID which is also present in the below list.
ATT
ATC
ATA
CTT
AAC
CTA
CTG
TTA
TTG
GTT
GTC
GTA
GTG
the output for this example would look like this:
ID1 0.065
ID2 0
for the ID2, the ration is 0 because there is no TTA and for the ID1 is 0.065 because 3 divided by 46 is equal to 0.065. for each ID, I only took the 3-letters characters which are common between above list and the rows below each ID. and also 2 columns are tab separated.
I am quite new in awk programming language. I wrote the following piece of code, but it does not return what I want. would you please help me to fix it?
3_letter_list= [ATT, ATC, ATA, CTT, AAC, CTA, CTG, TTA, TTG, GTT, GTC, GTA, GTG]
awk -F "\t" '{if($1==3_letter_list), (if $1=="TTA" & ratio=$2/$1)}' filename.txt > out.txt
ID3
AAA 2
AAC 8
ATA 1
ATC 20
AAG 26
ACA 6
ATG 11
ACC 16
ACG 7
AAT 2
ATT 4
AGC 18
TAA 1
TAC 8
ACT 3
AGG 1
TTC 20
TCA 1
TCC 8
TTG 6
TCG 4
AGT 5
TAT 3
GAC 18
GTC 12
TTT 6
TGC 7
GAG 31
TCT 1
GCC 19
GTG 21
TGG 6
GCG 8
CAC 12
GAT 6
CTC 12
GGA 2
CAG 22
GGC 25
CTG 52
CCC 15
GCT 3
GGG 6
CCG 4
CAT 4
CTT 2
CGC 18
GGT 4
CCT 3
CGG 13
Awk solution:
Assuming that list of 3-letter character groups is saved into groups_list.txt.
awk 'NR==FNR{ a[$1]; next }
/^ID[0-9]/{
if (id) { printf "%s %.4f\n", id, tta/sum; id=tta=sum="" }
id = $1; next
}
$1 == "TTA"{ tta = $2 }
$1 in a{ sum += $2 }
END{ printf "%s %.4f\n", id, tta/sum }' groups_list.txt file.txt
The output:
ID1 0.0698
ID2 0.0000

How do you mark unique occurrences in a pattern given that value are unique when occurring simultaneously and not when they come separately? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
Suppose my data looks like this
student article.bought
1 A pen
2 B pencil
3 V book
4 A pen
5 A inkbottle
6 B pen
7 B pencil
8 B pencil
9 V book
10 Z marker
11 A inkbottle
12 V book
13 V pen
14 V book
I need unique occurrences of articles probably in a different column like this
student article.bought Occurences
1 A pen 1
2 B pencil 1
3 V book 1
4 A pen 1 # as A is taking a pen again
5 A inkbottle 2 # 'A' changed from pen to ink bottle
6 B pen 2
7 B pencil 3 # though B took pencil before, this is different as he took a pen in between
8 B pencil 3
9 V book 1
10 Z marker 1
11 A inkbottle 2
12 V book 1
13 V pen 2
14 V book 3
In R, we can find changes in a student's selection by finding the difference, diff, of each subsequent value. When we take the cumulative sum, cumsum, of that logical index we get a running count of occurrences.
In the second line we coerce the factor variable article.bought to numeric and run the function from the first line using ave to group the function f by student.
f <- function(x) cumsum(c(F, diff(x) != 0)) + 1
df$Occurences <- with(df, ave(as.numeric(article.bought), student, FUN=f))
df
# student article.bought Occurences
# 1 A pen 1
# 2 B pencil 1
# 3 V book 1
# 4 A pen 1
# 5 A inkbottle 2
# 6 B pen 2
# 7 B pencil 3
# 8 B pencil 3
# 9 V book 1
# 10 Z marker 1
# 11 A inkbottle 2
# 12 V book 1
# 13 V pen 2
# 14 V book 3
create additional column [Original Sort Order] and enumerate from 1
to ...
sort table by student / orig sort order
enter =IF(A2=A1,IF(B2=B1,D1,D1+1),1) in D2 and copy down
convert column D to values (copy, paste as ... Values)
restore original sort order
If this is more than a one-off, use the same tactic to create a VBA script
A shot with SAS:
data try00;
length student article $20;
infile datalines dlm=' ';
input student $ article $;
datalines;
A pen
B pencil
V book
A pen
A inkbottle
B pen
B pencil
B pencil
V book
Z marker
A inkbottle
V book
V pen
V book
;
data try01;
set try00;
pos=_n_;
run;
proc sort data=try01 out=try02; by student pos article; run;
proc sort data=try02 out=stud(keep=student) nodupkey; by student; run;
data shell;
length occurrence 8.;
set try02;
if _n_>0 then delete;
run;
%macro loopstudent();
data _null_; set stud end=eof; if eof then call symput("nstu",_n_); run;
%do i=1 %to &nstu;
data _null_; set stud; if _n_=&i then call symput("stud&i",student); run;
data thisstu;
set try02;
where student="&&stud&i";
dummyart=lag(article);
retain occurrence 0;
if dummyart ne article then occurrence=occurrence+1;
else occurrence=occurrence;
drop dummyart;
run;
proc append base=shell data=thisstu; run;
%end;
proc sort data=shell out=final; by pos; run;
%mend loopstudent; %loopstudent();
dataset "final" has the result.

SAS table with percentage attached

I am trying to create a matrix with both numeric and percentage result. I was given two tables
id cc
1 2
1 5
1 40
2 55
2 2
2 130
2 177
3 20
3 55
3 40
4 30
4 100
id Description
1 Dell
1 Lenovo
1 HP
2 Sony
2 Dell
2 Acer
2 Other
3 Fujitsu
3 Sony
3 HP
4 Apple
4 Asus
I have already created a table that looks like..I used the code
CC CC1 CC2… …CC177
1 264 5 0
2 0 132 6
…
…
177 2 1 692
data RESULT;
set id_CC;
by id;
retain CC1-CC177; /*CC range from 1 to 177*/
array CC_List(177) CC1-CC177;
if first.id then do i=1 to 177;
id_LIST(i)=0;
end;
CC_List(CC)=1;
if last.id then output;
run;
ods output sscp=coocs;
ods select sscp;
proc corr data=RESULT sscp;
var CC1-CC177;
run;
/*proc print data=coocs;*/
/*run;*/
/**/
In other words, how many id have cc1 also have cc2..cc177..etc. Now, I am wondering if it's doable to add percentage next to each number. For instance if CC1*CC1=264 (100%) then CC1*CC2= 5/264=1.9%
Another table I am trying to create is to have description of each CC on the matrix. Each CC number stands for one brand. 2=Dell 177=Other, etc. I want to create a table looks like
If I want to change the CC1 CC2 to characters, how do I modify the arrays? Eventually, I would like my table looks like
Description Dell Lenovo HP Sony Acer Other Fujitsu Sony
Dell 264 (100%)
Lenovo
HP 50 (10%)
Sony
Acer
Other
Fujitsu
Sony
In other words, how many people have dell also have acer, sony, other, etc?
The rename is a question that's been asked on here so I'll leave that one for now.
For the percentages you'll need to create a character variable. TO calculate the percent use the automatic variable _n_ which is the row, but will also be the denominator for your calculation. Then use a concatenate function such as cats to create the variable in the format N(PP%).
data want;
set have;
array cc(177) cc1-cc177;
array dd(177) $ dd1-dd177;
do i=1 to 177;
percent=cc(i)/cc(_n_);
dd(i)=cats(cc(i), "(", put(percent, percent8.1), ")");
end;
run;
In answering Reeza, I did:
data RESULT_PRE;
set ID_CC;
by ID;
retain CC1-CC177;
array CC_LIST(177) CC1-CC177;
array DD_LIST(177) $ DD1-DD177;
if first.id then do i=1 to 177;
CC_LIST(i)=0;
end;
CC_LIST(CC)=1;
if last.id then output;
run;
data RESULT;
set RESULT_PRE;
array CC_LIST(177) CC1-CC177;
array DD_LIST(177) $ DD1-DD177;
do i=1 to 177;
percent=CC_LIST(i)/CC_LIST(_n_);
DD_LIST(i)=cats(CC_LIST(i), "(", put(percent, percent8.1), ")");
end;
run;
The error shows that Array subscript out of range at line xx column xx and ERROR 68-185: The function CC is unknown, or cannot be accessed.