SAS: Importing Fixed Width file w Infile function and where clause - file-io

I have two SAS data steps that import a text file called "Provider_Master.txt" using a fixed Width approach. The first data step pulls the entire "Provider_Master.txt" file into SAS as a dataset called "t_Pvdr_Mstr" and the next data step retains only the parts of the file that I need, which is RecordTypes that have a "T" in the RecordType field. This approach works well enough but is very inefficient because the file is over 2 gigs large. Once I strip away all the lines that don't have a "T" the RecordTypes field, the file is less than 400 KBs. It would make alot more sense to only import the lines I need instead of importing the entire text file but I'm not sure how to contrain the import to only pull down RecordTypes that have a "T". Below is an example of the code I'm currently using....
data Grid.t_Pvdr_Mstr ;
infile "C:\SASData\Provider_Master.txt " truncover;
input
Insurer $ 1-4
RecordType $ 5-7
Actions $ 8-8
Pvdr $ 9-18
Type $ 19-19
Name $ 20-58
Bus_Type_Code $ 59-61
Bus_Date $ 62-69
Address1 $ 70-129
Address2 $ 130-189
City $ 190-219
State $ 220-221
Zip $ 222-230
County $ 231-260
Country $ 261-263
Phone $ 264-274;
run;
data Grid.t_Pvdr_Mstr;
set Grid.t_Pvdr_Mstr;
where RecordType = 'T';
run;

Add an if statement to prevent it from outputting unneeded records. That way you do not need to create a second step with a where statement.
data Grid.t_Pvdr_Mstr ;
infile "C:\SASData\Provider_Master.txt " truncover;
input
Insurer $ 1-4
RecordType $ 5-7
Actions $ 8-8
Pvdr $ 9-18
Type $ 19-19
Name $ 20-58
Bus_Type_Code $ 59-61
Bus_Date $ 62-69
Address1 $ 70-129
Address2 $ 130-189
City $ 190-219
State $ 220-221
Zip $ 222-230
County $ 231-260
Country $ 261-263
Phone $ 264-274;
if(RecordType = 'T');
run;
An if without then statement effectively says "Do not pass this line unless this is true." Since output is implied at the run boundary, we're only outputting records where RecordType = 'T'.

Related

Sed/Awk: how to find and remove two lines if a pattern in the first line is being repeated; bash

I am processing text file(s) with thousands of records per file. Each record is made up of two lines: a header that starts with ">" and followed by a line with a long string of characters "-AGTCNR". The header has 10 fields separated by "|" whose first field is a unique identifier to each record e.g ">KEN096-15" and a record is termed duplicate if it has same identifier. Here is how a simple record look like:
>ACML500-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_-2
----TAAGATTTTGACTTCTTCCCCCATCATCAAGAAGAATTGT-------
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----------TCCCTTTAATACTAGGAGCCCCTGACATAGCCTTTCCTAAATAAT-----
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co
-------TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA---------------
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru
AACATTATATTTGGAATTT-------GATCAGGAATAGTCGGAACTTCTCTGAA------
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
TAAGATTTTGACTCATTAA----------------AATGGAGCAGGAACAGGATGA
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTAGGAAATTGATTAGTACCTTTAATATT----CCGAAT---
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA
>AFBTB002-09|Cole|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
-------TCTTCTGCTCAT-------GGGGCAGGAACAGGG----------TGA
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----------TCCCTTTAATACTAGGAGCCCCTTTCCT----TAAATAAT-----
Now I am triying to delete repeats, like duplicate records of "ACRJP458-10" and "PMANL2431-12".
Using a bash script I have extracted the unique identifiers and stored repeated ones in a variable "$duplicate_headers". Currently, I am trying to find any repeated instances of their two-line records and deleting them as follows:
for i in "$#"
do
unset duplicate_headers
duplicate_headers=`grep ">" $1 | awk 'BEGIN { FS="|"}; {print $1 "\n"; }' | sort | uniq -d`
for header in `echo -e "${duplicate_headers}"`
do
sed -i "/^.*\b${header}\b.*$/,+1 2d" $i
#sed -i "s/^.*\b${header}\b.*$//,+1 2g" $i
#sed -i "/^.*\b${header}\b.*$/{$!N; s/.*//2g; }" $i
done
done
The final result (with thousands of records in mind) will look like:
>ACML500-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_-2
----TAAGATTTTGACTTCTTCCCCCATCATCAAGAAGAATTGT-------
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----------TCCCTTTAATACTAGGAGCCCCTGACATAGCCTTTCCTAAATAAT-----
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co
-------TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA---------------
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru
AACATTATATTTGGAATTT-------GATCAGGAATAGTCGGAACTTCTCTGAA------
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
TAAGATTTTGACTCATTAA----------------AATGGAGCAGGAACAGGATGA
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA
>AFBTB002-09|Cole|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
-------TCTTCTGCTCAT-------GGGGCAGGAACAGGG----------TGA
$ awk -F'[|]' 'NR%2{f=seen[$1]++} !f' file
>ACML500-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_-2
----TAAGATTTTGACTTCTTCCCCCATCATCAAGAAGAATTGT-------
>ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----------TCCCTTTAATACTAGGAGCCCCTGACATAGCCTTTCCTAAATAAT-----
>ASILO303-17|Dip|gs-Par|sp-Par vid|subsp-NA|co
-------TAAGATTCTGATTACTCCCCCCCTCTCTAACTCTTCTTCTTCTATAGTAGATG
>ASILO326-17|Dip|gs-Goe|sp-Goe par|subsp-NA|c
TAAGATTTTGATTATTACCCCCTTCATTAACCAGGAACAGGATGA---------------
>CLT100-09|Lep|gs-Col|sp-Col elg|subsp-NA|co-Buru
AACATTATATTTGGAATTT-------GATCAGGAATAGTCGGAACTTCTCTGAA------
>PMANL2431-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_
----ATGCCTATTATAATTGGAGGATTTGGAAAACCTTTAATATT----CCGAAT
>STBOD057-09|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
ATCTAATATTGCACATAGAGGAACCTCNGTATTTTTTCTCTCCATCT------TTAG
>TBBUT582-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
-----CCCCCTCATTAACATTACTAAGTTGAAAATGGAGCAGGAACAGGATGA
>TBBUT583-11|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N
TAAGATTTTGACTCATTAA----------------AATGGAGCAGGAACAGGATGA
>AFBTB001-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGCTCCATCC-------------TAGAAAGAGGGG---------GGGTGA
>AFBTB003-09|Col|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
TAAGATTTTGACTTCTGC------CATGAGAAAGA-------------AGGGTGA
>AFBTB002-09|Cole|gs-NA|sp-NA|subsp-NA|co-Ethi|site-NA|lat_N
-------TCTTCTGCTCAT-------GGGGCAGGAACAGGG----------TGA
To run it on multiple files at once would be this to remove duplicates across all files:
awk -F'[|]' 'FNR%2{f=seen[$1]++} !f' *
or this to only remove duplicates within each file:
awk -F'[|]' 'FNR==1{delete seen} FNR%2{f=seen[$1]++} !f' *

How to compare two strings of a file match the strings of another file using AWK?

I possess 2 huge files and I need to count how many entries of file 1 exist on file 2.
The file 1 contains two ids, source and destination, like below:
11111111111111|22222222222222
33333333333333|44444444444444
55555555555555|66666666666666
11111111111111|44444444444444
77777777777777|22222222222222
44444444444444|00000000000000
12121212121212|77777777777777
01010101010101|01230123012301
77777777777777|97697697697697
66666666666666|12121212121212
The file 2 contains the valid id list, which will be used to filter file 1:
11111111111111
22222222222222
44444444444444
77777777777777
00000000000000
88888888888888
66666666666666
99999999999999
12121212121212
01010101010101
What I am struggling to achieve is find a way to count how many entries in file one possess the entry in file 2. Only when both numbers in the same line
exist in file 2 will the line be counted.
On file 2:
11111111111111|22222222222222 — This will be counted because both entries exist on file 2, as well as 77777777777777|22222222222222 because both entries exist on file 2.
33333333333333|44444444444444 — This will not be counted because 33333333333333 does not exist on file 2 and the same goes to 55555555555555|66666666666666, the first does not exist on file 2.
So in the examples I mentioned in the beginning it should count 6, and printing this should be enough, better than editing one file.
awk -F'|' 'FNR == NR { seen[$0] = 1; next }
seen[$1] && seen[$2] { ++count }
END { print count }' file2 file1
Explanation:
1) FNR == NR (number of record in current file equals number of record) is only true for the first input file, which is file2 (the order is important!). Thus for every line of file2, we record the number in seen.
2) For other lines (which is file1, given second on the command line) if the |-separated fields (-F'|') number 1 and 2 were both seen (in file2), we increment count by one.
3) In the END output the count.
Caveat: Every unique number in file2 is loaded into memory. But this also makes it fast instead of having to read through file2 over and over again.
Don't know how to do it in awk but if you are open to a quick-and-dirty bash script that someone can help make efficient, you could try this:
searcher.sh
-------------
#!/bin/bash
file1="$1"
file2="$2"
-- split by pipe
while IFS='|' read -ra line; do
-- find 1st item in file2. If found, find 2nd item in file2
grep -q ${line[0]} "$file2"
if [ $? -eq 0 ]; then
grep -q ${line[1]} "$file2"
if [ $? -eq 0 ]; then
-- print line since both items were found in file2
echo "${line[0]}|${line[1]}"
fi
fi
done < "$file1"
Usage
------
bash searcher.sh file1 file2
Result using your example
--------------------------
$ time bash searcher.sh file1 file2
11111111111111 | 22222222222222
11111111111111 | 44444444444444
77777777777777 | 22222222222222
44444444444444 | 00000000000000
12121212121212 | 77777777777777
66666666666666 | 12121212121212
real 0m1.453s
user 0m0.423s
sys 0m0.627s
That's really slow on my old PC.

line count with in the text files having multiple lines and single lines

i am using UTL_FILE utility in oracle to get the data in to csv file. here i am using the script.
so i am getting the set of text files
case:1
sample of output in the test1.csv file is
"sno","name"
"1","hari is in singapore
ramesh is in USA"
"2","pong is in chaina
chang is in malaysia
vilet is in uk"
now i am counting the number of records in the test1.csv by using linux commans as
egrep -c "^\"[0-9]" test1.csv
here i am getting the record count as
2 (ACCORDING TO LINUX)
but if i calculate the number of records by using select * from test;
COUNT(*)
---------- (ACCORDING TO DATA BASE)
2
case:2
sample of output in the test2.csv file is
"sno","name","p"
"","",""
"","","ramesh is in USA"
"","",""
now i am counting the number of records in the test2.csv by using linux commans as
egrep -c "^\"[0-9]" test2.csv
here i am getting the record count as
0 (ACCORDING TO LINUX)
but if i calculate the number of records by using select * from test;
COUNT(*)
---------- (ACCORDING TO DATA BASE)
2
can any body help me how to count the exact lines in case:1 and case:2 using the single command
thanks in advance.
Columns in both case is different. To make it generic I wrote a perl script which will print the rows. It generates the regex from headers and used it to calculate the rows. I assumed that first line always represents the number of columns.
#!/usr/bin/perl -w
open(FH, $ARGV[0]) or die "Failed to open file";
# Get coloms from HEADER and use it to contruct regex
my $head = <FH>;
my #col = split(",", $head); # Colums array
my $col_cnt = scalar(#col); # Colums count
# Read rest of the rows
my $rows;
while(<FH>) {
$rows .= $_;
}
# Create regex based on number of coloms
# E.g for 3 coloms, regex should be
# ".*?",".*?",".*?"
# this represents anything between " and "
my $i=0;
while($i < $col_cnt) {
$col[$i++] = "\".*?\"";
}
my $regex = join(",", #col);
# /s to treat the data as single line
# /g for global matching
my #row_cnt = $rows =~ m/($regex)/sg;
print "Row count:" . scalar(#row_cnt);
Just store it as row_count.pl and run it as ./row_count.pl filename
egrep -c test1.csv doesn't have a search term to match for, so it's going to try to use test1.csv as the regular expression it tries to search for. I have no idea how you managed to get it to return 2 for your first example.
A useable egrep command that will actually produce the number of records in the files is egrep '"[[:digit:]]*"' test1.csv assuming your examples are actually accurate.
timp#helez:~/tmp$ cat test.txt
"sno","name"
"1","hari is in singapore
ramesh is in USA"
"2","pong is in chaina
chang is in malaysia
vilet is in uk"
timp#helez:~/tmp$ egrep -c '"[[:digit:]]*"' test.txt
2
timp#helez:~/tmp$ cat test2.txt
"sno","name"
"1","hari is in singapore"
"2","ramesh is in USA"
timp#helez:~/tmp$ egrep -c '"[[:digit:]]*"' test2.txt
2
Alternatively you might do better to add an extra value to your SELECT statement. Something like SELECT 'recmatch.,.,',sno,name FROM TABLE; instead of SELECT sno,name FROM TABLE; and then grep for recmatch.,., though that's something of a hack.
In your second example your lines do not start with " followed by a number. That's why count is 0. You can try egrep -c "^\"([0-9]|\")" to catch empty first column values. But in fact it might be simpler to count all lines and remove 1 because of the header row.
e.g.
count=$(( $(wc -l test.csv) - 1 ))

counting the lines in the text file having different types [duplicate]

i am using UTL_FILE utility in oracle to get the data in to csv file. here i am using the script.
so i am getting the set of text files
case:1
sample of output in the test1.csv file is
"sno","name"
"1","hari is in singapore
ramesh is in USA"
"2","pong is in chaina
chang is in malaysia
vilet is in uk"
now i am counting the number of records in the test1.csv by using linux commans as
egrep -c "^\"[0-9]" test1.csv
here i am getting the record count as
2 (ACCORDING TO LINUX)
but if i calculate the number of records by using select * from test;
COUNT(*)
---------- (ACCORDING TO DATA BASE)
2
case:2
sample of output in the test2.csv file is
"sno","name","p"
"","",""
"","","ramesh is in USA"
"","",""
now i am counting the number of records in the test2.csv by using linux commans as
egrep -c "^\"[0-9]" test2.csv
here i am getting the record count as
0 (ACCORDING TO LINUX)
but if i calculate the number of records by using select * from test;
COUNT(*)
---------- (ACCORDING TO DATA BASE)
2
can any body help me how to count the exact lines in case:1 and case:2 using the single command
thanks in advance.
Columns in both case is different. To make it generic I wrote a perl script which will print the rows. It generates the regex from headers and used it to calculate the rows. I assumed that first line always represents the number of columns.
#!/usr/bin/perl -w
open(FH, $ARGV[0]) or die "Failed to open file";
# Get coloms from HEADER and use it to contruct regex
my $head = <FH>;
my #col = split(",", $head); # Colums array
my $col_cnt = scalar(#col); # Colums count
# Read rest of the rows
my $rows;
while(<FH>) {
$rows .= $_;
}
# Create regex based on number of coloms
# E.g for 3 coloms, regex should be
# ".*?",".*?",".*?"
# this represents anything between " and "
my $i=0;
while($i < $col_cnt) {
$col[$i++] = "\".*?\"";
}
my $regex = join(",", #col);
# /s to treat the data as single line
# /g for global matching
my #row_cnt = $rows =~ m/($regex)/sg;
print "Row count:" . scalar(#row_cnt);
Just store it as row_count.pl and run it as ./row_count.pl filename
egrep -c test1.csv doesn't have a search term to match for, so it's going to try to use test1.csv as the regular expression it tries to search for. I have no idea how you managed to get it to return 2 for your first example.
A useable egrep command that will actually produce the number of records in the files is egrep '"[[:digit:]]*"' test1.csv assuming your examples are actually accurate.
timp#helez:~/tmp$ cat test.txt
"sno","name"
"1","hari is in singapore
ramesh is in USA"
"2","pong is in chaina
chang is in malaysia
vilet is in uk"
timp#helez:~/tmp$ egrep -c '"[[:digit:]]*"' test.txt
2
timp#helez:~/tmp$ cat test2.txt
"sno","name"
"1","hari is in singapore"
"2","ramesh is in USA"
timp#helez:~/tmp$ egrep -c '"[[:digit:]]*"' test2.txt
2
Alternatively you might do better to add an extra value to your SELECT statement. Something like SELECT 'recmatch.,.,',sno,name FROM TABLE; instead of SELECT sno,name FROM TABLE; and then grep for recmatch.,., though that's something of a hack.
In your second example your lines do not start with " followed by a number. That's why count is 0. You can try egrep -c "^\"([0-9]|\")" to catch empty first column values. But in fact it might be simpler to count all lines and remove 1 because of the header row.
e.g.
count=$(( $(wc -l test.csv) - 1 ))

Print three columns with awk, then sets of three columns

I want to create a number of files from a much larger file, dividing by columns. For example, the header of my larger file looks like this:
Name Chr Position SNP1A SNP1B SNP1C SNP2A SNP2B SNP2C SNP3A SNP3B SNP3C
and I want to create these files:
Name Chr Position SNP1A SNP1B SNP1C
Name Chr Position SNP2A SNP2B SNP2C
Name Chr Position SNP3A SNP3B SNP3C
I've been trying to use awk, but I'm a bit of a novice with it, so my command currently reads:
for ((i=1; i<=440;i++)); do awk -f printindivs.awk inputfile done
Where printindivs.awk is:
{print $1 $2 $3 $((3*$i)+1) $((3*$i)+2) $((3*$i)+3))}
The output I'm getting suggests that my way of trying to get the sets of three is wrong: how can I do this?
Thanks
You can do this easily will just a simple awk script:
$ awk '{for(i=4;i<=NF;i+=3)print $1,$2,$3,$i,$(i+1),$(i+2) > ("out"++j)}' file
The output files will be in out[1..n]:
$ cat out1
Name Chr Position SNP1A SNP1B SNP1C
$ cat out2
Name Chr Position SNP2A SNP2B SNP2C
$ cat out3
Name Chr Position SNP3A SNP3B SNP3C