awk delete line from csv if char length of second column is less than 12 - awk

I have a csv which looks like so:
42242,"France."
2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"brazil"
23432423,"colombia"
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
234234,"Paris."
I would like to delete rows where the char length of the second column is <12.
I think awk can do this:
awk -F , '$2=length>12' file >filout
but this seems wrong.. :(
I want to delete the line to get:
2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."

$ awk -F, 'length($2)>=12' input_file
2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."

As your second field is contained within double quotes, you must use the double quote, rather than the comma, as the separator to determine the length of the second field:
awk -F\" 'length($2)>=12' file
If you just print the length of the second field, you will see what I mean. First using the comma as separator:
awk -F, '{print length($2)}' file
9
25
8
10
20
8
Second, using the double quote as the separator:
awk -F\" '{print length($2)}' file
7
79
6
8
80
6

adding gsub(/[\200-\277]/, "&") to properly measure # of UTF-8 characters in byte mode, assuming well-formed UTF-8 input. skip this part if you're using gawk in unicode mode
echo '42242,"France."
2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"brazil"
23432423,"colombia"
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
234234,"Paris."' |
— to do it the CSV way without a proper parser (or an awk that's even unicode aware) - it measures full row length, minus string index position of 1st comma, then minus 2 more for the quotation marks :
mawk '(_+=++_)^_^_<(-_-- + length($--_) \
- index($_, FS) \
- gsub(/[\200-\277]/, "&"))' FS=','
to do it the double-quotes ("...") way :
gawk '((_+=++_)^_^_-_^_)<length($(--_+_--))' FS='"'
1 2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
2 234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."

Using ruby with CSV.parse on slightly modified data to show correct output on commas within quotes.
% ruby -r 'csv' -ne 'if CSV.parse($_).map {|i|
if i[1].length >= 12 then true end}[0] then puts $_ end' file
2343242,"Labor, atoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
23432423,Institut Cochin,"colombia"
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
Data
% cat file
42242,"France."
2343242,"Labor, atoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"brazil"
23432423,Institut Cochin,"colombia"
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
234234,"Paris."

42242,"France."
2343242,"Laboratoire de Virologie, AP-HP, Hôpital Necker-Enfants malades, Paris, France."
234234,"brazil"
23432423,"colombia"
234234,"Université de Paris, Institut Cochin, INSERM U1016, CNRS UMR8104, Paris, France."
234234,"Paris."
does have , inside quoted values, therefore just setting field separator to , would not work, observe that e.g. 2nd columnd of 2nd row will be "Laboratoire de Virologie, to counter that you might use FPAT provided in More CSV chapter of GNU AWK manual as follows
awk 'BEGIN{FPAT="([^,]*)|(\"([^\"]|\"\")+\")"}length($2)>12' file.csv
though keep in mind that " are included in quotes fields, so you might need to adjust value inside comparison.

Related

Using AWK for best match replace

I have two files:
operators.txt # includes Country_code and Country_name
49 Germany
43 Austria
32 Belgium
33 France
traffic.txt # MSISDN and VLR_address (includes Country_code prefix)
123456789 491234567
123456788 432569874
123456787 333256987
123456789 431238523
I need to replace the VLR_address in traffic.txt file with Country_name from the first file.
The following awk command do that:
awk 'NR==FNR{a[$1]=$2;next} {print $1,a[$2]}' <(cat operators.txt) <(cat traffic.txt|awk '{print $1,substr($2,1,2)}')
123456789 Germany
123456788 Austria
123456787 France
123456789 Austria
but how to do it in case operators file is:
49 Germany
43 Austria
32 Belgium
33 France
355 Albania
1246 Barbados
1 USA
when country_code is not fixed length and in some case best match will apply e.g.
124612345 shall be Barbados
122018523 shall be USA
The sample input/output you provided isn't adequate to test with as it doesn't include the cases you later described as problematic but if we modify it to include a representation of those later statements:
$ head operators.txt traffic.txt
==> operators.txt <==
49 Germany
43 Austria
32 Belgium
33 France
1 USA
355 Albania
1246 Barbados
==> traffic.txt <==
123456789 491234567
123456788 432569874
123456787 333256987
123456789 431238523
foo 124612345
bar 122018523
then this may be what you want:
$ cat tst.sh
#!/usr/bin/env bash
awk '
NR==FNR {
keys[++numKeys] = $1
map[$1] = $2
next
}
{
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
if ( index($2,key) == 1 ) {
$2 = map[key]
break
}
}
print
}
' <(sort -k1,1rn operators.txt) traffic.txt
$ ./tst.sh
123456789 Germany
123456788 Austria
123456787 France
123456789 Austria
foo Barbados
bar USA
You obviously need to try a substring of the correct length.
awk 'NR==FNR{a[$2]=$1;next}
{ for (prefix in a) {
p = a[prefix]; l = length(p)
if ($2 ~ "^" p) { $2 = prefix; break } } }1' operators.txt traffic.txt
Notice how Awk itself is perfectly capable of reading files without the help of cat. You also nearly never need to pipe one Awk script into another; just refactor to put all the logic in one script.
I swapped the value of the key and the value in the NR==FNR block but that is more of a stylistic change.
And, as always, the final 1 is a shorthand idiom for printing all lines.
Perhaps as an optimization, pull the prefixes into a regular expression so that you can simply match on them all in one go, instead of looping over them.
awk 'NR==FNR{a[$1]=$2; regex = regex "|" $1; next}
FNR == 1 { regex = "^(" substr(regex, 2) ")" } # trim first "|"
match($2, regex) { $2 = a[substr($2, 1, RLENGTH)] } 1' operators.txt traffic.txt
The use of match() to pull out the length of the matched substring is arguably a complication; I wish Awk would provide this information for a normal regex match without the use of a separate dedicated function.

How to replace character from middle of a spool file?

i have a table with the below attributes
NAME COUNTRY CONTINENT ADD1 ADD2 ADD3 ADD4 ADD5 PINCODE
-----------------------------------------------------------
Adam USA NA NYC NY xxxxxx
Rakesh INDIA ASIA MUMBAI MH yyyyyy
Paul UK EU LONDON ENG zzzzzz
from this i have created a spool file file.txt in linux which will hold the below value
file.txt
Adam|USA|NA|NYC|NY||||xxxxxx
Rakesh|INDIA|ASIA|MUMBAI||MH|||yyyyyy
Paul|UK|EU|LONDON|ENG||||zzzzzz
This spool file will run on loop for every line.
For every line i want to store the required output in one variable l_addresses
Thus if we do echo "$l_addresses", it should give the required output for every line.
Required Output
NYC NY "" "" ""
MUMBAI "" MH "" ""
LONDON ENG "" "" ""
Using awk:
$ awk -F\| '{ # set field separator
for(i=4;i<=8;i++) # loop wanted fields
printf "%s%s",($i==""?"\"\"":$i),(i==8?ORS:OFS) # replace nulls and delims
}' file
OUtput:
NYC NY "" "" ""
MUMBAI "" MH "" ""
LONDON ENG "" "" ""

How can I count the frequency of letters

I have a data like this
>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR
>sp|P13674|P4HA1_HUMAN Prolyl 4-hydroxylase subunit alpha-1 OS=Homo sapiens OX=9606 GN=P4HA1 PE=1 SV=2
VECCPNCRGTGMQIRIHQIGPGMVQQIQSVCMECQGHGERISPKDRCKSCNGRKIVREKKILEVHIDKGMKDGQKITFHGEGDQEPGLEPGDIIIVLDQKDHAVFTRRGEDLFMCMDIQLVEALCGFQKPISTLDNRTIVITSHPGQIVKHGDIKCVLNEGMPIYRRPYEKGRLIIEFKVNFPENGFLSPDKLSLLEKLLPERKEVEE
>sp|Q7Z4N8|P4HA3_HUMAN Prolyl 4-hydroxylase subunit alpha-3 OS=Homo sapiens OX=9606 GN=P4HA3 PE=1 SV=1
MTEQMTLRGTLKGHNGWVTQIATTPQFPDMILSASRDKTIIMWKLTRDETNYGIPQRALRGHSHFVSDVVISSDGQFALSGSWDGTLRLWDLTTGTTTRRFVGHTKDVLSVAFSSDNRQIVSGSRDKTIKLWNTLGVCKYTVQDESHSEWVSCVRFSPNSSNPIIVSCGWDKLVKVWNLANCKLK
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
IQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQL
>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
MQPILLLLAFLLLPRADAGEIIGGHEAKPHSRPYMAYLMIWDQKSLKRCGGFLIRDDFVLTAAHCWGSSINVTLGAHNIKEQEPTQQFIPVKRPIPHPAYNPKNFSNDIMLLQLERKAKRTRAVQPLRLPSNKAQVKPGQTCSVAGWGQTAPLGKHSHTLQEVKMTVQEDRKCES
>sp|Q9UHX1|PUF60_HUMAN Poly(U)-binding-splicing factor PUF60 OS=Homo sapiens OX=9606 GN=PUF60 PE=1 SV=1
MGKDYYQTLGLARGASDEEIKRAYRRQALRYHPDKNKEPGAEEKFKEIAEAYDVLSDPRKREIFDRYGEEGLKGSGPSGGSGGGANGTSFSYTFHGDPHAMFAEFFGGRNPFDTFFGQRNGEEGMDIDDPFSGFPMGMGGFTNVNFGRSRSAQEPARKKQDPPVTHDLRVSLEEIYSGCTKKMKISHK
>sp|Q06416|P5F1B_HUMAN Putative POU domain, class 5, transcription factor 1B OS=Homo sapiens OX=9606 GN=POU5F1B PE=5 SV=2
IVVKGHSTCLSEGALSPDGTVLATASHDGYVKFWQIYIEGQDEPRCLHEWKPHDGRPLSCLLFCDNHKKQDPDVPFWRFLITGADQNRELKMWCTVSWTCLQTIRFSPDIFSSVSVPPSLKVCLDLSAEYLILSDVQRKVLYVMELLQNQEEGHACFSSISEFLLTHPVLSFGIQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQLNPDVVAPLPTHTAHEDFTFGESRPELGSEGLGSAAHGSQPDLRRIVELPAPADFLSLSSETKPKLMTPDAFMTPSASLQQITASPSSSSSGSSSSSSSSSSSLTAVSAMSSTSAVDPSLTRPPEELTLSPKLQLDGSLTMSSSGSLQASPRGLLPGLLPAPADKLTPKGPGQVPTATSALSLELQEVEP
>sp|O14683|P5I11_HUMAN Tumor protein p53-inducible protein 11 OS=Homo sapiens OX=9606 GN=TP53I11 PE=1 SV=2
MIHNYMEHLERTKLHQLSGSDQLESTAHSRIRKERPISLGIFPLPAGDGLLTPDAQKGGETPGSEQWKFQELSQPRSHTSLKVSNSPEPQKAVEQEDELSDVSQGGSKATTPASTANSDVATIPTDTPLKEENEGFVKVTDAPNKSEISKHIEVQVAQETRNVSTGSAENEEKSEVQAIIESTPELDMDKDLSGYKGSSTPTKGIENKAFDRNTESLFEELSSAGSGLIGDVDEGADLLGMGREVENLILENTQLLETKNALNIVKNDLIAKVDELTCEKDVLQGELEAVKQAKLKLEEKNRELEEELRKARAEAEDARQKAKDDDDSDIPTAQRKRFTRVEMARVLMERNQYKERLMELQEAVRWTEMIRASRENPAMQEKKRSSIWQFFSRLFSSSSNTTKKPEPPVNLKYNAPTSHVTPSVK
I want to count how many of each letter is there, so if I have one I count like this
>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR
cat input.txt | grep -v ">" | fold -w1 | sort | uniq -c
6 A
9 C
10 D
1 E
7 F
18 G
5 H
4 I
7 K
21 L
15 N
7 P
6 Q
11 R
16 S
18 T
7 V
8 W
7 Y
however, I want to calculate for all in a better way and more efficient especially when the data is huge
Counting characters in strings can easily be done with awk. To do this, you make use of the function gsub:
gsub(ere, repl[, in])
Behave like sub (see below), except that it shall replace all occurrences of the regular expression (like the ed utility global substitute) in $0 or in the in argument when specified.
sub(ere, repl[, in ])
Substitute the string repl in place of the first instance of the extended regular expression ERE in string in and return the number of substitutions. <snip> If in is omitted, awk shall use the current record ($0) in its place.
source: Awk Posix Standard
The following two functions perform the counting in this way:
function countCharacters(str) {
while(str != "") { c=substr(str,1,1); a[toupper[c]]+=gsub(c,"",str) }
}
or if there might appear a lot of equal consecutive characters, the following solution might shave off a couple of seconds.
function countCharacters2(str) {
n=length(str)
while(str != "") { c=substr(str,1,1); gsub(c"+","",str);
m=length(str); a[toupper[c]]+=n-m; n=m
}
}
Below you find 4 implementations based on the first function. The first two run on a standard awk, the latter two on an optimized version for fasta-files:
1. Read sequence and processes it line by line:
awk '!/^>/{s=$0; while(s!="") { c=substr(s,1,1); a[c]+=gsub(c,"",s) } }
END {for(c in a) print c,a[c]}' file
2. concatenate all sequences and process it in the end:
awk '!/^>/{s=s $0 }
END {while(s!="") { c=substr(s,1,1); a[c]+=gsub(c,"",s) }
for(c in a) print c,a[c]}' file
3. Same as 1 but use bioawk:
bioawk -c fastx '{while ($seq!=""){ c=substr($seq,1,1);a[c]+=gsub(c,"",$seq) } }
END{ for(c in a) print c,a[c] }' file
4. Same as 2 but use bioawk:
bioawk -c fastx '{s=s $seq}
END { while(s!="") { c=substr(s,1,1); a[c]+=gsub(c,"",s) }
for(c in a) print c,a[c]}' file
Here are some timing results based on this fasta-file
OP : grep,sort,uniq : 47.548 s
EdMorton 1 : awk : 39.992 s
EdMorton 2 : awk,sort,uniq : 53.965 s
kvantour 1 : awk : 18.661 s
kvantour 2 : awk : 9.309 s
kvantour 3 : bioawk : 1.838 s
kvantour 4 : bioawk : 1.838 s
karafka : awk : 38.139 s
stack0114106 1: perl : 22.754 s
stack0114106 2: perl : 13.648 s
stack0114106 3: perl (zdim) : 7.759 s
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.
With any awk in any shell on any UNIX box:
$ cat tst.awk
!/^>/ {
for (i=1; i<=length($0); i++) {
cnt[substr($0,i,1)]++
}
}
END {
for (char in cnt) {
print char, cnt[char]
}
}
$ awk -f tst.awk file
A 107
N 67
P 107
C 41
Q 88
D 102
E 132
R 101
F 65
S 168
G 140
T 115
H 52
I 84
V 101
W 27
K 114
Y 30
L 174
M 39
or if you prefer:
$ awk -v ORS= '!/^>/{gsub(/./,"&\n"); print}' file | sort | uniq -c
107 A
41 C
102 D
132 E
65 F
140 G
52 H
84 I
114 K
174 L
39 M
67 N
107 P
88 Q
101 R
168 S
115 T
101 V
27 W
30 Y
Try this Perl solution for better performance.
$ perl -lne '
if( ! /^>/ ) { while(/./g) { $kv{$&}++} }
END { for ( sort keys %kv) { print "$_ $kv{$_}" }}
' learner.txt
A 107
C 41
D 102
E 132
F 65
G 140
H 52
I 84
K 114
L 174
M 39
N 67
P 107
Q 88
R 101
S 168
T 115
V 101
W 27
Y 30
$
One more solution using Perl, optimized for performance.
$ time perl -lne '
if( ! /^>/ ) { for($i=0;$i<length($_);$i++)
{ $x=substr($_,$i,1); $kv{$x}++ } }
END { for ( sort keys %kv) { print "$_ $kv{$_}" }}
' chrY.fa
A 2994088
C 1876822
G 1889305
N 30812232
T 3002884
a 4892104
c 3408967
g 3397589
n 140
t 4953284
real 0m15.922s
user 0m15.750s
sys 0m0.108s
$
Edit with further performance optimizations
All timings reported below are averages over 3-5 runs on a desktop, done at around the same time but swapped around to avoid pronounced cacheing effects.
Changing the C-style for loop to for my $i (0..length($_)) speeds the second solution from 9.2 seconds to 6.8 seconds.
Then, also removing a scalar ($x) at each operation, with
if (not /^>/) { for $i (0..length($_)) { ++$kv{ substr($_,$i,1) } } }
speeds this up to 5.3 seconds.
Further reducing variable use, by copying $_ and thus freeing up the loop to use $_
if (not /^>/) { $l=$_; ++$kv{ substr($l,$_,1) } for 0..length($l) }
only helps a little, running at 5.2 seconds.
This compares with the awk solution, given as kvantour 2 in nice comparisons in kvantour answer, at 6.5 seconds (on this system).
Of course none of this can be compared to the optimized bioawk (C-code?) program. For that we'd need to write this in C (which is not very hard using Inline C).
Note that removing a sub call (to substr) for every character by using
if (not /^>/) { ++$kv{$_} for split //; }
results in "only" a 6.4 seconds average, not as good as the above tweaks; this was a surprise.
These times are on a desktop with v5.16. On v5.24, on the same machine, the best-case (substr with no extra variables in the loop) time is 4.8 seconds while the one without the substr (but with split) is 5.8 seconds. It's nice to see that newer versions of Perl perform better, at least in these cases.
For reference and easy timing by others, complete code for the best run
time perl -lne'
if (not /^>/) { $l=$_; ++$kv{ substr($l,$_,1) } for 0..length($l) }
END { for ( sort keys %kv) { print "$_ $kv{$_}" }}
' chrY.fa
not sure how much faster this would be but if you try please post your timings
$ awk '!/^>/ {n=split($0,a,""); for(i=1;i<=n;i++) c[a[i]]++}
END {for(k in c) print k,c[k]}' file | sort
A 6
C 9
D 10
E 1
F 7
G 18
H 5
I 4
K 7
L 21
N 15
P 7
Q 6
R 11
S 16
T 18
V 7
W 8
Y 7
this reports counts for the file, not line by line. As noted below, not all awk's support empty string split.
Here are the timings of the three approaches:
$ time grep -v ">" filey | fold -w1 | sort | uniq -c >/dev/null
real 0m11.470s
user 0m11.746s
sys 0m0.260s
$ time awk '{n=split($0,a,""); for(i=1;i<=n;i++) c[a[i]++]} END{for(k in c) print k,c[k]}' filey >/dev/null
real 0m7.441s
user 0m7.334s
sys 0m0.060s
$ time awk '{n=length($0); for(i=1;i<=n;i++) c[substr($0,i,1)]++} END{for(k in c) print k,c[k]}' filey >/dev/null
real 0m5.055s
user 0m4.979s
sys 0m0.047s
for the test file
$ wc filey
118098 649539 16828965 filey
it surprised me that substr is faster than split. Perhaps due to array allocation.

Sed replace nth column of multiple tsv files without header

Here are multiple tsv files, where I want to add 'XX' characters only in the second column (everywhere except in the header) and save it to this same file.
Input:
$ls
file1.tsv file2.tsv file3.tsv
$head -n 4 file1.tsv
a b c
James England 25
Brian France 41
Maria France 18
Ouptut wanted:
a b c
James X1_England 25
Brian X1_France 41
Maria X1_France 18
I tried this, but the result is not kept in the file, and a simple redirection won't work:
# this works, but doesn't save the changes
i=1
for f in *tsv
do awk '{if (NR!=1) print $2}’ $f | sed "s|^|X${i}_|"
i=$((i+1))
done
# adding '-i' option to sed: this throws an error but would be perfect (sed no input files error)
i=1
for f in *tsv
do awk '{if (NR!=1) print $2}’ $f | sed -i "s|^|T${i}_|"
i=$((i+1))
done
Some help would be appreciated.
The second column is particularly easy because you simply replace the first occurrence of the separator.
for file in *.tsv; do
sed -i '2,$s/\t/\tX1_/' "$file"
done
If your sed doesn't recognize the symbol \t, use a literal tab (in many shells, you type it with ctrlv tab.) On *BSD (and hence MacOS) you need -i ''
AWK solution:
awk -i inplace 'BEGIN { FS=OFS="\t" } NR!=1 { $2 = "X1_" $2 } 1' file1.tsv
Input:
a b c
James England 25
Brian France 41
Maria France 18
Output:
a b c
James X1_England 25
Brian X1_France 41
Maria X1_France 18

Awk Scripting printf ignoring my sort command

I am trying to run a script that I have set up but when I go to sort the contents and display the text the content is printed but the sort command is ignored and the information is just printed. I tried this code format using awk and the sort function is ignored but I am not sure why.
Command I tried:
sort -t, -k4 -k3 | awk -F, '{printf "%-18s %-27s %-15s %s\n", $1, $2, $3, $4 }' c_list.txt
The output I am getting is:
Jim Girv 199 pathway rd Orlando FL
Megan Rios 205 highwind dr Sacremento CA
Tyler Scott 303 cross st Saint James NY
Tim Harding 1150 Washton ave Pasadena CA
The output I need is:
Tim Harding 1150 Washton ave Pasadena CA
Megan Rios 205 highwind dr Sacremento CA
Jim Girv 199 pathway rd Orlando FL
Tyler Scott 303 cross st Saint James NY
It just ignores the sort command but still prints the info I need in the format from the file.
I need it to sort based off the fourth field first the state and the third field next the town then display the information.
An example where each field is separated by a comma.
Field 1 Field 2 Field 3 Field 4
Jim Girv, 199 pathway rd, Orlando, FL
The problem is you're doing sort | awk 'script' file instead of sort file | awk 'script' so sort is sorting nothing and consequently producing no output while awk is operating on your original file and so producing output from that. You should have noticed that your sort command is hanging too for lack of input and you should have mentioned that in your question.
To demonstrate:
$ cat file
c
b
a
$ sort | awk '1' file
c
b
a
$ sort file | awk '1'
a
b
c