Create awk script with 3 separated awk begin commands - awk

I have this three awk commands in seperated script, but i would need to have them all in the same awk script usign the following command.
How it's supposed to be merged to work correctly with a while ?
Command:
gawk -f sc.awk sh1.csv > sh2.csv
awk script:
#Update id column
BEGIN{FS=OFS=","}FNR==1{print "new_id",$0;next} {print FNR-1,$0}
Second awk
#Extraxt year from date
BEGIN{FS=OFS=","} NR==1{$4="year,date"; print $0} NR!=1{sub(/[0-9]{4}/, "&,&", $4); print $0}
Third awk
#Delete surnames
BEGIN{
FS=","
OFS=","
}{
while(sub(/ [[:alpha:]]+$/,"",$3))
{}}
{print}
Dataset:
id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera,longitude,latitude,is_geocoding_exact
3,Tim Elliot,2015-01-02,shot,gun,53,M,A,Shelton,WA,True,attack,Not fleeing,False,-123.122,47.247,True
4,Lewis Lee Lembke,2015-01-02,shot,gun,47,M,W,Aloha,OR,False,attack,Not fleeing,False,-122.892,45.487,True
8,Matthew Hoffman,2015-01-04,shot,toy weapon,32,M,W,San Francisco,CA,True,attack,Not fleeing,False,-122.422,37.763,True
Output expected:
new_id,id,name,year,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera,longitude,latitude,is_geocoding_exact
1,3,Tim,2015,2015-01-02,shot,gun,53,M,A,Shelton,WA,True,attack,Not fleeing,False,-123.122,47.247,True
2,4,Lewis,2015,2015-01-02,shot,gun,47,M,W,Aloha,OR,False,attack,Not fleeing,False,-122.892,45.487,True
3,8,Matthew,2015,2015-01-04,shot,toy weapon,32,M,W,San Francisco,CA,True,attack,Not fleeing,False,-122.422,37.763,True

You can write all three scripts as a single awk script. As with any awk script, you simply take the changes you need to make one-at-a-time and you write a rule to accomplish this. In your case, you can rewrite sc.awk as follows:
BEGIN { # begin rule
FS = OFS = ","
}
FNR == 1 { # first line rule
$3 = "year,date"
print "new_id", $0
next
}
{ # all other records
year = $3 # save $3 as year
sub(/-.*$/,"",year) # remove - to end leaving year
sub(/ .*$/,"",$2) # remove surname
$3 = year "," $3 # update new $3 field
print ++n, $0 # output new_id and record
}
Example Use/Output
With your sample input in file you would have:
$ awk -f sc.awk file
new_id,id,name,year,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera,longitude,latitude,is_geocoding_exact
1,3,Tim,2015,2015-01-02,shot,gun,53,M,A,Shelton,WA,True,attack,Not fleeing,False,-123.122,47.247,True
2,4,Lewis,2015,2015-01-02,shot,gun,47,M,W,Aloha,OR,False,attack,Not fleeing,False,-122.892,45.487,True
3,8,Matthew,2015,2015-01-04,shot,toy weapon,32,M,W,San Francisco,CA,True,attack,Not fleeing,False,-122.422,37.763,True

If you need to maintain the 3 separate awk scripts and you want to be able to run them 'together', one simple (though not very efficient) method:
awk -f sh1.awk sh1.csv | awk -f sh2.awk | awk -f sh3.awk > sh2.csv
This generates:
$ cat sh2.csv
new_id,id,name,year,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera,longitude,latitude,is_geocoding_exact
1,3,Tim,2015,2015-01-02,shot,gun,53,M,A,Shelton,WA,True,attack,Not fleeing,False,-123.122,47.247,True
2,4,Lewis,2015,2015-01-02,shot,gun,47,M,W,Aloha,OR,False,attack,Not fleeing,False,-122.892,45.487,True
3,8,Matthew,2015,2015-01-04,shot,toy weapon,32,M,W,San Francisco,CA,True,attack,Not fleeing,False,-122.422,37.763,True
On the other hand, if the intention is to merge the 3 separate scripts into one you can re-use your current code by figuring out the correct order and adjusting the field references based on the original input file, eg:
$ cat sh_all.awk
BEGIN { FS=OFS="," }
FNR==1 { $3="year,date"; print "new_id",$0; next}
{ while(sub(/ [[:alpha:]]+$/,"",$2)) {}
sub(/[0-9]{4}/, "&,&", $3)
print FNR-1,$0
}
Taking for a spin:
awk -f sh_all.awk sh1.csv > sh2.csv
This also generates:
$ cat sh2.csv
new_id,id,name,year,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera,longitude,latitude,is_geocoding_exact
1,3,Tim,2015,2015-01-02,shot,gun,53,M,A,Shelton,WA,True,attack,Not fleeing,False,-123.122,47.247,True
2,4,Lewis,2015,2015-01-02,shot,gun,47,M,W,Aloha,OR,False,attack,Not fleeing,False,-122.892,45.487,True
3,8,Matthew,2015,2015-01-04,shot,toy weapon,32,M,W,San Francisco,CA,True,attack,Not fleeing,False,-122.422,37.763,True

Related

selecting columns in awk discarding corresponding header

How to properly select columns in awk after some processing. My file here:
cat foo
A;B;C
9;6;7
8;5;4
1;2;3
I want to add a first column with line numbers and then extract some columns of the result. For the example let's get the new first (line numbers) and third columns. This way:
awk -F';' 'FNR==1{print "linenumber;"$0;next} {print FNR-1,$1,$3}' foo
gives me this unexpected output:
linenumber;A;B;C
1 9 7
2 8 4
3 1 3
but expected is (note B is now the third column as we added linenumber as first):
linenumber;B
1;6
2;5
3;2
[fixed and revised]
To get your expected output, use:
$ awk 'BEGIN {
FS=OFS=";"
}
{
print (FNR==1?"linenumber":FNR-1),$(FNR==1?3:1)
}' file
Output:
linenumber;C
1;9
2;8
3;1
To add a column with line number and extract first and last columns, use:
$ awk 'BEGIN {
FS=OFS=";"
}
{
print (FNR==1?"linenumber":FNR-1),$1,$NF
}' file
Output this time:
linenumber;A;C
1;9;7
2;8;4
3;1;3
Why do you print $0 (the complete record) in your header? And, if you want only two columns in your output, why to you print 3 (FNR-1, $1 and $3)? Finally, the reason why your output field separators are spaces instead of the expected ; is simply that... you did not specify the output field separator (OFS). You can do this with a command line variable assignment (OFS=\;), as shown in the second and third versions below, but also using the -v option (-v OFS=\;) or in a BEGIN block (BEGIN {OFS=";"}) as you wish (there are differences between these 3 methods but they don't matter here).
[EDIT]: see a generic solution at the end.
If the field you want to keep is the second of the input file (the B column), try:
$ awk -F\; 'FNR==1 {print "linenumber;" $2; next} {print FNR-1 ";" $2}' foo
linenumber;B
1;6
2;5
3;2
or
$ awk -F\; 'FNR==1 {print "linenumber",$2; next} {print FNR-1,$2}' OFS=\; foo
linenumber;B
1;6
2;5
3;2
Note that, as long as you don't want to keep the first field of the input file ($1), you could as well overwrite it with the line number:
$ awk -F\; '{$1=FNR==1?"linenumber":FNR-1; print $1,$2}' OFS=\; foo
linenumber;B
1;6
2;5
3;2
Finally, here is a more generic solution to which you can pass the list of indexes of the columns of the input file you want to print (1 and 3 in this example):
$ awk -F\; -v cols='1;3' '
BEGIN { OFS = ";"; n = split(cols, c); }
{ printf("%s", FNR == 1 ? "linenumber" : FNR - 1);
for(i = 1; i <= n; i++) printf("%s", OFS $(c[i]));
printf("\n");
}' foo
linenumber;A;C
1;9;7
2;8;4
3;1;3

Need to retrieve a value from an HL7 file using awk

In a Linux script program, I've got the following awk command for other purposes and to rename the file.
cat $edifile | awk -F\| '
{ OFS = "|"
print $0
} ' | tr -d "\012" > $newname.hl7
While this is happening, I'd like to grab the 5th field of the MSH segment and save it for later use in the script. Is this possible?
If no, how could I do it later or earlier on?
Example of the segment.
MSH|^~\&|business1|business2|/u/tmp/TR0049-GE-1.b64|routing|201811302126||ORU^R01|20181130212105810|D|2.3
What I want to do is retrieve the path and file name in MSH 5 and concatenate it to the end of the new file.
I've used this to capture the data but no luck. If fpth is getting set, there is no evidence of it and I don't have the right syntax for an echo within the awk phrase.
cat $edifile | awk -F\| '
{ OFS = "|"
{fpth=$(5)}
print $0
} ' | tr -d "\012" > $newname.hl7
any suggestions?
Thank you!
Try
filename=`awk -F'|' '{print $5}' $edifile | head -1`
You can skip the piping through head if the file is a single line
First of all, it must be mentioned that the awk line in your first piece of code, has zero use:
$ cat $edifile | awk -F\| ' { OFS = "|"; print $0 }' | tr -d "\012" > $newname.hl7
This is totally equivalent to
$ cat $edifile | tr -d "\012" > $newname.hl7
because OFS is only used to redefine $0 if you redefine a field.
Example:
$ echo "a|b|c" | awk -F\| '{OFS="/"; print $0}'
a|b|c
$ echo "a|b|c" | awk -F\| '{OFS="/"; $1=$1; print $0}'
a/b/c
I understand that you have a hl7 file in which you have a single line starting with the string "MSH". From this line you want to store the 5th field: this is achieved in the following way:
fpth=$(awk -v outputfile="${newname}.hl7" '
BEGIN{FS="|"; ORS="" }
($1 == "MSH"){ print $5 }
{ print $0 > outputfile }' $edifile)
I have replaced ORS to an empty character set, as it is equivalent to tr -d "\012". The above will work very nicely if you only have a single MSH in your file.

What is the meaning of $0 = $0 in Awk?

While going through a piece of code I saw the below command:
grep "r" temp | awk '{FS=","; $0=$0} { print $1,$3}'
temp file contain the pattern like:
1. r,1,5
2. r,4,5
3. ...
I could not understand what does the statement $0=$0 mean in awk command.
Can anyone explain what does it mean?
When you do $1=$1 (or any other assignment to a field) it causes record recompilation where $0 is rebuilt with every FS replaced with OFS but it does not change NF (unless there was no $1 previously and then NF would change from 0 to 1) or reevaluate the record in any other way.
When you do $0=$0 it causes field splitting where NF, $1, $2, etc. are repopulated based on the current value of FS but it does not change the FSs to OFSs or modify $0 in any other way.
Look:
$ echo 'a-b-c' |
awk -F'-+' -v OFS='-' '
function p() { printf "%d) %d: $0=%s, $2=%s\n", ++c,NF,$0,$2 }
{ p(); $2=""; p(); $1=$1; p(); $0=$0; p(); $1=$1; p() }
'
1) 3: $0=a-b-c, $2=b
2) 3: $0=a--c, $2=
3) 3: $0=a--c, $2=
4) 2: $0=a--c, $2=c
5) 2: $0=a-c, $2=c
Note in the above that even though setting $2 to null resulted in 2 consecutive -s and the FS of -+ means that 2 -s are a single separator, they are not treated as such until $0=$0 causes the record to be re-split into fields as shown in output step 4.
The code you have:
awk '{FS=","; $0=$0}'
is using $0=$0 as a cludge to work around the fact that it's not setting FS until AFTER the first record has been read and split into fields:
$ printf 'a,b\nc,d\n' | awk '{print NF, $1}'
1 a,b
1 c,d
$ printf 'a,b\nc,d\n' | awk '{FS=","; print NF, $1}'
1 a,b
2 c
$ printf 'a,b\nc,d\n' | awk '{FS=","; $0=$0; print NF, $1}'
2 a
2 c
The correct solution, of course, is instead to simply set FS BEFORE The first record is read:
$ printf 'a,b\nc,d\n' | awk -F, '{print NF, $1}'
2 a
2 c
To be clear - assigning any value to $0 causes field splitting, it does not cause record recompilation while assigning any value to any field ($1, etc.) causes record recompilation but not field splitting:
$ echo 'a-b-c' | awk -F'-+' -v OFS='#' '{$2=$2}1'
a#b#c
$ echo 'a-b-c' | awk -F'-+' -v OFS='#' '{$0=$0}1'
a-b-c
$0 = $0 is used most often to rebuild the field separation evaluation of a modified entry. Ex: adding a field will change $NF after $0 = $0 where it stay as original (at entry of the line).
in this case, it change every line the field separator by , and (see #EdMorton comment below for strike) reparse the line with current FS info where a awk -F ',' { print $1 "," $3 }' is a lot better coding for the same idea, taking the field separator at begining for all lines (in this case, could be different if separator is modified during process depernding by example of previous line content)
ex:
echo "foo;bar" | awk '{print NF}{FS=";"; print NF}{$0=$0;print NF}'
1
1
2
based on #EdMorton comment and related post (What is the meaning of $0 = $0 in Awk)
echo "a-b-c" |\
awk ' BEGIN{ FS="-+"; OFS="-"}
function p(Ref) { printf "%12s) NF=%d $0=%s, $2=%s\n", Ref,NF,$0,$2 }
{
p("Org")
$2="-"; p( "S2=-")
$1=$1 ; p( "$1=$1")
$2=$2 ; p( "$2=$2")
$0=$0 ; p( "$0=$0")
$2=$2 ; p( "$2=$2")
$3=$3 ; p( "$3=$3")
$1=$1 ; p( "$1=$1")
} '
Org) NF=3 $0=a-b-c, $2=b
S2=-) NF=3 $0=a---c, $2=-
$1=$1) NF=3 $0=a---c, $2=-
$2=$2) NF=3 $0=a---c, $2=-
$0=$0) NF=2 $0=a---c, $2=c
$2=$2) NF=2 $0=a-c, $2=c
$3=$3) NF=3 $0=a-c-, $2=c
$1=$1) NF=3 $0=a-c-, $2=c
$0=$0 is for re-evaluate the fields
For example
akshay#db-3325:~$ cat <<EOF | awk '/:/{FS=":"}/\|/{FS="|"}{print $2}'
1:2
2|3
EOF
# Same with $0=$0, it will force awk to have the $0 reevaluated
akshay#db-3325:~$ cat <<EOF | awk '/:/{FS=":"}/\|/{FS="|"}{$0=$0;print $2}'
1:2
2|3
EOF
2
3
# NF - gives you the total number of fields in a record
akshay#db-3325:~$ cat <<EOF | awk '/:/{FS=":"}/\|/{FS="|"}{print NF}'
1:2
2|3
EOF
1
1
# When we Force to re-evaluate the fields, we get correct 2 fields
akshay#db-3325:~$ cat <<EOF | awk '/:/{FS=":"}/\|/{FS="|"}{$0=$0; print NF}'
1:2
2|3
EOF
2
2
>>> echo 'a-b-c' | awk -F'-+' -v OFS='#' '{$2=$2}1'
>>> a#b#c
This can be slightly simplified to
mawk 'BEGIN { FS="[-]+"; OFS = "#"; } ($2=$2)'
Rationale being the boolean test that comes afterwards will evaluate to true upon the assignment, so that itself is sufficient to re-gen the fields in OFS and print it.

How to use awk sort by column 3

I have a file (user.csv)like this
ip,hostname,user,group,encryption,aduser,adattr
want to print all column sort by user,
I tried awk -F ":" '{print|"$3 sort -n"}' user.csv , it doesn't work.
How about just sort.
sort -t, -nk3 user.csv
where
-t, - defines your delimiter as ,.
-n - gives you numerical sort. Added since you added it in your
attempt. If your user field is text only then you dont need it.
-k3 - defines the field (key). user is the third field.
Use awk to put the user ID in front.
Sort
Use sed to remove the duplicate user ID, assuming user IDs do not contain any spaces.
awk -F, '{ print $3, $0 }' user.csv | sort | sed 's/^.* //'
Seeing as that the original question was on how to use awk and every single one of the first 7 answers use sort instead, and that this is the top hit on Google, here is how to use awk.
Sample net.csv file with headers:
ip,hostname,user,group,encryption,aduser,adattr
192.168.0.1,gw,router,router,-,-,-
192.168.0.2,server,admin,admin,-,-,-
192.168.0.3,ws-03,user,user,-,-,-
192.168.0.4,ws-04,user,user,-,-,-
And sort.awk:
#!/usr/bin/awk -f
# usage: ./sort.awk -v f=FIELD FILE
BEGIN {
FS=","
}
# each line
{
a[NR]=$0 ""
s[NR]=$f ""
}
END {
isort(s,a,NR);
for(i=1; i<=NR; i++) print a[i]
}
#insertion sort of A[1..n]
function isort(S, A, n, i, j) {
for( i=2; i<=n; i++) {
hs = S[j=i]
ha = A[j=i]
while (S[j-1] > hs) {
j--;
S[j+1] = S[j]
A[j+1] = A[j]
}
S[j] = hs
A[j] = ha
}
}
To use it:
awk sort.awk f=3 < net.csv # OR
chmod +x sort.awk
./sort.awk f=3 net.csv
You can choose a delimiter, in this case I chose a colon and printed the column number one, sorting by alphabetical order:
awk -F\: '{print $1|"sort -u"}' /etc/passwd
awk -F, '{ print $3, $0 }' user.csv | sort -nk2
and for reverse order
awk -F, '{ print $3, $0 }' user.csv | sort -nrk2
try this -
awk '{print $0|"sort -t',' -nk3 "}' user.csv
OR
sort -t',' -nk3 user.csv
awk -F "," '{print $0}' user.csv | sort -nk3 -t ','
This should work
To exclude the first line (header) from sorting, I split it out into two buffers.
df | awk 'BEGIN{header=""; $body=""} { if(NR==1){header=$0}else{body=body"\n"$0}} END{print header; print body|"sort -nk3"}'
With GNU awk:
awk -F ',' '{ a[$3]=$0 } END{ PROCINFO["sorted_in"]="#ind_str_asc"; for(i in a) print a[i] }' file
See 8.1.6 Using Predefined Array Scanning Orders with gawk for more sorting algorithms.
I'm running Linux (Ubuntu) with mawk:
tmp$ awk -W version
mawk 1.3.4 20200120
Copyright 2008-2019,2020, Thomas E. Dickey
Copyright 1991-1996,2014, Michael D. Brennan
random-funcs: srandom/random
regex-funcs: internal
compiled limits:
sprintf buffer 8192
maximum-integer 2147483647
mawk (and gawk) has an option to redirect the output of print to a command. From man awk chapter 9. Input and output:
The output of print and printf can be redirected to a file or command by appending > file, >> file or | command to the end of the print statement. Redirection opens file or command only once, subsequent redirections append to the already open stream.
Below you'll find a simplied example how | can be used to pass the wanted records to an external program that makes the hard work. This also nicely encapsulates everything in a single awk file and reduces the command line clutter:
tmp$ cat input.csv
alpha,num
D,4
B,2
A,1
E,5
F,10
C,3
tmp$ cat sort.awk
# print header line
/^alpha,num/ {
print
}
# all other lines are data lines that should be sorted
!/^alpha,num/ {
print | "sort --field-separator=, --key=2 --numeric-sort"
}
tmp$ awk -f sort.awk input.csv
alpha,num
A,1
B,2
C,3
D,4
E,5
F,10
See man sort for the details of the sort options:
-t, --field-separator=SEP
use SEP instead of non-blank to blank transition
-k, --key=KEYDEF
sort via a key; KEYDEF gives location and type
-n, --numeric-sort
compare according to string numerical value

Is there a way to completely delete fields in awk, so that extra delimiters do not print?

Consider the following command:
$ gawk -F"\t" "BEGIN{OFS=\"\t\"}{$2=$3=\"\"; print $0}" Input.tsv
When I set $2 = $3 = "", the intended effect is to get the same effect as writing:
print $1,$4,$5...$NF
However, what actually happens is that I get two empty fields, with the extra field delimiters still printing.
Is it possible to actually delete $2 and $3?
Note: If this was on Linux in bash, the correct statement above would be the following, but Windows does not handle single quotes well in cmd.exe.
$ gawk -F'\t' 'BEGIN{OFS="\t"}{$2=$3=""; print $0}' Input.tsv
This is an oldie but goodie.
As Jonathan points out, you can't delete fields in the middle, but you can replace their contents with the contents of other fields. And you can make a reusable function to handle the deletion for you.
$ cat test.awk
function rmcol(col, i) {
for (i=col; i<NF; i++) {
$i = $(i+1)
}
NF--
}
{
rmcol(3)
}
1
$ printf 'one two three four\ntest red green blue\n' | awk -f test.awk
one two four
test red blue
You can't delete fields in the middle, but you can delete fields at the end, by decrementing NF.
So you can shift all the later fields down to overwrite $2 and $3 then decrement NF by two, which erases the last two fields:
$ echo 1 2 3 4 5 6 7 | awk '{for(i=2; i<NF-1; ++i) $i=$(i+2); NF-=2; print $0}'
1 4 5 6 7
If you are just looking to remove columns, you can use cut:
$ cut -f 1,4- file.txt
To emulate cut:
$ awk -F "\t" '{ for (i=1; i<=NF; i++) if (i != 2 && i != 3) { if (i == NF) printf $i"\n"; else printf $i"\t" } }' file.txt
Similarly:
$ awk -F "\t" '{ delim =""; for (i=1; i<=NF; i++) if (i != 2 && i != 3) { printf delim $i; delim = "\t"; } printf "\n" }' file.txt
HTH
The only way I can think to do it in Awk without using a loop is to use gsub on $0 to combine adjacent FS:
$ echo {1..10} | awk '{$2=$3=""; gsub(FS"+",FS); print}'
1 4 5 6 7 8 9 10
One way could be to remove fields like you do and remove extra spaces with gsub:
$ awk 'BEGIN { FS = "\t" } { $2 = $3 = ""; gsub( /\s+/, "\t" ); print }' input-file
In the addition of the answer by Suicidal Steve I'd like to suggest one more solution but using sed instead awk.
It seems more complicated than usage of cut as it was suggested by Steve. But it was the better solution because sed -i allows editing in-place.
$ sed -i 's/\(.*,\).*,.*,\(.*\)/\1\2/' FILENAME
To remove fields 2 and 3 from a given input file (assuming a tab field separator), you can remove the fields from $0 using gensub and regenerate it as follows:
awk -F '\t' 'BEGIN{OFS="\t"}\
{$0=gensub(/[^\t]*\t/,"",3);\
$0=gensub(/[^\t]*\t/,"",2);\
print}' Input.tsv
The method presented in the answer of ghoti has some problems:
every assignment of $i = $(i+1) forces awk to rebuild the record $0. This implies that if you have 100 fields and you want to delete field 10, you rebuild the record 90 times.
changing the value of NF manually is not posix compliant and leads to undefined behaviour (as is mentioned in the comments).
A somewhat more cumbersome, but stable robust way to delete a set of columns would be:
a single column:
awk -v del=3 '
BEGIN{FS=fs;OFS=ofs}
{ b=""; for(i=1;i<=NF;++i) if(i!=del) b=(b?b OFS:"") $i; $0=b }
# do whatever you want to do
' file
multiple columns:
awk -v del=3,5,7 '
BEGIN{FS=fs;OFS=ofs; del="," del ","}
{ b=""; for(i=1;i<=NF;++i) if (del !~ ","i",") b=(b?b OFS:"") $i; $0=b }
# do whatever you want to do
' file
Well, if the goal is to remove the extra delimiters, then you can use tr on Linux. Example:
$ echo "1,2,,,5" | tr -s ','
1,2,5
echo one two three four five six|awk '{
print $0
is3=$3
$3=""
print $0
print is3
}'
one two three four five six
one two four five six
three