Is there a way to completely delete fields in awk, so that extra delimiters do not print? - awk
Consider the following command:
$ gawk -F"\t" "BEGIN{OFS=\"\t\"}{$2=$3=\"\"; print $0}" Input.tsv
When I set $2 = $3 = "", the intended effect is to get the same effect as writing:
print $1,$4,$5...$NF
However, what actually happens is that I get two empty fields, with the extra field delimiters still printing.
Is it possible to actually delete $2 and $3?
Note: If this was on Linux in bash, the correct statement above would be the following, but Windows does not handle single quotes well in cmd.exe.
$ gawk -F'\t' 'BEGIN{OFS="\t"}{$2=$3=""; print $0}' Input.tsv
This is an oldie but goodie.
As Jonathan points out, you can't delete fields in the middle, but you can replace their contents with the contents of other fields. And you can make a reusable function to handle the deletion for you.
$ cat test.awk
function rmcol(col, i) {
for (i=col; i<NF; i++) {
$i = $(i+1)
}
NF--
}
{
rmcol(3)
}
1
$ printf 'one two three four\ntest red green blue\n' | awk -f test.awk
one two four
test red blue
You can't delete fields in the middle, but you can delete fields at the end, by decrementing NF.
So you can shift all the later fields down to overwrite $2 and $3 then decrement NF by two, which erases the last two fields:
$ echo 1 2 3 4 5 6 7 | awk '{for(i=2; i<NF-1; ++i) $i=$(i+2); NF-=2; print $0}'
1 4 5 6 7
If you are just looking to remove columns, you can use cut:
$ cut -f 1,4- file.txt
To emulate cut:
$ awk -F "\t" '{ for (i=1; i<=NF; i++) if (i != 2 && i != 3) { if (i == NF) printf $i"\n"; else printf $i"\t" } }' file.txt
Similarly:
$ awk -F "\t" '{ delim =""; for (i=1; i<=NF; i++) if (i != 2 && i != 3) { printf delim $i; delim = "\t"; } printf "\n" }' file.txt
HTH
The only way I can think to do it in Awk without using a loop is to use gsub on $0 to combine adjacent FS:
$ echo {1..10} | awk '{$2=$3=""; gsub(FS"+",FS); print}'
1 4 5 6 7 8 9 10
One way could be to remove fields like you do and remove extra spaces with gsub:
$ awk 'BEGIN { FS = "\t" } { $2 = $3 = ""; gsub( /\s+/, "\t" ); print }' input-file
In the addition of the answer by Suicidal Steve I'd like to suggest one more solution but using sed instead awk.
It seems more complicated than usage of cut as it was suggested by Steve. But it was the better solution because sed -i allows editing in-place.
$ sed -i 's/\(.*,\).*,.*,\(.*\)/\1\2/' FILENAME
To remove fields 2 and 3 from a given input file (assuming a tab field separator), you can remove the fields from $0 using gensub and regenerate it as follows:
awk -F '\t' 'BEGIN{OFS="\t"}\
{$0=gensub(/[^\t]*\t/,"",3);\
$0=gensub(/[^\t]*\t/,"",2);\
print}' Input.tsv
The method presented in the answer of ghoti has some problems:
every assignment of $i = $(i+1) forces awk to rebuild the record $0. This implies that if you have 100 fields and you want to delete field 10, you rebuild the record 90 times.
changing the value of NF manually is not posix compliant and leads to undefined behaviour (as is mentioned in the comments).
A somewhat more cumbersome, but stable robust way to delete a set of columns would be:
a single column:
awk -v del=3 '
BEGIN{FS=fs;OFS=ofs}
{ b=""; for(i=1;i<=NF;++i) if(i!=del) b=(b?b OFS:"") $i; $0=b }
# do whatever you want to do
' file
multiple columns:
awk -v del=3,5,7 '
BEGIN{FS=fs;OFS=ofs; del="," del ","}
{ b=""; for(i=1;i<=NF;++i) if (del !~ ","i",") b=(b?b OFS:"") $i; $0=b }
# do whatever you want to do
' file
Well, if the goal is to remove the extra delimiters, then you can use tr on Linux. Example:
$ echo "1,2,,,5" | tr -s ','
1,2,5
echo one two three four five six|awk '{
print $0
is3=$3
$3=""
print $0
print is3
}'
one two three four five six
one two four five six
three
Related
assigning a var inside AWK for use outside awk
I am using ksh on AIX. I have a file with multiple comma delimited fields. The value of each field is read into a variable inside the script. The last field in the file may contain multiple | delimited values. I need to test each value and keep the first one that doesn't begin with R, then stop testing the values. sample value of $principal_diagnosis0 R65.20|A41.9|G30.9|F02.80 I've tried: echo $principal_diagnosis0 | awk -F"|" '{for (i = 1; i<=NF; i++) {if ($i !~ "R"){echo $i; primdiag = $i}}}' but I get this message : awk: Field $i is not correct. My goal is to have a variable that I can use outside of the awk statement that gets assigned the first non-R code (in this case it would be A41.9). echo $principal_diagnosis0 | awk -F"|" '{for (i = 1; i<=NF; i++) {if ($i !~ "R"){print $i}}}' gets me the output of : A41.9 G30.9 F02.80 So I know it's reading the values and evaluating properly. But I need to stop after the first match and be able to use that value outside of awk. Thanks!
To answer your specific question: $ principal_diagnosis0='R65.20|A41.9|G30.9|F02.80' $ foo=$(echo "$principal_diagnosis0" | awk -v RS='|' '/^[^R]/{sub(/\n/,""); print; exit}') $ echo "$foo" A41.9 The above will work with any awk, you can do it more briefly with GNU awk if you have it: foo=$(echo "$principal_diagnosis0" | awk -v RS='[|\n]' '/^[^R]/{print; exit}')
you can make FS and OFS do all the hard work : echo "${principal_diagnosis0}" | mawk NF=NF FS='^(R[^|]+[|])+|[|].+$' OFS= A41.9 —————————————————————————————————————————— another slightly different variation of the same concept — overwriting fields but leaving OFS as is : gawk -F'^.*R[^|]+[|]|[|].+$' '$--NF=$--NF' A41.9 this works, because when you break it out : gawk -F'^.*R[^|]+[|]|[|].+$' ' { print NF } $(_=--NF)=$(__=--NF) { print _, __, NF, $0 }' 3 1 2 1 A41.9 you'll notice you start with NF = 3, and the two subsequent decrements make it equivalent to $1 = $2, but since final NF is now reduced to just 1, it would print it out correctly instead of 2 copies of it …… which means you can also make it $0 = $2, as such : gawk -F'^.*R[^|]+[|]|[|].+$' '$-_=$-—NF' A41.9 —————————————————————————————————————————— a 3rd variation, this time using RS instead of FS : mawk NR==2 RS='^.*R[^|]+[|]|[|].+$' A41.9 —————————————————————————————————————————— and if you REALLY don't wanna mess with FS/OFS/RS, use gsub() instead : nawk 'gsub("^.*R[^|]+[|]|[|].+$",_)' A41.9
selecting columns in awk discarding corresponding header
How to properly select columns in awk after some processing. My file here: cat foo A;B;C 9;6;7 8;5;4 1;2;3 I want to add a first column with line numbers and then extract some columns of the result. For the example let's get the new first (line numbers) and third columns. This way: awk -F';' 'FNR==1{print "linenumber;"$0;next} {print FNR-1,$1,$3}' foo gives me this unexpected output: linenumber;A;B;C 1 9 7 2 8 4 3 1 3 but expected is (note B is now the third column as we added linenumber as first): linenumber;B 1;6 2;5 3;2 [fixed and revised]
To get your expected output, use: $ awk 'BEGIN { FS=OFS=";" } { print (FNR==1?"linenumber":FNR-1),$(FNR==1?3:1) }' file Output: linenumber;C 1;9 2;8 3;1 To add a column with line number and extract first and last columns, use: $ awk 'BEGIN { FS=OFS=";" } { print (FNR==1?"linenumber":FNR-1),$1,$NF }' file Output this time: linenumber;A;C 1;9;7 2;8;4 3;1;3
Why do you print $0 (the complete record) in your header? And, if you want only two columns in your output, why to you print 3 (FNR-1, $1 and $3)? Finally, the reason why your output field separators are spaces instead of the expected ; is simply that... you did not specify the output field separator (OFS). You can do this with a command line variable assignment (OFS=\;), as shown in the second and third versions below, but also using the -v option (-v OFS=\;) or in a BEGIN block (BEGIN {OFS=";"}) as you wish (there are differences between these 3 methods but they don't matter here). [EDIT]: see a generic solution at the end. If the field you want to keep is the second of the input file (the B column), try: $ awk -F\; 'FNR==1 {print "linenumber;" $2; next} {print FNR-1 ";" $2}' foo linenumber;B 1;6 2;5 3;2 or $ awk -F\; 'FNR==1 {print "linenumber",$2; next} {print FNR-1,$2}' OFS=\; foo linenumber;B 1;6 2;5 3;2 Note that, as long as you don't want to keep the first field of the input file ($1), you could as well overwrite it with the line number: $ awk -F\; '{$1=FNR==1?"linenumber":FNR-1; print $1,$2}' OFS=\; foo linenumber;B 1;6 2;5 3;2 Finally, here is a more generic solution to which you can pass the list of indexes of the columns of the input file you want to print (1 and 3 in this example): $ awk -F\; -v cols='1;3' ' BEGIN { OFS = ";"; n = split(cols, c); } { printf("%s", FNR == 1 ? "linenumber" : FNR - 1); for(i = 1; i <= n; i++) printf("%s", OFS $(c[i])); printf("\n"); }' foo linenumber;A;C 1;9;7 2;8;4 3;1;3
Substitute patterns using a correspondence file
I try to change in a file some word by others using sed or awk. My initial fileA as this format: >Genus_species|NP_001006347.1|transcript-60_2900.p1:1-843 I have a second fileB with the correspondences like this: NP_001006347.1 GeneA XP_003643123.1 GeneB I am trying to substitute in FileA the name to get this ouput: >Genus_species|GeneA|transcript-60_2900.p1:1-843 I was thinking to use awk or sed, to do something like 's/$patternA/$patternB/' with a while read l but how to indicate which pattern 1 and 2 are in the fileB? I tried also this but not working. sed "$(sed 's/^\([^ ]*\) \(.*\)$/s#\1#\2#g/' fileB)" fileA Awk may be able to do the job more easily? Thanks
It is easier to this in awk: awk -v OFS='|' 'NR == FNR { map[$1] = $2 next } { for (i=1; i<=NF; ++i) $i in map && $i = map[$i] } 1' file2 FS='|' file1 >Genus_species|GeneA|transcript-60_2900.p1:1-843
Written and tested with your shown samples, considering that you have only one entry for NP_digits.digits in your Input_fileA then you could try following too. awk ' FNR==NR{ arr[$1]=$2 next } match($0,/NP_[0-9]+\.[0-9]+/) && ((val=substr($0,RSTART,RLENGTH)) in arr){ $0=substr($0,1,RSTART-1) arr[val] substr($0,RSTART+RLENGTH) } 1 ' Input_fileB Input_fileA
Using awk awk -F [\|" "] 'NR==FNR { arr[$1]=$2;next } NR!=FNR { OFS="|";$2=arr[$2] }1' fileB fileA Set the field delimiter to space or |. Process fileB first (NR==FNR) Create an array called arr with the first space delimited field as the index and the second the value. Then for the second file (NR != FNR), check for an entry for the second field in the arr array and if there is an entry, change the second field for the value in the array and print the lines with short hand 1
You are looking for the join command which can be used like this: join -11 -22 -t'|' <(tr ' ' '|' < fileB | sort -t'|' -k1) <(sort -t'|' -k2 fileA) This performs a join on column 1 of fileB with column 2 of fileA. The tr was used such that fileB also uses | as delimiter because join requires it to be equal on both files. Note that the output columns are not in the order you specified. You can swap by piping the output into awk.
linux csv file concatenate columns into one column
I've been looking to do this with sed, awk, or cut. I am willing to use any other command-line program that I can pipe data through. I have a large set of data that is comma delimited. The rows have between 14 and 20 columns. I need to recursively concatenate column 10 with column 11 per row such that every row has exactly 14 columns. In other words, this: a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p will become: a,b,c,d,e,f,g,h,i,jkl,m,n,o,p I can get the first 10 columns. I can get the last N columns. I can concatenate columns. I cannot think of how to do it in one line so I can pass a stream of endless data through it and end up with exactly 14 columns per row. Examples (by request): How many columns are in the row? sed 's/[^,]//g' | wc -c Get the first 10 columns: cut -d, -f1-10 Get the last 4 columns: rev | cut -d, -f1-4 | rev Concatenate columns 10 and 11, showing columns 1-10 after that: awk -F',' ' NF { print $1","$2","$3","$4","$5","$6","$7","$8","$9","$10$11}'
Awk solution: awk 'BEGIN{ FS=OFS="," } { diff = NF - 14; for (i=1; i <= NF; i++) printf "%s%s", $i, (diff > 1 && i >= 10 && i < (10+diff)? "": (i == NF? ORS : ",")) }' file The output: a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
With GNU awk for the 3rd arg to match() and gensub(): $ cat tst.awk BEGIN{ FS="," } match($0,"(([^,]+,){9})(([^,]+,){"NF-14"})(.*)",a) { $0 = a[1] gensub(/,/,"","g",a[3]) a[5] } { print } $ awk -f tst.awk file a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
If perl is okay - can be used just like awk for stream processing $ cat ip.txt a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p 1,2,3,4,5,6,3,4,2,4,3,4,3,2,5,2,3,4 1,2,3,4,5,6,3,4,2,4,a,s,f,e,3,4,3,2,5,2,3,4 $ awk -F, '{print NF}' ip.txt 16 18 22 $ perl -F, -lane '$n = $#F - 4; print join ",", (#F[0..8], join("", #F[9..$n]), #F[$n+1..$#F]) ' ip.txt a,b,c,d,e,f,g,h,i,jkl,m,n,o,p 1,2,3,4,5,6,3,4,2,43432,5,2,3,4 1,2,3,4,5,6,3,4,2,4asfe3432,5,2,3,4 -F, -lane split on , results saved in #F array $n = $#F - 4 magic number, to ensure output ends with 14 columns. $#F gives the index of last element of array (won't work if input line has less than 14 columns) join helps to stitch array elements together with specified string #F[0..8] array slice with first 9 elements #F[9..$n] and #F[$n+1..$#F] the other slices as needed Borrowing from Ed Morton's regex based solution $ perl -F, -lape '$n=$#F-13; s/^([^,]*,){9}\K([^,]*,){$n}/$&=~tr|,||dr/e' ip.txt a,b,c,d,e,f,g,h,i,jkl,m,n,o,p 1,2,3,4,5,6,3,4,2,43432,5,2,3,4 1,2,3,4,5,6,3,4,2,4asfe3432,5,2,3,4 $n=$#F-13 magic number ^([^,]*,){9}\K first 9 fields ([^,]*,){$n} fields to change $&=~tr|,||dr use tr to delete the commas e this modifier allows use of Perl code in replacement section this solution also has the added advantage of working even if input field is less than 14
You can try this gnu sed sed -E ' s/,/\n/9g :A s/([^\n]*\n)(.*)(\n)(([^\n]*\n){4})/\1\2\4/ tA s/\n/,/g ' infile
First variant - with awk awk -F, ' { for(i = 1; i <= NF; i++) { OFS = (i > 9 && i < NF - 4) ? "" : "," if(i == NF) OFS = "\n" printf "%s%s", $i, OFS } }' input.txt Second variant - with sed sed -r 's/,/#/10g; :l; s/#(.*)((#[^#]){4})/\1\2/; tl; s/#/,/g' input.txt or, more straightforwardly (without loop) and probably faster. sed -r 's/,(.),(.),(.),(.)$/#\1#\2#\3#\4/; s/,//10g; s/#/,/g' input.txt Testing Input a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u Output a,b,c,d,e,f,g,h,i,jkl,m,n,o,p a,b,c,d,e,f,g,h,i,jklmn,o,p,q,r a,b,c,d,e,f,g,h,i,jklmnopq,r,s,t,u
Solved a similar problem using csvtool. Source file, copied from one of the other answers: $ cat input.txt a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p 1,2,3,4,5,6,3,4,2,4,3,4,3,2,5,2,3,4 1,2,3,4,5,6,3,4,2,4,a,s,f,e,3,4,3,2,5,2,3,4 Concatenating columns: $ cat input.txt | csvtool format '%1,%2,%3,%4,%5,%6,%7,%8,%9,%10%11%12,%13,%14,%15,%16,%17,%18,%19,%20,%21,%22\n' - a,b,c,d,e,f,g,h,i,jkl,m,n,o,p,,,,,, 1,2,3,4,5,6,3,4,2,434,3,2,5,2,3,4,,,, 1,2,3,4,5,6,3,4,2,4as,f,e,3,4,3,2,5,2,3,4 anatoly#anatoly-workstation:cbs$ cat input.txt
How to use awk sort by column 3
I have a file (user.csv)like this ip,hostname,user,group,encryption,aduser,adattr want to print all column sort by user, I tried awk -F ":" '{print|"$3 sort -n"}' user.csv , it doesn't work.
How about just sort. sort -t, -nk3 user.csv where -t, - defines your delimiter as ,. -n - gives you numerical sort. Added since you added it in your attempt. If your user field is text only then you dont need it. -k3 - defines the field (key). user is the third field.
Use awk to put the user ID in front. Sort Use sed to remove the duplicate user ID, assuming user IDs do not contain any spaces. awk -F, '{ print $3, $0 }' user.csv | sort | sed 's/^.* //'
Seeing as that the original question was on how to use awk and every single one of the first 7 answers use sort instead, and that this is the top hit on Google, here is how to use awk. Sample net.csv file with headers: ip,hostname,user,group,encryption,aduser,adattr 192.168.0.1,gw,router,router,-,-,- 192.168.0.2,server,admin,admin,-,-,- 192.168.0.3,ws-03,user,user,-,-,- 192.168.0.4,ws-04,user,user,-,-,- And sort.awk: #!/usr/bin/awk -f # usage: ./sort.awk -v f=FIELD FILE BEGIN { FS="," } # each line { a[NR]=$0 "" s[NR]=$f "" } END { isort(s,a,NR); for(i=1; i<=NR; i++) print a[i] } #insertion sort of A[1..n] function isort(S, A, n, i, j) { for( i=2; i<=n; i++) { hs = S[j=i] ha = A[j=i] while (S[j-1] > hs) { j--; S[j+1] = S[j] A[j+1] = A[j] } S[j] = hs A[j] = ha } } To use it: awk sort.awk f=3 < net.csv # OR chmod +x sort.awk ./sort.awk f=3 net.csv
You can choose a delimiter, in this case I chose a colon and printed the column number one, sorting by alphabetical order: awk -F\: '{print $1|"sort -u"}' /etc/passwd
awk -F, '{ print $3, $0 }' user.csv | sort -nk2 and for reverse order awk -F, '{ print $3, $0 }' user.csv | sort -nrk2
try this - awk '{print $0|"sort -t',' -nk3 "}' user.csv OR sort -t',' -nk3 user.csv
awk -F "," '{print $0}' user.csv | sort -nk3 -t ',' This should work
To exclude the first line (header) from sorting, I split it out into two buffers. df | awk 'BEGIN{header=""; $body=""} { if(NR==1){header=$0}else{body=body"\n"$0}} END{print header; print body|"sort -nk3"}'
With GNU awk: awk -F ',' '{ a[$3]=$0 } END{ PROCINFO["sorted_in"]="#ind_str_asc"; for(i in a) print a[i] }' file See 8.1.6 Using Predefined Array Scanning Orders with gawk for more sorting algorithms.
I'm running Linux (Ubuntu) with mawk: tmp$ awk -W version mawk 1.3.4 20200120 Copyright 2008-2019,2020, Thomas E. Dickey Copyright 1991-1996,2014, Michael D. Brennan random-funcs: srandom/random regex-funcs: internal compiled limits: sprintf buffer 8192 maximum-integer 2147483647 mawk (and gawk) has an option to redirect the output of print to a command. From man awk chapter 9. Input and output: The output of print and printf can be redirected to a file or command by appending > file, >> file or | command to the end of the print statement. Redirection opens file or command only once, subsequent redirections append to the already open stream. Below you'll find a simplied example how | can be used to pass the wanted records to an external program that makes the hard work. This also nicely encapsulates everything in a single awk file and reduces the command line clutter: tmp$ cat input.csv alpha,num D,4 B,2 A,1 E,5 F,10 C,3 tmp$ cat sort.awk # print header line /^alpha,num/ { print } # all other lines are data lines that should be sorted !/^alpha,num/ { print | "sort --field-separator=, --key=2 --numeric-sort" } tmp$ awk -f sort.awk input.csv alpha,num A,1 B,2 C,3 D,4 E,5 F,10 See man sort for the details of the sort options: -t, --field-separator=SEP use SEP instead of non-blank to blank transition -k, --key=KEYDEF sort via a key; KEYDEF gives location and type -n, --numeric-sort compare according to string numerical value