awk command to split nth field - awk

I am learning AWK and was trying some exercises on built-in string functions.
Here's my exercise:
I have a file containing as below
RecordType:83
1,2,3,a|x|y|z,4,5
And my desired output is as below:
RecordType:83
1,2,3,a,4,5
1,0,0,x,4,5
1,0,0,y,4,5
1,0,0,z,4,5
I wrote an awk command for the above output.
awk -F',' '$1 ~ /RecordType:83/{print $0}
$1 == 1{
split($4,splt,"|")
for(i in splt)
{
if(i==1)
print $1,$2,$3,splt[i],$5,$6
else
print $1,0,0,splt[i],$5,$6
}
}' OFS=, file_name
The above command looks so clumsy. Is there any way minimizing the command?
Thanks in advance

The shortest possible one-liner I could manage:
awk -F, 'NR>1{n=split($4,a,"|");for(;i++<n;){$4=a[i];print;$2=$3=0}}NR==1' OFS=, file
RecordType:83    
1,2,3,a,4,5
1,0,0,x,4,5
1,0,0,y,4,5
1,0,0,z,4,5
The much more readable script (recommended):
BEGIN {
FS=OFS="," # Comma delimiter
}
NR==1 { # If the first line in file
print $0 # Print the whole line
next # Skip to next line
}
{
n=split($4,a,"|") # Split field four on |
for(i=1;i<=n;i++) # For each sub-field
print $1,i==1?$2OFS$3:"0"OFS"0",a[i],$5,$6 # Print the output
}

another shorter one-liner
awk -F, -v OFS="," 'NR>1{n=split($4,a,"|");while(++i<=n){$4=a[i];print;$2=$3=0}}NR==1' file
with your example:
kent$ awk -F, -v OFS="," 'NR>1{n=split($4,a,"|");while(++i<=n){$4=a[i];print;$2=$3=0}}NR==1' file
RecordType:83
1,2,3,a,4,5
1,0,0,x,4,5
1,0,0,y,4,5
1,0,0,z,4,5

Related

Multiple conditional output from single input

I am having a file test.txt. I am looking for multiple pattern matching and I am printing them independently one by one with
awk 'substr($1,5,15) ~ /ccc/ { print $0 }' test.txt >test1.txt
awk 'substr($1,5,15) ~ /abb/ { print $0 }' test.txt >test2.txt
awk 'substr($1,5,15) ~ /abc/ { print $0 }' test.txt >test3.txt
Now, can I run it in one go. Like after
awk 'substr($1,5,15) ~ /ccc/ { print $0 }' test.txt
in the lines which dont match the above pattern can I run
awk 'substr($1,5,15) ~ /abb/ { print $0 }'
and similarly in unmatched pattern lines
awk 'substr($1,5,15) ~ /abc/ { print $0 }'
Input file test.txt
NNNNNabcabAAAAATCTAATCTGCCAGTT
NNNNNabcccTTTTTCTAGTCACGATAGCC
NNNNNaaabbCTAGTTTGTGTAGTAATTTT
NNNNNaaaabTTTTTTTTTTTTTTTTTTTT
NNNNNabbbbTTTTTTCACTACTGGGTTTC
NNNNNabcaaTTTTTTTTAATGGGTCTCAA
NNNNNabaccTTTTTTTTTCGGGAGGCGGG
NNNNNccaaaTTTTTTTTTTTTTATTTGAG
NNNNNabcccTTTTTTTTTACACACAATTC
NNNNNabcccTAAGACTGGCCCACAGCTGA
NNNNNabcaaTAGAGACGGGGTTTCACCAT
NNNNNabcaaTTTTTGTCGAAGATCTCACC
NNNNNabcabTTGGTAAACAGGCGGGTGTA
NNNNNabcccTACTTTTTTTAGTGATACAC
NNNNNaaabbTTTTTGCAAAAAGTAATTTG
NNNNNabcabTTTTTTTTTCTTTCTGCCTG
NNNNNabcaaTTTTGAGACAGAATCTTGCT
NNNNNaaabbTTTTTTTTTTTTTACTAGTG
NNNNNabcccTAGACAGGGAATACTTTATT
NNNNNabcabGACAGGGAATACTTATATTC
awk 'substr($1,5,15) ~ /ccc/ { print $0 }' test.txt >test1.txt
test1.txt
NNNNNabcccTTTTTCTAGTCACGATAGCC
NNNNNabcccTTTTTTTTTACACACAATTC
NNNNNabcccTAAGACTGGCCCACAGCTGA
NNNNNabcccTACTTTTTTTAGTGATACAC
NNNNNabcccTAGACAGGGAATACTTTATT
awk 'substr($1,5,15) ~ /abb/ { print $0 }' test.txt >test2.txt
test2.txt
NNNNNaaabbCTAGTTTGTGTAGTAATTTT
NNNNNabbbbTTTTTTCACTACTGGGTTTC
NNNNNaaabbTTTTTGCAAAAAGTAATTTG
NNNNNaaabbTTTTTTTTTTTTTACTAGTG
awk 'substr($1,5,15) ~ /abc/ { print $0 }' test.txt >test3.txt
NNNNNabcabAAAAATCTAATCTGCCAGTT
NNNNNabcccTTTTTCTAGTCACGATAGCC
NNNNNabcaaTTTTTTTTAATGGGTCTCAA
NNNNNabcccTTTTTTTTTACACACAATTC
NNNNNabcccTAAGACTGGCCCACAGCTGA
NNNNNabcaaTAGAGACGGGGTTTCACCAT
NNNNNabcaaTTTTTGTCGAAGATCTCACC
NNNNNabcabTTGGTAAACAGGCGGGTGTA
NNNNNabcccTACTTTTTTTAGTGATACAC
NNNNNabcabTTTTTTTTTCTTTCTGCCTG
NNNNNabcaaTTTTGAGACAGAATCTTGCT
NNNNNabcccTAGACAGGGAATACTTTATT
NNNNNabcabGACAGGGAATACTTATATTC
While doing like this, following lines are in two output files
NNNNNabcccTAAGACTGGCCCACAGCTGA
NNNNNabcccTACTTTTTTTAGTGATACAC
NNNNNabcccTAGACAGGGAATACTTTATT
NNNNNabcccTTTTTCTAGTCACGATAGCC
NNNNNabcccTTTTTTTTTACACACAATTC
What I am looking for is once an output is print, I dont want to look for matching patten in those input files again. My expected output
test1.txt
NNNNNabcccTTTTTCTAGTCACGATAGCC
NNNNNabcccTTTTTTTTTACACACAATTC
NNNNNabcccTAAGACTGGCCCACAGCTGA
NNNNNabcccTACTTTTTTTAGTGATACAC
NNNNNabcccTAGACAGGGAATACTTTATT
test2.txt
NNNNNaaabbCTAGTTTGTGTAGTAATTTT
NNNNNabbbbTTTTTTCACTACTGGGTTTC
NNNNNaaabbTTTTTGCAAAAAGTAATTTG
NNNNNaaabbTTTTTTTTTTTTTACTAGTG
test3.txt
NNNNNabcabAAAAATCTAATCTGCCAGTT
NNNNNabcaaTTTTTTTTAATGGGTCTCAA
NNNNNabcaaTAGAGACGGGGTTTCACCAT
NNNNNabcaaTTTTTGTCGAAGATCTCACC
NNNNNabcabTTGGTAAACAGGCGGGTGTA
NNNNNabcabTTTTTTTTTCTTTCTGCCTG
NNNNNabcaaTTTTGAGACAGAATCTTGCT
NNNNNabcabGACAGGGAATACTTATATTC
To do all three in one awk process, try:
awk 'substr($1,5,15) ~ /ccc/ { print>"test1.txt"}
substr($1,5,15) ~ /abb/ { print>"test2.txt"}
substr($1,5,15) ~ /abc/ { print>"test3.txt"}' test.txt
Here, print>"test1.txt" prints to file test1.txt.
Note that > means something different in awk than it means in shell. In awk, like in shell, the first print to a file will overwrite the previous contents of the file. However, unlike shell, subsequent awk print statements using > append to the file.
Variation: Printing only to the first matched output file
awk 'substr($1,5,15) ~ /ccc/ { print>"test1.txt"; next}
substr($1,5,15) ~ /abb/ { print>"test2.txt"; next}
substr($1,5,15) ~ /abc/ { print>"test3.txt"}' test.txt
Here, when a match is found, next tells awk to skip the rest of the tests and jump to start over on the next line.
awk '
{
str = substr($1,5,15)
out = 0
if (str ~ /ccc/) out=1
else if (str ~ /abb/) out=2
else if (str ~ /abc/) out=3
}
out { print > ("test" out ".txt") }
' test.txt
With GNU awk you could use a switch statement instead of nested ifs.
This golf presumes no concurrent matches.
gawk '{
match(substr($1,5,15), /(ccc)|(abb)|(abc)/, A) # probably unnecessary substring
for(i in A) n=i # get last index of A (match number)
print > "test" n ".txt" # print to variable filename
}' test.txt

How to not remove the header while executing awk

I have a file file like this :
k_1_1
k_1_3
k_1_6
...
I have a file file2 :
0,1,2,3,...
k_1_1,17,16,15,...
k_1_2,17,89,15,...
k_1_3,10,26,45,...
k_1_4,17,16,15,...
k_1_5,10,26,45,...
k_1_6,17,16,15,...
...
I want to print lines of file2 that is matched with fileThe desired output is :
0,1,2,3,...
k_1_1,17,16,15,...
k_1_3,10,26,45,...
k_1_6,17,16,15,...
I tried
awk 'BEGIN{FS=OFS=","}NR==FNR{a[$1];next}$1 in a {print $0}' file file2 > result
But the header line is gone in result like this :
k_1_1,17,16,15,...
k_1_3,10,26,45,...
k_1_6,17,16,15,...
How can a maintain it? Thank you.
Always print the first line, unconditionally.
awk 'BEGIN{FS=OFS=","}
NR==FNR{a[$1];next}
FNR==1 || $1 in a' file file2 > result
Notice also how { print $0 } is not necessary because it's the default action.
A very ad-hoc solution to your problem could be to compose the output in a command group:
{ head -1 file2; awk 'BEGIN{FS=OFS=","}NR==FNR{a[$1];next}$1 in a {print $0}' file file2; } > result
Could you please try following.
awk -F, 'FNR==NR{a[$1]=$0;next} FNR==1 && ++count==1{print;next} a[$1]' Input_file Input_file2
OR
awk -F, 'FNR==NR{a[$1]=$0;next} FNR==1{print;next} a[$1]' Input_file Input_file2

print unique lines based on field

Would like to print unique lines based on first field , keep the first occurrence of that line and remove duplicate other occurrences.
Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
10,09-10-2014,def
40,06-10-2014,ghi
10,15-10-2014,abc
Desired Output:
10,15-10-2014,abc
20,12-10-2014,bcd
40,06-10-2014,ghi
Have tried below command and in-complete
awk 'BEGIN { FS = OFS = "," } { !seen[$1]++ } END { for ( i in seen) print $0}' Input.csv
Looking for your suggestions ...
You put your test for "seen" in the action part of the script instead of the condition part. Change it to:
awk -F, '!seen[$1]++' Input.csv
Yes, that's the whole script:
$ cat Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
10,09-10-2014,def
40,06-10-2014,ghi
10,15-10-2014,abc
$
$ awk -F, '!seen[$1]++' Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
40,06-10-2014,ghi
This should give you what you want:
awk -F, '{ if (!($1 in a)) a[$1] = $0; } END '{ for (i in a) print a[i]}' input.csv
typo there in syntax.
awk '{ if (!($1 in a)) a[$1] = $0; } END { for (i in a) print a[i]}'

Edit header file with awk

I have a file that is white-space separated value, i need to convert this into:
header=tab separated,
records=" ; " separated (space-semicolon-space)
what i'm doing now is:
cat ${original} | awk 'END {FS=" "} { for(i=1; i<=NR; i++) {if (i==1) { OFS="\t"; print $0; } else { OFS=";" ;print $0; }}}' > ${new}
But is working only partly, first, it produces millions of lines, while the original ones has about 90000.
Second, the header, which should be modified here:
if (i==1) { OFS="\t"; print $0; }
Is not modified at all,
Another option would be by using sed, i can get that job to be done partially, but again the header remains untouched:
cat ${original} | sed 's/\t/ ;/g' > ${new}
this line should change all the separator in file
awk -F'\t' -v OFS=";" '$1=$1' file
this will leave header untouched:
awk -F'\t' -v OFS=";" 'NR>1{$1=$}1' file
this will only change the header line:
awk -F'\t' -v OFS=";" 'NR==1{$1=$1}1' file
you could paste some example to let us know why your header was not modified.

using awk to count characters and modify file accordingly

I have a file that looks like this
#FCD17BKACXX:8:1101:2703:2197#0/1
CAGCTTTACTCGTCATTTCCCCCAAGGGTAAAATGCGTCCGTCCATTAAGTTCACAGTCATCGTCT
+FCD17BKACXX:8:1101:2703:2197#0/1
^`^\eggcghheJ`dffhhhffhe`ecd^a^_ceacecfhf\beZegfhh_fghhgfZbdg]c^a`
#FCD17BKACXX:8:1101:4434:2244#0/1
CTGCGTTCATCGCGTTGTTGGGAGGAATCTCTACCCCAGGTTCTCGCTGTGAA
+FCD17BKACXX:8:1101:4434:2244#0/1
eeecgeceeffhhihi_fhhiicdgfghiiihiiihiiihVbcdgfhge`cee
#FCD17BKACXX:8:1101:6394:2107#0/1
CAGCAGGACTAGGGCCTGCAGACGTACTG
+FCD17BKACXX:8:1101:6394:2107#0/1
eeeccggeghhiihiihihihhhhcfghf
I would like to go to every second line and count the number of characters. If the line contains less than e.g. 66 characters then fill it to 66 with 'A' and print to new file. If it contains 66 characters then just print the line as is.
The output file would look like this;
#FCD17BKACXX:8:1101:2703:2197#0/1
CAGCTTTACTCGTCATTTCCCCCAAGGGTAAAATGCGTCCGTCCATTAAGTTCACAGTCATCGTCT
+FCD17BKACXX:8:1101:2703:2197#0/1
^`^\eggcghheJ`dffhhhffhe`ecd^a^_ceacecfhf\beZegfhh_fghhgfZbdg]c^a`
#FCD17BKACXX:8:1101:4434:2244#0/1
CTGCGTTCATCGCGTTGTTGGGAGGAATCTCTACCCCAGGTTCTCGCTGTGAAAAAAAAAAAAAAA
+FCD17BKACXX:8:1101:4434:2244#0/1
eeecgeceeffhhihi_fhhiicdgfghiiihiiihiiihVbcdgfhge`ceeAAAAAAAAAAAAA
#FCD17BKACXX:8:1101:6394:2107#0/1
CAGCAGGACTAGGGCCTGCAGACGTACTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+FCD17BKACXX:8:1101:6394:2107#0/1
eeeccggeghhiihiihihihhhhcfghfAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
I have a very basic knowledge of awk so from a learning perspective I would like to use awk to solve the problem.
One way:
awk '!(NR%2) && length<66{for(i=length;i<66;i++)$0=$0 "A"}1' file
This should be faster than the accepted approach:
awk 'NR%2==0 { x = sprintf("%-66s", $0); gsub(/ /,"A",x); $0 = x }1' file
Results:
#FCD17BKACXX:8:1101:2703:2197#0/1
CAGCTTTACTCGTCATTTCCCCCAAGGGTAAAATGCGTCCGTCCATTAAGTTCACAGTCATCGTCT
+FCD17BKACXX:8:1101:2703:2197#0/1
^`^\eggcghheJ`dffhhhffhe`ecd^a^_ceacecfhf\beZegfhh_fghhgfZbdg]c^a`
#FCD17BKACXX:8:1101:4434:2244#0/1
CTGCGTTCATCGCGTTGTTGGGAGGAATCTCTACCCCAGGTTCTCGCTGTGAAAAAAAAAAAAAAA
+FCD17BKACXX:8:1101:4434:2244#0/1
eeecgeceeffhhihi_fhhiicdgfghiiihiiihiiihVbcdgfhge`ceeAAAAAAAAAAAAA
#FCD17BKACXX:8:1101:6394:2107#0/1
CAGCAGGACTAGGGCCTGCAGACGTACTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+FCD17BKACXX:8:1101:6394:2107#0/1
eeeccggeghhiihiihihihhhhcfghfAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
I would paste another strange (maybe) oneliner:
awk 'BEGIN{while(++i<66)t=t"A"}!(NR%2){$0=$0substr(t,length)}1' file
awk 'NR%2 == 0{
printf("%s", $0)
for(i=length($0); i<66; i++)printf("A")
print "";next }
{print}'
awk -v FS= '{printf "%s",$0} !(NR%2){for (i=NF+1;i<=66;i++) printf "A"} {print ""}'
or if you don't like loops:
awk -v FS= '{sfx=(NR%2 ? "" : sprintf("%*s",66-NF,"")); gsub(/ /,"A",sfx); print $0 sfx}'