using awk to count characters and modify file accordingly - awk

I have a file that looks like this
#FCD17BKACXX:8:1101:2703:2197#0/1
CAGCTTTACTCGTCATTTCCCCCAAGGGTAAAATGCGTCCGTCCATTAAGTTCACAGTCATCGTCT
+FCD17BKACXX:8:1101:2703:2197#0/1
^`^\eggcghheJ`dffhhhffhe`ecd^a^_ceacecfhf\beZegfhh_fghhgfZbdg]c^a`
#FCD17BKACXX:8:1101:4434:2244#0/1
CTGCGTTCATCGCGTTGTTGGGAGGAATCTCTACCCCAGGTTCTCGCTGTGAA
+FCD17BKACXX:8:1101:4434:2244#0/1
eeecgeceeffhhihi_fhhiicdgfghiiihiiihiiihVbcdgfhge`cee
#FCD17BKACXX:8:1101:6394:2107#0/1
CAGCAGGACTAGGGCCTGCAGACGTACTG
+FCD17BKACXX:8:1101:6394:2107#0/1
eeeccggeghhiihiihihihhhhcfghf
I would like to go to every second line and count the number of characters. If the line contains less than e.g. 66 characters then fill it to 66 with 'A' and print to new file. If it contains 66 characters then just print the line as is.
The output file would look like this;
#FCD17BKACXX:8:1101:2703:2197#0/1
CAGCTTTACTCGTCATTTCCCCCAAGGGTAAAATGCGTCCGTCCATTAAGTTCACAGTCATCGTCT
+FCD17BKACXX:8:1101:2703:2197#0/1
^`^\eggcghheJ`dffhhhffhe`ecd^a^_ceacecfhf\beZegfhh_fghhgfZbdg]c^a`
#FCD17BKACXX:8:1101:4434:2244#0/1
CTGCGTTCATCGCGTTGTTGGGAGGAATCTCTACCCCAGGTTCTCGCTGTGAAAAAAAAAAAAAAA
+FCD17BKACXX:8:1101:4434:2244#0/1
eeecgeceeffhhihi_fhhiicdgfghiiihiiihiiihVbcdgfhge`ceeAAAAAAAAAAAAA
#FCD17BKACXX:8:1101:6394:2107#0/1
CAGCAGGACTAGGGCCTGCAGACGTACTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+FCD17BKACXX:8:1101:6394:2107#0/1
eeeccggeghhiihiihihihhhhcfghfAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
I have a very basic knowledge of awk so from a learning perspective I would like to use awk to solve the problem.

One way:
awk '!(NR%2) && length<66{for(i=length;i<66;i++)$0=$0 "A"}1' file

This should be faster than the accepted approach:
awk 'NR%2==0 { x = sprintf("%-66s", $0); gsub(/ /,"A",x); $0 = x }1' file
Results:
#FCD17BKACXX:8:1101:2703:2197#0/1
CAGCTTTACTCGTCATTTCCCCCAAGGGTAAAATGCGTCCGTCCATTAAGTTCACAGTCATCGTCT
+FCD17BKACXX:8:1101:2703:2197#0/1
^`^\eggcghheJ`dffhhhffhe`ecd^a^_ceacecfhf\beZegfhh_fghhgfZbdg]c^a`
#FCD17BKACXX:8:1101:4434:2244#0/1
CTGCGTTCATCGCGTTGTTGGGAGGAATCTCTACCCCAGGTTCTCGCTGTGAAAAAAAAAAAAAAA
+FCD17BKACXX:8:1101:4434:2244#0/1
eeecgeceeffhhihi_fhhiicdgfghiiihiiihiiihVbcdgfhge`ceeAAAAAAAAAAAAA
#FCD17BKACXX:8:1101:6394:2107#0/1
CAGCAGGACTAGGGCCTGCAGACGTACTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+FCD17BKACXX:8:1101:6394:2107#0/1
eeeccggeghhiihiihihihhhhcfghfAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

I would paste another strange (maybe) oneliner:
awk 'BEGIN{while(++i<66)t=t"A"}!(NR%2){$0=$0substr(t,length)}1' file

awk 'NR%2 == 0{
printf("%s", $0)
for(i=length($0); i<66; i++)printf("A")
print "";next }
{print}'

awk -v FS= '{printf "%s",$0} !(NR%2){for (i=NF+1;i<=66;i++) printf "A"} {print ""}'
or if you don't like loops:
awk -v FS= '{sfx=(NR%2 ? "" : sprintf("%*s",66-NF,"")); gsub(/ /,"A",sfx); print $0 sfx}'

Related

Merge lines based on first column without delimiter

I need to merge all the lines that have the same value on the first column.
The input file is the following:
34600000031|(1|1|0|1|1|20190114180000|20191027185959)
34600000031|(2|2|0|2|2|20190114180000|20191027185959)
34600000031|(3|3|0|3|3|20190114180000|20191027185959)
34600000031|(4|4|0|4|4|20190114180000|20191027185959)
34600000015|(1|1|100|1|8|20190114180000|20191027185959)
34600000015|(2|2|100|2|9|20190114180000|20191027185959)
34600000015|(3|3|100|3|10|20190114180000|20191027185959)
34600000015|(4|4|100|4|11|20190114180000|20191027185959)
I was able to partially achieve it using the following:
awk -F'|' '$1!=p{if(p)print s; p=$1; s=$0; next}{sub(p,x); s=s $0} END{print s}' INPUT
The output is the following:
34600000031|(1|1|0|1|1|20190114180000|20191027185959)|(2|2|0|2|2|20190114180000|20191027185959)|(3|3|0|3|3|20190114180000|20191027185959)|(4|4|0|4|4|20190114180000|20191027185959)
34600000015|(1|1|100|1|8|20190114180000|20191027185959)|(2|2|100|2|9|20190114180000|20191027185959)|(3|3|100|3|10|20190114180000|20191027185959)|(4|4|100|4|11|20190114180000|20191027185959)
What I need (and i cannot find how) is the following:
34600000031|(1|1|0|1|1|20190114180000|20191027185959)(2|2|0|2|2|20190114180000|20191027185959)(3|3|0|3|3|20190114180000|20191027185959)(4|4|0|4|4|20190114180000|20191027185959)
34600000015|(1|1|100|1|8|20190114180000|20191027185959)(2|2|100|2|9|20190114180000|20191027185959)(3|3|100|3|10|20190114180000|20191027185959)(4|4|100|4|11|20190114180000|20191027185959)
I could do a sed after the initial awk but I don't believe that this is the proper way to do it.
You need to substitute the separator in the values too. Your fixes awk would look like this:
awk -F'|' '$1!=p{if(p)print s; p=$1; s=$0; next}{sub(p "\\|",x); s=s $0} END{print s}'
but it's also good to match beginning of the string:
awk -F'|' '$1!=p{if(p)print s; p=$1; s=$0; next}{sub("^" p "\\|",x); s=s $0} END{print s}'
I would do it somewhat simpler, which uses more memory (as it stores everything in an array) but doesn't need the file to be sorted:
awk -F'|' '{ k=$1; sub("^" $1 "\\|", ""); a[k] = a[k] $0 } END{ for (i in a) print i "|" a[i] }'
For each line, remember the first field, substitute the first field with | for nothing, then add it to an array indexed by the first field. On the end, print each element in the array with the key, separator and value.
$ awk -F'|' '
{
curr = $1
sub(/^[^|]+\|/,"")
printf "%s%s", (curr==prev ? "" : ors curr FS), $0
ors = ORS
prev = curr
}
END { print "" }
' file
34600000031|(1|1|0|1|1|20190114180000|20191027185959)(2|2|0|2|2|20190114180000|20191027185959)(3|3|0|3|3|20190114180000|20191027185959)(4|4|0|4|4|20190114180000|20191027185959)
34600000015|(1|1|100|1|8|20190114180000|20191027185959)(2|2|100|2|9|20190114180000|20191027185959)(3|3|100|3|10|20190114180000|20191027185959)(4|4|100|4|11|20190114180000|20191027185959)

Find/replace within a line only if line does not contain a certain string (awk)

I'm trying to reproduce an awk command using different syntax. I have a file (test.txt) that looks like this:
>NAME_123_CONSENSUS
GACTATACA
ATACTAGA
>NAME2_48_TEST
ATAGCGA
and I'm hoping to replace all occurences of "A" with "1" using different syntax of awk. I can solve this using the following line:
awk '!/_/{gsub("A", "1"); 1' test.txt
However, I cannot get the same result using a for loop,
awk '{for(j=1; j<=NF; j++) if ($j ~ "_") print; else print gsub("A","1")}' test.txt
nor using the following input
awk '{ if ($0 ~ "_") print $0; else print gsub("A", "1"); }' test.txt
Both of these last commands give the following output. Why are they giving different output and what am I missing to make both of the last two commands give the desired output?
>NAME_123_CONSENSUS
4
4
5
>NAME2_48_TEST
3
You are incorrectly using the gsub() function here. The sub()/gsub() function return the number of substitutions made and not the modified string. You set the string to modify as the last argument and print it back
awk '{ for(j=1; j<=NF; j++) if ($j ~ "_") print; else { gsub("A","1",$0); print } }'
That said your first command is most efficient/terse way of writing this. Notice you were missing a } in the OP. It should been written as
awk '!/_/{ gsub("A", "1") }1'
Or use gensub() available in GNU Awk's that return the modified string that you can use to print. See more about it on String-Functions of GNU Awk
awk '{ for(j=1; j<=NF; j++) if ($j ~ "_") print; else print gensub(/A/, "1", "g") }'

How to not remove the header while executing awk

I have a file file like this :
k_1_1
k_1_3
k_1_6
...
I have a file file2 :
0,1,2,3,...
k_1_1,17,16,15,...
k_1_2,17,89,15,...
k_1_3,10,26,45,...
k_1_4,17,16,15,...
k_1_5,10,26,45,...
k_1_6,17,16,15,...
...
I want to print lines of file2 that is matched with fileThe desired output is :
0,1,2,3,...
k_1_1,17,16,15,...
k_1_3,10,26,45,...
k_1_6,17,16,15,...
I tried
awk 'BEGIN{FS=OFS=","}NR==FNR{a[$1];next}$1 in a {print $0}' file file2 > result
But the header line is gone in result like this :
k_1_1,17,16,15,...
k_1_3,10,26,45,...
k_1_6,17,16,15,...
How can a maintain it? Thank you.
Always print the first line, unconditionally.
awk 'BEGIN{FS=OFS=","}
NR==FNR{a[$1];next}
FNR==1 || $1 in a' file file2 > result
Notice also how { print $0 } is not necessary because it's the default action.
A very ad-hoc solution to your problem could be to compose the output in a command group:
{ head -1 file2; awk 'BEGIN{FS=OFS=","}NR==FNR{a[$1];next}$1 in a {print $0}' file file2; } > result
Could you please try following.
awk -F, 'FNR==NR{a[$1]=$0;next} FNR==1 && ++count==1{print;next} a[$1]' Input_file Input_file2
OR
awk -F, 'FNR==NR{a[$1]=$0;next} FNR==1{print;next} a[$1]' Input_file Input_file2

Edit header file with awk

I have a file that is white-space separated value, i need to convert this into:
header=tab separated,
records=" ; " separated (space-semicolon-space)
what i'm doing now is:
cat ${original} | awk 'END {FS=" "} { for(i=1; i<=NR; i++) {if (i==1) { OFS="\t"; print $0; } else { OFS=";" ;print $0; }}}' > ${new}
But is working only partly, first, it produces millions of lines, while the original ones has about 90000.
Second, the header, which should be modified here:
if (i==1) { OFS="\t"; print $0; }
Is not modified at all,
Another option would be by using sed, i can get that job to be done partially, but again the header remains untouched:
cat ${original} | sed 's/\t/ ;/g' > ${new}
this line should change all the separator in file
awk -F'\t' -v OFS=";" '$1=$1' file
this will leave header untouched:
awk -F'\t' -v OFS=";" 'NR>1{$1=$}1' file
this will only change the header line:
awk -F'\t' -v OFS=";" 'NR==1{$1=$1}1' file
you could paste some example to let us know why your header was not modified.

awk command to split nth field

I am learning AWK and was trying some exercises on built-in string functions.
Here's my exercise:
I have a file containing as below
RecordType:83
1,2,3,a|x|y|z,4,5
And my desired output is as below:
RecordType:83
1,2,3,a,4,5
1,0,0,x,4,5
1,0,0,y,4,5
1,0,0,z,4,5
I wrote an awk command for the above output.
awk -F',' '$1 ~ /RecordType:83/{print $0}
$1 == 1{
split($4,splt,"|")
for(i in splt)
{
if(i==1)
print $1,$2,$3,splt[i],$5,$6
else
print $1,0,0,splt[i],$5,$6
}
}' OFS=, file_name
The above command looks so clumsy. Is there any way minimizing the command?
Thanks in advance
The shortest possible one-liner I could manage:
awk -F, 'NR>1{n=split($4,a,"|");for(;i++<n;){$4=a[i];print;$2=$3=0}}NR==1' OFS=, file
RecordType:83    
1,2,3,a,4,5
1,0,0,x,4,5
1,0,0,y,4,5
1,0,0,z,4,5
The much more readable script (recommended):
BEGIN {
FS=OFS="," # Comma delimiter
}
NR==1 { # If the first line in file
print $0 # Print the whole line
next # Skip to next line
}
{
n=split($4,a,"|") # Split field four on |
for(i=1;i<=n;i++) # For each sub-field
print $1,i==1?$2OFS$3:"0"OFS"0",a[i],$5,$6 # Print the output
}
another shorter one-liner
awk -F, -v OFS="," 'NR>1{n=split($4,a,"|");while(++i<=n){$4=a[i];print;$2=$3=0}}NR==1' file
with your example:
kent$ awk -F, -v OFS="," 'NR>1{n=split($4,a,"|");while(++i<=n){$4=a[i];print;$2=$3=0}}NR==1' file
RecordType:83
1,2,3,a,4,5
1,0,0,x,4,5
1,0,0,y,4,5
1,0,0,z,4,5