awk - combine lines variables & print as columns - awk

I do have a problem that is too difficult for me, but believe it can be solved very easily in awk.
My data looks like this:
8377885 8384365 8385357 8385877 # 8378246 8384786 8385450 8386102
66999065 66999928 67091529 # 66999090 67000051 67091593
It's different lines that that have '#' exactly in the middle of them. I want to:
1.Combine line elements separated with '#' from first to last;
2.Print all combined elements as column.
Preferred output would look like this:
8377885 8378246
8384365 8384786
8385357 8385450
8385877 8386102
8390268 8390996
66999065 66999090
66999928 67000051
67091529 67091593
Hope someone could help me with this.

Here you have a one-liner:
awk 'BEGIN { FS = "[# ]+" } { for (i=1;i<=NF/2;i++) { printf "%s %s\n", $i, $(NF/2+i) } }' infile
That yields:
8377885 8378246
8384365 8384786
8385357 8385450
8385877 8386102
66999065 66999090
66999928 67000051
67091529 67091593

below will work:
awk -F"#" '{n=split($1,a," ");split($2,b," ");for(i=1;i<=n;i++)print a[i],b[i]}' your_file
tested below:
> awk -F"#" '{n=split($1,a," ");split($2,b," ");for(i=1;i<=n;i++)print a[i],b[i]}' temp
8377885 8378246
8384365 8384786
8385357 8385450
8385877 8386102
66999065 66999090
66999928 67000051
67091529 67091593

Related

Using awk to analyze log file to identify blocks and to extract information

I am trying to figure out a way to use awk to analyze my log files from an old application. The log file contains processing information from the application but the structure is a bit messy. But it has a structure like this:
some random text
...
BLOCK-BEGIN bla bla INFO1:VAL1
variable lines of text
INFO2:VAL2
variable lines of text
POSSIBLE-BLOCK-END-PHRASE1
...
some random text
INFO3:not-desired-val5
...
BLOCK-BEGIN bla bla INFO1:VAL3
variable lines of text
INFO2:VAL4
variable lines of text
POSSIBLE-BLOCK-END-PHRASE2
...
What I want to do is to first identify the blocks. In this example above, there are two blocks with same block beginning but different endings. Within each block, I want to extract then few information, i.e. INFO1,INFO2 in the example. The desired output in this case would be:
VAL1,VAL2
VAL3,VAL4
I know some basic of awk. Therefore, any solutions or hints are highly welcome. Thanks
Update: my first attempt
awk '/BLOCK-BEGIN/{printf substr($4,7)",";for (i = 0 ; i < NF; i++) getline; if($0 ~ '/^INFO2/') print substr($0,7)}'
The output is:
VAL1,VAL2
VAL3,VAL4
But is there a better way to do it? Any suggestions?
$ awk -v OFS=',' '
(split($NF,a,/:/) == 2) && sub(/^INFO/,"",a[1]) {
info[a[1]] = a[2]
if ( a[1] == 2 ) {
print info[1], info[2]
}
}
' file
VAL1,VAL2
VAL3,VAL4
Regarding the code you posted in your question:
printf substr($4,7)"," - never do printf <input data> as it'll fail when your input contains printf formatting characters, always do printf "%s", <input data> instead so that could should be written printf "%,",substr($4,7).
getline - there's aonly a few specific situations where getline is the right approach and when it is you have to write it securely. This isn't the right situation and it's not written securely. See awk.freeshell.org/AllAboutGetline.
for (i = 0 ; i < NF; i++) all field numbers, array indices, and string character positions in awk start at 1, not 0, so write your code to match to you don't trip over thinking arrays or anything else start at zero - for (i = 1 ; i <= NF; i++).
'foo... $0 ~ '/^INFO2/' ...bar' those inner 's are terminating the awk script body and so exposing what's between them to the shell for interpretation. Never do that. In this case idk why you thought you needed them as your code should just be 'foo... $0 ~ /^INFO2/ ...bar'.
With your shown samples only, please try following awk code.
awk -F'INFO[0-9]+:' '
/BLOCK-BEGIN/{
if(val2 && val1){
print val1","val2
}
val1=val2=""
val1=$NF
next
}
/^INFO[0-9]+:/{
val2=(val2?val2 ",":"") $NF
}
END{
if(val2 && val1){
print val1","val2
}
}
' Input_file

Find/replace within a line only if line does not contain a certain string (awk)

I'm trying to reproduce an awk command using different syntax. I have a file (test.txt) that looks like this:
>NAME_123_CONSENSUS
GACTATACA
ATACTAGA
>NAME2_48_TEST
ATAGCGA
and I'm hoping to replace all occurences of "A" with "1" using different syntax of awk. I can solve this using the following line:
awk '!/_/{gsub("A", "1"); 1' test.txt
However, I cannot get the same result using a for loop,
awk '{for(j=1; j<=NF; j++) if ($j ~ "_") print; else print gsub("A","1")}' test.txt
nor using the following input
awk '{ if ($0 ~ "_") print $0; else print gsub("A", "1"); }' test.txt
Both of these last commands give the following output. Why are they giving different output and what am I missing to make both of the last two commands give the desired output?
>NAME_123_CONSENSUS
4
4
5
>NAME2_48_TEST
3
You are incorrectly using the gsub() function here. The sub()/gsub() function return the number of substitutions made and not the modified string. You set the string to modify as the last argument and print it back
awk '{ for(j=1; j<=NF; j++) if ($j ~ "_") print; else { gsub("A","1",$0); print } }'
That said your first command is most efficient/terse way of writing this. Notice you were missing a } in the OP. It should been written as
awk '!/_/{ gsub("A", "1") }1'
Or use gensub() available in GNU Awk's that return the modified string that you can use to print. See more about it on String-Functions of GNU Awk
awk '{ for(j=1; j<=NF; j++) if ($j ~ "_") print; else print gensub(/A/, "1", "g") }'

How to split - awk

I was wondering if I can make lists having left characters OR right characters after splitting with '=', and finally each character also get splitted with another '|' and ','. I have tried but failed because the number of lists are not fixed. Even C16 can be shown up, then it will be 16 item in the input.
Can you give me any hint?
Input
C1=34,C2=35,C3="99"
Output
C1|C2|C3#34,35,"99"
You can pass multiple characters as the delimiter when using -F. The command could look like this:
awk -F'[,=]' '{printf "%s|%s|%s#%s,%s,%s", $1,$3,$5,$2,$4,$6}' input.txt
I'm using , and = as the delimiter. This makes it simple to access individual values and reassemble them using printf.
If the number of columns is unknown, you need to loop over the columns. First over the odd columns which are the names, then over the even columns which are the values. I suggest to put it into a script:
test.awk
BEGIN {
FS="[,=]"
}
{
for(i=1;i<=NF;i+=2){
if(i>=NF-1){
fmt="%s"
}else{
fmt="%s|"
}
printf fmt,$i
}
printf "#"
for(i=2;i<=NF;i+=2){
if(i>=NF-1){
fmt="%s"
}else{
fmt="%s,"
}
printf fmt,$i
}
}
Then execute it like this:
awk -f test.awk input.txt
awk -F'[=,]' '
{
for (i=1;i<=NF;i+=2) {
printf "%s%s", $i, (i<(NF-1)?"|":"#")
}
for (i=2;i<=NF;i+=2) {
printf "%s%s", $i, (i<NF?",":ORS)
}
}
' file
C1|C2|C3#34,35,"99"
awk '{sub(/C1=34,C2=35,C3="99"/,"C1|C2|C3#34,35,\"99\"")}1' file
C1|C2|C3#34,35,"99"

using awk to count characters and modify file accordingly

I have a file that looks like this
#FCD17BKACXX:8:1101:2703:2197#0/1
CAGCTTTACTCGTCATTTCCCCCAAGGGTAAAATGCGTCCGTCCATTAAGTTCACAGTCATCGTCT
+FCD17BKACXX:8:1101:2703:2197#0/1
^`^\eggcghheJ`dffhhhffhe`ecd^a^_ceacecfhf\beZegfhh_fghhgfZbdg]c^a`
#FCD17BKACXX:8:1101:4434:2244#0/1
CTGCGTTCATCGCGTTGTTGGGAGGAATCTCTACCCCAGGTTCTCGCTGTGAA
+FCD17BKACXX:8:1101:4434:2244#0/1
eeecgeceeffhhihi_fhhiicdgfghiiihiiihiiihVbcdgfhge`cee
#FCD17BKACXX:8:1101:6394:2107#0/1
CAGCAGGACTAGGGCCTGCAGACGTACTG
+FCD17BKACXX:8:1101:6394:2107#0/1
eeeccggeghhiihiihihihhhhcfghf
I would like to go to every second line and count the number of characters. If the line contains less than e.g. 66 characters then fill it to 66 with 'A' and print to new file. If it contains 66 characters then just print the line as is.
The output file would look like this;
#FCD17BKACXX:8:1101:2703:2197#0/1
CAGCTTTACTCGTCATTTCCCCCAAGGGTAAAATGCGTCCGTCCATTAAGTTCACAGTCATCGTCT
+FCD17BKACXX:8:1101:2703:2197#0/1
^`^\eggcghheJ`dffhhhffhe`ecd^a^_ceacecfhf\beZegfhh_fghhgfZbdg]c^a`
#FCD17BKACXX:8:1101:4434:2244#0/1
CTGCGTTCATCGCGTTGTTGGGAGGAATCTCTACCCCAGGTTCTCGCTGTGAAAAAAAAAAAAAAA
+FCD17BKACXX:8:1101:4434:2244#0/1
eeecgeceeffhhihi_fhhiicdgfghiiihiiihiiihVbcdgfhge`ceeAAAAAAAAAAAAA
#FCD17BKACXX:8:1101:6394:2107#0/1
CAGCAGGACTAGGGCCTGCAGACGTACTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+FCD17BKACXX:8:1101:6394:2107#0/1
eeeccggeghhiihiihihihhhhcfghfAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
I have a very basic knowledge of awk so from a learning perspective I would like to use awk to solve the problem.
One way:
awk '!(NR%2) && length<66{for(i=length;i<66;i++)$0=$0 "A"}1' file
This should be faster than the accepted approach:
awk 'NR%2==0 { x = sprintf("%-66s", $0); gsub(/ /,"A",x); $0 = x }1' file
Results:
#FCD17BKACXX:8:1101:2703:2197#0/1
CAGCTTTACTCGTCATTTCCCCCAAGGGTAAAATGCGTCCGTCCATTAAGTTCACAGTCATCGTCT
+FCD17BKACXX:8:1101:2703:2197#0/1
^`^\eggcghheJ`dffhhhffhe`ecd^a^_ceacecfhf\beZegfhh_fghhgfZbdg]c^a`
#FCD17BKACXX:8:1101:4434:2244#0/1
CTGCGTTCATCGCGTTGTTGGGAGGAATCTCTACCCCAGGTTCTCGCTGTGAAAAAAAAAAAAAAA
+FCD17BKACXX:8:1101:4434:2244#0/1
eeecgeceeffhhihi_fhhiicdgfghiiihiiihiiihVbcdgfhge`ceeAAAAAAAAAAAAA
#FCD17BKACXX:8:1101:6394:2107#0/1
CAGCAGGACTAGGGCCTGCAGACGTACTGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+FCD17BKACXX:8:1101:6394:2107#0/1
eeeccggeghhiihiihihihhhhcfghfAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
I would paste another strange (maybe) oneliner:
awk 'BEGIN{while(++i<66)t=t"A"}!(NR%2){$0=$0substr(t,length)}1' file
awk 'NR%2 == 0{
printf("%s", $0)
for(i=length($0); i<66; i++)printf("A")
print "";next }
{print}'
awk -v FS= '{printf "%s",$0} !(NR%2){for (i=NF+1;i<=66;i++) printf "A"} {print ""}'
or if you don't like loops:
awk -v FS= '{sfx=(NR%2 ? "" : sprintf("%*s",66-NF,"")); gsub(/ /,"A",sfx); print $0 sfx}'

awk Joining n fields with delimiter

How can I use awk to join various fields, given that I don't know how many of them I have? For example, given the input string
aaa/bbb/ccc/ddd/eee
I use -F'/' as delimiter, do some manipulation on aaa, bbb, ccc, ddd, eee (altering, removing...) and I want to join it back to print something line
AAA/bbb/ddd/e
Thanks
... given that I don't know how many of them I have?
Ah, but you do know how many you have. Or you will soon, if you keep reading :-)
Before giving you a record to process, awk will set the NF variable to the number of fields in that record, and you can use for loops to process them (comments aren't part of the script, I've just put them there to explain):
$ echo pax/is/a/love/god | awk -F/ '{
gsub (/god/,"dog",$5); # pax,is,a,love,dog
$4 = ""; # pax,is,a,,dog
$6 = $5; # pax,is,a,,dog,dog
$5 = "rabid"; # pax,is,a,,rabid,dog
printf $1; # output "pax"
for (i = 2; i <= NF; i++) { # output ".<field>"
if ($i != "") { # but only for non-blank fields (skip $4)
printf "."$i;
}
}
printf "\n"; # finish line
}'
pax.is.a.rabid.dog
This shows manipulation of the values, as well as insertion and deletion.
The following will show you how to process each field and do some example manipulations on them.
The only caveat of using the output field separator OFS is that "deleted" fields will still have delimiters as shown in the output below; however it makes the code much simpler if you can live with that.
awk '
BEGIN{FS=OFS="/"}
{
for(i=1;i<=NF;i++){
if($i == "aaa")
$i=toupper($i)
else if($i ~ /c/)
$i=""
else if($i ~ /^eee$/)
$i="e"
}
}1' <<<'aaa/bbb/ccc/ddd/eee'
Output
AAA/bbb//ddd/e
This might work for you:
echo "aaa/bbb/ccc/ddd/eee" |
awk 'BEGIN{FS=OFS="/"}{sub(/../,"",$4);NF=4;print}'
aaa/bbb/ccc/d
To delete fields not at the end use a function to shuffle the values:
echo "aaa/bbb/ccc/ddd/eee" |
awk 'func d(n){for(x=n;x<=NF-1;x++){y=x+1;$x=$y}NF--};BEGIN{FS=OFS="/"}{d(2);print}'
aaa/ccc/ddd/eee
Deletes the second field.
awk -F'/' '{ # I'd suggest to add them to an array, like:
# for (i=1;i<=NF;i++) {a[i]=$i }
# Now manipulate your elements in the array
# then finally print them:
n = asorti(a, dest)
for (i=1;i<=n;i++) { output+=dest[i] "/") }
print gensub("/$","","g",output)
}' INPUTFILE
Doing it this way you can delete elements as well. Note deleting an item can be done like delete array[index].