Print parts of file using awk - awk

I have several conditions for what I want to print (skip hello that is included in part I would like to print, print from \k{f} to \l{k}, from \word{g} to \word2{g}, print row starting \hello2 and print the part between \b and \bf - there is a problem: in \bf} is } that should not to be printed):
awk '
/\\hello/{
next
}
/\\k\{f\}|\\word\{g\}|\\b/{
found=1
}
found;
/\\l\{f\}|\\word2\{g\}|\\bf/{
found=""
}
/\\hello2/
' file.txt
I would like to add condition for \bf that it should be alone in the row. How to do that please?
file.txt:
text
text
\hello2
456
565
\word{g}
s
\hello
\word2{g}
\k{f}
fdsfd
fgs
\l{f}
text
\b
7
\hello
\bf}
text
Output now:
\word{g}
s
\word2{g}
\k{f}
fdsfd
fgs
\l{f}
\b
7
\bf}
The desired output:
\word{g}
s
\word2{g}
\k{f}
fdsfd
fgs
\l{f}
\b
7
\bf
This question is related to: this question

Add a condition to replace \bf} with \bf
awk '
/\\hello/{
next
}
/\\k\{f\}|\\word\{g\}|\\b/{
found=1
}
# Fix BF lines
/\\bf}/ { $0 = "\\bf" }
#
found;
/\\l\{f\}|\\word2\{g\}|\\bf/{
found=""
}
/\\hello2/
' file.txt

Related

Insert pattern from current line in next line

I have this file :
>AX-899-Af-889-[A/G]
GTCCATTCAGGTAAAAAAAAAAAACATAACAATTGAAATTGCATGA
>AX-899-Af-889-[A/G]
GCAAACTATTTTCATGAATGAACTTCAGTTGATTGTGAGATG
>AX-899-Af-889-[G/T]
AAGGTAGAATGACACCATTAAACAGTAGGGAATTGGTCACAGAACTCT
I need to insert the pattern [X/X] present in the lines starting by > in the next line at the 10th position and replace this 10th character :
>AX-899-Af-889-[A/G]
GTCCATTCA[A/G]GTAAAAAAAAAAAACATAACAATTGAAATTGCATGA
>AX-899-Af-889-[A/G]
GCAAACTAT[A/G]TTCATGAATGAACTTCAGTTGATTGTGAGATG
>AX-899-Af-889-[G/T]
AAGGTAGAA[G/T]GACACCATTAAACAGTAGGGAATTGGTCACAGAACTCT
I can extract the pattern :
awk 'match($0, /^>/) {split($0,a,"-"); print; getline; print a[5]}1' file
Also replace the 10th character by a pattern ("N" for example) : sed 's/^\([ATCG].\{8\}\)[ATCG]/\1N/' file
With your shown samples, please try following awk.
awk '
BEGIN{ FS=OFS="-" }
/^>/ {
val=$NF
print
next
}
{
print substr($0,1,9) val substr($0,11)
val=""
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ FS=OFS="-" } ##Starting BEGIN section from here and setting FS and OFS as - here.
/^>/ { ##Checking condition if line starts from > then do following.
val=$NF ##Setting last field($NF) to val here.
print ##printing current line here.
next ##next will skip all further statements from here.
}
{
print substr($0,1,9) val substr($0,11) ##printing substring from 1st to 9 chars of current line.
##Followed by val and rest of values from 11th char to till last of current line.
val="" ##Nullifying val here.
}
' Input_file ##Mentioning Input_file name here.
Another:
$ awk '
BEGIN { FS=OFS="" } # each char is a field of its own
{
if(/^>/) # if record starts with a >
b=substr($0,length-4,5) # get last 5 chars to buffer
else # otherwise
$10=b # replace 10th char with buffer
}1' file # output
Some output:
>AX-899-Af-889-[A/G]
GTCCATTCA[A/G]GTAAAAAAAAAAAACATAACAATTGAAATTGCATGA
...
Using sed
$ cat sed.script
/^>/{ #If the line starts with >
p #Print it to create a duplicate line
s/[^[]*\([^]]*]\)/\1/ #Using back referencing, extract the pattern at the end
h #Store the pattern in hold space
d #Now stored in hold space, delete the duplicated line.
}
{
G #Append the contents of the hold space to that of the pattern space.
s/\n// #Remove the newline created by previous command
s/\(.\{9\}\).\([^[]*\)\(.*\)/\1\3\2/ #Replace 10th character with the content obtained from the hold space
}
$ sed -f sed.script input_file
>AX-899-Af-889-[A/G]
GTCCATTCA[A/G]GTAAAAAAAAAAAACATAACAATTGAAATTGCATGA
>AX-899-Af-889-[A/G]
GCAAACTAT[A/G]TTCATGAATGAACTTCAGTTGATTGTGAGATG
>AX-899-Af-889-[G/T]
AAGGTAGAA[G/T]GACACCATTAAACAGTAGGGAATTGGTCACAGAACTCT
Or as a one liner
$ sed '/^>/{p;s/[^[]*\([^]]*]\)/\1/;h;d};{G;s/\n//;s/\(.\{9\}\).\([^[]*\)\(.*\)/\1\3\2/}' input_file
Another idea using sed:
sed -E '/^>/{N;s/(.*-)(\[[^][]*])(\n.{9})./\1\2\3\2/}' file
Explanation
/^>/ If the line starts with >
N Append the next line to the pattern space
(.*-) Capture group 1, match till the last occurrence of -
(\[[^][]*]) Capture group 2, match from opening to closing square brackets [...]
(\n.{9}). Capture a newline and 9 characters in group 3 and match the 10th character
\1\2\n\3\2 The replacement using the backreferences to the capture groups including newline
Output
>AX-899-Af-889-[A/G]
GTCCATTCA[A/G]GTAAAAAAAAAAAACATAACAATTGAAATTGCATGA
>AX-899-Af-889-[A/G]
GCAAACTAT[A/G]TTCATGAATGAACTTCAGTTGATTGTGAGATG
>AX-899-Af-889-[G/T]
AAGGTAGAA[G/T]GACACCATTAAACAGTAGGGAATTGGTCACAGAACTCT

Search pattern and print it on the line before (awk)

I have this file :
>AX-89948491
CACCTTTT[C/T]ATTTCATTCCTAC
>AX-89940152
AGATGAGA[A/G]TAAAGCTTCTGTC
>AX-89922107
ACAGAAAT[G/T]TATAGATATTACT
I need to find the pattern "[A-Z]/[A-Z]" (it is necessarily present every two lines) ; and put it on the line before like this :
>AX-89948491-[C/T]
CACCTTTT[C/T]ATTTCATTCCTAC
>AX-89940152-[A/G]
AGATGAGA[A/G]TAAAGCTTCTGTC
>AX-89922107-[G/T]
ACAGAAAT[G/T]TATAGATATTACT
I did :
awk 'tmp=/\[[A-Z]\/[A-Z]]/{if (a && a !~ /\[[A-Z]\/[A-Z]]/) print a"-"$tmp; print} {a=$0}' my_file
But that gives the entire line , not the pattern.
Any help?
You could print the previous line plus the current matched part of the pattern, and given that it is present every 2 lines:
awk '
match($0, /\[[A-Z]\/[A-Z]]/) {
m = substr($0, RSTART, RLENGTH)
print prev "-" m ORS $0
}
{prev = $0}
' my_file
Output
>AX-89948491-[C/T]
CACCTTTT[C/T]ATTTCATTCCTAC
>AX-89940152-[A/G]
AGATGAGA[A/G]TAAAGCTTCTGTC
>AX-89922107-[G/T]
ACAGAAAT[G/T]TATAGATATTACT
With your shown samples only, please try following awk program. Here is tac + awk + tac solution. Simple explanation would be using tac to print output in reverse lines order(from bottom to up) sending it to awk program to get [[A-Z]/[A-Z] and saving its matched value to val variable and printing that line, if match function doesn't have any matched regex value then printing that line(basically lines where we need to add [[A-Z]/[A-Z] value) along with - and val value. Now passing this output to tac again to get output in exact same format in which OP has shown us samples.
tac Input_file |
awk '
match($0,/\[[A-Z]\/[A-Z]]/){
val=substr($0,RSTART,RLENGTH)
print
next
}
{
print $0"-"val
}
' | tac

How to use Awk to output multiple consecutive lines

Input/File
A:1111
B:21222
C:33rf33
D:444dct4
E:5tdffe
F:4444we
G:j5555
H:46666
I:efe989ef
J:efee
Basically need to select the line that contains 2122 (i.e line B/2)
& line which starts with 444dct4 (i.e Line D) till efe989ef (i.e line I/9)
To summarize
Select Line B (contains 2122)
Select Line D (444dct4) till Line I
Desired Output
B:21222
D:444dct4
E:5tdffe
F:4444we
G:j5555
H:46666
I:efe989ef
Could you please try following, written and tested with shown samples in GNU awk. This one also takes care in case line's 2nd column 21222 in between range of 444dct4 to efe989ef then it will NOT re-print it.
awk -F':' '
$2=="21222" && !found{
print
next
}
$2=="444dct4"{
found=1
}
found
$2=="efe989ef"{
found=""
}
' Input_file
Explanation: Adding detailed explanation for above.
awk -F':' ' ##Starting awk program from here and setting field separator as colon here.
$2=="21222" && !found{ ##Checking if 2nd field is 21222 and found is NOT set then try following.
print ##Printing the current line here.
next ##next will skip all further statements from here.
}
$2=="444dct4"{ ##Checking condition if 2nd field is 444dct4 then do following.
found=1 ##Setting found to 1 here.
}
found ##Checking condition if found is SET then print that line.
$2=="efe989ef"{ ##Checking condition if 2nd field is efe989ef then do following.
found="" ##Nullifying found here.
}
' Input_file ##Mentioning Input_file name here.
$ awk -F: '
/2122/ { # line that contains 2122
print
next # to avoid duplicate printing if 2122 also in D-I
}
$2~/^444dct4/,$2~/efe989ef/ # starts with 444dct4 till efe989ef
' file
Output:
B:21222
D:444dct4
E:5tdffe
F:4444we
G:j5555
H:46666
I:efe989ef
Edit:
One-liner:
$ awk -F: '/2122/{print; next} $2~/^444dct4/,$2~/efe989ef/' file.txt
awk -v str1="2122" -v str2="444dct4" -v str3="efe989ef" 'BEGIN { flag=0 } $0 ~ str1 { print } $0 ~ str2 { flag=1 } $0 ~ str3 { flag=0;print;next } flag' file
For flexibility, set the line to find as str1, the from as str2 and the to as str3. Set a print flag (flag) to begin with. When 2122 is in the second field print. Then when the second field begins with 44dct4 set the print flag to one. When the second field starts with efe989ef, set the print flag to 0, print the line and skip to the next record. The variable flag will then determine what does and doesn't get printed.

Extract sequence from list of data into separate line

sample.txt does have "tab-separated column", and there's semi-colon seperated that needed to be splitted accordingly from sequence of number into repeated value.
cat sample.txt
2 2627 588;577
2 2629 566
2 2685 568-564
2 2771 573
2 2773 597
2 2779 533
2 2799 558
2 6919 726;740-742;777
2 7295 761;771-772
Please be noted that, some of line may have inverted sequence 568-564
By using previous script, I manage to split it, but failed to extract from sequence (splitted by dash)
#!/bin/sh
awk -F"\t" '{print $1}' $1 >> $2 &&
awk -F"\t" '{print $2}' $1 >> $2 &&
awk -F"\t" '{print $3}' $1 >> $2 &&
sed -i "s/^M//;s/;\r//g" $2
#!/bin/awk -f
BEGIN { FS=";"; recNr=1}
!NF { ++recNr; lineNr=0; next }
{ ++lineNr }
lineNr == 1 { next }
recNr == 1 { a[lineNr] = $0 }
recNr == 2 { b[lineNr] = $0 }
recNr == 3 {
for (i=1; i<=NF; i++) {
print a[lineNr] "," b[lineNr] "," $i
}
}
Expected
2,2627,588
2,2627,577
2,2629,566
2,2685,564
2,2685,565
2,2685,566
2,2685,567
2,2685,568
2,2771,573
2,2773,597
2,2779,533
2,2799,558
2,6919,726
2,6919,740
2,6919,741
2,6919,742
2,6919,777
2,7295,761
2,7295,771
2,7295,772
Could you please try following(will add explanation in few mins).
awk '
BEGIN{
OFS=","
}
{
num=split($NF,array,";")
for(i=1;i<=num;i++){
if(array[i]~/-/){
split(array[i],array2,"-")
to=array2[1]>array2[2]?array2[1]:array2[2]
from=array2[1]<array2[2]?array2[1]:array2[2]
while(from<=to){
print $1,$2,from++
}
}
else{
print $1,$2,array[i]
}
from=to=""
}
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of code here.
OFS="," ##Setting OFS as comma here.
}
{
num=split($NF,array,";") ##Splitting last field of line into an array named array with delimiter semi-colon here.
for(i=1;i<=num;i++){ ##Starting a for loop from 1 to till value of num which is actually length of array created in previous step.
if(array[i]~/-/){ ##Checking condition if array value with index i is having dash then do followong.
split(array[i],array2,"-") ##Split value of array with index i to array2 here with delimiter -(dash) here.
to=array2[1]>array2[2]?array2[1]:array2[2] ##Creating to variable which will compare 2 elements of array2 and have maximum value out of them here.
from=array2[1]<array2[2]?array2[1]:array2[2] ##Creating from variable which will compare 2 elements of array2 and will have minimum out of them.
while(from<=to){ ##Running while loop from variable from to till value of variable to here.
print $1,$2,from++ ##Printing 1st, 2nd fields with value of from variable and increasing from value with 1 each time it comes here.
}
}
else{ ##Mention else part of if condition here.
print $1,$2,array[i] ##Printing only 1st, 2nd fields along with value of array with index i here.
}
from=to="" ##Nullifying variables from and to here.
}
}
' Input_file ##Mentioning Input_file name here.
Adding link for conditional statements ? and : explanation as per James sir's comments:
https://www.gnu.org/software/gawk/manual/html_node/Conditional-Exp.html
For shown sample output will be as follows.
2,2627,588
2,2627,577
2,2629,566
2,2685,564
2,2685,565
2,2685,566
2,2685,567
2,2685,568
2,2771,573
2,2773,597
2,2779,533
2,2799,558
2,6919,726
2,6919,740
2,6919,741
2,6919,742
2,6919,777
2,7295,761
2,7295,771
2,7295,772
$ awk '
BEGIN {
FS="( +|;)" # input field separator is space or ;
OFS="," # output fs is comma
}
{
for(i=3;i<=NF;i++) { # from the 3rd field to the end
n=split($i,t,"-") # split on - if any. below loop from smaller to greater
if(n) # in case of empty fields
for(j=(t[1]<t[n]?t[1]:t[n]); j<=(t[1]<t[n]?t[n]:t[1]);j++)
print $1,$2,j # output
}
}' file
Output
2,2627,588
2,2627,577
2,2629,566
2,2685,564 <─┐
2,2685,565 │
2,2685,566 ├─ wrong order, from smaller to greater
2,2685,567 │
2,2685,568 <─┘
2,2771,573
2,2773,597
2,2779,533
2,2799,558
2,6919,726
2,6919,740
2,6919,741
2,6919,742
2,6919,777
2,7295,761
2,7295,771
2,7295,772
Tested on GNU awk, mawk, Busybox awk and awk version 20121220.

Move values to column based on row value

The input file the date block change every each 4 lines (column 1). Example for days 061218 and 061418, but not in the case for date 061318, which contends 8 lines.
Then in the case where the date does not change after 5 lines,like the example on date 061318 in that case the values of the second part lines 5-8 need to be added to the END ond the lines 1-4. To get correctly in the output file desired.
Input file
061218,2660,2660,2661
061218,0,0,0,0
061218,48,30,569
061218,SD/05,F1/R0,SD/05
061318,2654,2654
061318,0,0
061318,114,60
061318,SD/05,F1/R0
061318,2666
061318,0
061318,1
061318,F1/R0
061418,2648,2648,2649
061418,0,0,0
061418,871,868,876
061418,SD/05,F1/R0,SD/05
Output file
061218,2660,2660,2661
061218,0,0,0,0
061218,48,30,569
061218,SD/05,F1/R0,SD/05
061318,2654,2654,2666
061318,0,0,0
061318,114,60,1
061318,SD/05,F1/R0,F1/R0
061418,2648,2648,2649
061418,0,0,0
061418,871,868,876
061418,SD/05,F1/R0,SD/05
I tried:
awk -F, '{a[$1]=a[$1]?a[$1]","$2:$2;}END{for (i in a)print i, a[i];}' OFS=, file
Thanks in advance
If your Input_file is same as shown sample(which you mentioned in your comments it is) then could you please try following.
awk '
BEGIN{
FS=OFS=","
}
prev!=$1 && prev{
for(i=1;i<=count;i++){
print prev,a[prev,i]
}
prev=count=""
}
{
prev=$1
sub(/[^,]*,/,"")
if(count==4){
count=1
}
else{
count++
}
a[prev,count]=a[prev,count]?a[prev,count] OFS $0:$0
}
END{
if(prev){
for(i=1;i<=count;i++){
print prev,a[prev,i]
}
}
}' Input_file
Change above a[prev,count] line to a[prev,count]=(a[prev,count]?a[prev,count] OFS:"")$0 in Ed Morton sir's style too, to shorten and make it compatible to other awks too.