AWK - get value between two strings over multiple lines - awk

input.txt:
>block1
111111111111111111111
>block2
222222222222222222222
>block3
333333333333333333333
AWK command:
awk '/>block2.*>/' input.txt
Expected output
222222222222222222222
However, AWK is returning nothing. What am I misunderstanding?
Thanks!

If you want to print the line after the line containing >block2, then you could use:
awk '/^>block2$/ { nr=NR+1 } NR == nr { print }'
Track the record number plus 1 when you find the match; when the current record number matches the remembered one, print the current record.
If you want all the lines between the line >block2 and >block3, then you'd use:
awk '/^>block2$/,/^>block3/ {if ($0 !~ /^>block[23]$/) print }'
For all lines between the two markers, if the line doesn't match either marker, print it. The output is the same with the sample data file.

another awk
$ awk 'c&&c--; /^>block2/{c=1}' file
222222222222222222222
c specifies how many lines you want to print after the match. If you want the text between two markers
$ awk '/^>block3/{exit} s; /^>block2/{s=1}' file
222222222222222222222
if there are multiple instances and you want them all, just change exit to s=0

You probably meant:
$ awk '/>/{f=/^>block2$/;next} f' file
222222222222222222222

Related

How do I match a pattern and then copy multiple lines?

I have two files that I am working with. The first file is a master database file that I am having to search through. The second file is a file that I can make that allows me to name the items from the master database that I would like to pull out. I have managed to make an AWK solution that will search the master database and extract the exact line that matches the second file. However, I cannot figure out how to copy the lines after the match to my new file.
The master database looks something like this:
40005X/50005/60005/3/10/9/
10038A/20038/30038/0/5/23./XXXX/
10039A/20039/30039/0/5/23./XXXX/
10040A/20040/30040/0/5/23./XXXX/
10041A/20041/30041/0/5/23./XXXX/
10042A/20042/30042/0/5/23./XXXX/
10043A/20043/30043/0/5/23./XXXX/
10044A/20044/30044/0/5/23./XXXX/
10045A/20045/30045/0/5/23./XXXX/
10046A/20046/30046/0/5/23./XXXX/
40006X/50006/60006/3/10/3/
10047A/20047/30047/0/5/23./XXXX/
10048A/20048/30048/0/5/23./XXXX/
10049A/20049/30049/0/5/23./XXXX/
40007X/50007/60007/3/10/3/
10050A/20050/30050/0/5/23./XXXX/
10051A/20051/30051/0/5/23./XXXX/
10052A/20052/30052/0/5/23./XXXX/
40008X/50008/60008/3/10/1/
10053A/20053/30053/0/5/23./XXXX/
40009X/50009/60009/3/10/3/
10054A/20054/30054/0/5/23./XXXX/
10055A/20055/30055/0/5/23./XXXX/
10056A/20056/30056/0/5/23./XXXX/
40010X/50010/60010/3/10/3/
10057A/20057/30057/0/5/23./XXXX/
10058A/20058/30058/0/5/23./XXXX/
10059A/20059/30059/0/5/23./XXXX/
In my example, the lines that start with 4000 is the first line that I am matching up to. The last number in that row is what tells me how many lines there are to copy. So in the first line, 40005X/50005/60005/3/10/9/, I would be matching off of the 40005X, and the 9 in that line tells me that there are 9 lines underneath that I need to copy with it.
The second file is very simple and looks something like this:
40005X
40007X
40008X
As the script finds each match, I would like to move the information from the first file to a new file for analysis. The end result would look like this:
40005X/50005/60005/3/10/9/
10038A/20038/30038/0/5/23./XXXX/
10039A/20039/30039/0/5/23./XXXX/
10040A/20040/30040/0/5/23./XXXX/
10041A/20041/30041/0/5/23./XXXX/
10042A/20042/30042/0/5/23./XXXX/
10043A/20043/30043/0/5/23./XXXX/
10044A/20044/30044/0/5/23./XXXX/
10045A/20045/30045/0/5/23./XXXX/
10046A/20046/30046/0/5/23./XXXX/
40007X/50007/60007/3/10/3/
10050A/20050/30050/0/5/23./XXXX/
10051A/20051/30051/0/5/23./XXXX/
10052A/20052/30052/0/5/23./XXXX/
40008X/50008/60008/3/10/1/
10053A/20053/30053/0/5/23./XXXX/
The code that I currently have that will match the first line is this:
#! /bin/ksh
file1=input_file
file2=input_masterdb
file3=output_test
awk -F'/' 'NR==FNR {id[$1]; next} $1 in id' $file1 $file2 > $file3
I have had the most success with AWK, however I am open to any suggestion. However, I am working on this on a UNIX system. I would like to keep it as a KSH script, since most of the other scripts that I use with this are written in that format, and I am most familiar with it.
Thank you for your help!!
Your existing awk matches correctly the rows from the ids' file, you now need to add a condition to print N lines ahead after reading the last field of the matching row. So we will set a variable p to the number of lines to print plus one (the current one), and decrease per row printing.
awk -F'/' 'NR==FNR{id[$0]; next} $1 in id{p=$6+1} p-->0{print}' file1 file2
or the same with last condition more "awkish" (by Ed Morton) and covering any possible extreme case of a huge file
awk -F'/' 'NR==FNR{id[$0]; next} $1 in id{p=$6+1} p&&p--' file1 file2
here the print condition is omitted, as it is the default action, and the condition is true again as long as decreasing p is positive.
another one
$ awk -F/ 'NR==FNR {a[$1]; next}
!n && $1 in a {n=$(NF-1)+1}
n&&n--' file2 file1
40005X/50005/60005/3/10/9/
10038A/20038/30038/0/5/23./XXXX/
10039A/20039/30039/0/5/23./XXXX/
10040A/20040/30040/0/5/23./XXXX/
10041A/20041/30041/0/5/23./XXXX/
10042A/20042/30042/0/5/23./XXXX/
10043A/20043/30043/0/5/23./XXXX/
10044A/20044/30044/0/5/23./XXXX/
10045A/20045/30045/0/5/23./XXXX/
10046A/20046/30046/0/5/23./XXXX/
40007X/50007/60007/3/10/3/
10050A/20050/30050/0/5/23./XXXX/
10051A/20051/30051/0/5/23./XXXX/
10052A/20052/30052/0/5/23./XXXX/
40008X/50008/60008/3/10/1/
10053A/20053/30053/0/5/23./XXXX/
this takes care if any of the content lines match the ids given. This will only look for another id after the specified number of lines printed.
Could you please try following, written and tested with shown samples in GNU awk. Considering that you want to print lines from line which stars from digits X here. Where Input_file2 is file having only ids and Input_file1 is master file as per OP's question.
awk '
{
sub(/ +$/,"")
}
FNR==NR{
a[$0]
next
}
/^[0-9]+X/{
match($0,/[0-9]+\/$/)
no_of_lines_to_print=substr($0,RSTART,RLENGTH-1)
found=count=""
}
{
if(count==no_of_lines_to_print){ count=found="" }
for(i in a){
if(match($0,i)){
found=1
print
next
}
}
}
found{
++count
}
count<=no_of_lines_to_print && count!=""
' Input_file2 Input_file1

Bash code for Selecting few columns from a variable

In a file I have a list of coordinates stored (see figure, to the left).
From there I want to copy the coordinates only (red marked) and put them in another file.
I copy the correct section from the file using COORD=`grep -B${i} '&END COORD' ${cpki_file}. Then I tried to use awk to extract the required numbers from the COORD variable . It does output all the numbers in the file but deletes the spaces between values (figure, to the right).
How to write the red marked section as they are?
N=200
NEndCoord=`grep -B${N} '&END COORD' ${cpki_file}|wc -l`
NCoord=`grep -B${N} '&END COORD' ${cpki_file}| grep -B200 '&COORD' |wc -l`
let i=$NEndCoord-$NCoord
COORD=`grep -B${i} '&END COORD' ${cpki_file}`
echo "$COORD" | awk '{ print $2 $3 $4 }'
echo "$COORD" | awk '{ print $2 $3 $4 }'>tmp.txt
When you start using combinations of grep, sed, awk, cut and alike, you should realize you can do it all in a single awk command. In case of the OP, this would do exactly the same:
awk '/[&]END COORD/{p=0}
p { print $2,$3,$4 }
/[&]COORD/{p=1}' file
This parses the file keeping track of a printing flag p. The flag is set if "&COORD" is found and unset if "&END COORD" is found. Printing is done, only when the flag p is set. Since we don't want to print the line with "&END COORD", we have to reset the flag before we do the check for the printing. The same holds for the line with "&COORD", but there we have to reset it after we do the check for the printing (its a bit a weird reversed logic).
The problem with the above is that it will also process the lines
UNIT angstrom
If you want to have these removed, you might want to do a check on the total columns:
awk '/[&]END COORD/{p=0}
p && (NF==4){ print $2,$3,$4 }
/[&]COORD/{p=1}' file
Of only print the lines which do not contain "UNIT" or are empty:
awk '/[&]END COORD/{p=0}
p && (NF>0) && ($1 != "UNIT"){ print $2,$3,$4 }
/[&]COORD/{p=1}' file
sed one-liner:
sed -n '/^&COORD$/,/^UNIT/{s/.*[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)/\1\t\2\t\3/p}' <infile.txt >outfile.txt
Explanation:
Invocation:
sed: stream editor
-n: do not print unless eplicit
Commands in sed:
/^&COORD$/,/^UNIT/: Selects groups of lines after &COORDS and before UNIT.
{s/.*[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)/\1\t\2\t\3/p}: Process each selected lines.
s/.*[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)[[:space:]]\+\(.*\): Regex capture space delimited groups except the first.
/\1\t\2\t\3/: Replace with tab delimited values of the captured groups.
p: Explicit printout.

awk to parse field if specific value is in another

In the awk below I am trying to parse $2 using the _ only if $3 is a specific valus (ID). I am reading that parsed value into an array and going to use it as a key in a lookup. The awk does execute but the entire line 2 or line with ID in $3 prints not just the desired. The print statement is only to see what results (for testing only) and will not be part of the script. Thank you :).
awk
awk -F'\t' '$3=="ID"
f="$(echo $2|cut -d_ -f1,1)"
{
print $f
}' file
file tab-delimited
R_Index locus type
17 chr20:31022959 NON
18 chr11:118353210-chr9:20354877_KMT2A-MLLT3.K8M9 ID
desired
$f = chr11:118353210-chr9:20354877
Not completely clear, could you please try following.
awk '{split($2,array,"_");if(array[2]=="KMT2A-MLLT3.K8M9"){print array[1]}}' Input_file
Or if you want top change 2nd field's value along with printing all lines then try following once.
awk '{split($2,array,"_");if(array[2]=="KMT2A-MLLT3.K8M9"){$2=array[1]}} 1' Input_file

Use AWK to search through fasta file, given a second file containing sequence names

I have a 2 files. One is a fasta file contain multiple fasta sequences, while another file includes the names of candidate sequences I want to search (file Example below).
seq.fasta
>Clone_18
GTTACGGGGGACACATTTTCCCTTCCAATGCTGCTTTCAGTGATAAATTGAGCATGATGGATGCTGATAATATCATTCCCGTGT
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
>Clone_27-1
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC
>Clone_27-2
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTCGTTTTGTTCTAGATTAACTATCAGTTTGGTTCTGTTTGTCCTCGTACTGGGTTGTGTCAATGCACAACTT
>Clone_34-1
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCG
>Clone_34-3
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCGATATCGCTGAAGCCCAATC
>Clone_44-1
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCC
>Clone_44-3
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCCCGGCAGCGCAGCCGTCGTCTCTACCCTTCACCAGGAATAAGTTTATTTTTCTACTTAC
name.txt
Clone_23
Clone_27-1
I want to use AWK to search through the fasta file, and obtain all the fasta sequences for given candidates whose names were saved in another file.
awk 'NR==FNR{a[$1]=$1} BEGIN{RS="\n>"; FS="\n"} NR>FNR {if (match($1,">")) {sub(">","",$1)} for (p in a) {if ($1==p) print ">"$0}}' name.txt seq.fasta
The problem is that I can only extract the sequence of first candidate in name.txt, like this
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
Can anyone help to fix one-line awk command above?
If it is ok or even desired to print the name as well, you can simply use grep:
grep -Ff name.txt -A1 a.fasta
-f name.txt picks patterns from name.txt
-F treats them as literal strings rather than regular expressions
A1 prints the matching line plus the subsequent line
If the names are not desired in output I would simply pipe to another grep:
above_command | grep -v '>'
An awk solution can look like this:
awk 'NR==FNR{n[$0];next} substr($0,2) in n && getline' name.txt a.fasta
Better explained in a multiline version:
# True as long as we are reading the first file, name.txt
NR==FNR {
# Store the names in the array 'n'
n[$0]
next
}
# I use substr() to remove the leading `>` and check if the remaining
# string which is the name is a key of `n`. getline retrieves the next line
# If it succeeds the condition becomes true and awk will print that line
substr($0,2) in n && getline
$ awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' name.txt seq.fasta
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
>Clone_27-1
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC

How to print fields for repeated key column in one line

I'd like to transform a table in such a way that for duplicated
values in column #2 it would have corresponding values from column #1.
I.e. something like that...
MZ00024296 AC148152.3_FG005
MZ00047079 AC148152.3_FG006
MZ00028122 AC148152.3_FG008
MZ00032922 AC148152.3_FG008
MZ00048218 AC148152.3_FG008
MZ00024680 AC148167.6_FG001
MZ00013456 AC149475.2_FG003
to
AC148152.3_FG005 MZ00024296
AC148152.3_FG006 MZ00047079
AC148152.3_FG008 MZ00028122|MZ00032922|MZ00048218
AC148167.6_FG001 MZ00024680
AC149475.2_FG003 MZ00013456
As I need it to computations in R I tried to use:
x=aggregate(mz_grmz,by=list(mz_grmz[,2]),FUN=paste(mz_grmz[,1],sep="|"))
but it don't work (wrong function)
Error in match.fun(FUN) :
'paste(mz_grmz[, 1], sep = "|")' is not a function, character or symbol
I also remind myself about unstack() function, but it isn't what I need.
I tried to do it using awk, based on my base knowledge I reworked code given here:
site1
#! /bin/sh
for y do
awk -v FS="\t" '{
for (x=1;x<=NR;x++) {
if (NR>2 && x=x+1) {
print $2"\t"x
}
else {print NR}
}
}' $y > $y.2
done
unfortunately it doesn't work, it's only produce enormous file with field #2 and some numbers.
I suppose it is easy task, but it is above my skills right now.
Could somebody give me a hint? Maybe just function to use in aggregate in R.
Thanks
You could do it in awk like this:
awk '
{
if ($2 in a)
a[$2] = a[$2] "|" $1
else
a[$2] = $1
}
END {
for (i in a)
print i, a[i]
}' INFILE > OUTFILE
to keep the output as same as the text in your question (empty lines etc..):
awk '{if($0 &&($2 in a))a[$2]=a[$2]"|"$1;else if ($0) a[$2]=$1;}\
END{for(x in a){print x,a[x];print ""}}' inputFile
test:
kent$ echo "MZ00024296 AC148152.3_FG005
MZ00047079 AC148152.3_FG006
MZ00028122 AC148152.3_FG008
MZ00032922 AC148152.3_FG008
MZ00048218 AC148152.3_FG008
MZ00024680 AC148167.6_FG001
MZ00013456 AC149475.2_FG003"|awk '{if($0 &&($2 in a))a[$2]=a[$2]"|"$1;else if ($0) a[$2]=$1;}END{for(x in a){print x,a[x];print ""}}'
AC149475.2_FG003 MZ00013456
AC148152.3_FG005 MZ00024296
AC148152.3_FG006 MZ00047079
AC148152.3_FG008 MZ00028122|MZ00032922|MZ00048218
AC148167.6_FG001 MZ00024680
This GNU sed solution might work for you:
sed -r '1{h;d};H;${x;s/(\S+)\s+(\S+)/\2\t\1/g;:a;s/(\S+\t)([^\n]*)(\n+)\1([^\n]*)\n*/\1\2|\4\3/;ta;p};d' input_file
Explanation: Use the extended regex option-r to make regex's more readable. Read the whole file into the hold space (HS). Then on end-of-file, switch to the HS and firstly swap and tab separate fields. Then compare the first fields in adjacent lines and if they match, tag the second field from the second record to the first line separated by a |. Repeated until no further adjacent lines have duplicate first fields then print the file out.