Extract text between patterns in new files - awk

I'm trying to analyze a file with the following structure:
AAAAA
123
456
789
AAAAA
555
777
999
777
The idea is to detect the 'AAAAA' pattern and extract the two following lines. After this is done, I would like to append the next 'AAAAA' pattern and the following two lines, so th final file will look something like this:
AAAAA
123
456
AAAA
555
777
Taking into account that the last one will not end with the 'AAAAA' pattern.
Any idea about how this can be done ? I've use sed but I don't know how to select the number of lines to be retained after the pattern...
Fo example with AWK:
awk '/'$AAAAA'/,/'$AAAAA'/' INPUTFILE.txt
Bu this will only extract all the text between the two AAAAA
Thanks

With sed
sed -n '/AAAAA/{N;N;p}' file.txt

with smart counters
$ awk '/AAAAA/{n=3} n&&n--' file
AAAAA
123
456
AAAAA
555
777

The grep command has a flag that prints lines after each match. For example:
grep AAAAA --after 2 <file>
Unless I misunderstood, this should match your requirements, and is much simpler than awk scripts.

You may try this awk:
awk '$1 == "AAAAA" {n = NR+2} NR <= n' file
AAAAA
123
456
AAAAA
555
777

just cheat
mawk/mawk2/gawk 'BEGIN { FS = OFS = "AAAAA\n"; RS = "^$";
} END { for(x=2; x<= NF; x++) { print $(x) } }'
no one says the fields must be split by spaces, and rows must be new-lines one-by-one. By design of FS, every field after $1 will contain the matches you need, and fitting multiple "rows" of each within $2 etc.
In this example, in $2 you will find 12 bytes, like this :
1 2 3 \n 4 5 6 \n 7 8 9 \n # spaced out for readability

Related

Add text from inside of a file above a line match in another file

I have a file File1.txt that contains the following:
abc
def
ghi
I have another file File2.txt that contains the following:
123
456
789
I would like to add the contents of file 1 above 789 to give:
123
456
abc
def
ghi
789
Ultimately using sed or awk.
I tried a workaround where I used:
file1="$(cat File1.txt)"
sed -i "/^789/i $file1 " File2.txt
However this does not work as intended. Any help would be much appreciated, thank you for your expertise.
Execute cat with GNU sed:
sed '/^789$/e cat file1' file2
Output:
123
456
abc
def
ghi
789
With any awk in any shell on every UNIX box for any value of 789:
$ awk 'NR==FNR{new=new $0 ORS; next} $0=="789"{printf "%s", new} 1' file1 file2
123
456
abc
def
ghi
789
The only caveat is that file1 can't be so massive that it doesn't fit in memory.

Query the contents of a file using another file in AWK

I am trying to conditionally filter a file based on values in a second file. File1 contains numbers and File2 contains two columns of numbers. The question is to filter out those rows in file1 which fall within the range denoted in each row of file2.
I have a series of loops which works, but takes >12hrs to run depending on the lengths of both files. This code is noted below. Alternatively, I have tried to use awk, and looked at other questions posted on slack overflow, but I cannot figure out how to change the code appropriately.
Loop method:
while IFS= read READ
do
position=$(echo $READ | awk '{print $4}')
while IFS= read BED
do
St=$(echo $BED | awk '{print $2}')
En=$(echo $BED | awk '{print $3}')
if (($position < "$St"))
then
break
else
if (($position >= "$St" && $position <= "$En"));
then
echo "$READ" | awk '{print $0"\t EXON"}' >> outputfile
fi
fi
done < file2
done < file1
Blogs with similar questions:
awk: filter a file with another file
awk 'NR==FNR{a[$1];next} !($2 in a)' d3_tmp FS="[ \t=]" m2p_tmp
Find content of one file from another file in UNIX
awk -v FS="[ =]" 'NR==FNR{rows[$1]++;next}(substr($NF,1,length($NF)-1) in rows)' File1 File2
file1: (tab delimited)
AAA BBB 1500
CCC DDD 2500
EEE FFF 2000
file2: (tab delimited)
GGG 1250 1750
HHH 1950 2300
III 2600 2700
Expected output would retain rows 1 and 3 from file1 (in a new file, file3) because these records fall within the ranges of row 1 columns 2 and 3, and row 2 columns 2 and columns 3 of file2. In the actual files, they're not row restricted i.e. I am not wanting to look at row1 of file1 and compare to row1 of file2, but compare row1 to all rows in file2 to get the hit.
file3 (output)
AAA BBB 1500
EEE FFF 2000
One way:
awk 'NR==FNR{a[i]=$2;b[i++]=$3;next}{for(j=0;j<i;j++){if ($3>=a[j] && $3<=b[j]){print;}}}' i=0 file2 file1
AAA BBB 1500
EEE FFF 2000
Read the file2 contents and store it in arrays a and b. When file1 is read, check for the number to be between the entire a and b arrays and print.
One more option:
$ awk 'NR==FNR{for(i=$2;i<=$3;i++)a[i];next}($3 in a)' file2 file1
AAA BBB 1500
EEE FFF 2000
File2 is read and the entire range of numbers is broken up and stored into the associate array a. When we read the file1, we just need to lookup the array a.
Another awk. It may or may not make sense depending on the filesizes:
$ awk '
NR==FNR {
a[$3]=$2 # hash file2 records, $3 is key, $2 value
next
}
{
for(i in a) # for each record in file1 go thru ever element in a
if($3<=i && $3>=a[i]) { # if it falls between
print # output
break # exit loop once match found
}
}' file2 file1
Output:
AAA BBB 1500
EEE FFF 2000

Sed replace nth column of multiple tsv files without header

Here are multiple tsv files, where I want to add 'XX' characters only in the second column (everywhere except in the header) and save it to this same file.
Input:
$ls
file1.tsv file2.tsv file3.tsv
$head -n 4 file1.tsv
a b c
James England 25
Brian France 41
Maria France 18
Ouptut wanted:
a b c
James X1_England 25
Brian X1_France 41
Maria X1_France 18
I tried this, but the result is not kept in the file, and a simple redirection won't work:
# this works, but doesn't save the changes
i=1
for f in *tsv
do awk '{if (NR!=1) print $2}’ $f | sed "s|^|X${i}_|"
i=$((i+1))
done
# adding '-i' option to sed: this throws an error but would be perfect (sed no input files error)
i=1
for f in *tsv
do awk '{if (NR!=1) print $2}’ $f | sed -i "s|^|T${i}_|"
i=$((i+1))
done
Some help would be appreciated.
The second column is particularly easy because you simply replace the first occurrence of the separator.
for file in *.tsv; do
sed -i '2,$s/\t/\tX1_/' "$file"
done
If your sed doesn't recognize the symbol \t, use a literal tab (in many shells, you type it with ctrlv tab.) On *BSD (and hence MacOS) you need -i ''
AWK solution:
awk -i inplace 'BEGIN { FS=OFS="\t" } NR!=1 { $2 = "X1_" $2 } 1' file1.tsv
Input:
a b c
James England 25
Brian France 41
Maria France 18
Output:
a b c
James X1_England 25
Brian X1_France 41
Maria X1_France 18

Counting occurrences between pattern matches in data file and generating a report

I have a file structured like this:
MATCH A and B
001
005
101
MATCH A and C
020
400
MATCH B and C
001
156
807
920
I want to generate a report that looks like:
A and B: 3
A and C: 2
B and C: 4
I imagine the tools to use are sed/awk. I know that sed can print lines between pattern matches, but the following ends up printing out the whole file.
sed -n '/^MATCH/,/^MATCH/p' file.txt | wc -l
This returns the number of lines in the whole file. Any tips on where to look at next? It seems that this isn't the most common task and I haven't been able to find many other suggestions.
This awk should do:
awk -v RS= '{print $2,$3,$4":",NF-4}' file
A and B: 3
A and C: 2
B and C: 4
Since record are separated by one blank line, and RS is set to nothing,
we just have to count the fields NF minus first line.
This may be better:
awk -v RS= -F"\n" '{print $1":",NF-1}' file
MATCH A and B: 3
MATCH A and C: 2
MATCH B and C: 4
Or remove the MATCH word:
awk -v RS= -F"\n" '{sub("MATCH ","",$1);print $1":",NF-1}' file
A and B: 3
A and C: 2
B and C: 4

Using awk for a table lookup

I would like to use awk to lookup a value from a text file. The text file has a very simple format:
text \t value
text \t value
text \t value
...
I want to pass the actual text for which the value should be looked up via a shell variable, e.g., $1.
Any ideas how I can do this with awk?
your help is great appreciated.
All the best,
Alberto
You can do this in a pure AWK script without a shell wrapper:
#!/usr/bin/awk -f
BEGIN { key = ARGV[1]; ARGV[1]="" }
$1 == key { print $2 }
Call it like this:
./lookup.awk keyval lookupfile
Example:
$ cat lookupfile
aaa 111
bbb 222
ccc 333
ddd 444
zzz 999
mmm 888
$ ./lookup.awk ddd lookupfile
444
$ ./lookup.awk zzz lookupfile
999
This could even be extended to select the desired field using an argument.
#!/usr/bin/awk -f
BEGIN { key = ARGV[1]; field = ARGV[2]; ARGV[1]=ARGV[2]="" }
$1 == key { print $field }
Example:
$ cat lookupfile2
aaa 111 abc
bbb 222 def
ccc 333 ghi
ddd 444 jkl
zzz 999 mno
mmm 888 pqr
$ ./lookupf.awk mmm 1 lookupfile2
mmm
$ ./lookupf.awk mmm 2 lookupfile2
888
$ ./lookupf.awk mmm 3 lookupfile2
pqr
Something like this would do the job:
#!/bin/sh
awk -vLOOKUPVAL=$1 '$1 == LOOKUPVAL { print $2 }' < inputFile
Essentially you set the lookup value passed into the shell script in $1 to an awk variable, then you can access that within awk itself. To clarify, the first $1 is the shell script argument passed in on the command line, the second $1 (and subsequent $2) are fields 1 and 2 of the input file.
TEXT=`grep value file | cut -f1`
I think grep might actually be a better fit:
$ echo "key value
ambiguous correct
wrong ambiguous" | grep '^ambiguous ' | awk ' { print $2 } '
The ^ on the pattern is to match to the start of the line and ensure that you don't match a line where the value, rather than the key, was the desired text.