How do I match a pattern and then copy multiple lines? - awk

I have two files that I am working with. The first file is a master database file that I am having to search through. The second file is a file that I can make that allows me to name the items from the master database that I would like to pull out. I have managed to make an AWK solution that will search the master database and extract the exact line that matches the second file. However, I cannot figure out how to copy the lines after the match to my new file.
The master database looks something like this:
40005X/50005/60005/3/10/9/
10038A/20038/30038/0/5/23./XXXX/
10039A/20039/30039/0/5/23./XXXX/
10040A/20040/30040/0/5/23./XXXX/
10041A/20041/30041/0/5/23./XXXX/
10042A/20042/30042/0/5/23./XXXX/
10043A/20043/30043/0/5/23./XXXX/
10044A/20044/30044/0/5/23./XXXX/
10045A/20045/30045/0/5/23./XXXX/
10046A/20046/30046/0/5/23./XXXX/
40006X/50006/60006/3/10/3/
10047A/20047/30047/0/5/23./XXXX/
10048A/20048/30048/0/5/23./XXXX/
10049A/20049/30049/0/5/23./XXXX/
40007X/50007/60007/3/10/3/
10050A/20050/30050/0/5/23./XXXX/
10051A/20051/30051/0/5/23./XXXX/
10052A/20052/30052/0/5/23./XXXX/
40008X/50008/60008/3/10/1/
10053A/20053/30053/0/5/23./XXXX/
40009X/50009/60009/3/10/3/
10054A/20054/30054/0/5/23./XXXX/
10055A/20055/30055/0/5/23./XXXX/
10056A/20056/30056/0/5/23./XXXX/
40010X/50010/60010/3/10/3/
10057A/20057/30057/0/5/23./XXXX/
10058A/20058/30058/0/5/23./XXXX/
10059A/20059/30059/0/5/23./XXXX/
In my example, the lines that start with 4000 is the first line that I am matching up to. The last number in that row is what tells me how many lines there are to copy. So in the first line, 40005X/50005/60005/3/10/9/, I would be matching off of the 40005X, and the 9 in that line tells me that there are 9 lines underneath that I need to copy with it.
The second file is very simple and looks something like this:
40005X
40007X
40008X
As the script finds each match, I would like to move the information from the first file to a new file for analysis. The end result would look like this:
40005X/50005/60005/3/10/9/
10038A/20038/30038/0/5/23./XXXX/
10039A/20039/30039/0/5/23./XXXX/
10040A/20040/30040/0/5/23./XXXX/
10041A/20041/30041/0/5/23./XXXX/
10042A/20042/30042/0/5/23./XXXX/
10043A/20043/30043/0/5/23./XXXX/
10044A/20044/30044/0/5/23./XXXX/
10045A/20045/30045/0/5/23./XXXX/
10046A/20046/30046/0/5/23./XXXX/
40007X/50007/60007/3/10/3/
10050A/20050/30050/0/5/23./XXXX/
10051A/20051/30051/0/5/23./XXXX/
10052A/20052/30052/0/5/23./XXXX/
40008X/50008/60008/3/10/1/
10053A/20053/30053/0/5/23./XXXX/
The code that I currently have that will match the first line is this:
#! /bin/ksh
file1=input_file
file2=input_masterdb
file3=output_test
awk -F'/' 'NR==FNR {id[$1]; next} $1 in id' $file1 $file2 > $file3
I have had the most success with AWK, however I am open to any suggestion. However, I am working on this on a UNIX system. I would like to keep it as a KSH script, since most of the other scripts that I use with this are written in that format, and I am most familiar with it.
Thank you for your help!!

Your existing awk matches correctly the rows from the ids' file, you now need to add a condition to print N lines ahead after reading the last field of the matching row. So we will set a variable p to the number of lines to print plus one (the current one), and decrease per row printing.
awk -F'/' 'NR==FNR{id[$0]; next} $1 in id{p=$6+1} p-->0{print}' file1 file2
or the same with last condition more "awkish" (by Ed Morton) and covering any possible extreme case of a huge file
awk -F'/' 'NR==FNR{id[$0]; next} $1 in id{p=$6+1} p&&p--' file1 file2
here the print condition is omitted, as it is the default action, and the condition is true again as long as decreasing p is positive.

another one
$ awk -F/ 'NR==FNR {a[$1]; next}
!n && $1 in a {n=$(NF-1)+1}
n&&n--' file2 file1
40005X/50005/60005/3/10/9/
10038A/20038/30038/0/5/23./XXXX/
10039A/20039/30039/0/5/23./XXXX/
10040A/20040/30040/0/5/23./XXXX/
10041A/20041/30041/0/5/23./XXXX/
10042A/20042/30042/0/5/23./XXXX/
10043A/20043/30043/0/5/23./XXXX/
10044A/20044/30044/0/5/23./XXXX/
10045A/20045/30045/0/5/23./XXXX/
10046A/20046/30046/0/5/23./XXXX/
40007X/50007/60007/3/10/3/
10050A/20050/30050/0/5/23./XXXX/
10051A/20051/30051/0/5/23./XXXX/
10052A/20052/30052/0/5/23./XXXX/
40008X/50008/60008/3/10/1/
10053A/20053/30053/0/5/23./XXXX/
this takes care if any of the content lines match the ids given. This will only look for another id after the specified number of lines printed.

Could you please try following, written and tested with shown samples in GNU awk. Considering that you want to print lines from line which stars from digits X here. Where Input_file2 is file having only ids and Input_file1 is master file as per OP's question.
awk '
{
sub(/ +$/,"")
}
FNR==NR{
a[$0]
next
}
/^[0-9]+X/{
match($0,/[0-9]+\/$/)
no_of_lines_to_print=substr($0,RSTART,RLENGTH-1)
found=count=""
}
{
if(count==no_of_lines_to_print){ count=found="" }
for(i in a){
if(match($0,i)){
found=1
print
next
}
}
}
found{
++count
}
count<=no_of_lines_to_print && count!=""
' Input_file2 Input_file1

Related

use awk to split one file into several small files by pattern

I have read this post about using awk to split one file into several files:
and I am interested in one of the solutions provided by Pramod and jaypal singh:
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; {print >> OUT; close(OUT)}' Input_File
Because I still can not add any comment so I ask in here.
If the input is
>chr22
asdgasge
asegaseg
>chr1
aweharhaerh
agse
>chr14
gasegaseg
How come it will result in three files:
chr22.fasta
chr1.fasta
chr14.fasta
As an example, in chr22.fasta:
>chr22
asdgasge
asegaseg
I understand the first part
/^>chr/ {OUT=substr($0,2) ".fa"};
and these commands:
/^>chr/ substr() close() >>
But I don't understand that how awk split the input by the second part:
{print >> OUT; close(OUT)}
Could anyone explain more details about this command? Thanks a lot!
Could you please go through following and let me know if this helps you.
awk ' ##Starting awk program here.
/^>chr/{ ##Checking condition if a line starts from string chr then do following.
OUT=substr($0,2) ".fa" ##Create variable OUT whose value is substring of current line and starts from letter 2nd to till end. concatenating .fa to it too.
}
{
print >> OUT ##Printing current line(s) in file name whose value is variable OUT.
close(OUT) ##using close to close output file whose value if variable OUT value. Basically this is to avoid "TOO MANY FILES OPENED ERROR" error.
}' Input_File ##Mentioning Input_file name here.
You could take reference from man awk page for used functions of awk too as follows.
substr(s, i [, n]) Returns the at most n-character substring of s starting at i. If n is omitted, the rest of s is used.
The part you are asking questions about is a bit uncomfortable:
{ print $0 >> OUT; close(OUT) }
With this part, the awk program does the following for every line it processes:
Open the file OUT
Move the file pointer the the end of the file OUT
append the line $0 followed by ORS to the file OUT
close the file OUT
Why is this uncomfortable? Mainly because of the structure of your files. You should only close the file when you finished writing to it and not every time you write to it. Currently, if you have a fasta record of 100 lines, it will open, close the file 100 times.
A better approach would be:
awk '/^>chr/{close(OUT); OUT=substr($0,2)".fasta" }
{print > OUT }
END {close(OUT)}'
Here we only open the file the first time we write to it and we close it when we don't need it anymore.
note: the END statement is not really needed.

If two columns match change the third

File 1.txt:
13002:1:3:6aw:4:g:Dw:S:5342:dsan
13003:5:3s:6s:4:g:D:S:3456:fdsa
13004:16:t3:6:4hh:g:D:S:5342:inef
File 2.txt:
13002:6544
13003:5684
I need to replace the old data in column 9 of 1.txt with new data from column 2 of 2.txt if it exists. I think this can be done line by line as both files have the same column 1 field. This is a 3Gb file size. I have been playing about with awk but can't achieve the following.
I was trying the following:
awk 'NR==FNR{a[$1]=$2;} {$9a[b[2]]}' 1.txt 2.txt
Expected result:
13002:1:3:6aw:4:g:Dw:S:6544:dsan
13003:5:3s:6s:4:g:D:S:5684:fdsa
13004:16:t3:6:4hh:g:D:S:5342:inef
You seem to have a couple of odd typos in your attempt. You want to replace $9 with the value from the array if it is defined. Also, you want to make sure Awk uses colon as separator both on input and output.
awk -F : 'BEGIN { OFS=FS }
NR==FNR{a[$1]=$2; next}
$1 in a {$9 = a[$1] } 1' 2.txt 1.txt
Notice how 2.txt is first, so that NR==FNR is true when you are reading this file, but not when you start reading 1.txt. The next in the first block prevents Awk from executing the second condition while you are reading the first file. And the final 1 is a shorthand for an unconditional print which of course will be executed for every line in the second file, regardless of whether you replaced anything.

awk command to print only part of matching lines

awk command to compare lines in file and print only first line if there are some new words in other lines.
For example: file.txt is having
i am going
i am going today
i am going with my friend
output should be
I am going
this will work for the sample input but perhaps will fail for the actual one, unless you have a representative input we wouldn't know...
$ awk 'NR>1 && $0~p {if(!f) print p; f=1; next} {p=$0; f=0}' file
i am going
you may want play with p=$0 to restrict matching number of fields if the line lengths are not in increasing order...

awk to store and reset variable from file

Trying to use awk to lookup the string in file1 (which is always just 1 field) in the same line of file2. That is if row 1 is being used in file1 then only row 1 is used in file2. Since it is possible for the value to be missing this is a check done to ensure it is there. This is just an idea so there probably is a better way, but I just wanted to see. Thank you :).
file1
R_2017_01_13_12_11_56_user_S5-00580-24-Medexome
R_2017_01_13_14_46_04_user_S5-00580-25-Medexome
file2
The oldest folder is R_2017_01_13_12_11_56_user_S5-00580-24-Medexome, created on 2017-01-17+11:31:02.5035483130 and analysis done using v1.4 by cmccabe at 01/17/17 12:41:03 PM
desired output for $filename
R_2017_01_13_12_11_56_user_S5-00580-24-Medexome
After a bunch of processes are run using $filename, I need to reset that variable with a new one.
file1 (after rerunning some process)
R_2017_01_13_12_11_56_user_S5-00580-24-Medexome
R_2017_01_13_14_46_04_user_S5-00580-25-Medexome
file2 (after rerunning some process)
The oldest folder is R_2017_01_13_12_11_56_user_S5-00580-24-Medexome, created on 2017-01-17+11:31:02.5035483130 and analysis done using v1.4 by cmccabe at 01/17/17 12:41:03 PM
The oldest folder is R_2017_01_13_14_46_04_user_S5-00580-25-Medexome, created on 2017-01-17+06:53:07.3194950000 and analysis done using v1.4 by cmccabe at 01/18/17 06:59:08 AM
desired output for $filename now is (since this is value is new)
R_2017_01_13_14_46_04_user_S5-00580-25-Medexome
awk
filename=$(awk 'NR==1{print $1}' file1 file2)
You want to check if the last line of file2 contains a string given in file1.
For this, you just have to read that last line and then see if it matches with any of the words in file1.
$ awk 'ENDFILE {line=$0} FNR<NR && line ~ $1' file2 file1
R_2017_01_13_14_46_04_user_S5-00580-25-Medexome
This uses:
ENDFILE {line=$0}
after reading a file, $0 contains the last line that was read (well, not always, but I assume you have a modern version of awk, since ENDFILE is a GNU awk extension). With this, we store this last line into line, so that we can use it when reading the next file.
FNR<NR && line ~ $1
while reading the file1, check if the given word is present in the stored line. If so, print is automatically triggered.
This uses the FNR<NR trick, where FNR holds the number of the line in the current file, while NR the number of line in general. This way, FNR==NR is only true while reading the first file and FNR<NR from the second on.
If you only need to check the last line of file2 continuously, you could:
$ awk 'NR==FNR{a[$1];next}{for(i in a)if(i ~ $0) print i}' file1 <(tail -f file2)
Explained:
NR==FNR{a[$1];next} read into a array the search terms from file1
file2 is tail -f'd into awk using process substitution, ie. it reads a record from the end of file2, goes thru all search words in a and looks them from the record, printing search word if there is a match

The meaning of "a" in an awk command?

I have an awk command in a script I am trying to make work, and I don't understand the meaning of 'a':
awk 'FNR==NR{ a[$1]=$0;next } ($2 in a)' FILELIST.TXT FILEIN.* > FILEOUT.*
I'm quite new to using command line, so I'm just trying to figure things out, thanks.
a is an associative array.
a[$1] = $0;
takes the first word $1 on the line as the index in the array, and stores the whole line $0 as the value. It does this for the first file (while the file record number is equal to the overall record number). The next command means it doesn't process the rest of the script while it is processing the first file.
For the rest of the data files, it evaluates:
($2 in a)
and prints the line if the word in $2 is found. This makes storing $0 in a relatively expensive because it is storing a copy of the whole file (possibly twice if there's only one word on each line of the file). It is more conventional and sufficient to do a[$1]++ or even a[$1] = 1.
Given FILELIST.TXT
ABC The rest
DEF And more
Given FILEIN.1 containing:
Word ABC and so on
Grow FED won't be shown
This DEF will be shown
The XYZ will be missing
The output will be:
Word ABC and so on
This DEF will be shown
Here a is not a command but an awk array it can very well be arr also:
awk 'FNR==NR {arr[$1]=$0;next} ($2 in arr)' FILELIST.TXT FILEIN.* > FILEOUT.*
a is nothing but an array, in your code
FNR==NR{ a[$1]=$0;next }
Creates an array called "a" with indexes taken from the first column of the first input file.
All element values are set to the current record.
The next statement forces awk to immediately stop processing the current record and go on to the next record.