Breakdown of one line of code involving awk - awk

I am currently working on a project which involves comparing data from two different files. I had been looking for a command which would compare one line of file1 with each line of file2, and print out a '1' if there is a match and a '0' if not matched. Then, it would repeat this command for the second line of file1 for each line of file1.
I found this bit of code online which seems to work for my project, but I was hoping someone would help to break it down for me and provide more of an explanation.
awk 'FNR==NR{a[$1]; next} {print $1, ($1 in a) ? "1":"0"}' file1.txt file2.txt
Additionally, I'm new to this so any resources which may guide me to my answer would be really helpful. Thank you.

Here's what this awk is saying:
awk 'FNR==NR{a[$1]; next} {print $1, ($1 in a) ? "1":"0"}' file1.txt file2.txt
IF the record number of this specific file being processed FNR is the same as the overall record number being processed NR execute {a[$1]; next} block. We can safely assume that if this condition is true that we are processing the first file.
{a[$1]; next} Add the first column as a key in the array named a and then go to the next line without processing anymore of the awk script. Once the first file is fully processed here we will have an array with a key for every distinct value found in the first column of the first file.
{print $1, ($1 in a) ? "1":"0"} Since we must now be on the second file we are printing every line/record we encounter. Here we print the first column, then if that column's value is in the array as a key, then we print 1 otherwise we print 0.
In short this is printing every first column of the second file and stating if that column also exists in the first file printing a 1 or 0.

Repeated here for locality of reference, and in case the question gets edited:
awk 'FNR==NR{a[$1]; next} {print $1, ($1 in a) ? "1":"0"}' file1.txt file2.txt
You should really read a basic awk primer first. Basically the clause FNR==NR is a common idiom to check if we're reading the first file. NR is the overall record number (the line number), while FNR is the record number in the current file, so you're still processing the first file when these are equal. The action then stores the first column (not the entire line) into an array. So the first thing this program does is read the first column of the first file into the array named a. Then it starts reading the second file, and prints the first column of each line, followed by "1" or "0" depending on if the value in the first column is in the array.

Related

awk choose a line with $1 present in a file and output with a changed field

I've tried to use Awk to do the following:
I have a large txt file with first column the name of a gene and different values, essentially numeric, in each column.
Now I have a file with a list of genes (not all genes, just a subset) that I want to modify.
Initially I just removed lines using something I found in a forum
awk -F '\t' ' FILENAME=="gene_list" {arr[$1]; next} # create an array without values
!($1 in arr)' gene_list original_file.txt > modified_file.txt
This worked great but now I need to keep all rows (in the same order) but modify these genes to do something like:
if ($1 in arr) {print $1, $2, $3-($4/10), $4}
else {print $0}
So you see, this time, if it is different (the gene is not in my list), I want to keep the whole line, otherwise I want to keep the whole line but modify the value in one column by a given number.
If you could include something so that the value remains an integer that would be great. I'll also have to replace by 0 if the value becomes negative. But this I know how to do , at least in a separate command.
Edit: minimal example:
list of genes in a txt file, one under the other:
ccl5
cxcr4
setx
File to modify: (I put comas as field separator here, but there should be tab to separate the fields)
ccl4,3,18000,50000
ccl5,4,400,5000
cxcr4,5,300,2500
apoe,4,100,90
setx,3,200,1903
Expected output: (I remove the 10th of 4th column when gene in first column matches a gene in my separate txt file, otherwise I keep the full line unchanged)
ccl4,3,18000,50000
ccl5,4,0,5000
cxcr4,5,50,2500
apoe,4,100,90
setx,3,10,1903
Just spell out the arithmetic constraints.
The following is an attempt to articulate it in idiomatic Awk.
if (something) { print } can be rearticulated as just something. So just 1 (which is always true) is a common idiom for "print all lines (if you reach this point in the script before hitting next).
Rounding a floating-point number can be done with sprintf("%1.0f", n) which correctly rounds up if the fraction is bigger than 0.5 (int(n) would always round down).
awk 'BEGIN { FS=OFS="\t" }
FILENAME=="gene_list" {arr[$1]; next}
$1 in arr { x=sprintf("%1.0f", $3-($4/10));
if (x<0) x=0; print $1, $2, x, $4; next }
1' gene_list original_file.txt > modified_file.txt
Demo: https://ideone.com/oDjKhf

Get values from the next row and merge- awk

I have a pipe delimited file like this
OLD|123432
NEW|232322
OLD|1234452
NEW|232324
OLD|656966
NEW|232325
I am trying to create a new file where I am trying to merge rows based on the value in the first column (OLD/NEW). First column in the output file will have the new number and the second column will have the old number.
Output
232322|123432
232324|1234452
232325|656966
I looked at the answer here How to merge every two lines into one from the command line?. I know it is not the exact solution but used as a starting point.
and tried to make it work to solve this but throws syntax error.
awk -F "|" 'NR%2{OFS = "|" printf "%s ",$0;next;}1'
You may use this awk:
awk 'BEGIN {FS=OFS="|"} $1 == "NEW" {print $2, old} $1 == "OLD" {old = $2}' file
232322|123432
232324|1234452
232325|656966
Using $0 will have the value of the whole line. If the field separator is a pipe, the second column is $2 that has the number.
If you want to use the remainder with NR%2, another option could be storing the value of the second column in a variable, for example v
awk 'BEGIN{FS=OFS="|"} NR%2{v=$2;next;}{print $2,v}' file
Output
232322|123432
232324|1234452
232325|656966

How do I match a pattern and then copy multiple lines?

I have two files that I am working with. The first file is a master database file that I am having to search through. The second file is a file that I can make that allows me to name the items from the master database that I would like to pull out. I have managed to make an AWK solution that will search the master database and extract the exact line that matches the second file. However, I cannot figure out how to copy the lines after the match to my new file.
The master database looks something like this:
40005X/50005/60005/3/10/9/
10038A/20038/30038/0/5/23./XXXX/
10039A/20039/30039/0/5/23./XXXX/
10040A/20040/30040/0/5/23./XXXX/
10041A/20041/30041/0/5/23./XXXX/
10042A/20042/30042/0/5/23./XXXX/
10043A/20043/30043/0/5/23./XXXX/
10044A/20044/30044/0/5/23./XXXX/
10045A/20045/30045/0/5/23./XXXX/
10046A/20046/30046/0/5/23./XXXX/
40006X/50006/60006/3/10/3/
10047A/20047/30047/0/5/23./XXXX/
10048A/20048/30048/0/5/23./XXXX/
10049A/20049/30049/0/5/23./XXXX/
40007X/50007/60007/3/10/3/
10050A/20050/30050/0/5/23./XXXX/
10051A/20051/30051/0/5/23./XXXX/
10052A/20052/30052/0/5/23./XXXX/
40008X/50008/60008/3/10/1/
10053A/20053/30053/0/5/23./XXXX/
40009X/50009/60009/3/10/3/
10054A/20054/30054/0/5/23./XXXX/
10055A/20055/30055/0/5/23./XXXX/
10056A/20056/30056/0/5/23./XXXX/
40010X/50010/60010/3/10/3/
10057A/20057/30057/0/5/23./XXXX/
10058A/20058/30058/0/5/23./XXXX/
10059A/20059/30059/0/5/23./XXXX/
In my example, the lines that start with 4000 is the first line that I am matching up to. The last number in that row is what tells me how many lines there are to copy. So in the first line, 40005X/50005/60005/3/10/9/, I would be matching off of the 40005X, and the 9 in that line tells me that there are 9 lines underneath that I need to copy with it.
The second file is very simple and looks something like this:
40005X
40007X
40008X
As the script finds each match, I would like to move the information from the first file to a new file for analysis. The end result would look like this:
40005X/50005/60005/3/10/9/
10038A/20038/30038/0/5/23./XXXX/
10039A/20039/30039/0/5/23./XXXX/
10040A/20040/30040/0/5/23./XXXX/
10041A/20041/30041/0/5/23./XXXX/
10042A/20042/30042/0/5/23./XXXX/
10043A/20043/30043/0/5/23./XXXX/
10044A/20044/30044/0/5/23./XXXX/
10045A/20045/30045/0/5/23./XXXX/
10046A/20046/30046/0/5/23./XXXX/
40007X/50007/60007/3/10/3/
10050A/20050/30050/0/5/23./XXXX/
10051A/20051/30051/0/5/23./XXXX/
10052A/20052/30052/0/5/23./XXXX/
40008X/50008/60008/3/10/1/
10053A/20053/30053/0/5/23./XXXX/
The code that I currently have that will match the first line is this:
#! /bin/ksh
file1=input_file
file2=input_masterdb
file3=output_test
awk -F'/' 'NR==FNR {id[$1]; next} $1 in id' $file1 $file2 > $file3
I have had the most success with AWK, however I am open to any suggestion. However, I am working on this on a UNIX system. I would like to keep it as a KSH script, since most of the other scripts that I use with this are written in that format, and I am most familiar with it.
Thank you for your help!!
Your existing awk matches correctly the rows from the ids' file, you now need to add a condition to print N lines ahead after reading the last field of the matching row. So we will set a variable p to the number of lines to print plus one (the current one), and decrease per row printing.
awk -F'/' 'NR==FNR{id[$0]; next} $1 in id{p=$6+1} p-->0{print}' file1 file2
or the same with last condition more "awkish" (by Ed Morton) and covering any possible extreme case of a huge file
awk -F'/' 'NR==FNR{id[$0]; next} $1 in id{p=$6+1} p&&p--' file1 file2
here the print condition is omitted, as it is the default action, and the condition is true again as long as decreasing p is positive.
another one
$ awk -F/ 'NR==FNR {a[$1]; next}
!n && $1 in a {n=$(NF-1)+1}
n&&n--' file2 file1
40005X/50005/60005/3/10/9/
10038A/20038/30038/0/5/23./XXXX/
10039A/20039/30039/0/5/23./XXXX/
10040A/20040/30040/0/5/23./XXXX/
10041A/20041/30041/0/5/23./XXXX/
10042A/20042/30042/0/5/23./XXXX/
10043A/20043/30043/0/5/23./XXXX/
10044A/20044/30044/0/5/23./XXXX/
10045A/20045/30045/0/5/23./XXXX/
10046A/20046/30046/0/5/23./XXXX/
40007X/50007/60007/3/10/3/
10050A/20050/30050/0/5/23./XXXX/
10051A/20051/30051/0/5/23./XXXX/
10052A/20052/30052/0/5/23./XXXX/
40008X/50008/60008/3/10/1/
10053A/20053/30053/0/5/23./XXXX/
this takes care if any of the content lines match the ids given. This will only look for another id after the specified number of lines printed.
Could you please try following, written and tested with shown samples in GNU awk. Considering that you want to print lines from line which stars from digits X here. Where Input_file2 is file having only ids and Input_file1 is master file as per OP's question.
awk '
{
sub(/ +$/,"")
}
FNR==NR{
a[$0]
next
}
/^[0-9]+X/{
match($0,/[0-9]+\/$/)
no_of_lines_to_print=substr($0,RSTART,RLENGTH-1)
found=count=""
}
{
if(count==no_of_lines_to_print){ count=found="" }
for(i in a){
if(match($0,i)){
found=1
print
next
}
}
}
found{
++count
}
count<=no_of_lines_to_print && count!=""
' Input_file2 Input_file1

awk to store and reset variable from file

Trying to use awk to lookup the string in file1 (which is always just 1 field) in the same line of file2. That is if row 1 is being used in file1 then only row 1 is used in file2. Since it is possible for the value to be missing this is a check done to ensure it is there. This is just an idea so there probably is a better way, but I just wanted to see. Thank you :).
file1
R_2017_01_13_12_11_56_user_S5-00580-24-Medexome
R_2017_01_13_14_46_04_user_S5-00580-25-Medexome
file2
The oldest folder is R_2017_01_13_12_11_56_user_S5-00580-24-Medexome, created on 2017-01-17+11:31:02.5035483130 and analysis done using v1.4 by cmccabe at 01/17/17 12:41:03 PM
desired output for $filename
R_2017_01_13_12_11_56_user_S5-00580-24-Medexome
After a bunch of processes are run using $filename, I need to reset that variable with a new one.
file1 (after rerunning some process)
R_2017_01_13_12_11_56_user_S5-00580-24-Medexome
R_2017_01_13_14_46_04_user_S5-00580-25-Medexome
file2 (after rerunning some process)
The oldest folder is R_2017_01_13_12_11_56_user_S5-00580-24-Medexome, created on 2017-01-17+11:31:02.5035483130 and analysis done using v1.4 by cmccabe at 01/17/17 12:41:03 PM
The oldest folder is R_2017_01_13_14_46_04_user_S5-00580-25-Medexome, created on 2017-01-17+06:53:07.3194950000 and analysis done using v1.4 by cmccabe at 01/18/17 06:59:08 AM
desired output for $filename now is (since this is value is new)
R_2017_01_13_14_46_04_user_S5-00580-25-Medexome
awk
filename=$(awk 'NR==1{print $1}' file1 file2)
You want to check if the last line of file2 contains a string given in file1.
For this, you just have to read that last line and then see if it matches with any of the words in file1.
$ awk 'ENDFILE {line=$0} FNR<NR && line ~ $1' file2 file1
R_2017_01_13_14_46_04_user_S5-00580-25-Medexome
This uses:
ENDFILE {line=$0}
after reading a file, $0 contains the last line that was read (well, not always, but I assume you have a modern version of awk, since ENDFILE is a GNU awk extension). With this, we store this last line into line, so that we can use it when reading the next file.
FNR<NR && line ~ $1
while reading the file1, check if the given word is present in the stored line. If so, print is automatically triggered.
This uses the FNR<NR trick, where FNR holds the number of the line in the current file, while NR the number of line in general. This way, FNR==NR is only true while reading the first file and FNR<NR from the second on.
If you only need to check the last line of file2 continuously, you could:
$ awk 'NR==FNR{a[$1];next}{for(i in a)if(i ~ $0) print i}' file1 <(tail -f file2)
Explained:
NR==FNR{a[$1];next} read into a array the search terms from file1
file2 is tail -f'd into awk using process substitution, ie. it reads a record from the end of file2, goes thru all search words in a and looks them from the record, printing search word if there is a match

The meaning of "a" in an awk command?

I have an awk command in a script I am trying to make work, and I don't understand the meaning of 'a':
awk 'FNR==NR{ a[$1]=$0;next } ($2 in a)' FILELIST.TXT FILEIN.* > FILEOUT.*
I'm quite new to using command line, so I'm just trying to figure things out, thanks.
a is an associative array.
a[$1] = $0;
takes the first word $1 on the line as the index in the array, and stores the whole line $0 as the value. It does this for the first file (while the file record number is equal to the overall record number). The next command means it doesn't process the rest of the script while it is processing the first file.
For the rest of the data files, it evaluates:
($2 in a)
and prints the line if the word in $2 is found. This makes storing $0 in a relatively expensive because it is storing a copy of the whole file (possibly twice if there's only one word on each line of the file). It is more conventional and sufficient to do a[$1]++ or even a[$1] = 1.
Given FILELIST.TXT
ABC The rest
DEF And more
Given FILEIN.1 containing:
Word ABC and so on
Grow FED won't be shown
This DEF will be shown
The XYZ will be missing
The output will be:
Word ABC and so on
This DEF will be shown
Here a is not a command but an awk array it can very well be arr also:
awk 'FNR==NR {arr[$1]=$0;next} ($2 in arr)' FILELIST.TXT FILEIN.* > FILEOUT.*
a is nothing but an array, in your code
FNR==NR{ a[$1]=$0;next }
Creates an array called "a" with indexes taken from the first column of the first input file.
All element values are set to the current record.
The next statement forces awk to immediately stop processing the current record and go on to the next record.