awk choose a line with $1 present in a file and output with a changed field - awk

I've tried to use Awk to do the following:
I have a large txt file with first column the name of a gene and different values, essentially numeric, in each column.
Now I have a file with a list of genes (not all genes, just a subset) that I want to modify.
Initially I just removed lines using something I found in a forum
awk -F '\t' ' FILENAME=="gene_list" {arr[$1]; next} # create an array without values
!($1 in arr)' gene_list original_file.txt > modified_file.txt
This worked great but now I need to keep all rows (in the same order) but modify these genes to do something like:
if ($1 in arr) {print $1, $2, $3-($4/10), $4}
else {print $0}
So you see, this time, if it is different (the gene is not in my list), I want to keep the whole line, otherwise I want to keep the whole line but modify the value in one column by a given number.
If you could include something so that the value remains an integer that would be great. I'll also have to replace by 0 if the value becomes negative. But this I know how to do , at least in a separate command.
Edit: minimal example:
list of genes in a txt file, one under the other:
ccl5
cxcr4
setx
File to modify: (I put comas as field separator here, but there should be tab to separate the fields)
ccl4,3,18000,50000
ccl5,4,400,5000
cxcr4,5,300,2500
apoe,4,100,90
setx,3,200,1903
Expected output: (I remove the 10th of 4th column when gene in first column matches a gene in my separate txt file, otherwise I keep the full line unchanged)
ccl4,3,18000,50000
ccl5,4,0,5000
cxcr4,5,50,2500
apoe,4,100,90
setx,3,10,1903

Just spell out the arithmetic constraints.
The following is an attempt to articulate it in idiomatic Awk.
if (something) { print } can be rearticulated as just something. So just 1 (which is always true) is a common idiom for "print all lines (if you reach this point in the script before hitting next).
Rounding a floating-point number can be done with sprintf("%1.0f", n) which correctly rounds up if the fraction is bigger than 0.5 (int(n) would always round down).
awk 'BEGIN { FS=OFS="\t" }
FILENAME=="gene_list" {arr[$1]; next}
$1 in arr { x=sprintf("%1.0f", $3-($4/10));
if (x<0) x=0; print $1, $2, x, $4; next }
1' gene_list original_file.txt > modified_file.txt
Demo: https://ideone.com/oDjKhf

Related

Get values from the next row and merge- awk

I have a pipe delimited file like this
OLD|123432
NEW|232322
OLD|1234452
NEW|232324
OLD|656966
NEW|232325
I am trying to create a new file where I am trying to merge rows based on the value in the first column (OLD/NEW). First column in the output file will have the new number and the second column will have the old number.
Output
232322|123432
232324|1234452
232325|656966
I looked at the answer here How to merge every two lines into one from the command line?. I know it is not the exact solution but used as a starting point.
and tried to make it work to solve this but throws syntax error.
awk -F "|" 'NR%2{OFS = "|" printf "%s ",$0;next;}1'
You may use this awk:
awk 'BEGIN {FS=OFS="|"} $1 == "NEW" {print $2, old} $1 == "OLD" {old = $2}' file
232322|123432
232324|1234452
232325|656966
Using $0 will have the value of the whole line. If the field separator is a pipe, the second column is $2 that has the number.
If you want to use the remainder with NR%2, another option could be storing the value of the second column in a variable, for example v
awk 'BEGIN{FS=OFS="|"} NR%2{v=$2;next;}{print $2,v}' file
Output
232322|123432
232324|1234452
232325|656966

How to add zero decimal points to all elements of a matrix in shell

I have a file containing integer and floating numbers which has 3 columns and many lines. I want to have all the numbers in floating format which 6 significant figures, i.e. the number 2 becomes 2.000000. I already tried this script but it only works for one column. I appreciate if you help me how to due it for the next two columns to or any other one line script which can do the job. Thanks in advance
awk '{printf "%.6f\n",$1}' file
The script you provided works for only one column because you are only printing the first column in the script. If you have only 3 columns and if the number of columns is fixed you just need to add the other two columns to your print statement to make it work.
For example:
awk '{printf "%.6f %.6f %.6f \n",$1,$2,$3}'
If the number of columns is unknown you can use a loop inside awk to print all the columns. NF will give the total number of records. You can iterate through it and print the results.
awk '{ for (i=1; i<= NF; i++) printf "%.6f ",$i; print "" }' input
You also can use this script to print the three columns as required, if your file only contains three columns (NF is 3 in that case).
As pointed out by #kvantour in comments, an elegant way to write the above one liner is:
awk '{for(i=1;i<=NF;++i) printf "%.6f" (i==NF?ORS:OFS), $i}
Essentially, both one-liners do the same thing but the ORS (Output Record Separator) and OFS (Output Field Separator) variables give the flexibility to change the output easily and quickly.
So, (i==NF?ORS:OFS) what it does is, if the field is not the last column then OFS is printed (OFS is a space by default) and if it is the last column then ORS is printed (default value of ORS is newline ). Advantage of using these separator variables are, for example, if you want to have two new lines instead of a single newline between rows in the results it can be easily set using ORS="\n\n"

Breakdown of one line of code involving awk

I am currently working on a project which involves comparing data from two different files. I had been looking for a command which would compare one line of file1 with each line of file2, and print out a '1' if there is a match and a '0' if not matched. Then, it would repeat this command for the second line of file1 for each line of file1.
I found this bit of code online which seems to work for my project, but I was hoping someone would help to break it down for me and provide more of an explanation.
awk 'FNR==NR{a[$1]; next} {print $1, ($1 in a) ? "1":"0"}' file1.txt file2.txt
Additionally, I'm new to this so any resources which may guide me to my answer would be really helpful. Thank you.
Here's what this awk is saying:
awk 'FNR==NR{a[$1]; next} {print $1, ($1 in a) ? "1":"0"}' file1.txt file2.txt
IF the record number of this specific file being processed FNR is the same as the overall record number being processed NR execute {a[$1]; next} block. We can safely assume that if this condition is true that we are processing the first file.
{a[$1]; next} Add the first column as a key in the array named a and then go to the next line without processing anymore of the awk script. Once the first file is fully processed here we will have an array with a key for every distinct value found in the first column of the first file.
{print $1, ($1 in a) ? "1":"0"} Since we must now be on the second file we are printing every line/record we encounter. Here we print the first column, then if that column's value is in the array as a key, then we print 1 otherwise we print 0.
In short this is printing every first column of the second file and stating if that column also exists in the first file printing a 1 or 0.
Repeated here for locality of reference, and in case the question gets edited:
awk 'FNR==NR{a[$1]; next} {print $1, ($1 in a) ? "1":"0"}' file1.txt file2.txt
You should really read a basic awk primer first. Basically the clause FNR==NR is a common idiom to check if we're reading the first file. NR is the overall record number (the line number), while FNR is the record number in the current file, so you're still processing the first file when these are equal. The action then stores the first column (not the entire line) into an array. So the first thing this program does is read the first column of the first file into the array named a. Then it starts reading the second file, and prints the first column of each line, followed by "1" or "0" depending on if the value in the first column is in the array.

The meaning of "a" in an awk command?

I have an awk command in a script I am trying to make work, and I don't understand the meaning of 'a':
awk 'FNR==NR{ a[$1]=$0;next } ($2 in a)' FILELIST.TXT FILEIN.* > FILEOUT.*
I'm quite new to using command line, so I'm just trying to figure things out, thanks.
a is an associative array.
a[$1] = $0;
takes the first word $1 on the line as the index in the array, and stores the whole line $0 as the value. It does this for the first file (while the file record number is equal to the overall record number). The next command means it doesn't process the rest of the script while it is processing the first file.
For the rest of the data files, it evaluates:
($2 in a)
and prints the line if the word in $2 is found. This makes storing $0 in a relatively expensive because it is storing a copy of the whole file (possibly twice if there's only one word on each line of the file). It is more conventional and sufficient to do a[$1]++ or even a[$1] = 1.
Given FILELIST.TXT
ABC The rest
DEF And more
Given FILEIN.1 containing:
Word ABC and so on
Grow FED won't be shown
This DEF will be shown
The XYZ will be missing
The output will be:
Word ABC and so on
This DEF will be shown
Here a is not a command but an awk array it can very well be arr also:
awk 'FNR==NR {arr[$1]=$0;next} ($2 in arr)' FILELIST.TXT FILEIN.* > FILEOUT.*
a is nothing but an array, in your code
FNR==NR{ a[$1]=$0;next }
Creates an array called "a" with indexes taken from the first column of the first input file.
All element values are set to the current record.
The next statement forces awk to immediately stop processing the current record and go on to the next record.

Using AWK to Process Input from Multiple Files

Many people have been very helpful by posting the following solution for AWK'ing multiple input files at once:
$ awk 'FNR==NR{a[$1]=$2 FS $3;next}{ print $0, a[$1]}' file2 file1
This works well, but I was wondering if I someone could explain to me why? I find the AWK syntax a little bit tough to get the hang of and was hoping someone wouldn't mind breaking the code snippet down for me.
awk 'FNR==NR{a[$1]=$2 FS $3;next}
here we handle the 1st input (file2). say, FS is space, we build an array(a) up, index is column1, value is column2 " " column3 the FNR==NR and next means, this part of codes work only for file2. you could man gawk check what are NR and FNR
{ print $0, a[$1]}' file2 file1
When NR != FNR it's time to process 2nd input, file1. here we print the line of file1, and take column1 as index, find out the value in array(a) print. in another word, file1 and file2 are joined by column1 in both files.
for NR and FNR, shortly,
1st input has 5 lines
2nd input has 10 lines,
NR would be 1,2,3...15
FNR would be 1...5 then 1...10
you see the trick of FNR==NR check.
I found this question/answer on Google and it appears to be referring to a very specific data set found in another question (How to merge two files using AWK?). What follows is the answer I was looking for (and that I think most people would be), i.e., simply to concatenate every line from two different files using AWK. Though you could probably use some UNIX utilities like join or paste, AWK is obviously much more flexible and powerful if your desired output is different, by using if statements, or altering the OFS (which may be more difficult to do depending on the utility; see below) for example, altering the output in a much more expressive way (an important consideration for shell scripters.)
For simple line-by-line concatenation:
awk 'FNR==NR { a[FNR""] = $0; next } { print a[FNR""], $0 }' file1 file2
This emulates the function of a numerically indexed array (AWK only has associative arrays) by using implicit type conversion. It is relatively expressive and easy to understand.
Using two files called test1 and test2 with the following lines:
test1:
line one
line two
line three
test2:
line four
line five
line six
I get this result:
line one line four
line two line five
line three line six
Depending on how you want to join the values between the columns in the output, you can pick the appropriate output field separator. Here's an example with ellipses (...) separating the columns:
awk 'BEGIN { OFS="..."} FNR==NR { a[(FNR"")] = $0; next } { print a[(FNR"")], $0 }' test1 test2
Yielding this result:
line one...line four
line two...line five
line three...line six
I hope at least that this inspires you all to take advantage of the power of AWK!
A while ago I stumbled in a very good solution to handle multiple files at once. The way is to save in memory the files in AWK arrays using the method:
FILENAME==ARGV[1] { file2array[FNR] = $0 ; next }
FILENAME==ARGV[2] { file1array[FNR] = $0 ; next }
For post data treatment, is better to save the number of lines, so:
FILENAME==ARGV[1] { file2array[FNR] = $0 ; f2rows = FNR ; next }
FILENAME==ARGV[2] { file1array[FNR] = $0 ; f1rows = FNR ; next }
f2rows and f1rows will hold the position of the last row.
It has more code, but if you want more complex data treatment, I think it's the better approach. Besides, the previous approaches treated the inputs sequentially, so if you needed to do some calculations that depended on data from both files simultaneously you wouldn't be able to do it, and with this approach you can do everything with both files.