finding differences across a row with awk - awk

I have a table in which most of the values in a given row are the same. What I want to pull out are any rows where at least one of the values is different. I’ve figured out how to do that with something like this
awk -F "\t" '{if (($4!=$5)&&($5!=$6)&&($6!=$7)) print $0;}'
The only problem is there are 40 some odd columns to compare. Is there a more elegant way to compare multiple columns for differences. BTW – these are non numerical values so a fancy math trick wont work.
Thanks All. I'm a newbee so I have to admit that I don't understand all of the commands, etc. but I can look it up from here. Not sure who's suggestion I'll go with but I learn more from concrete examples than I do from textbook explanations so having these different solutions is a big help with my learning curve.

A fancy math trick might not work but how about:
$ cat file
one one one one two
two two two two two
three four four five
$ awk '{f=$0;gsub($1,"")}NF{print f}' file
one one one one two
three four four five
First we store the line in original state f=$0 then we do a global substitution on everything matching the first field, if all fields are the same then nothing will be left therefor NF will be 0 and nothing will be printed else we print the original line.
Your script starts at $4 which suggests you are only interested in changes from this field on in which case:
$ awk '{f=$0;gsub($4,"")}NF>3{print f}' file

If any field differs from some other field, then either it differs from field 1, or field 1 differs from some other field (by definition). So just loop from 2 to NF (number of fields) comparing it against all other fields:
awk -F "\t" '{ for (i = 2; i <= NF ;i++) if ($i != $1) { print; next; }}'
You can tune this to ignore leading fields (e.g., start at 5 and compare against $4) as needed.

You could just use a for loop:
awk -F "\t" '{ for(i=4;i<NF;i++) if ($i != $(i+1)) { print; next } }' file
Adjust accordingly. HTH.

Related

awk - store first occurrence based on cell

I have a file (around 10k entries) with following format:
text1;text2;text3;lat;long
A;B;C;55.01;12.01
A;B;C;n/a;n/a
D;E;F;56.011;13.099
D;E;F;56.01;13.01
D;E;F;n/a;n/a
I;B;C;n/a;n/a
containing duplicates, some without, others with mildly contradicting LAT;LONG coordinates.
I only want to store first unique value of [$1;$2;$3;$4;$5] as output, so desired output should look like:
text1;text2;text3;lat;long
A;B;C;55.01;12.01
D;E;F;56.011;13.099
I;B;C;n/a;n/a
I'd assume that I want to create an array, but I struggle with proper formating of it... so any help appreciated !
I'm glad you have it working, but personally, I would suggest something a little more along the lines of:
$ awk -F";" '!seen[$1,$2,$3] {print; seen[$1,$2,$3]=1}' file
Example Use/Output
With your data in file, you could then do:
$ awk -F";" '!seen[$1,$2,$3] {print; seen[$1,$2,$3]=1}' file
text1;text2;text3;lat;long
A;B;C;55.01;12.01
D;E;F;56.011;13.099
I;B;C;n/a;n/a
You can shorten it to about your example (which simply checks if the unique index of the first three fields combined has been set yet and relies on the default print operation to output the first records having the unique combination):
$ awk -F";" '!seen[$1,$2,$3]++' file
However, using the joined fields $1,$2,$3 as the index is about the only way you can ensure uniqueness.
If you say your works, then it is certainly shorter. Let me know if you have further questions.
Found it by stopping to look for creating arrays
created a new $1 being $1,$2,$3, but the other solutions is indeed more elegant, here is the command I came up with after merging the fields in the file (and setting them as new $1), which I then didn't have to do
awk -F';' '!seen[($1)]++' file1.csv > file2.csv

awk choose a line with $1 present in a file and output with a changed field

I've tried to use Awk to do the following:
I have a large txt file with first column the name of a gene and different values, essentially numeric, in each column.
Now I have a file with a list of genes (not all genes, just a subset) that I want to modify.
Initially I just removed lines using something I found in a forum
awk -F '\t' ' FILENAME=="gene_list" {arr[$1]; next} # create an array without values
!($1 in arr)' gene_list original_file.txt > modified_file.txt
This worked great but now I need to keep all rows (in the same order) but modify these genes to do something like:
if ($1 in arr) {print $1, $2, $3-($4/10), $4}
else {print $0}
So you see, this time, if it is different (the gene is not in my list), I want to keep the whole line, otherwise I want to keep the whole line but modify the value in one column by a given number.
If you could include something so that the value remains an integer that would be great. I'll also have to replace by 0 if the value becomes negative. But this I know how to do , at least in a separate command.
Edit: minimal example:
list of genes in a txt file, one under the other:
ccl5
cxcr4
setx
File to modify: (I put comas as field separator here, but there should be tab to separate the fields)
ccl4,3,18000,50000
ccl5,4,400,5000
cxcr4,5,300,2500
apoe,4,100,90
setx,3,200,1903
Expected output: (I remove the 10th of 4th column when gene in first column matches a gene in my separate txt file, otherwise I keep the full line unchanged)
ccl4,3,18000,50000
ccl5,4,0,5000
cxcr4,5,50,2500
apoe,4,100,90
setx,3,10,1903
Just spell out the arithmetic constraints.
The following is an attempt to articulate it in idiomatic Awk.
if (something) { print } can be rearticulated as just something. So just 1 (which is always true) is a common idiom for "print all lines (if you reach this point in the script before hitting next).
Rounding a floating-point number can be done with sprintf("%1.0f", n) which correctly rounds up if the fraction is bigger than 0.5 (int(n) would always round down).
awk 'BEGIN { FS=OFS="\t" }
FILENAME=="gene_list" {arr[$1]; next}
$1 in arr { x=sprintf("%1.0f", $3-($4/10));
if (x<0) x=0; print $1, $2, x, $4; next }
1' gene_list original_file.txt > modified_file.txt
Demo: https://ideone.com/oDjKhf

How to add zero decimal points to all elements of a matrix in shell

I have a file containing integer and floating numbers which has 3 columns and many lines. I want to have all the numbers in floating format which 6 significant figures, i.e. the number 2 becomes 2.000000. I already tried this script but it only works for one column. I appreciate if you help me how to due it for the next two columns to or any other one line script which can do the job. Thanks in advance
awk '{printf "%.6f\n",$1}' file
The script you provided works for only one column because you are only printing the first column in the script. If you have only 3 columns and if the number of columns is fixed you just need to add the other two columns to your print statement to make it work.
For example:
awk '{printf "%.6f %.6f %.6f \n",$1,$2,$3}'
If the number of columns is unknown you can use a loop inside awk to print all the columns. NF will give the total number of records. You can iterate through it and print the results.
awk '{ for (i=1; i<= NF; i++) printf "%.6f ",$i; print "" }' input
You also can use this script to print the three columns as required, if your file only contains three columns (NF is 3 in that case).
As pointed out by #kvantour in comments, an elegant way to write the above one liner is:
awk '{for(i=1;i<=NF;++i) printf "%.6f" (i==NF?ORS:OFS), $i}
Essentially, both one-liners do the same thing but the ORS (Output Record Separator) and OFS (Output Field Separator) variables give the flexibility to change the output easily and quickly.
So, (i==NF?ORS:OFS) what it does is, if the field is not the last column then OFS is printed (OFS is a space by default) and if it is the last column then ORS is printed (default value of ORS is newline ). Advantage of using these separator variables are, for example, if you want to have two new lines instead of a single newline between rows in the results it can be easily set using ORS="\n\n"

awk: calculating sum from values in single field with multiple delimiters

Related to another post I had...
parsing a sql string for integer values with multiple delimiters,
In which I say I can easily accomplish the same with UNIX tools (ahem). I found it a bit more messy than expected. I'm looking for an awk solution. Any suggestions on the following?
Here is my original post, paraphrased:
#
I want to use awk to parse data sourced from a flat file that is pipe delimited. One of the fields is sub-formatted as follows. My end state is to sum the integers within the field, but my question here is to see of ways to use awk to sum the numeric values in the field. The pattern of the sub-formatting will always be where the desired integers will be preceded by a tilde (~) and followed by an asterisk (*), except for the last one in field. The number of sub fields may vary too (my example has 5, but there could more or less). The 4 char TAG name is of no importance.
So here is a sample:
|GADS~55.0*BILK~0.0*BOBB~81.0*HETT~32.0*IGGR~51.0|
From this example, all I would want for processing is the final number of 219. Again, I can work on the sum part as a further step; just interested in getting the numbers.
#
My solution currently entails two awk statements. First using gsub to replace the '~' with a '*' delimiter in my target field, 77:
awk -F'|' 'BEGIN {OFS="|"} { gsub("~", "*", $77) ; print }' file_1 > file_2
My second awk statement is to calculate the numeric sums on the target field, 77, which is the last field, and replace it with the calculated value. It is built on the assumption that there will be no other asterisks (*) anywhere else in the file. I'm okay with that. It is working for most examples, but not others, and my gut tells me this isn't that robust of an answer. Any ideas? The suggestions on my other post for SQL were great, but I couldn't implement them for unrelated silly reasons.
awk -F'*' '{if (NF>=2) {s=0; for (i=1; i<=NF; i++) s=s+$i; print substr($1, 1, length($1)-4) s;} else print}' file_2 > file_3
To get the sum (219) from your example, you can use this:
awk -F'[^0-9.]+' '{for(i=1;i<=NF;i++)s+=$i;print s}' file
or the following for 219.00 :
awk -F'[^0-9.]+' '{for(i=1;i<=NF;i++)s+=$i;printf "%.2f\n", s}' file

Using AWK to Process Input from Multiple Files

Many people have been very helpful by posting the following solution for AWK'ing multiple input files at once:
$ awk 'FNR==NR{a[$1]=$2 FS $3;next}{ print $0, a[$1]}' file2 file1
This works well, but I was wondering if I someone could explain to me why? I find the AWK syntax a little bit tough to get the hang of and was hoping someone wouldn't mind breaking the code snippet down for me.
awk 'FNR==NR{a[$1]=$2 FS $3;next}
here we handle the 1st input (file2). say, FS is space, we build an array(a) up, index is column1, value is column2 " " column3 the FNR==NR and next means, this part of codes work only for file2. you could man gawk check what are NR and FNR
{ print $0, a[$1]}' file2 file1
When NR != FNR it's time to process 2nd input, file1. here we print the line of file1, and take column1 as index, find out the value in array(a) print. in another word, file1 and file2 are joined by column1 in both files.
for NR and FNR, shortly,
1st input has 5 lines
2nd input has 10 lines,
NR would be 1,2,3...15
FNR would be 1...5 then 1...10
you see the trick of FNR==NR check.
I found this question/answer on Google and it appears to be referring to a very specific data set found in another question (How to merge two files using AWK?). What follows is the answer I was looking for (and that I think most people would be), i.e., simply to concatenate every line from two different files using AWK. Though you could probably use some UNIX utilities like join or paste, AWK is obviously much more flexible and powerful if your desired output is different, by using if statements, or altering the OFS (which may be more difficult to do depending on the utility; see below) for example, altering the output in a much more expressive way (an important consideration for shell scripters.)
For simple line-by-line concatenation:
awk 'FNR==NR { a[FNR""] = $0; next } { print a[FNR""], $0 }' file1 file2
This emulates the function of a numerically indexed array (AWK only has associative arrays) by using implicit type conversion. It is relatively expressive and easy to understand.
Using two files called test1 and test2 with the following lines:
test1:
line one
line two
line three
test2:
line four
line five
line six
I get this result:
line one line four
line two line five
line three line six
Depending on how you want to join the values between the columns in the output, you can pick the appropriate output field separator. Here's an example with ellipses (...) separating the columns:
awk 'BEGIN { OFS="..."} FNR==NR { a[(FNR"")] = $0; next } { print a[(FNR"")], $0 }' test1 test2
Yielding this result:
line one...line four
line two...line five
line three...line six
I hope at least that this inspires you all to take advantage of the power of AWK!
A while ago I stumbled in a very good solution to handle multiple files at once. The way is to save in memory the files in AWK arrays using the method:
FILENAME==ARGV[1] { file2array[FNR] = $0 ; next }
FILENAME==ARGV[2] { file1array[FNR] = $0 ; next }
For post data treatment, is better to save the number of lines, so:
FILENAME==ARGV[1] { file2array[FNR] = $0 ; f2rows = FNR ; next }
FILENAME==ARGV[2] { file1array[FNR] = $0 ; f1rows = FNR ; next }
f2rows and f1rows will hold the position of the last row.
It has more code, but if you want more complex data treatment, I think it's the better approach. Besides, the previous approaches treated the inputs sequentially, so if you needed to do some calculations that depended on data from both files simultaneously you wouldn't be able to do it, and with this approach you can do everything with both files.