Adding columns from two file with awk - awk

I have the following problem:
Let's say I have two files with the following structure:
1 17.650 0.001 0.000E+00
1 17.660 0.002 0.000E+00
1 17.670 0.003 0.000E+00
1 17.680 0.004 0.000E+00
1 17.690 0.001 0.000E+00
1 17.700 0.000 0.000E+00
1 17.710 0.004 0.000E+00
1 17.720 0.089 0.000E+00
1 17.730 0.011 0.000E+00
1 17.740 0.000 0.000E+00
1 17.750 0.032 0.000E+00
1 17.760 0.100 0.000E+00
1 17.770 0.020 0.000E+00
1 17.780 0.002 0.000E+00
2 -20.000 0.001 0.000E+00
2 -19.990 0.002 0.000E+00
2 -19.980 0.003 0.000E+00
2 -19.970 0.004 0.000E+00
2 -19.960 0.001 0.000E+00
2 -19.950 0.000 0.000E+00
2 -19.940 0.004 0.000E+00
2 -19.930 0.089 0.000E+00
2 -19.920 0.011 0.000E+00
2 -19.910 0.000 0.000E+00
2 -19.900 0.032 0.000E+00
2 -19.890 0.100 0.000E+00
2 -19.880 0.020 0.000E+00
2 -19.870 0.002 0.000E+00
The first two columns are identical in both files and what is different is the 3rd and 4th columns. The above is a sample of these files. That blank line is essential and can be found throughout the files separating the data into "blocks". It must exist!
Using the following command:
awk '{a[FNR]=$1; b[FNR]=$2; s[FNR]+=$3} END{for (i=1; i<=FNR; i++) print a[i], b[i], s[i]}' file1 file2 > file-result
I am trying to create a file where columns 1 and 2 are identical to the ones in the original files and the 3rd column is the sum of the 3rd column in file1 and file2.
This command works if there is no blank line. With the blank line I get the following:
1 17.650 0.001
1 17.660 0.002
1 17.670 0.003
1 17.680 0.004
1 17.690 0.001
1 17.700 0.000
1 17.710 0.004
1 17.720 0.089
1 17.730 0.011
1 17.740 0.000
1 17.750 0.032
1 17.760 0.100
1 17.770 0.020
1 17.780 0.002
0
2 -20.000 0.001
2 -19.990 0.002
2 -19.980 0.003
2 -19.970 0.004
2 -19.960 0.001
2 -19.950 0.000
2 -19.940 0.004
2 -19.930 0.089
2 -19.920 0.011
2 -19.910 0.000
2 -19.900 0.032
2 -19.890 0.100
2 -19.880 0.020
2 -19.870 0.002
(please note that in the above I have not written the actual sum in column 3 but you get the idea)
How can I make sure that 0 doesn't appear in the blank line? I can't figure it out.

Note that if the 2 first columns are really identical you don't need to store them in arrays; store only the 3rd columns of the first file.
The solution to your problem is simple: when processing the second file test if the line is blank and, if it is, print it. Else print the modified line. The nextstatement makes all this quite easy and clean.
awk 'NR == FNR {s[NR]=$3; next}
/^[[:space:]]*$/ {print; next}
{print $1, $2, s[FNR]+$3}' file1 file2 > file-result
The first block runs only on lines of the first file (NR == FNR is true only then). It stores the 3rd field in array s, indexed by the line number. The next statement moves immediately to the next line and prevents the two other blocks to run on lines of the first file.
The second block thus runs only on lines of the second file, and only if they are blank (^[[:space:]]*$ means only 0 or more spaces between beginning (^) and end ($) of line). The block prints a blank line as it is and the next statement again moves immediately to the next line, preventing the last block to run. Note that if your awk supports the \s operator you can replace [[:space:]] by \s. Note also that the test could also be NF == 0 (NF is the number of fields of the current record).
So the 3rd and last block runs only on non-blank lines of the second file. It simply prints the two first fields and the sum of the third fields of the two files (taken from $3 and the s array).

Related

Merging multiple dataframes with summarisation of variables

I am trying to use awk command for the first time and I have to merge multiple files (inputfile_1, inputfile_2, ... ) that look all like this
X.IID NUM SUM AVG S_SUM
X001 2 0 0.000 0.000
X002 2 1 0.138 0.276
X003 2 1 0.138 0.276
X004 2 0 0.000 0.000
X005 2 1 0.138 0.276
I have to merge those files by X.IID and also for each X.IID sum the values of NUM and AVG into two new variables. I tried to use this awk command:
awk ‘{iid[FNR]=$1;num_sum[FNR]+=$2;avg_sum[FNR]+=$4} END{for(i=1;i<=FNR;i++)print iid[i],num_sum[i],avg_sum[i]}’ inputfile_*.csv > outputfile.csv
For some reason some X.IID values repeat in the output file. Example:
X.IID X0 X0.1
X001 50 -1.297
X001 50 -1.297
X002 50 -0.004
X003 50 -0.105
X003 50 -0.105
X003 50 -0.105
I could just remove the repeated observations as they do not differ from each other, but I would like some reassurance that the values calculated are correct.

Comparing two columns in two files using awk with duplicates

File 1
A4gnt 0 0 0 0.3343
Aaas 2.79 2.54 1.098 0.1456
Aacs 0.94 0.88 1.063 0.6997
Aadac 0 0 0 0.3343
Aadacl2 0 0 0 0.3343
Aadat 0.01 0 1.723 0.7222
Aadat 0.06 0.03 1.585 0.2233
Aaed1 0.28 0.24 1.14 0.5337
Aaed1 1.24 1.27 0.976 0.9271
Aaed1 15.91 13.54 1.175 0.163
Aagab 1.46 1.14 1.285 0.3751
Aagab 6.12 6.3 0.972 0.6569
Aak1 0.02 0.01 1.716 0.528
Aak1 0.1 0.19 0.561 0.159
Aak1 0.14 0.19 0.756 0.5297
Aak1 0.16 0.18 0.907 0.6726
Aak1 0.21 0 0 0.066
Aak1 0.26 0.27 0.967 0.9657
Aak1 0.54 1.65 0.325 0.001
Aamdc 0.04 0 15.461 0.0875
Aamdc 1.03 1.01 1.019 0.8817
Aamdc 1.27 1.26 1.01 0.9285
Aamdc 7.21 6.94 1.039 0.7611
Aamp 0.06 0.05 1.056 0.9136
Aamp 0.11 0.11 1.044 0.9227
Aamp 0.12 0.13 0.875 0.7584
Aamp 0.22 0.2 1.072 0.7609
File 2
Adar
Ak3
Alox15b
Ampd2
Ampd3
Ankrd17
Apaf1
Aplp1
Arih1
Atg14
Aurkb
Bcl2l14
Bmp2
Brms1l
Cactin
Camta2
Cav1
Ccr5
Chfr
Clock
Cnot1
Crebrf
Crtc3
Csnk2b
Cul3
Cx3cl1
Dnaja3
Dnmt1
Dtl
Ednra
Eef1e1
Esr1
Ezr
Fam162a
Fas
Fbxo30
Fgr
Flcn
Foxp3
Frzb
Fzd6
Gdf3
Hey2
Hnf4
The desired output would be wherever matches in the first column from both file print out all the columns in the first file (including duplicates).
I've tried
awk 'NR==FNR{a[$1]=$2"\t"$3"\t"$4"\t"$5; next} { if($1 in a) { print $0,a[$1] } }' File2 File1 > output
But for some reason I'm getting just few hits. Does anyone know why?
Read second file first, and store 1st column values in array arr as array keys, and then read first file, if 1st column of file1 exists in array arr which was created using file2, then print current row/record from file1.
awk 'FNR==NR{arr[$1];next}$1 in arr' file2 file1
Advantage:
if you use a[$1]=$2"\t"$3"\t"$4"\t"$5; next, if there's any data with same key will be replaced with previous value,
but if you use arr[$1];next, we store just unique key, and $1 in arr takes care of duplicate record even if it exists

Searching File2 with 3 columns from File 1 with awk

Does anyone know how to print "Not Found" if there is no match, such that the print output will always contain the same number of lines as File 1?
To be more specific, I have two files with four columns:
File 1:
1 800 800 0.51
2 801 801 0.01
3 802 802 0.01
4 803 803 0.23
File 2:
1 800 800 0.55
2 801 801 0.09
3 802 802 0.88
4 803 804 0.24
This is what I am using now:
$ awk 'NR==FNR{a[$1,$2,$3];next}($1,$2,$3) in a{print $4}' file1.txt file2.txt
This generates the following output:
0.55
0.09
0.88
However, I want to get this:
0.55
0.09
0.88
Not Found
Could you please help?
Sorry if this is presented in a confusing manner; I have little experience with awk and am confused myself.
In a separate issue, I want to end up having a file that has the data from File 2 added on to File1, like so:
1 800 800 0.51 0.55
2 801 801 0.01 0.09
3 802 802 0.01 0.88
4 803 803 0.23 Not Found
I was going to generate the file as before (lets call it file2-matches.txt), then use the paste command:
paste -d"\t" file1.txt file2-matches.txt > output.txt
But considering I have to do this matching for over 100 files, is there any more efficient way to do this that you can suggest?
Add an else clause:
$ awk 'NR==FNR{a[$1,$2,$3];next} {if (($1,$2,$3) in a) {print $4} else {print "not found"}}' f1 f2
0.55
0.09
0.88
not found

obtain averages of field 2 after grouping by field 1 with awk

I have a file with two fields containing numbers that I have sorted numerically based on field 1. The numbers in field 1 range from 1 to 200000 and the numbers in field 2 between 0 and 1. I want to obtain averages for both field 1 and field 2 in batches (based on rows).
Here is example input output when specifying batches of 4 rows:
1 0.12
1 0.34
2 0.45
2 0.40
50 0.60
301 0.12
899 0.13
1003 0.14
1300 0.56
1699 0.43
2100 0.25
2500 0.56
The output would be:
1.5 0.327
563.25 0.247
1899.75 0.45
Here you go:
awk -v n=4 '{s1 += $1; s2 += $2; if (++i % n == 0) { print s1/n, s2/n; s1=s2=0; } }'
Explanation:
Initialize n=4, the size of the batches
Collect the sums: sum of 1st column in s1, the 2nd in s2
Increment counter i by 1 (default initial value is 0, no need to set it)
If i is divisible by n with no remainder, then we print the averages, and reset the sum variables

Search multiple columns for values below a threshold using awk or other bash script

I would like to extract lines of a file where specific columns have a value <0.05.
For example if $2 or $4 or $6 has a value <0.05 then I want to send that line to a new file.
I do not want any lines that have a value >0.05 in any of these columns
cat File_1.txt
S_003 P_003 S_006 P_006 S_008 P_008
74.9 0.006 59.6 0.061 72.2 0.002
96.2 0.003 89.4 0.001 106.9 0.000
105.8 0.003 72.6 0.003 86.7 0.002
45.8 0.726 38.5 0.981 43.9 0.800
50.7 0.305 47.8 0.314 46.6 0.615
49.9 0.366 50.4 0.165 48.2 0.392
42.5 0.920 43.7 0.698 40.3 0.970
46.3 0.684 42.9 0.760 47.7 0.438
192.4 0.001 312.8 0.001 274.3 0.001
I tried this using awk, but it would only work doing it a very long way.
awk ' $2<=0.05' file_1.txt > file_2.txt
awk ' $4<=0.05' file_2.txt > file_3.txt
etc, and achieved the desired result
96.2 0.003 89.4 0.001 106.9 0.000
105.8 0.003 72.6 0.003 86.7 0.002
192.4 0.001 312.8 0.001 274.3 0.001
but my file is 198 columns and 57000 lines
I also tried piping the awk commands together with no luck. it only searches $2
awk ' $2<=0.05 || $4=<0.05' File_1.txt > File_2.txt
74.9 0.006 59.6 0.051 72.2 0.002
96.2 0.003 89.4 0.001 106.9 0.000
105.8 0.003 72.6 0.003 86.7 0.002
192.4 0.001 312.8 0.001 274.3 0.001
I'm pretty new at this and would appreciate any advice on how to achieve this using awk
Thanks
Sam
Perhaps this is what you're looking for. It will search every even numbered column and check that each of these columns contain numbers smaller than '0.05':
awk 'NF>1 { for(i=2;i<=NF;i+=2) if ($i>0.05) next }1' File_1.txt
Results:
96.2 0.003 89.4 0.001 106.9 0.000
105.8 0.003 72.6 0.003 86.7 0.002
192.4 0.001 312.8 0.001 274.3 0.001