Search multiple columns for values below a threshold using awk or other bash script - awk

I would like to extract lines of a file where specific columns have a value <0.05.
For example if $2 or $4 or $6 has a value <0.05 then I want to send that line to a new file.
I do not want any lines that have a value >0.05 in any of these columns
cat File_1.txt
S_003 P_003 S_006 P_006 S_008 P_008
74.9 0.006 59.6 0.061 72.2 0.002
96.2 0.003 89.4 0.001 106.9 0.000
105.8 0.003 72.6 0.003 86.7 0.002
45.8 0.726 38.5 0.981 43.9 0.800
50.7 0.305 47.8 0.314 46.6 0.615
49.9 0.366 50.4 0.165 48.2 0.392
42.5 0.920 43.7 0.698 40.3 0.970
46.3 0.684 42.9 0.760 47.7 0.438
192.4 0.001 312.8 0.001 274.3 0.001
I tried this using awk, but it would only work doing it a very long way.
awk ' $2<=0.05' file_1.txt > file_2.txt
awk ' $4<=0.05' file_2.txt > file_3.txt
etc, and achieved the desired result
96.2 0.003 89.4 0.001 106.9 0.000
105.8 0.003 72.6 0.003 86.7 0.002
192.4 0.001 312.8 0.001 274.3 0.001
but my file is 198 columns and 57000 lines
I also tried piping the awk commands together with no luck. it only searches $2
awk ' $2<=0.05 || $4=<0.05' File_1.txt > File_2.txt
74.9 0.006 59.6 0.051 72.2 0.002
96.2 0.003 89.4 0.001 106.9 0.000
105.8 0.003 72.6 0.003 86.7 0.002
192.4 0.001 312.8 0.001 274.3 0.001
I'm pretty new at this and would appreciate any advice on how to achieve this using awk
Thanks
Sam

Perhaps this is what you're looking for. It will search every even numbered column and check that each of these columns contain numbers smaller than '0.05':
awk 'NF>1 { for(i=2;i<=NF;i+=2) if ($i>0.05) next }1' File_1.txt
Results:
96.2 0.003 89.4 0.001 106.9 0.000
105.8 0.003 72.6 0.003 86.7 0.002
192.4 0.001 312.8 0.001 274.3 0.001

Related

Merging multiple dataframes with summarisation of variables

I am trying to use awk command for the first time and I have to merge multiple files (inputfile_1, inputfile_2, ... ) that look all like this
X.IID NUM SUM AVG S_SUM
X001 2 0 0.000 0.000
X002 2 1 0.138 0.276
X003 2 1 0.138 0.276
X004 2 0 0.000 0.000
X005 2 1 0.138 0.276
I have to merge those files by X.IID and also for each X.IID sum the values of NUM and AVG into two new variables. I tried to use this awk command:
awk ‘{iid[FNR]=$1;num_sum[FNR]+=$2;avg_sum[FNR]+=$4} END{for(i=1;i<=FNR;i++)print iid[i],num_sum[i],avg_sum[i]}’ inputfile_*.csv > outputfile.csv
For some reason some X.IID values repeat in the output file. Example:
X.IID X0 X0.1
X001 50 -1.297
X001 50 -1.297
X002 50 -0.004
X003 50 -0.105
X003 50 -0.105
X003 50 -0.105
I could just remove the repeated observations as they do not differ from each other, but I would like some reassurance that the values calculated are correct.

Adding columns from two file with awk

I have the following problem:
Let's say I have two files with the following structure:
1 17.650 0.001 0.000E+00
1 17.660 0.002 0.000E+00
1 17.670 0.003 0.000E+00
1 17.680 0.004 0.000E+00
1 17.690 0.001 0.000E+00
1 17.700 0.000 0.000E+00
1 17.710 0.004 0.000E+00
1 17.720 0.089 0.000E+00
1 17.730 0.011 0.000E+00
1 17.740 0.000 0.000E+00
1 17.750 0.032 0.000E+00
1 17.760 0.100 0.000E+00
1 17.770 0.020 0.000E+00
1 17.780 0.002 0.000E+00
2 -20.000 0.001 0.000E+00
2 -19.990 0.002 0.000E+00
2 -19.980 0.003 0.000E+00
2 -19.970 0.004 0.000E+00
2 -19.960 0.001 0.000E+00
2 -19.950 0.000 0.000E+00
2 -19.940 0.004 0.000E+00
2 -19.930 0.089 0.000E+00
2 -19.920 0.011 0.000E+00
2 -19.910 0.000 0.000E+00
2 -19.900 0.032 0.000E+00
2 -19.890 0.100 0.000E+00
2 -19.880 0.020 0.000E+00
2 -19.870 0.002 0.000E+00
The first two columns are identical in both files and what is different is the 3rd and 4th columns. The above is a sample of these files. That blank line is essential and can be found throughout the files separating the data into "blocks". It must exist!
Using the following command:
awk '{a[FNR]=$1; b[FNR]=$2; s[FNR]+=$3} END{for (i=1; i<=FNR; i++) print a[i], b[i], s[i]}' file1 file2 > file-result
I am trying to create a file where columns 1 and 2 are identical to the ones in the original files and the 3rd column is the sum of the 3rd column in file1 and file2.
This command works if there is no blank line. With the blank line I get the following:
1 17.650 0.001
1 17.660 0.002
1 17.670 0.003
1 17.680 0.004
1 17.690 0.001
1 17.700 0.000
1 17.710 0.004
1 17.720 0.089
1 17.730 0.011
1 17.740 0.000
1 17.750 0.032
1 17.760 0.100
1 17.770 0.020
1 17.780 0.002
0
2 -20.000 0.001
2 -19.990 0.002
2 -19.980 0.003
2 -19.970 0.004
2 -19.960 0.001
2 -19.950 0.000
2 -19.940 0.004
2 -19.930 0.089
2 -19.920 0.011
2 -19.910 0.000
2 -19.900 0.032
2 -19.890 0.100
2 -19.880 0.020
2 -19.870 0.002
(please note that in the above I have not written the actual sum in column 3 but you get the idea)
How can I make sure that 0 doesn't appear in the blank line? I can't figure it out.
Note that if the 2 first columns are really identical you don't need to store them in arrays; store only the 3rd columns of the first file.
The solution to your problem is simple: when processing the second file test if the line is blank and, if it is, print it. Else print the modified line. The nextstatement makes all this quite easy and clean.
awk 'NR == FNR {s[NR]=$3; next}
/^[[:space:]]*$/ {print; next}
{print $1, $2, s[FNR]+$3}' file1 file2 > file-result
The first block runs only on lines of the first file (NR == FNR is true only then). It stores the 3rd field in array s, indexed by the line number. The next statement moves immediately to the next line and prevents the two other blocks to run on lines of the first file.
The second block thus runs only on lines of the second file, and only if they are blank (^[[:space:]]*$ means only 0 or more spaces between beginning (^) and end ($) of line). The block prints a blank line as it is and the next statement again moves immediately to the next line, preventing the last block to run. Note that if your awk supports the \s operator you can replace [[:space:]] by \s. Note also that the test could also be NF == 0 (NF is the number of fields of the current record).
So the 3rd and last block runs only on non-blank lines of the second file. It simply prints the two first fields and the sum of the third fields of the two files (taken from $3 and the s array).

Comparing two columns in two files using awk with duplicates

File 1
A4gnt 0 0 0 0.3343
Aaas 2.79 2.54 1.098 0.1456
Aacs 0.94 0.88 1.063 0.6997
Aadac 0 0 0 0.3343
Aadacl2 0 0 0 0.3343
Aadat 0.01 0 1.723 0.7222
Aadat 0.06 0.03 1.585 0.2233
Aaed1 0.28 0.24 1.14 0.5337
Aaed1 1.24 1.27 0.976 0.9271
Aaed1 15.91 13.54 1.175 0.163
Aagab 1.46 1.14 1.285 0.3751
Aagab 6.12 6.3 0.972 0.6569
Aak1 0.02 0.01 1.716 0.528
Aak1 0.1 0.19 0.561 0.159
Aak1 0.14 0.19 0.756 0.5297
Aak1 0.16 0.18 0.907 0.6726
Aak1 0.21 0 0 0.066
Aak1 0.26 0.27 0.967 0.9657
Aak1 0.54 1.65 0.325 0.001
Aamdc 0.04 0 15.461 0.0875
Aamdc 1.03 1.01 1.019 0.8817
Aamdc 1.27 1.26 1.01 0.9285
Aamdc 7.21 6.94 1.039 0.7611
Aamp 0.06 0.05 1.056 0.9136
Aamp 0.11 0.11 1.044 0.9227
Aamp 0.12 0.13 0.875 0.7584
Aamp 0.22 0.2 1.072 0.7609
File 2
Adar
Ak3
Alox15b
Ampd2
Ampd3
Ankrd17
Apaf1
Aplp1
Arih1
Atg14
Aurkb
Bcl2l14
Bmp2
Brms1l
Cactin
Camta2
Cav1
Ccr5
Chfr
Clock
Cnot1
Crebrf
Crtc3
Csnk2b
Cul3
Cx3cl1
Dnaja3
Dnmt1
Dtl
Ednra
Eef1e1
Esr1
Ezr
Fam162a
Fas
Fbxo30
Fgr
Flcn
Foxp3
Frzb
Fzd6
Gdf3
Hey2
Hnf4
The desired output would be wherever matches in the first column from both file print out all the columns in the first file (including duplicates).
I've tried
awk 'NR==FNR{a[$1]=$2"\t"$3"\t"$4"\t"$5; next} { if($1 in a) { print $0,a[$1] } }' File2 File1 > output
But for some reason I'm getting just few hits. Does anyone know why?
Read second file first, and store 1st column values in array arr as array keys, and then read first file, if 1st column of file1 exists in array arr which was created using file2, then print current row/record from file1.
awk 'FNR==NR{arr[$1];next}$1 in arr' file2 file1
Advantage:
if you use a[$1]=$2"\t"$3"\t"$4"\t"$5; next, if there's any data with same key will be replaced with previous value,
but if you use arr[$1];next, we store just unique key, and $1 in arr takes care of duplicate record even if it exists

Searching File2 with 3 columns from File 1 with awk

Does anyone know how to print "Not Found" if there is no match, such that the print output will always contain the same number of lines as File 1?
To be more specific, I have two files with four columns:
File 1:
1 800 800 0.51
2 801 801 0.01
3 802 802 0.01
4 803 803 0.23
File 2:
1 800 800 0.55
2 801 801 0.09
3 802 802 0.88
4 803 804 0.24
This is what I am using now:
$ awk 'NR==FNR{a[$1,$2,$3];next}($1,$2,$3) in a{print $4}' file1.txt file2.txt
This generates the following output:
0.55
0.09
0.88
However, I want to get this:
0.55
0.09
0.88
Not Found
Could you please help?
Sorry if this is presented in a confusing manner; I have little experience with awk and am confused myself.
In a separate issue, I want to end up having a file that has the data from File 2 added on to File1, like so:
1 800 800 0.51 0.55
2 801 801 0.01 0.09
3 802 802 0.01 0.88
4 803 803 0.23 Not Found
I was going to generate the file as before (lets call it file2-matches.txt), then use the paste command:
paste -d"\t" file1.txt file2-matches.txt > output.txt
But considering I have to do this matching for over 100 files, is there any more efficient way to do this that you can suggest?
Add an else clause:
$ awk 'NR==FNR{a[$1,$2,$3];next} {if (($1,$2,$3) in a) {print $4} else {print "not found"}}' f1 f2
0.55
0.09
0.88
not found

Gnuplot conditional plotting

I need to use gnuplot to plot wind direction values (y) against time (x) in a 2D plot using lines and points. This works fine if successive values are close together. If the values are eg separated by 250 degrees then I need to have a condition that checks the previous y value and does not draw a line joining the two points. This condition occurs when the wind dir is in the 280 degrees to 20 degrees sector and the plots are messy eg a north wind. AS the data is time dependent I cannot use polar plots except at a specific point in time. I need to show change in direction over time.
Basically the problem is:
plot y against x ; when (y2-y1)>= 180 then break/erase line joining successive points
Can anyone give me an example of how to do this?
A sample from the data file is:
2014-06-16 16:00:00 0.000 990.081 0.001 0.001 0.001 0.001 0.002 0.001 11.868 308 002.54 292 004.46 00
2014-06-16 16:10:00 0.000 990.047 0.001 0.001 0.001 0.001 0.002 0.001 11.870 303 001.57 300 002.48 00
2014-06-16 16:20:00 0.000 990.014 0.001 0.001 0.001 0.001 0.002 0.001 11.961 334 001.04 314 002.07 00
2014-06-16 16:30:00 0.000 990.014 0.001 0.001 0.001 0.001 0.002 0.001 11.818 005 001.18 020 002.14 00
2014-06-16 16:40:00 0.000 990.014 0.001 0.001 0.001 0.001 0.002 0.001 11.725 332 001.14 337 002.26 00
and I want to plot column 12 vs time.
You can insert a filtering condition in the using statement and use a value of 1/0 if the condition is not fullfilled. In that case this point is not connect to others:
set timefmt '%Y-%m-%d %H:%M:%S'
set xdata time
unset key
y1 = y2 = 0
plot 'data.dat' using 1:(y1 = y2, y2 = $12, ($0 == 0 || y2 - y1 < 180) ? $12 : 1/0) with lines,\
'data.dat' using 1:12 with points
With your data sample and gnuplot version 4.6.5 I get the plot:
Unfortunately, with this approach you cannot categorize lines, but only points and also the line following the 1/0 point aren't drawn.
A better approach would be to use awk to insert an empty line when a jump occurs. In a 2D-plot points from different data blocks (separated by a single new line) aren't connected:
set timefmt '%Y-%m-%d %H:%M:%S'
set xdata time
unset key
plot '< awk ''{y1 = y2; y2 = $12; if (NR > 1 && y2 - y1 >= 180) printf("\n"); print}'' data.dat' using 1:12 with linespoints
In order to break the joining lines two conditional statements must be fulfilled and BOTH must include the newline statement printf("\n"):
plot '< awk ''{y1 = y2; y2 = $12; if (NR > 1 && y2 - y1 >= 180) printf("\n") ; if (NR > 1 && y2 -y1 <= 0) printf("\n"); print}'' /Desktop/plotdata.txt' using 1:12 with linespoints
There is absolutely no need for awk. You can simply "interrupt" the lines by using variable color for the line. For gnuplot<5.0.0 you can use 0xffffff=white (or whatever your background color is) as linecolor and the line will hardly be visible. For gnuplot>=5.0.0 you can use any transparent color, e.g. 0xff123456, i.e. the line is really invisible.
Data: SO24425910.dat
2014-06-16 16:00:00 330
2014-06-16 16:10:00 320
2014-06-16 16:20:00 310
2014-06-16 16:30:00 325
2014-06-16 16:40:00 090
2014-06-16 16:50:00 060
2014-06-16 17:00:00 070
2014-06-16 17:10:00 280
2014-06-16 17:20:00 290
2014-06-16 17:30:00 300
Script: (works for gnuplot>=4.4.0, March 2010)
### conditional interruption of line
reset
FILE = "SO24425910.dat"
set key noautotitle
set yrange[0:360]
set ytics 90
set grid x
set grid y
set xdata time
set timefmt "%Y-%m-%d %H:%M"
set format x "%H:%M"
set multiplot layout 2,1
plot y1=NaN FILE u 1:(y0=y1,y1=$3):(abs(y1-y0)>=180?0xffffff:0xff0000) w l lc rgb var
plot y1=NaN FILE u 1:(y0=y1,y1=$3):(abs(y1-y0)>=180?0xffffff:0xff0000) w l lc rgb var, \
'' u 1:3 w p pt 7 lc rgb "red"
unset multiplot
### end of script
Result: