using awk to eliminate records that have a match specified by field 1 and within a defined value of field 2 - awk

I have a problem that I am trying to use awk to solve. It has application in selecting good quality single nucleaotide ploymorphisms (SNP) for placing on a SNP-chip, where there is a requirement that no SNP is within 60bp of another SNP. The file looks like this:
comp1008_seq1 20
comp1008_seq1 234
comp1008_seq1 260
comp1008_seq1 500
comp3044_seq1 300
comp3044_seq1 350
comp3044_seq1 460
comp3044_seq1 600
................
I want to only print records that are not within +-60 (based on field 2) when they are from the same component (based on field 1). Therefore, it doesn't matter if they are within +-60 when they are from different components (based on field 1). The output in the above example should look like this:
comp1008_seq1 20
comp1008_seq1 234
comp1008_seq1 500
comp3044_seq1 300
comp3044_seq1 460
comp3044_seq1 600

http://ideone.com/h6oEI
{
if ($1 != last1 || abs($2-last2) > 60 ) print
last1 = $1; last2 = $2
}
function abs(x){
return x > 0 ? x : -x
}

Related

How to break the line every n-th time when results are written individually into a file?

I have this program, which takes the values from 2 separate files (ex1.idl and ex2.idl), performs a calculation and writes the results into a different file (ex3.txt). It all works well and the output are the results in one large line.
I am looking for an easy way to break the line every 10th element in the output, like this:
100 200 500 600 120 180 400 450 900 100
100 200 700 600 620 580 400 450 900 400
200 200 700 800 620 580 400 450 800 300
with open('ex1.idl') as f1, open('ex2_1.idl') as f2:
with open('ex3.txt', 'w') as f3:
start_line = 905 #reading from this line forward
for i in range(start_line - 1):
next(f1)
next(f2)
f1 = list(map(float, f1.read().split()))
f2 = list(map(float, f2.read().split()))
for result in map(lambda v : v[0]/v[1], zip(f1,f2)):
if(f3.count()%10 != 0):
f3.write(str(result) + ' ')
else:
f3.write(str(result) + '\n')
I thank you in advance for a solution (advice).
I figured it ouf for future viewers:
for i, result in enumerate(map(lambda v : v[0]/v[1], zip(f1,f2)), start=1):
if(i%10 != 0):
f3.write(' ' + str('{0:.9f}'.format(result)))
else:
f3.write(' ' + str('{0:.9f}'.format(result)) + '\n')

Replace nth and (n+1)th values in one file with same values from another file

i have two files:
f1.txt:
header 1
header 2
100
100
100
100
100
100
100
100
100
100
100
100
100
f2.txt:
header 1
header 2
10
1234
5678
10
10
2345
6789
10
10
3456
7890
10
10
desired output
f3.txt:
header 1
header 2
100
1234
5678
100
100
2345
6789
100
100
3456
7890
100
100
the values in f2.txt that occur in lines 4 & 5, then 8 & 9, then 12 & 13 (i.e., they're spaced every 6th row), i want to put them inside f1.txt to replace the corresponding rows in f1.txt. how can i do this?
so far, i have only been able to print these values out of f2.txt as such:
exec<f2.txt
var=$(awk 'NR % 6 == 4')
echo "$var"
this produces
1234
2345
3456
then when i change 4 to 5, it gives me the 2nd set of values. so am trying to learn how to extract the 2 sets of values, and then put them in f1.txt? any help will be greatly appreciated. thanks!
Try:
paste f1.txt f2.txt | awk -F'\t' '
NR < 3 || (NR-2)%4 == 1 || (NR-2)%4 == 0 {print $1; next}
{print $2}
'
Your desired output does not indicate groups of 6 lines, but instead groups of 4 lines. Perhaps the 2 header lines are throwing you off.
I'm assuming your input files do not contain tabs.
More concise awk from Ed Morton:
awk -F'\t' '{print (NR-2)%4 < 2 ? $1 : $2}'

Join multiple files in gawk

I have a large number of files (around 500). Each file contain two columns. The first column is same for every file. I want to join all the files into a single file using gawk.
For example,
File 1
a 123
b 221
c 904
File 2
a 298
b 230
c 102
and so on. I want a final file like as below:
Final file
a 123 298
b 221 230
c 904 102
I have found scripts that can join two files, but I need to join multiple files.
For given sample files:
$ head f*
==> f1 <==
a 123
b 221
c 904
==> f2 <==
a 298
b 230
c 102
==> f3 <==
a 500
b 600
c 700
Method 1:
$ awk '{a[FNR]=((a[FNR])?a[FNR]FS$2:$0)}END{for(i=1;i<=FNR;i++) print a[i]}' f*
a 123 298 500
b 221 230 600
c 904 102 700
Method 2: (Will probably be faster as your are not loading 500 files in memory)
Using paste and awk together. (Assuming first column is same and present in all files). Doing paste f* will give you the following result:
$ paste f*
a 123 a 298 a 500
b 221 b 230 b 600
c 904 c 102 c 700
Pipe that to awk to remove extra columns.
$ paste f* | awk '{printf "%s ",$1;for(i=2;i<=NF;i+=2) printf "%s%s",$i,(i==NF?RS:FS)}'
a 123 298 500
b 221 230 600
c 904 102 700
You can re-direct the output to another file.
I have encountered this problem very frequently.
I strongly encourage you to check into the getline function in gawk.
getline var < filename
is the command syntax and can be used to solve your problem.
I would suggest utilizing another language that solves this problem much more easily. Typically I invest about 5 lines of code to solve this standard problem.
j=1;
j=getline x < "filename";
if(j==0) {
break;
}
... (Commands involving x such as split and print).
You could try something like :
$ ls
f1.txt f2.txt f3.txt
$ awk '($0 !~ /#/){a[$1]=a[$1]" "$2} END {for(i in a){print i""a[i]}}' *.txt
a 123 298 299
b 221 230 231
c 904 102 103
awk 'FNR==NR{arr[$1]=$2; next;}{printf "%s%s%s%s%s",$1,OFS,arr[$1],OFS,$2; print"";}' file1 file2
based on this

Obtaining "consensus" results from two different files using awk

I have file1 as a result of a first operation, it has the following structure
201 12 0.298231 8.8942
206 13 -0.079795 0.6367
101 34 0.86348 0.7456
301 15 0.215355 4.6378
303 16 0.244734 5.9895
and file2 as a result of a different operation and has the same type of structure.
File 2 sample
204 60 -0.246038 6.0535
304 83 -0.246209 6.0619
101 34 -0.456629 6.0826
211 36 -0.247003 6.1011
305 83 -0.247134 6.1075
206 46 -0.247485 6.1249
210 39 -0.248066 6.1537
107 41 -0.248201 6.1603
102 20 -0.248542 6.1773
I would like to select fields 1 and 2 that have a field 3 value higher than a threshold in file1 (0.8) , then for these selected values of field 1 and 2, select the values that have a field 3 value higher than another threshold in file 2 (abs(x)=0.4).
Note that although files 1 and 2 have the same structure fields 1 and 2 values are not the same (not the same number of lines etc..)
Can you do this with awk?
desired output
101 34
If you combine awk with unix commands you can do the following
sort file1.txt > sorted1.txt
sort file2.txt > sorted2.txt
Sorting will allow you to use JOIN on the first line (which I assume is unique). Now field 3 of file1 is $3 and file2 is $6. Using awk you can write the following.:
join sorted1.txt sorted2.txt | awk 'function abs(value){return (value<0?-value:value);}{print $1"\t"$2} $3 >=0.8 && abs($6) >=0.4'
In essence, in the awk you first write a function to deal with absolute values, then you simply ask it to print line 1 and 2 selecting for the criteria you detailed at $3 and $6 (formely field 3 of file1 and file2 respectively)
Hope this helps...

How to Add Column with Percentage

I would like to calculate percentage of value in each line out of all lines and add it as another column.
Input (delimiter is \t):
1 10
2 10
3 20
4 40
Desired output with added third column showing calculated percentage based on values in second column:
1 10 12.50
2 10 12.50
3 20 25.00
4 40 50.00
I have tried to do it myself, but when I calculated total for all lines I didn't know how to preserve rest of line unchanged. Thanks a lot for help!
Here you go, one pass step awk solution -
awk 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
[jaypal:~/Temp] cat file
1 10
2 10
3 20
4 40
[jaypal:~/Temp] awk 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
1 10 12.5
2 10 12.5
3 20 25
4 40 50
Update: If tab is a required in output then just set the OFS variable to "\t".
[jaypal:~/Temp] awk -v OFS="\t" 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
1 10 12.5
2 10 12.5
3 20 25
4 40 50
Breakout of pattern {action} statements:
The first pattern is NR==FNR. FNR is awk's in-built variable that keeps track of number of records (by default separated by a new line) in a given file. So FNR in our case would be 4. NR is similar to FNR but it does not get reset to 0. It continues to grow on. So NR in our case would be 8.
This pattern will be true only for the first 4 records and thats exactly what we want. After perusing through the 4 records, we are assign the total to a variable a. Notice that we did not initialize it. In awk we don't have to. However, this would break if entire column 2 is 0. So you can handle it by putting an if statement in the second action statement i.e do the division only if a > 0 else say division by 0 or something.
next is needed cause we don't really want second pattern {action} statement to execute. next tells awk to stop further actions and move to the next record.
Once the four records are parsed, the next pattern{action} begins, which is pretty straight forward. Doing the percentage and print column 1 and 2 along with percentage next to them.
Note: As #lhf mentioned in the comment, this one-liner will only work as long as you have the data set in a file. It won't work if you pass data through a pipe.
In the comments, there is a discussion going on ways to make this awk one-liner take input from a pipe instead of a file. Well the only way I could think of was to store the column values in array and then using for loop to spit each value out along with their percentage.
Now arrays in awk are associative and are never in order, i.e pulling the values out of arrays will not be in the same order as they went in. So if that is ok then the following one-liner should work.
[jaypal:~/Temp] cat file
1 10
2 10
3 20
4 40
[jaypal:~/Temp] cat file | awk '{b[$1]=$2;sum=sum+$2} END{for (i in b) print i,b[i],(b[i]/sum)*100}'
2 10 12.5
3 20 25
4 40 50
1 10 12.5
To get them in order, you can pipe the result to sort.
[jaypal:~/Temp] cat file | awk '{b[$1]=$2;sum=sum+$2} END{for (i in b) print i,b[i],(b[i]/sum)*100}' | sort -n
1 10 12.5
2 10 12.5
3 20 25
4 40 50
You can do it in a couple of passes
#!/bin/bash
total=$(awk '{total=total+$2}END{print total}' file)
awk -v total=$total '{ printf ("%s\t%s\t%.2f\n", $1, $2, ($2/total)*100)}' file
You need to escape it as %%. For instance:
printf("%s\t%s\t%s%%\n", $1, $2, $3)
Perhaps there is better way but I would pass file twice.
Content of 'infile':
1 10
2 10
3 20
4 40
Content of 'script.awk':
BEGIN {
## Tab as field separator.
FS = "\t";
}
## First pass of input file. Get total from second field.
ARGIND == 1 {
total += $2;
next;
}
## Second pass of input file. Print each original line and percentage as third field.
{
printf( "%s\t%2.2f\n", $0, $2 * 100 / total );
}
Run the script in my linux box:
gawk -f script.awk infile infile
And result:
1 10 12.50
2 10 12.50
3 20 25.00
4 40 50.00