separate number range from a file using awk - awk

I have a file with 5 columns and I want to separate the columns using number range as a criteria: example:
chr1 2120987 2144159 NM_001282670 0.48106
chr1 2123333 2126214 NM_001256946 2.71647
chr1 4715104 4837854 NM_001042478 0
chr1 4715104 4843851 NM_018836 0
chr1 3728644 3773797 NM_014704 4.61425
chr1 3773830 3801993 NM_004402 4.39674
chr1 3773830 3801993 NM_001282669 0
chr1 6245079 6259679 NM_000983 75.1769
chr1 6304251 6305638 NM_001024598 0
chr1 6307405 6321035 NM_207370 0.273874
chr1 6161846 6240194 NM_015557 0.0149477
chr1 6266188 6281359 NM_207396 0
chr1 6281252 6296044 NM_012405 14.0752
I want to remove 0 from the list , then would like to sort out numbers between 0.01 and 0.27 and so on....
I am new to shell programming....can someone help with awk ?
Thanks.

As you are new to shell programming, you may not be aware of grep and sort which would be simpler for this job.
If you are hell-bent on awk as your tool of choice, please just disregard my answer.
I would do it like this:
grep -v '\s0$' file | sort -k 5,5 -g
chr1 6161846 6240194 NM_015557 0.0149477
chr1 6307405 6321035 NM_207370 0.273874
chr1 2120987 2144159 NM_001282670 0.48106
chr1 2123333 2126214 NM_001256946 2.71647
chr1 3773830 3801993 NM_004402 4.39674
chr1 3728644 3773797 NM_014704 4.61425
chr1 6281252 6296044 NM_012405 14.0752
chr1 6245079 6259679 NM_000983 75.1769
The grep with -v inverts the search and looks for lines not containing the sequence space followed by a zero followed by end of line. The sort sorts the data on column 5, and does a general numeric sort because of the -g.

If you are trying to select the rows in which $5 is non-zero and within a certain range, then indeed awk makes sense, and the following may be close to what you're after:
awk -v min=0.01 -v max=0.27 '
$5 == 0 { next }
min <= $5 && $5 <= max { print }'
Here, the call to awk has been parameterized to suggest how these few lines can be adapted for more general usage.

Related

Trying to print previous line in awk, but instead appears to print current line twice

I'm trying to use awk to imitate uniq -d on specific fields to print the line currently being read as well as the previous line using the first solution from here, but it appears to print the same line twice.
Here's a sample of the stuff in the file.
130 chr1 7237 7238 0k9imgkt
135 chr1 7637 7637 b9gko
138 chr1 7908 7908 kob9g
139 chr1 8045 8045 34e5rg 4r
151 chr1 8329 8329 b
151 chr1 8346 8346 345y46htyh
151 chr1 8346 8346 76jtuj
152 chr1 8358 8358 asfge
Here's the line I used. I'm trying to compare rows based on the second, third, and fourth fields; if two or more rows are identical in those fields, print the entirety of those rows. Also, it's safe to assume that the rows are sorted based on fields 1, 2, and 3.
awk '{prev = $0; ++array[$2$3$4]; if(array[$2$3$4] == 2) {print; curr = $0; $0 = prev; print; $0 = curr}}' file
Here's what I want the output to be.
151 chr1 8346 8346 345y46htyh
151 chr1 8346 8346 76jtuj
And here's what the output is.
151 chr1 8346 8346 76jtuj
151 chr1 8346 8346 76jtuj
If I understood your question correctly, could you please try following.
awk 'FNR==NR{a[$2$3$4]++;next} a[($2$3$4)]>1' Input_file Input_file
OR
awk '{k=$2 FS $3 FS $4} FNR==NR{a[k]++;next} a[k]>1' Input_file Input_file
Output will be as follows.
151 chr1 8346 8346 345y46htyh
151 chr1 8346 8346 76jtuj
You are printing the same line twice. It's not entirely clear what you want the logic to be, but surely one of the print statements should be print curr or perhaps print prev. Also the lone prev doesn't do anything, and looks like it was left over from an editing mistake.
Perhaps you are looking for something like
awk '++array[$2$3$4] >= 2 {
if(prev)print prev;
print;
prev = ""; next }
{ prev = $0 }' file
If that doesn't do what you want, maybe edit your question to describe in more detail what you hope your current script should do; code which doesn't do what you want isn't really a good way to communicate what you do want.
Here is another awk solution that doesn't read input file twice and works even if your input is not sorted.
awk '(k = $2 FS $3 FS $4) in a {
print a[k] $0; a[k] = ""; next
} { a[k] = $0 ORS }' file

filtering in a text file using awk

i have a tab separated text file like this small example:
chr1 100499714 100499715 1
chr1 100502177 100502178 10
chr1 100502181 100502182 2
chr1 100502191 100502192 18
chr1 100502203 100502204 45
in the new file that I will make:
1- I want to select the rows based on the 4th column meaning in the value of 4th column is at least 10, I will keep the entire row otherwise will be filtered out.
2- in the next step the 4th column will be removed.
the result will look like this:
chr1 100502177 100502178
chr1 100502191 100502192
chr1 100502203 100502204
to get such results I have tried the following code in awk:
cat input.txt | awk '{print $1 "\t" $2 "\t" $3}' > out.txt
but I do not know how to implement the filtering step. do you know how to fix the code?
Just put the condition before output:
cat input.txt | awk '$4 >= 10 {print $1 "\t" $2 "\t" $3}' > out.txt
here is another, might work better if you have many more fields
$ awk '$NF>=10{sub(/\t\w+$/,""); print}' file

awk to update unknown values in file using range in another

I am trying to modify an awkkindly provided by #karakfa to update all the unknown values in $6 of file2, if the $4 value in file2 is within the range of $1 of file1. If there is already a value in $6 other then unknown, it is skipped and the next line is processed. In my awk attempt below the final output is 6 tab-delimited fields. Currently the awk runs but the unknown vales are not updated and I can not seem to solve this. Thank you :)
file1 (space delimited)
chr1:4714792-4852594 AJAP1
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A
chr1:15783224-15798586 CELA2A
file2 (tab-delimited)
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . unknown
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
desired output
--- the second and fourth unknown values are updated based on the range that they fall in $1 of file1
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . AJAP1
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
current output with awk
awk -v OFS='\t' 'NR==FNR{
rstart[a[1]]=a[2]
rend[a[1]]=a[3]
value[a[1]]=$2
next}
$6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] {sub(/unknown/,value[$1],$6)}1' hg19.txt input | column -t
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . unknown
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . unknown
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
edit:
awk -v OFS='\t' 'NR==FNR{split($1,a,/[:-]/)
rstart[a[1]]=a[2]
rend[a[1]]=a[3]
value[a[1]]=$2
next}
$6~/unknown/ && $2>=rstart[$1] && $3<=rend[$1] {sub(/unknown/,value[$1],$6)}1' hg19.txt input | column -t
possible solution to issue 2:
----- matching $2 values in file1 are combined with the first lines rstart[a[1]]=a[2] being the start and the last lines rend[a[1]]=a[3] being the end
chr1:4714792-4837854 AJAP1
chr1:9160364-9189229 GPR157
chr1:15783223-15798586 CELA2A
here is another script (it's inefficient since does a linear scan instead of more efficient search approaches) but works and simpler.
$ awk -v OFS='\t' 'NR==FNR{split($1,a,"[:-]"); k=a[1]; c[k]++;
rstart[k,c[k]]=a[2];
rend[k,c[k]]=a[3];
value[k,c[k]]=$2;
next}
$6=="unknown" && ($1 in c) {k=$1;
for(i=1; i<=c[k]; i++)
if($2>=rstart[k,i] && $3<=rend[k,i])
{$6=value[k,i]; break}}1' file1 file2 |
column -t
since it's possible to have more than one match, this one uses the first found.
chr1 3649533 3649653 chr1:3649533-3649653 . TP73
chr1 4736396 4736516 chr1:4736396-4736516 . AJAP1
chr1 5923314 5923434 chr1:5923314-5923434 . NPHP4
chr1 9161991 9162111 chr1:9161991-9162111 . GPR157
chr1 9162050 9162051 chr1:9162050-9162051 . rs6697376
note that the fourth record also matches based on the rules.

awk to add closing parenthesis if field begins with opening parenthesis

I have an awk that seemed straight-forward, but I seem to be having a problem. In the file below if $5 starts with a ( then to that string a ) is added at the end. However if$5does not start with a(then nothing is done. The out is separated by a tab. Theawkis almost right but I am not sure how to add the condition to only add a)if the field starts with a(`. Thank you :).
file
chr7 100490775 100491863 chr7:100490775-100491863 ACHE
chr7 100488568 100488719 chr7:100488568-100488719 ACHE;DJ051769
chr1 159174749 159174770 chr1:159174749-159174770 (ACKR1
chr1 159175223 159176240 chr1:159175223-159176240 (ACKR1
awk tried
awk -v OFS='\t' '{print $1,$2,$3,$4,""$5")"}' file
current output
chr7 100490775 100491863 chr7:100490775-100491863 ACHE)
chr7 100488568 100488719 chr7:100488568-100488719 ACHE;DJ051769)
chr1 159174749 159174770 chr1:159174749-159174770 (ACKR1)
chr1 159175223 159176240 chr1:159175223-159176240 (ACKR1)
desired output (line 1 and 2 nothing is done but line 3 and 4 have a ) added to the end)
chr7 100490775 100491863 chr7:100490775-100491863 ACHE
chr7 100488568 100488719 chr7:100488568-100488719 ACHE;DJ051769
chr1 159174749 159174770 chr1:159174749-159174770 (ACKR1)
chr1 159175223 159176240 chr1:159175223-159176240 (ACKR1)
$ awk -v OFS='\t' '{p = substr($5,1,1)=="(" ? ")" : ""; $5=$5 p}1' mp.txt
chr7 100490775 100491863 chr7:100490775-100491863 ACHE
chr7 100488568 100488719 chr7:100488568-100488719 ACHE;DJ051769
chr1 159174749 159174770 chr1:159174749-159174770 (ACKR1)
chr1 159175223 159176240 chr1:159175223-159176240 (ACKR1)
Check the first character of the 5th field. If it is ( append a ) to the end, otherwise append the empty string.
By appending something (where one of the somethings is "nothing" :) in all cases, we force awk to reconstitute the record with the defined (tab) output separator, which saves us from having to print the individual fields. The trailing 1 acts as an always-true pattern whose default action is simply to print the reconstituted line.

incorrect count of unique text in awk

I am getting the wrong counts using the awk below. The unique text in $5 before the - is supposed to be counted.
input
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 1 15
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 2 16
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 3 16
chr1 1267394 1268196 chr1:1267394-1268196 TAS1R3-46|gc=68.2 553 567
chr1 1267394 1268196 chr1:1267394-1268196 TAS1R3-46|gc=68.2 554 569
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 46 203
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 47 206
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 48 206
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 49 207
current output
1
desired output (AGRN,TAS1R3, PIK3CD) are unique and counted
3
awk
awk -F '[- ]' '!seen[$6]++ {n++} END {print n}' file
Try
awk -F '-| +' '!seen[$6]++ {n++} END {print n}' file
Your problem is that when ' ' (a space) is included as part of a regex to form FS (via -F) it loses its special default-value behavior, and only matches spaces individually as separators.
That is, the default behavior of recognizing runs of whitespace (any mix of spaces and tabs) as a single separator no longer applies.
Thus, [- ] won't do as the field separator, because it recognizes the empty strings between adjacent spaces as empty fields.
You can verify this by printing the field count - based on your intended parsing, you're expecting 9 fields:
$ awk -F '[- ]' '{ print NF }' file
17 # !! 8 extra fields - empty fields
$ awk -F '-| +' '{ print NF }' file
9 # OK, thanks to modified regex
You need alternation -| + to ensure that runs of spaces are treated as a single separator; if tabs should also be matched, use '-|[[:blank:]]+'
Including "-" in FS might be fine in some cases, but in general if the actual field separator is something else (e.g. whitespace, as seems to be the case here, or perhaps a tab), it would be far better to set FS according to the specification of the file format. In any case, it's easy to extract the subfield of interest. In the following, I'll assume the FS is whitespace.
awk '{split($5, a, "-"); if (!(count[a[1]]++)) n++ }
END {print n}'
If you want the details:
awk '{split($5, a, "-"); count[a[1]]++}
END { for(i in count) {print i, count[i]}}'
Output of the second incantation:
AGRN 3
PIK3CD 4
TAS1R3 2