incorrect count of unique text in awk - awk

I am getting the wrong counts using the awk below. The unique text in $5 before the - is supposed to be counted.
input
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 1 15
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 2 16
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 3 16
chr1 1267394 1268196 chr1:1267394-1268196 TAS1R3-46|gc=68.2 553 567
chr1 1267394 1268196 chr1:1267394-1268196 TAS1R3-46|gc=68.2 554 569
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 46 203
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 47 206
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 48 206
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 49 207
current output
1
desired output (AGRN,TAS1R3, PIK3CD) are unique and counted
3
awk
awk -F '[- ]' '!seen[$6]++ {n++} END {print n}' file

Try
awk -F '-| +' '!seen[$6]++ {n++} END {print n}' file
Your problem is that when ' ' (a space) is included as part of a regex to form FS (via -F) it loses its special default-value behavior, and only matches spaces individually as separators.
That is, the default behavior of recognizing runs of whitespace (any mix of spaces and tabs) as a single separator no longer applies.
Thus, [- ] won't do as the field separator, because it recognizes the empty strings between adjacent spaces as empty fields.
You can verify this by printing the field count - based on your intended parsing, you're expecting 9 fields:
$ awk -F '[- ]' '{ print NF }' file
17 # !! 8 extra fields - empty fields
$ awk -F '-| +' '{ print NF }' file
9 # OK, thanks to modified regex
You need alternation -| + to ensure that runs of spaces are treated as a single separator; if tabs should also be matched, use '-|[[:blank:]]+'

Including "-" in FS might be fine in some cases, but in general if the actual field separator is something else (e.g. whitespace, as seems to be the case here, or perhaps a tab), it would be far better to set FS according to the specification of the file format. In any case, it's easy to extract the subfield of interest. In the following, I'll assume the FS is whitespace.
awk '{split($5, a, "-"); if (!(count[a[1]]++)) n++ }
END {print n}'
If you want the details:
awk '{split($5, a, "-"); count[a[1]]++}
END { for(i in count) {print i, count[i]}}'
Output of the second incantation:
AGRN 3
PIK3CD 4
TAS1R3 2

Related

Print first column of a file and the substraction of two columns plus a number changing the separator

I am trying to print the first column of this file as well as the substraction between the fifth and fourth columns plus 1. In addition, I want to change the separator from a space to a tab.
This is the file:
A gene . 200 500 y
H gene . 1000 2000 j
T exon 1 550 650 m
U intron . 300 400 o
My expected output is:
A 301
H 1001
T 101
U 101
I´ve tried:
awk '{print $1'\t'$5-$4+1}' myFile
But my output is not tab separated, in fact, columns are not even separated by spaces.
I also tried:
awk OFS='\t' '{print $1 $5-$4+1}' myFile
But then I get a syntax error
Do you know how can I solve this?
Thanks!
Could you please try following. Written with shown samples.
awk 'BEGIN{OFS="\t"} {print $1,(($5-$4)+1)}' Input_file
Explanation: Why your output is not tab separated because you haven't used ,(comma) to print separator hence it will print them like A301 and so on. Also in case you want to set OFS in variable level in awk then you should use awk -v OFS='\t' '{print $1,(($5-$4)+1)}' Input_file where -v is important to let awk know that you are defining variable's value as TAB here. Also I have used parenthesis with subtraction and addition to make it clearer.

filtering in a text file using awk

i have a tab separated text file like this small example:
chr1 100499714 100499715 1
chr1 100502177 100502178 10
chr1 100502181 100502182 2
chr1 100502191 100502192 18
chr1 100502203 100502204 45
in the new file that I will make:
1- I want to select the rows based on the 4th column meaning in the value of 4th column is at least 10, I will keep the entire row otherwise will be filtered out.
2- in the next step the 4th column will be removed.
the result will look like this:
chr1 100502177 100502178
chr1 100502191 100502192
chr1 100502203 100502204
to get such results I have tried the following code in awk:
cat input.txt | awk '{print $1 "\t" $2 "\t" $3}' > out.txt
but I do not know how to implement the filtering step. do you know how to fix the code?
Just put the condition before output:
cat input.txt | awk '$4 >= 10 {print $1 "\t" $2 "\t" $3}' > out.txt
here is another, might work better if you have many more fields
$ awk '$NF>=10{sub(/\t\w+$/,""); print}' file

using awk to handle 2 text files

I have a text file like this small example:
chr12 2904300 2904315 peak_8 167 . 8.41241 21.74573 16.71985 65
chr1 3663184 3663341 peak_9 77 . 7.86961 12.16321 7.70843 37
chr1 6284759 6285189 peak_10 220 . 13.85268 27.34231 22.06610 332
chr1 6653468 6653645 peak_11 196 . 13.59296 24.85586 19.68392 117
chr1 8934964 8935095 peak_12 130 . 8.82937 17.84867 13.03453 36
and have another file like the 2nd example:
ENSG00000004478|12|2904119|2904309
ENSG00000002933|7|150498624|150498638
ENSG00000173153|11|64073050|64073208
ENSG00000001626|7|117120017|117120148
ENSG00000003249|16|90085750|90085881
ENSG00000003056|12|9102084|9102551
the first example is tab separated and the 2nd example is |
separated.I want to select only the rows from the 1st example if "the
average of columns 2 and 3 in the first example is between the 3rd and
4th columns in the 2nd example and also the number of the first column
in the 1st example is equal to the 2nd column of the 2nd example".
for example the output from these 2 examples would be:
chr12 2904300 2904315 peak_8 167 . 8.41241 21.74573 16.71985 65
I am trying to do that using awk:
awk 'FNR==NR{a[FNR]=($2+$3)/2;b[FNR]=$0;next} (FNR in a) && ($3<=a[FNR] && $4>=a[FNR]){print b[FNR]}' file1 FS="|" file2
but it does not work and returns nothing. do you know how I can correct the code?
solution
awk 'NR==FNR ? \!((a[NR]=$1)&&(z[NR]=$0)&&(avr[NR]=($3+$2)/2)) : (($4>=avr[FNR] && avr[FNR]>=$3)&&(a[FNR]=="chr"$2)){print z[FNR] }' file1 FS="|" file2
file1 FS=" " by default, file2 FS="|"
awk ?: part description
1). NR==FNR check if the file we are parsing is the first or second
NR - The total number of input records seen so far.
FNR - The input record number in the current input file
2) if working on first file \!((a[NR]=$1)&&(z[NR]=$0)&&(avr[NR]=($3+$2)/2))
\! - prevent printing to screen
a[NR]=$1 - array holds the first field of file
z[NR]=$0 - array holds the lines of first file
avr[NR]=($3+$2)/2 - array holds the average requested from the first file
3) check for conditions of second file to print lines:
a) (($4>=avr[FNR] && avr[FNR]>=$3)&&(a[FNR]=="chr"$2)){print z[FNR] }
b)($4>=avr[FNR] && avr[FNR]>=$3) - check average of first file is in between the values of fields 3 & 4 of the second file
c)(a[FNR]=="chr"$2) - check that numbers in field 1(first file) & field 2(second file) are the same
**d)**if conditions are true print to screen the line from the first file (z[FNR])

awk to add closing parenthesis if field begins with opening parenthesis

I have an awk that seemed straight-forward, but I seem to be having a problem. In the file below if $5 starts with a ( then to that string a ) is added at the end. However if$5does not start with a(then nothing is done. The out is separated by a tab. Theawkis almost right but I am not sure how to add the condition to only add a)if the field starts with a(`. Thank you :).
file
chr7 100490775 100491863 chr7:100490775-100491863 ACHE
chr7 100488568 100488719 chr7:100488568-100488719 ACHE;DJ051769
chr1 159174749 159174770 chr1:159174749-159174770 (ACKR1
chr1 159175223 159176240 chr1:159175223-159176240 (ACKR1
awk tried
awk -v OFS='\t' '{print $1,$2,$3,$4,""$5")"}' file
current output
chr7 100490775 100491863 chr7:100490775-100491863 ACHE)
chr7 100488568 100488719 chr7:100488568-100488719 ACHE;DJ051769)
chr1 159174749 159174770 chr1:159174749-159174770 (ACKR1)
chr1 159175223 159176240 chr1:159175223-159176240 (ACKR1)
desired output (line 1 and 2 nothing is done but line 3 and 4 have a ) added to the end)
chr7 100490775 100491863 chr7:100490775-100491863 ACHE
chr7 100488568 100488719 chr7:100488568-100488719 ACHE;DJ051769
chr1 159174749 159174770 chr1:159174749-159174770 (ACKR1)
chr1 159175223 159176240 chr1:159175223-159176240 (ACKR1)
$ awk -v OFS='\t' '{p = substr($5,1,1)=="(" ? ")" : ""; $5=$5 p}1' mp.txt
chr7 100490775 100491863 chr7:100490775-100491863 ACHE
chr7 100488568 100488719 chr7:100488568-100488719 ACHE;DJ051769
chr1 159174749 159174770 chr1:159174749-159174770 (ACKR1)
chr1 159175223 159176240 chr1:159175223-159176240 (ACKR1)
Check the first character of the 5th field. If it is ( append a ) to the end, otherwise append the empty string.
By appending something (where one of the somethings is "nothing" :) in all cases, we force awk to reconstitute the record with the defined (tab) output separator, which saves us from having to print the individual fields. The trailing 1 acts as an always-true pattern whose default action is simply to print the reconstituted line.

How awk the filename as a column in the output?

I am trying to perform some grep in contents of several files in a directory and appending my grep match in a single file, in my output I would also want a column which will have the filename as well to understand from which files that entry was picked up. I was trying to use awk for the same but it did not work.
for i in *_2.5kb.txt; do more $i | grep "NM_001080771" | echo `basename $i` | awk -F'[_.]' '{print $1"_"$2}' | head >> prom_genes_2.5kb.txt; done
files names are like this , I have around 50 files
48hrs_CT_merged_peaks_2.5kb.txt
48hrs_TAMO_merged_peaks_2.5kb.txt
72hrs_TAMO_merged_peaks_2.5kb.txt
72hrs_CT_merged_peaks_2.5kb.txt
5D_CT_merged_peaks_2.5kb.txt
5D_TAMO_merged_peaks_2.5kb.txt
each file contents several lines
chr1 3663275 3663483 14 2.55788 2.99631 1.40767 NM_001011874 -
chr1 4481687 4488063 264 7.85098 28.25170 26.41094 NM_011441 -
chr1 5008006 5013929 243 8.20677 26.17854 24.37907 NM_021374 -
chr1 5578362 5579949 65 3.48568 7.83501 6.57570 NM_011011 +
chr1 5905702 5908002 148 5.84647 16.53171 14.88463 NM_010342 -
chr1 9288507 9290352 77 4.04459 9.12442 7.77642 NM_027671 -
chr1 9291742 9292528 142 5.74749 16.21792 14.28185 NM_027671 -
chr1 9535689 9536176 72 4.45286 8.82567 7.29563 NM_021511 +
chr1 9535689 9536176 72 4.45286 8.82567 7.29563 NM_175236 +
chr1 9535689 9536176 72 4.45286 8.82567 7.29563 NR_027664 +
When I am getting a match for "NM_001080771" I am printing the entire content of that line to a new file and for each file this operation is being done and appending the match to one output file. I also want to add a column with filename as shown above in the final output so that I know from which file I am getting the entries.
desired output
chr4 21610972 21618492 193 7.28409 21.01724 19.35525 NM_001080771 - 48hrs_CT
chr4 21605096 21618696 76 4.22442 9.32981 7.68131 NM_001080771 - 48hrs_TAMO
chr4 21604864 21618713 12 1.78194 2.36793 1.25883 NM_001080771 - 72hrs_CT
chr4 21610305 21615717 26 2.90579 4.47333 2.65353 NM_001080771 - 72hrs_TAMO
chr4 21609924 21618600 23 2.63778 4.0642 2.33685 NM_001080771 - 5D_CT
chr4 21609936 21618680 30 5.63778 3.0642 8.33685 NM_001080771 - 5D_TAMO
This is not working. I want to basically append a column where the filename should also get added as an entry either first or the last column. How to do that?
or you can do all in awk
awk '/NM_001080771/ {print $0, FILENAME}' *_2.5kb.txt
this trims the filename in the desired format
$ awk '/NM_001080771/{sub(/_merged_peaks_2.5kb.txt/,"",FILENAME);
print $0, FILENAME}' *_2.5kb.txt
As long as the number of files is not huge, why not just:
grep NM_001080771 *_2.5kb.txt | awk -F: '{print $2,$1}'
If you have too many files for that to work, here's a script-based approach that uses awk to append the filename:
#!/bin/sh
for i in *_2.5kb.txt; do
< $i grep "NM_001080771" | \
awk -v where=`basename $i` '{print $0,where}'
done
./thatscript | head > prom_genes_2.5kb.txt
Here we are using awk's -v VAR=VALUE command line feature to pass in the filename (because we are using stdin we don't have anything useful in awk's built-in FILENAME variable).
You can also use such a loop around #karakfa's elegant awk-only approach:
#!/bin/sh
for i in *_2.5kb.txt; do
awk '/NM_001080771/ {print $0, FILENAME}' $i
done
And finally, here's a version with the desired filename munging:
#!/bin/sh
for i in *_2.5kb.txt; do
awk -v TAG=${i%_merged_peaks_2.5kb.txt} '/NM_001080771/ {print $0, TAG}' $i
done
(this uses the shell's variable substitution ${variable%pattern} to trim pattern from the end of variable)
Bonus
Guessing you might want to search for other strings in the future, so why don't we pass in the search string like so:
#!/bin/sh
what=${1?Need search string}
for i in *_2.5kb.txt; do
awk -v TAG=${i%_merged_peaks_2.5kb.txt} /${what}/' {print $0, TAG}' $i
done
./thatscript NM_001080771 | head > prom_genes_2.5kb.txt
YET ANOTHER EDIT
Or if you have a pathological need to over-complicate and pedantically quote things, even in 5-line "throwaway" scripts:
#!/bin/sh
shopt -s nullglob
what="${1?Need search string}"
filematch="*_2.5kb.txt"
trimsuffix="_merged_peaks_2.5kb.txt"
for filename in $filematch; do
awk -v tag="${filename%${trimsuffix}}" \
-v what="${what}" \
'$0 ~ what {print $0, tag}' $filename
done