How awk the filename as a column in the output? - awk

I am trying to perform some grep in contents of several files in a directory and appending my grep match in a single file, in my output I would also want a column which will have the filename as well to understand from which files that entry was picked up. I was trying to use awk for the same but it did not work.
for i in *_2.5kb.txt; do more $i | grep "NM_001080771" | echo `basename $i` | awk -F'[_.]' '{print $1"_"$2}' | head >> prom_genes_2.5kb.txt; done
files names are like this , I have around 50 files
48hrs_CT_merged_peaks_2.5kb.txt
48hrs_TAMO_merged_peaks_2.5kb.txt
72hrs_TAMO_merged_peaks_2.5kb.txt
72hrs_CT_merged_peaks_2.5kb.txt
5D_CT_merged_peaks_2.5kb.txt
5D_TAMO_merged_peaks_2.5kb.txt
each file contents several lines
chr1 3663275 3663483 14 2.55788 2.99631 1.40767 NM_001011874 -
chr1 4481687 4488063 264 7.85098 28.25170 26.41094 NM_011441 -
chr1 5008006 5013929 243 8.20677 26.17854 24.37907 NM_021374 -
chr1 5578362 5579949 65 3.48568 7.83501 6.57570 NM_011011 +
chr1 5905702 5908002 148 5.84647 16.53171 14.88463 NM_010342 -
chr1 9288507 9290352 77 4.04459 9.12442 7.77642 NM_027671 -
chr1 9291742 9292528 142 5.74749 16.21792 14.28185 NM_027671 -
chr1 9535689 9536176 72 4.45286 8.82567 7.29563 NM_021511 +
chr1 9535689 9536176 72 4.45286 8.82567 7.29563 NM_175236 +
chr1 9535689 9536176 72 4.45286 8.82567 7.29563 NR_027664 +
When I am getting a match for "NM_001080771" I am printing the entire content of that line to a new file and for each file this operation is being done and appending the match to one output file. I also want to add a column with filename as shown above in the final output so that I know from which file I am getting the entries.
desired output
chr4 21610972 21618492 193 7.28409 21.01724 19.35525 NM_001080771 - 48hrs_CT
chr4 21605096 21618696 76 4.22442 9.32981 7.68131 NM_001080771 - 48hrs_TAMO
chr4 21604864 21618713 12 1.78194 2.36793 1.25883 NM_001080771 - 72hrs_CT
chr4 21610305 21615717 26 2.90579 4.47333 2.65353 NM_001080771 - 72hrs_TAMO
chr4 21609924 21618600 23 2.63778 4.0642 2.33685 NM_001080771 - 5D_CT
chr4 21609936 21618680 30 5.63778 3.0642 8.33685 NM_001080771 - 5D_TAMO
This is not working. I want to basically append a column where the filename should also get added as an entry either first or the last column. How to do that?

or you can do all in awk
awk '/NM_001080771/ {print $0, FILENAME}' *_2.5kb.txt
this trims the filename in the desired format
$ awk '/NM_001080771/{sub(/_merged_peaks_2.5kb.txt/,"",FILENAME);
print $0, FILENAME}' *_2.5kb.txt

As long as the number of files is not huge, why not just:
grep NM_001080771 *_2.5kb.txt | awk -F: '{print $2,$1}'
If you have too many files for that to work, here's a script-based approach that uses awk to append the filename:
#!/bin/sh
for i in *_2.5kb.txt; do
< $i grep "NM_001080771" | \
awk -v where=`basename $i` '{print $0,where}'
done
./thatscript | head > prom_genes_2.5kb.txt
Here we are using awk's -v VAR=VALUE command line feature to pass in the filename (because we are using stdin we don't have anything useful in awk's built-in FILENAME variable).
You can also use such a loop around #karakfa's elegant awk-only approach:
#!/bin/sh
for i in *_2.5kb.txt; do
awk '/NM_001080771/ {print $0, FILENAME}' $i
done
And finally, here's a version with the desired filename munging:
#!/bin/sh
for i in *_2.5kb.txt; do
awk -v TAG=${i%_merged_peaks_2.5kb.txt} '/NM_001080771/ {print $0, TAG}' $i
done
(this uses the shell's variable substitution ${variable%pattern} to trim pattern from the end of variable)
Bonus
Guessing you might want to search for other strings in the future, so why don't we pass in the search string like so:
#!/bin/sh
what=${1?Need search string}
for i in *_2.5kb.txt; do
awk -v TAG=${i%_merged_peaks_2.5kb.txt} /${what}/' {print $0, TAG}' $i
done
./thatscript NM_001080771 | head > prom_genes_2.5kb.txt
YET ANOTHER EDIT
Or if you have a pathological need to over-complicate and pedantically quote things, even in 5-line "throwaway" scripts:
#!/bin/sh
shopt -s nullglob
what="${1?Need search string}"
filematch="*_2.5kb.txt"
trimsuffix="_merged_peaks_2.5kb.txt"
for filename in $filematch; do
awk -v tag="${filename%${trimsuffix}}" \
-v what="${what}" \
'$0 ~ what {print $0, tag}' $filename
done

Related

filtering in a text file using awk

i have a tab separated text file like this small example:
chr1 100499714 100499715 1
chr1 100502177 100502178 10
chr1 100502181 100502182 2
chr1 100502191 100502192 18
chr1 100502203 100502204 45
in the new file that I will make:
1- I want to select the rows based on the 4th column meaning in the value of 4th column is at least 10, I will keep the entire row otherwise will be filtered out.
2- in the next step the 4th column will be removed.
the result will look like this:
chr1 100502177 100502178
chr1 100502191 100502192
chr1 100502203 100502204
to get such results I have tried the following code in awk:
cat input.txt | awk '{print $1 "\t" $2 "\t" $3}' > out.txt
but I do not know how to implement the filtering step. do you know how to fix the code?
Just put the condition before output:
cat input.txt | awk '$4 >= 10 {print $1 "\t" $2 "\t" $3}' > out.txt
here is another, might work better if you have many more fields
$ awk '$NF>=10{sub(/\t\w+$/,""); print}' file

get format on int for awk print

I want the number of lines of my python files with relative path.
I get that like this::
$ find ./ -name "*.py" -exec wc -l {} \;| awk '{print $1, $2}'
29 ./setup.py
28 ./proj_one/setup.py
896 ./proj_one/proj_one/data_ns.py
169 ./proj_one/proj_one/lib.py
310 ./proj_one/proj_one/base.py
0 ./proj_one/proj_one/__init__.py
72 ./proj_one/tests/lib_test.py
How could I get (formated ints) like this::
29 ./setup.py
28 ./proj_one/setup.py
896 ./proj_one/proj_one/data_ns.py
169 ./proj_one/proj_one/lib.py
310 ./proj_one/proj_one/base.py
0 ./proj_one/proj_one/__init__.py
72 ./proj_one/tests/lib_test.py
You can use printf with a width format modifier to make a formatted column:
$ find ./ -name "*.py" -exec wc -l {} \;| awk '{printf "%10s %s\n", $1, $2}'
On most platforms, you can print with comma separators as a specifier (if you have BIG files) but the quoting can be challenging for command line use:
$ echo 10000000 | awk '{printf "%'\''d\n", $1}'
10,000,000
How about just pipping the output of find to column -t.
The column utility formats its input into multiple columns
-t Determine the number of columns the input contains and create a table. Read man page
za$ find . -name "*rb" -exec wc -l {} \; | column -t
20 ./dirIO.rb
314 ./file_info.rb
53 ./file_santizer.rb
154 ./file_writer.rb
58 ./temp/maps.rb
248 ./usefu_ruby.rb

awk not printing header in output file

The below awk seems to work great with 1 issue, the header lines do hot print in the output? I have been staring at this awhile with no luck. What am I missing? Thank you :).
awk
awk 'NR==FNR{for (i=1;i<=NF;i++) a[$i];next} FNR==1 || ($7 in a)' /home/panels/file1 test.txt |
awk '{split($2,a,"-"); print a[1] "\t" $0}' |
sort |
cut -f2-> /home/panels/test_filtered.vcf
test.txt (used in the awk to give the filtered output --only a small portion of the data but the tab delimited format is shown)
Chr Start End Ref Alt
chr1 949608 949608 G A
current output (has no header)
chr1 949608 949608 G A
desired output (has header)
Chr Start End Ref Alt
chr1 949608 949608 G A
It looks like the header is going to sort, and getting mixed in with your data. A simple solution is to do:
... | { read line; echo $line; sort; } |
to prevent the first line from going to sort.
you can combine your scripts and add the sort into awk and handle header this way.
$ awk 'NR==FNR{for(i=1;i<=NF;i++)a[$i]; next}
FNR==1{print "dummy\t" $0; next}
$7 in a{split($2,b,"-"); print b[1] "\t" $0 | "sort" }' file1 file2 |
cut -f2

incorrect count of unique text in awk

I am getting the wrong counts using the awk below. The unique text in $5 before the - is supposed to be counted.
input
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 1 15
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 2 16
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 3 16
chr1 1267394 1268196 chr1:1267394-1268196 TAS1R3-46|gc=68.2 553 567
chr1 1267394 1268196 chr1:1267394-1268196 TAS1R3-46|gc=68.2 554 569
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 46 203
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 47 206
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 48 206
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 49 207
current output
1
desired output (AGRN,TAS1R3, PIK3CD) are unique and counted
3
awk
awk -F '[- ]' '!seen[$6]++ {n++} END {print n}' file
Try
awk -F '-| +' '!seen[$6]++ {n++} END {print n}' file
Your problem is that when ' ' (a space) is included as part of a regex to form FS (via -F) it loses its special default-value behavior, and only matches spaces individually as separators.
That is, the default behavior of recognizing runs of whitespace (any mix of spaces and tabs) as a single separator no longer applies.
Thus, [- ] won't do as the field separator, because it recognizes the empty strings between adjacent spaces as empty fields.
You can verify this by printing the field count - based on your intended parsing, you're expecting 9 fields:
$ awk -F '[- ]' '{ print NF }' file
17 # !! 8 extra fields - empty fields
$ awk -F '-| +' '{ print NF }' file
9 # OK, thanks to modified regex
You need alternation -| + to ensure that runs of spaces are treated as a single separator; if tabs should also be matched, use '-|[[:blank:]]+'
Including "-" in FS might be fine in some cases, but in general if the actual field separator is something else (e.g. whitespace, as seems to be the case here, or perhaps a tab), it would be far better to set FS according to the specification of the file format. In any case, it's easy to extract the subfield of interest. In the following, I'll assume the FS is whitespace.
awk '{split($5, a, "-"); if (!(count[a[1]]++)) n++ }
END {print n}'
If you want the details:
awk '{split($5, a, "-"); count[a[1]]++}
END { for(i in count) {print i, count[i]}}'
Output of the second incantation:
AGRN 3
PIK3CD 4
TAS1R3 2

take out specific columns from mulitple files

I have multiple files that look like the one below. They are tab-separated. For all the files I would like to take out column 1 and the column that start with XF:Z:. This will give me output 1
The files names are htseqoutput*.sam.sam where * varies. I am not sure about the awk function use, and if the for-loop is correct.
for f in htseqoutput*.sam.sam
do
awk ????? "$f" > “out${f#htseqoutput}”
done
input example
AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11 16 chr22 39715068 24 51M * 0 0 GACAATCAGCACACAGTTCCTGTCCGCCCGTCAATAAGTTCATCATCTGTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-12 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:18T31G0 YT:Z:UU XF:Z:SNORD43
GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 16 chr19 4724687 40 33M * 0 0 AGGCGAATGTGATAACCGCTACACTAAGGAAAC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:26C6 YT:Z:UU XF:Z:tRNA
TCGACTCCCGGTGTGGGAACC_0 16 chr13 45492060 23 21M * 0 0 GGTTCCCACACCGGGAGTCGA IIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0C20 YT:Z:UU XF:Z:tRNA
output 1:
AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11 SNORD43
GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 tRNA
TCGACTCCCGGTGTGGGAACC_0 tRNA
Seems like you could just use sed for this:
sed -r 's/^([ACGT0-9_]+).*XF:Z:([[:alnum:]]+).*/\1\t\2/' file
This captures the part at the start of the line and the alphanumeric part following XF:Z: and outputs them, separated by a tab character. One potential advantage of this approach is that it will work independently of the position of the XF:Z: string.
Your loop looks OK (you can use this sed command in place of the awk part) but be careful with your quotes. " should be used, not “/”.
Alternatively, if you prefer awk (and assuming that the bit you're interested in is always part of the last field), you can use a custom field separator:
awk -F'[[:space:]](XF:Z:)?' -v OFS='\t' '{print $1, $NF}' file
This optionally adds the XF:Z: part to the field separator, so that it is removed from the start of the last field.
You can try, if column with "XF:Z:" is always at the end
awk 'BEGIN{OFS="\t"}{n=split($NF,a,":"); print $1, a[n]}' file.sam
you get,
AACAGATGATGAACTTATTGACGGGCGGACAGGAACTGTGTGCTGATTGTC_11 SNORD43
GTTTCCTTAGTGTAGCGGTTATCACATTCGCCT_0 tRNA
TCGACTCCCGGTGTGGGAACC_0 tRNA
or, if this column is a variable position for each file
awk 'BEGIN{OFS="\t"}
FNR==1{
for(i=1;i<=NF;i++){
if($i ~ /^XF:Z:/) break
}
}
{n=split($i,a,":"); print $1, a[n]}' file.sam