Grep specific columns, count and write output file - awk

I'm trying to summarize my data and count for specific items
These are human sequencing data and hence very large.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL PRIMARY
1 12867 . C A 5 q40;bldp;blq SS=1;VT=SNP; GT:DP:AD:BQ:MQ:SB:FA:SS:SSC:MQA 1/0:8:7,1:36,39:0:0.0,0.0:0.125:0:5:14.9,16.0 1/0:2:2,0:33,0:0:0.0,0:0.0:1:5:16.0,0
To simplify, the data looks something like this
column1 column2 column3 column4 column5 column6 column7 column8 column9 column10 column11
x x x x x x x SS=1 x 1/0:8:7,1:36,39:0:0.0,0.0:0.125:0:5:14.9,16.0 1/0:2:2,0:33,0:0:0.0,0:0.0:1:5:16.0,0
x x x x x x x SS=2 x 1/0:8:7,1:36,39:0:0.0,0.0:0.125:0:5:14.9,16.0 1/0:2:2,0:33,0:0:0.0,0:0.0:1:5:16.0,0
First , I need to count the number how many different SS in column8. There are 5 different types of SS i.e. SS=1 ..... SS=5.
This could be done by grep command and
I tried
grep SS=1 file1.vcf | wc -l
grep SS=2 file1.vcf | wc -l
Then I want to count how many "0", "1" , "2" in column 10 and 11 at the position after the 7th colon (:)
This is the part that I'm not sure how to do. I was thinking about using awk but i'm not sure how to specify to look for at specific position (after the 7th colon (:)
awk -F ':' '$11==1' #this does command only specifies column but not at specific position.
I have 246 files that I want to do exactly the same. How can I apply to all my files and write the count in txt file? I only know how to do it one by one and probably I can cat the count files at the end.
for f in *.vcf; do grep SS=1 "$f" | wc -l > ${f}SS1.txt; done

To count how many different values you have in column 8 you can use the typical approach:
$ awk -F"\t" 'NR>1{a[$8]++} END{for (i in a) print i,a[i]}' file
SS=1 1
SS=2 1
To count how many different values you have in the 8th position of a :-separated string from the 10th and 11th fields, you can use split() to slice the string in blocks. And then, use the same approach as above.
$ awk -F"\t" 'NR>1{split($10,a,":"); split($11,b,":"); count10[a[8]]++; count11[b[8]]++} END {for (i in count10) print i, count10[i]; for (i in count11) print i, count11[i]}' a
0 2
1 2
You can put all together to get something like:
$ awk -F"\t" 'NR>1{count8[$8]++; split($10,a,":"); split($11,b,":"); count10[a[8]]++; count11[b[8]]++} END {for (i in count8) print i, count8[i]; for (i in count10) print i, count10[i]; for (i in count11) print i, count11[i]}' file
SS=1 1
SS=2 1
0 2
1 2
If you want to do this for many files, you can either use the loop or -better- work with FILENAME and ENDFILE to flush the stored information. Try it out and let us know if you face any problem there.

Related

awk compare two elements in the same line with regular expression

I have very long files where I have to compare two chromosome numbers present in the same line. I would like to use awk to create a file that take only the lines where the chromosome numbers are different.
Here is the example of my file:
CHROM ALT
1 ]1:1234567]T
1 T[1:2345678[
1 A[12:3456789[
2 etc...
In this example, I wish to compare the number of the chromosome (here '1' in the CHROM column) and the number that is between the first bracket ([ or ]) and the ":" symbol. If these numbers are different, I wish to print the corresponding line.
Here, the result should be like this:
1 A[12:3456789[
Thank you for your help.
$ awk -F'[][]' '$1+0 != $2+0' file
1 A[12:3456789[
2 etc...
This requires GNU awk for the 3 argument match() function:
gawk 'match($2, /[][]([0-9]+):/, a) && $1 != a[1]' file
Thanks again for the different answers.
Here are how my data looks like with several columns:
CHROM POS ID REF ALT
1 1000000 123:1 A ]1:1234567]T
1 2000000 456:1 A T[1:2345678[
1 3000000 789:1 T A[12:3456789[
2 ... ... . ...
My question is: how do I modify the previous code, when I have several columns?

Count entries based on exponential notation values with pure awk

I am trying to count the entries that are less than the e threshold of 1e-5 in my tab-del data file that looks something like the table below.
col1 col2 col3 eval
entry1 - - 1e-10
entry2 - - -
entry3 - - 0.001
I used this code:
$: awk -F"\t" '{print $4}' table.txt | awk '($1 + 0) < 1e-5' | grep [0-9*] | wc -l
This outputs:
$: 1
While this works, I would like to improve the command into something pure awk. I would love to know how to do this in awk. Also, I would like to know how to print the line that satisfies the threshold if this is possible. Thank for helping!
This is probably the best way:
awk -F"\t" '($4+0==$4) && ($4 < 1E-5){c++}END{print c}' file
This does the following:
($4+0==$4): first conditional to check if $4 is a number.
($4<1E-5): second conditional to check if the value matches the range
&&: If both conditions are satisfied, increment a counter c
at the END, print the value of c
Be aware that your grep in your original command will fail. If $4 in the original file would read like XXX1XXX (i.e. a string with a number in it) or XXX*XXX (i.e. a string with an asterisk in it), it would be counted as a match.

Printing field by column with variable line field

I have a command that returns an output similar to;
*************
* something *
*************
| Header | Title |
Column1|Column2 | Column3 |Column4| Column5 |Column6|Column7| Column8 |
--------------------------------------------------------------------------------
val1 val2 val3 x y i j 1(a) 2 1(a) 2 val4
val5 val6 val7 w x y z i j k 2(b) 2 1(b) 1 val8
..
..
Total lines: xx
I want to just print column6 for example, but because the output is not fixed variable field by space, awk '{print $x}' won't work for me. I need a way to print output by defined column (eg. Column6 or Column8). Maybe printing the column6 field from the right which is field $5 from the right? Is there such a method to print from right rather than the default of all the command which count field from the left?
Any help would be appreciated.
Use NF for this
awk '{print $(NF-5)}'
This will print the 6th to last column for example
I have been solving similar problem. Assuming Column6 is all the time there, you can use following colon to find an index
echo "Column1|Column2|Column6|Column8" | sed 's/Column6.*//;s/[^|]\+//g' | wc -c
Then you can simply construct the awk query
X=$(echo ...)
SCRIPT="{ print \$${X}; }"
echo "Column1 |Column2 |Column6 |Column8" | awk "${SCRIPT}"
|Column6
Rewrite In Gnu awk:
$ cat program.awk
BEGIN { FS="|" }
$0 ~ c { # process record with header
split($0,a,"|") # split to get header indexes
for(i in a) { # loop all column names
gsub(/^ *| *$/,"",a[i]) # trim space off
if(a[i]==c) ci=i # ci is the one
}
while(i=index($0,FS)) { # form FIXEDWIDTHS to separate fields
FIELDWIDTHS = FIELDWIDTHS i " "
$0=substr($0,i+1)
}
}
ci && $ci !~ /^-+$/ && $0=$ci # implicit printing
Run it:
$ awk -v c="Column6" -f program.awk file
1(a) 2
2(b) 2
If you want to edit the outputed column, the last row in program.awk is the place to do it. For example, let's imagine for a second that you'd like to loose the parenthesized part of Column6, you could for example create an action part with {sub(/\(.*\)/,""); print} for it.

How to track lines in large log file that don't appear in the expected order?

I have a large log file which includes lines in the format
id_number message_type
Here is an example for a log file where all lines appear in the expected order
1 A
2 A
1 B
1 C
2 B
2 C
However, not all lines appear in the expected order in my log file and I'd like to get a list of all id numbers that don't appear in expected order. For the following file
1 A
2 A
1 C
1 B
2 B
2 C
I would like to get an output that indicates id number 1 has lines that don't appear in the expected order. How to do this, using grep, sed and awk?
This works for me:
awk -v "a=ABC" 'substr(a, b[$1]++ + 1, 1) != $2 {print $1}' logfile
When you run this, the ID number from each out-of-order line will be printed. If there are no out-of-order lines, then nothing is printed.
How it works
-v "a=ABC"
This defines the variable a with the list of characters in their expected order.
substr(a, b[$1]++ + 1, 1) != $2 {print $1}
For each ID number, the array b keeps track of where we are. Initially, b is zero for all IDs. With this initial value, that is b[$1]==0, the expression substr(a, b[$1] + 1, 1) returns A which is our first expected output. The condition substr(a, b[$1] + 1, 1) != $2 thus checks if the expected output, from the substr function, differs from the actual output shown in the second field, $2. If it does differ, then the ID value, $1, is printed.
After the substr expression is computed, the trailing ++ in the expression b[$1]++ increments the value of b[$1] by 1 so that the value of b[$1] is ready for the next time that ID $1 is encountered.
Refinement
The above prints an ID number every time an out-of-order line is encountered. If you just want each bad ID printed once, not multiple times, use:
awk -v "a=ABC" 'substr(a, b[$1]++ + 1, 1) != $2 {bad[$1]++} END{for (n in bad) print n}' logfile
I am only on my iPad with no way to test this, but I can give you an idea how to do it with awk since no-one else is answering...
Something like this:
awk 'BEGIN{for(i=0;i<10000;i++)expected[i]=ord("A")}
{if(expected[$1]!=ord($2))
print "Out of order at line ", NR, $0;
expected[i]=ord($2)+1
}' yourFile
You will need to paste in the ord() function from here.
Basically, the concept is to initialise an array called expected[] that keeps track of the next message type expected for each id and then, as each line is read, check it is the next expected value.
Batch only (last sort is not mandatory)
sort -k1n YourFile | tee file1 | sort -k2 > file2 && comm -23 file1 file2 | sort

In AWK, is it possible to specify "ranges" of fields?

In AWK, is it possible to specify "ranges" of fields?
Example. Given a tab-separated file "foo" with 100 fields per line, I want to print only the fields 32 to 57 for each line, and save the result in a file "bar". What I do now:
awk 'BEGIN{OFS="\t"}{print $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55, $56, $57}' foo > bar
The problem with this is that it is tedious to type and prone to errors.
Is there some syntactic form which allows me to say the same in a more concise and less error prone fashion (like "$32..$57") ?
Besides the awk answer by #Jerry, there are other alternatives:
Using cut (assumes tab delimiter by default):
cut -f32-58 foo >bar
Using perl:
perl -nle '#a=split;print join "\t", #a[31..57]' foo >bar
Mildly revised version:
BEGIN { s = 32; e = 57; }
{ for (i=s; i<=e; i++) printf("%s%s", $(i), i<e ? OFS : "\n"); }
You can do it in awk by using RE intervals. For example, to print fields 3-6 of the records in this file:
$ cat file
1 2 3 4 5 6 7 8 9
a b c d e f g h i
would be:
$ gawk 'BEGIN{f="([^ ]+ )"} {print gensub("("f"{2})("f"{4}).*","\\3","")}' file
3 4 5 6
c d e f
I'm creating an RE segment f to represent every field plus it's succeeding field separator (for convenience), then I'm using that in the gensub to delete 2 of those (i.e the first 2 fields), remember the next 4 for reference later using \3, and then delete what comes after them. For your tab-separated file where you want to print fields 32-57 (i.e. the 26 fields after the first 31) you'd use:
gawk 'BEGIN{f="([^\t]+\t)"} {print gensub("("f"{31})("f"{26}).*","\\3","")}' file
The above uses GNU awk for it's gensub() function. With other awks you'd use sub() or match() and substr().
EDIT: Here's how to write a function to do the job:
gawk '
function subflds(s,e, f) {
f="([^" FS "]+" FS ")"
return gensub( "(" f "{" s-1 "})(" f "{" e-s+1 "}).*","\\3","")
}
{ print subflds(3,6) }
' file
3 4 5 6
c d e f
Just set FS as appropriate. Note that this will need a tweak for the default FS if your input file can start with spaces and/or have multiple spaces between fields and will only work if your FS is a single character.
I'm late but this is quick at to the point so I'll leave it here. In cases like this I normally just remove the fields I don't need with gsub and print. Quick and dirty example, since you know your file is delimited by tabs you can remove the first 31 fields:
awk '{gsub(/^(\w\t){31}/,"");print}'
example of removing 4 fields because lazy:
printf "a\tb\tc\td\te\tf\n" | awk '{gsub(/^(\w\t){4}/,"");print}'
Output:
e f
This is shorter to write, easier to remember and uses less CPU cycles than horrendous loops.
You can use a combination of loops and printf for that in awk:
#!/bin/bash
start_field=32
end_field=58
awk -v start=$start_field -v end=$end_field 'BEGIN{OFS="\t"}
{for (i=start; i<=end; i++) {
printf "%s" $i;
if (i < end) {
printf "%s", OFS;
} else {
printf "\n";
}
}}'
This looks a bit hacky, however:
it properly delimits your output based on the specified OFS, and
it makes sure to print a new line at the end for each input line in the file.
I do not know a way to do field range selection in awk. I know how to drop fields at the end of the input (see bellow), but not easily at the beginning. Bellow, the hard way to drop fields at the beginning.
If you know a character c that is not included in your input, you could use the following awk script:
BEGIN { s = 32; e = 57; c = "#"; }
{ NF = e # Drop the fields after e.
$s = c $s # Put a c in front of the s field.
sub(".*"c, "") # Drop the chars before c.
print # Print the edited line.
}
EDIT:
And I just thought that you can always find a character that is not in the input: use \n.
Unofrtunately don't seem to have access to my account anymore, but also don't have 50 rep to add a comment anyway.
Bob's answer can be simplified a lot using 'seq':
echo $(seq -s ,\$ 5 9| cut -d, -f2-)
$6,$7,$8,$9
The minor disadvantage is you have to specify your first field number as one lower.
So to get fields 3 through 7, I specify 2 as the first argument.
seq -s ,\$ 2 7 sets field seperator for seq at ',$' and yields 2,$3,$4,$5,$6,$7
cut -d, -f2- sets field delimiter at ',' and basically cuts of everything before the first comma, by showing everything from the second field on. Thus resulting in $3,$4,$5,$6,$7
When combined with Bob's answer, we get:
$ cat awk.txt
1 2 3 4 5 6 7 8 9
a b c d e f g h i
$ awk "{print $(seq -s ,\$ 2 7| cut -d, -f2-)}" awk.txt
3 4 5 6 7
c d e f g
$
I use this simple function, which does not check that the field range exists in the line.
function subby(f,l, s) {
s = $f
for(i=f+1;i<=l;i++)
s = sprintf("%s %s",s,$i)
return s
}
(I know OP requested "in AWK" but ... )
Using bash expansion on the command line to generate arguments list;
$ cat awk.txt
1 2 3 4 5 6 7 8 9
a b c d e f g h i
$ awk "{print $(c="" ;for i in {3..7}; do c=$c\$$i, ; done ; c=${c%%,} ; echo $c ;)}" awk.txt
3 4 5 6 7
c d e f g
explanation ;
c="" # var to hold args list
for i in {3..7} # the required variable range 3 - 7
do
# replace c's value with concatenation of existing value, literal $, i value and a comma
c=$c\$$i,
done
c=${c%%,} # remove trailing/final comma
echo $c #return the list string
placed on single line using semi-colons, inside $() to evaluate/expand in place.