I have a dataframe like this
1 3 MAPQ=0;CT=3to5;SRMAPQ=60
2 34 MAPQ=60;CT=3to5;SRMAPQ=67
4 56 MAPQ=67;CT=3to5;SRMAPQ=50
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
with using awk (or others)
I want to extract rows with only SRMAPQ over 60.
This means the output is
2 34 MAPQ=60;CT=3to5;SRMAPQ=67
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
update: "SRMAPQ=60" can be anywhere in the line,
MAPQ=44;CT=3to5;SRMAPQ=61;DT=3to5
You don't have to extract the value out of SRMAPQ separately and do the comparison. If the format is fixed like above, just use = as the field separator and access the last field using $NF
awk -F= '$NF > 60' file
Or if SRMAPQ can occur anywhere in the line (as updated in the comments), use a generic approach
awk 'match($0, /SRMAPQ=([0-9]+)/){ l = length("SRMAPQ="); v = substr($0, RSTART+l, RLENGTH-l) } v > 60' file
I would use GNU AWK following way let file.txt content be
1 3 MAPQ=0;CT=3to5;SRMAPQ=60
2 34 MAPQ=60;CT=3to5;SRMAPQ=67;SOMETHING=2
4 56 MAPQ=67;CT=3to5;SRMAPQ=50
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
then
awk 'BEGIN{FS="SRMAPQ="}$2>60' file.txt
output
2 34 MAPQ=60;CT=3to5;SRMAPQ=67;SOMETHING=2
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
Note: added SOMETHING to test if it would work when SRMAPQ is not last. Explantion: I set FS to SRMAPQ= thus what is before that becomes first field ($1) and what is behind becomes second field ($2). In 2nd line this is 67;SOMETHING=2 with which GNU AWK copes by converting its' longmost prefix which constitute number in this case 67, other lines have just numbers. Disclaimer: this solution assumes that all but last field have trailing ;, if this does not hold true please test my solution fully before usage.
(tested in gawk 4.2.1)
Related
I am trying to delete rows which have an empty field on the 21st column. For some reason this code works on other files (less columns) but not this particular one. I've tried converting the file into space separated, comma separated, tab-delimited nothing seems to work.
I've tried these 2 different methods:
awk -F'\t' '$21!=""'
awk -F'\t' '$21{print $0}'
For example here is a smaller version of my tab-delimited file. I would want to remove rows that are "" in the column "Gene"
"Gene_ID"
"Sample_1"
"Sample_x"
"Sample_19"
"Gene"
"ENSG00000223972"
12
2
1
"DDX11L1"
"ENSG00000227232"
6
12
45
"WASH7P"
"ENSG00000278267"
0
4
542
"MIR6859-1"
"ENSG00000186092"
4
2
34
"OR4F5"
"ENSG00000239945"
7
67
22
""
"ENSG00000233750"
9
4356
22
"CICP27"
"ENSG00000241599"
55
4
55
""
this should work, your field is not blank, it's empty quotes.
$ awk -F'\t' '$21!="\"\""'
or perhaps easier to read
$ awk -F'\t' -v empty='""' '$21!=empty'
I have a file with tab-delimited fields (or columns) like this one below:
cat abc_table.txt
a b c
1 11;qqw 213
2 22 222
3 333;rs2 83838
I would like to remove everything after the ";" on only the second field.
I have tried with
awk 'BEGIN{FS=OFS="\t"} NR>=1 && sub (/;[*]/,"",$2){print $0}' abc_table.txt
but it does not seem to work.
I also tried with sed:
's/;.*//g' abc_table.txt
but it erases also the strings in the third field:
a b c
1 11
2 22 222
3 333
The desired output is:
a b c
1 11 213
2 22 222
3 333 83838
If someone could help me, I would be very grateful!
You need to simply correct your regex.
awk '{sub(/;.*/,"",$2)} 1' Input_file
In case you have Input_file TAB delimited then try:
awk 'BEGIN{FS=OFS="\t"} {sub(/;.*/,"",$2)} 1' Input_file
Problem in OP's regex: OP's regex ;[*] is looking for ; and *(literal character) in 2nd field that's why its NOT able to substitute everything after ; in 2nd field. We need to simply give ;.* which means grab everything from very first occurrence of ; till last of 2nd field and then substitute with NULL in 2nd field.
An alternative solution using gnu sed:
sed -E 's/(^[^\t]*\t+[^;]*);[^\t]*/\1/' file
a b c
1 11 213
2 22 222
3 333 83838
This might work for you (GNU sed):
sed 's/[^\t]*/&\n/2;s/;[^\t]*\n//;s/\n//' file
Append a unique marker e.g. newline, to the end of field 2.
Remove everything from the first ; which is not a tab to a newline.
Remove the newline if any.
N.B. This method can be extended for selective or all fields e.g. same removal but for the first and third fields:
sed 's/[^\t]*/&\n/1;s//&\n/3;s/;[^\t]*\n//g;s/\n//g' file
I have a file as follows:
5 6
7 8
12 15
Using awk, how can I find the distance between the second column of one line with the first column of the next line. In this case, distance between 6 and 7 and 8 and 12 and print as follows, distance of first line set to zero:
5 6 0
7 8 1
12 15 4
awk '{print $0, (NR>1?$1-p:0); p=$2}' file
try:
awk 'NR==1{val=$2;print $0,"0";next} {print $0,$1-val;val=$2}' Input_file
Adding explanation now too successfully.
Checking for NR==1(when first line of Input_file) is there, then create a variable named val tp second field of the Input_file and then print the current line with "0" then do next(which will skip all further statements). Then printing the current line along with $1-val's value and then assigning the value of variable of val to $2 of the current line then.
Short awk approach:
awk 'NR==1{ $3=0 }NR>1{ $3=$1-p }{ p=$2 }1' file
The output:
5 6 0
7 8 1
12 15 4
p=$2 - capture the 2nd field value (p - considered as previous line value)
awk newbie here! I am asking for help to solve a simple specific task.
Here is file.txt
1
2
3
5
6
7
8
9
As you can see a single number (the number 4) is missing. I would like to print on the console the number 4 that is missing. My idea was to compare the current line number with the entry and whenever they don't match I would print the line number and exit. I tried
cat file.txt | awk '{ if ($NR != $1) {print $NR; exit 1} }'
But it prints only a newline.
I am trying to learn awk via this small exercice. I am therefore mainly interested in solutions using awk. I also welcome an explanation for why my code does not do what I would expect.
Try this -
awk '{ if (NR != $1) {print NR; exit 1} }' file.txt
4
since you have a solution already, here is another approach, comparing with previous values.
awk '$1!=p+1{print p+1} {p=$1}' file
you positional comparison won't work if you have more than one missing value.
Maybe this will help:
seq $(tail -1 file)|diff - file|grep -Po '.*(?=d)'
4
Since I am learning awk as well
awk 'BEGIN{i=0}{i++;if(i!=$1){print i;i=$1}}' file
4
`awk` explanation read each number from `$1` into array `i` and increment that number list line by line with `i++`, if the number is not sequential, then print it.
cat file
1
2
3
5
6
7
8
9
11
12
13
15
awk 'BEGIN{i=0}{i++;if(i!=$1){print i;i=$1}}' file
4
10
14
I have a file that I need to extract segments from based on a character range given in another file. I would like to do it using an awk command.
File one would look like this ( a single line):
AATTGTGAAGGTAGATGGCTCGCTCCGCGGCGGGGCGCGCGCGCGCGCGCGGGCTCGCTATATAGAGATATATGCGCGCGGCGCGCGGCGCGCGCGGCGCGCGCGTATATATATAGGCGCGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCCCCCCCCCCC
The second file would look like follows:
5 10
13 20
22 24
and the output would be:
GTGAAG
AGATGGCT
GCT
This one-liner will solve your problem:
awk 'BEGIN{getline sequence < "first_file"} {print substr(sequence, $1, $2 - $1 + 1) }' second_file
Explanation: This script reads string sequence from file named first_file(adjust it to the actual file name) using getline function. Then for each line of second file(which contains ranges for processing) it extracts necessary substring using substr function. substr accepts three parameters: string(sequence), position($1), and length($2 - $1 + 1).
Nya gave you the awk solution, here's one based on coreutils.
string
AATTGTGAAGGTAGATGGCTCGCTCCGCGGCGGGGCGCGCGCGCGCGCGCGGGCTCGCTATATAGAGATATATGCGCGCGGCGCGCGGCGCGCGCGGCGCGCGCGTATATATATAGGCGCGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCCCCCCCCCCC
offlen
5 10
13 20
22 24
You can get the output you want with:
while read off len; do cut -c${off}-${len} string; done < offlen
Output:
GTGAAG
AGATGGCT
GCT