I have a file that looks like this
seq1 CT 5 CCCGCTGCTGATGAC
seq2 AG 8 CTGTGTAGATGATGGGTTAGAG
seq3 TG 3 CGTGTGACA
I am trying to replace the nth character of field 4 with the string in field 2, where n= value specified by field 3. The output would be
seq1 CT 5 CCCGCTTGCTGATGAC
seq2 AG 8 CTGTGTAAGATGATGGGTTAGAG
seq3 TG 3 CGTGGTGACA
my attempt looks like this
awk '{a=$3; b=$2; sub(/(substr($4, a, 1))/,b); print $0}’
I guess what is happening is that it is treating whats specified by sub command as a string rather than get the string specified by substr command, and variable b. After searching I can’t find the correct way of doing this.
Cheers,
awk '{$4 = substr($4, 1, $3-1) $2 substr($4, $3+1); print}'
Related
I have a dataframe like this
1 3 MAPQ=0;CT=3to5;SRMAPQ=60
2 34 MAPQ=60;CT=3to5;SRMAPQ=67
4 56 MAPQ=67;CT=3to5;SRMAPQ=50
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
with using awk (or others)
I want to extract rows with only SRMAPQ over 60.
This means the output is
2 34 MAPQ=60;CT=3to5;SRMAPQ=67
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
update: "SRMAPQ=60" can be anywhere in the line,
MAPQ=44;CT=3to5;SRMAPQ=61;DT=3to5
You don't have to extract the value out of SRMAPQ separately and do the comparison. If the format is fixed like above, just use = as the field separator and access the last field using $NF
awk -F= '$NF > 60' file
Or if SRMAPQ can occur anywhere in the line (as updated in the comments), use a generic approach
awk 'match($0, /SRMAPQ=([0-9]+)/){ l = length("SRMAPQ="); v = substr($0, RSTART+l, RLENGTH-l) } v > 60' file
I would use GNU AWK following way let file.txt content be
1 3 MAPQ=0;CT=3to5;SRMAPQ=60
2 34 MAPQ=60;CT=3to5;SRMAPQ=67;SOMETHING=2
4 56 MAPQ=67;CT=3to5;SRMAPQ=50
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
then
awk 'BEGIN{FS="SRMAPQ="}$2>60' file.txt
output
2 34 MAPQ=60;CT=3to5;SRMAPQ=67;SOMETHING=2
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
Note: added SOMETHING to test if it would work when SRMAPQ is not last. Explantion: I set FS to SRMAPQ= thus what is before that becomes first field ($1) and what is behind becomes second field ($2). In 2nd line this is 67;SOMETHING=2 with which GNU AWK copes by converting its' longmost prefix which constitute number in this case 67, other lines have just numbers. Disclaimer: this solution assumes that all but last field have trailing ;, if this does not hold true please test my solution fully before usage.
(tested in gawk 4.2.1)
I have very long files where I have to compare two chromosome numbers present in the same line. I would like to use awk to create a file that take only the lines where the chromosome numbers are different.
Here is the example of my file:
CHROM ALT
1 ]1:1234567]T
1 T[1:2345678[
1 A[12:3456789[
2 etc...
In this example, I wish to compare the number of the chromosome (here '1' in the CHROM column) and the number that is between the first bracket ([ or ]) and the ":" symbol. If these numbers are different, I wish to print the corresponding line.
Here, the result should be like this:
1 A[12:3456789[
Thank you for your help.
$ awk -F'[][]' '$1+0 != $2+0' file
1 A[12:3456789[
2 etc...
This requires GNU awk for the 3 argument match() function:
gawk 'match($2, /[][]([0-9]+):/, a) && $1 != a[1]' file
Thanks again for the different answers.
Here are how my data looks like with several columns:
CHROM POS ID REF ALT
1 1000000 123:1 A ]1:1234567]T
1 2000000 456:1 A T[1:2345678[
1 3000000 789:1 T A[12:3456789[
2 ... ... . ...
My question is: how do I modify the previous code, when I have several columns?
I have a txt file that is 10s to hundreds lines long and and I need to sum a particular field each line ( and output) if a preceeding field matches.
Here is an example datset:
Sample4;6a0f64d2;size=1;,Sample4;f1cb4733a;size=6;,Sample3;aa44410feb29210c1156;size=2;
Sample2;5b91bef2329bd87f4c7;size=2;,Sample1;909cd4e2940f328b3;size=2;
The structure is
<sample ID>;<random id>;size=<numeric>;, then the next entry. There could be hundreds of entries in a line (this is just a small example)
Basically, I want to sum the "size" numbers for each entry across a line (entries seperated by ',') , but only those that have match with a particular sample IDentifier (e.g. sample4 for example)
So, if we want to match just the 'Sample4's, the script would produce this-
awk '{some-code for sample4}' example.txt
7
0
Because the entries with 'Sample4' add up to 7 in line 1, but in line 2, there are no Sample4 entries matching.
This could be done for each "SampleID" or, ideally, done for all sample IDs provided in a list ( perhaps in simple file, 1 line per sample ID), which would then output the counts for each row, with each Sample ID having its own column - e.g. for the example file above, results of the script would be:
Sample1 Sample2 Sample3 Sample4
0 0 2 7
2 2 0 0
Any ideas on how to get started?
Thanks!
another awk
awk -F';' '{for(i=1;i<NF-1;i+=3)
{split($(i+2),e,"=");
sub(/,/,"",$i);
header[$i];
a[$i,NR]+=e[2]}}
END {for(h in header) printf "%s", h OFS;
print "";
for(i=1;i<=NR;i++)
{for(h in header) printf "%s", a[h,i]+0 OFS;
print ""}}' file | column -t
Sample1 Sample2 Sample3 Sample4
0 0 2 7
2 2 0 0
ps. the order of columns is not guaranteed.
Explanation
To simplify parsing I used ; as the delimiter and got rid of , before the names. Using the structure assign name=sum of values for each line using multi-dim array a, separately keep track of all names in the header array. Once the lines are consumed, in the END block print the header and for each line the value of the corresponding name (or 0 if missing). Pretty print with column -t.
If I am understanding this correctly, you can do:
$ awk '{split($0,samp,/,/)
for (i=1; i in samp; i++){
sub(/;$/, "", samp[i])
split(samp[i], fields, /;/)
split(fields[3], ns, /=/)
data[fields[1]]+=ns[2]
}
printf "For line %s:\n", NR
for (e in data)
print e, data[e]
split("", data)
}' file
Prints:
For line 1:
Sample3 2
Sample4 7
For line 2:
Sample1 2
Sample2 2
I have a large log file which includes lines in the format
id_number message_type
Here is an example for a log file where all lines appear in the expected order
1 A
2 A
1 B
1 C
2 B
2 C
However, not all lines appear in the expected order in my log file and I'd like to get a list of all id numbers that don't appear in expected order. For the following file
1 A
2 A
1 C
1 B
2 B
2 C
I would like to get an output that indicates id number 1 has lines that don't appear in the expected order. How to do this, using grep, sed and awk?
This works for me:
awk -v "a=ABC" 'substr(a, b[$1]++ + 1, 1) != $2 {print $1}' logfile
When you run this, the ID number from each out-of-order line will be printed. If there are no out-of-order lines, then nothing is printed.
How it works
-v "a=ABC"
This defines the variable a with the list of characters in their expected order.
substr(a, b[$1]++ + 1, 1) != $2 {print $1}
For each ID number, the array b keeps track of where we are. Initially, b is zero for all IDs. With this initial value, that is b[$1]==0, the expression substr(a, b[$1] + 1, 1) returns A which is our first expected output. The condition substr(a, b[$1] + 1, 1) != $2 thus checks if the expected output, from the substr function, differs from the actual output shown in the second field, $2. If it does differ, then the ID value, $1, is printed.
After the substr expression is computed, the trailing ++ in the expression b[$1]++ increments the value of b[$1] by 1 so that the value of b[$1] is ready for the next time that ID $1 is encountered.
Refinement
The above prints an ID number every time an out-of-order line is encountered. If you just want each bad ID printed once, not multiple times, use:
awk -v "a=ABC" 'substr(a, b[$1]++ + 1, 1) != $2 {bad[$1]++} END{for (n in bad) print n}' logfile
I am only on my iPad with no way to test this, but I can give you an idea how to do it with awk since no-one else is answering...
Something like this:
awk 'BEGIN{for(i=0;i<10000;i++)expected[i]=ord("A")}
{if(expected[$1]!=ord($2))
print "Out of order at line ", NR, $0;
expected[i]=ord($2)+1
}' yourFile
You will need to paste in the ord() function from here.
Basically, the concept is to initialise an array called expected[] that keeps track of the next message type expected for each id and then, as each line is read, check it is the next expected value.
Batch only (last sort is not mandatory)
sort -k1n YourFile | tee file1 | sort -k2 > file2 && comm -23 file1 file2 | sort
I have a file that I need to extract segments from based on a character range given in another file. I would like to do it using an awk command.
File one would look like this ( a single line):
AATTGTGAAGGTAGATGGCTCGCTCCGCGGCGGGGCGCGCGCGCGCGCGCGGGCTCGCTATATAGAGATATATGCGCGCGGCGCGCGGCGCGCGCGGCGCGCGCGTATATATATAGGCGCGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCCCCCCCCCCC
The second file would look like follows:
5 10
13 20
22 24
and the output would be:
GTGAAG
AGATGGCT
GCT
This one-liner will solve your problem:
awk 'BEGIN{getline sequence < "first_file"} {print substr(sequence, $1, $2 - $1 + 1) }' second_file
Explanation: This script reads string sequence from file named first_file(adjust it to the actual file name) using getline function. Then for each line of second file(which contains ranges for processing) it extracts necessary substring using substr function. substr accepts three parameters: string(sequence), position($1), and length($2 - $1 + 1).
Nya gave you the awk solution, here's one based on coreutils.
string
AATTGTGAAGGTAGATGGCTCGCTCCGCGGCGGGGCGCGCGCGCGCGCGCGGGCTCGCTATATAGAGATATATGCGCGCGGCGCGCGGCGCGCGCGGCGCGCGCGTATATATATAGGCGCGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCCCCCCCCCCC
offlen
5 10
13 20
22 24
You can get the output you want with:
while read off len; do cut -c${off}-${len} string; done < offlen
Output:
GTGAAG
AGATGGCT
GCT