Hot to trim every nth line by a different value? - awk

I would like to trim the last XY characters of every 4th line. The cut off should be the different between the character count from line 4 and 2, and line 8 and 6.
For example: line 4 (29 characters) - line 2 (20 characters) = 9. So the last 9 characters of line 4 should be removed.
Input:
#V300059044L3C001R0010004402
AAGTAGATATCATGGAGCCG
+
FFFGFGGFGFGFFGFFGFFGGGGGFFFGG
#V300059044L3C001R0010009240
AAAGGGAGGGAGAATAAT
+
GFFGFEGFGFGEFDFGGEFFGGEDEGEGF
Output:
#V300059044L3C001R0010004402
AAGTAGATATCATGGAGCCG
+
FFFGFGGFGFGFFGFFGFFG
#V300059044L3C001R0010009240
AAAGGGAGGGAGAATAAT
+
GFFGFEGFGFGEFDFGGE

Running
awk 'NR%4==0 {$0=substr($0,1,a)} NR%2==0 {a=length($0)} {print $0}' input.txt
on input.txt
yields
#V300059044L3C001R0010004402
AAGTAGATATCATGGAGCCG
+
FFFGFGGFGFGFFGFFGFFG
#V300059044L3C001R0010009240
AAAGGGAGGGAGAATAAT
+
GFFGFEGFGFGEFDFGGE

Related

extract specific row with numbers over N

I have a dataframe like this
1 3 MAPQ=0;CT=3to5;SRMAPQ=60
2 34 MAPQ=60;CT=3to5;SRMAPQ=67
4 56 MAPQ=67;CT=3to5;SRMAPQ=50
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
with using awk (or others)
I want to extract rows with only SRMAPQ over 60.
This means the output is
2 34 MAPQ=60;CT=3to5;SRMAPQ=67
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
update: "SRMAPQ=60" can be anywhere in the line,
MAPQ=44;CT=3to5;SRMAPQ=61;DT=3to5
You don't have to extract the value out of SRMAPQ separately and do the comparison. If the format is fixed like above, just use = as the field separator and access the last field using $NF
awk -F= '$NF > 60' file
Or if SRMAPQ can occur anywhere in the line (as updated in the comments), use a generic approach
awk 'match($0, /SRMAPQ=([0-9]+)/){ l = length("SRMAPQ="); v = substr($0, RSTART+l, RLENGTH-l) } v > 60' file
I would use GNU AWK following way let file.txt content be
1 3 MAPQ=0;CT=3to5;SRMAPQ=60
2 34 MAPQ=60;CT=3to5;SRMAPQ=67;SOMETHING=2
4 56 MAPQ=67;CT=3to5;SRMAPQ=50
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
then
awk 'BEGIN{FS="SRMAPQ="}$2>60' file.txt
output
2 34 MAPQ=60;CT=3to5;SRMAPQ=67;SOMETHING=2
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
Note: added SOMETHING to test if it would work when SRMAPQ is not last. Explantion: I set FS to SRMAPQ= thus what is before that becomes first field ($1) and what is behind becomes second field ($2). In 2nd line this is 67;SOMETHING=2 with which GNU AWK copes by converting its' longmost prefix which constitute number in this case 67, other lines have just numbers. Disclaimer: this solution assumes that all but last field have trailing ;, if this does not hold true please test my solution fully before usage.
(tested in gawk 4.2.1)

awk with empty field in columns

Here my file.dat
1 A 1 4
2 2 4
3 4 4
3 7 B
1 U 2
Running awk '{print $2}' file.dat gives:
A
2
4
7
U
But I would like to keep the empty field:
A
4
U
How to do it?
I must add that between :
column 1 and 2 there is 3 whitespaces field separator
column 2 and 3 and between column 3 and 4 one whitespace field separator
So in column 2 there are 2 fields missing (lines 2 and 4) and in column 4
there are also 2 fields missing (lines 3 and 5)
If this isn't all you need:
$ awk -F'[ ]' '{print $4}' file
A
4
U
then edit your question to provide a more truly representative example and clearer requirements.
If the input is fixed-width columns, you can use substr to extract the slice you want. I have assumed that you want a single character at index 5:
awk '{ print(substr($0,5,1)) }' file
Your awk code is missing field separators.
Your example file doesn't clearly show what that field separator is.
From observation your file appears to have 5 columns.
You need to determine what your field separator is first.
This example code expects \t which means <TAB> as the field separator.
awk -F'\t' '{print $3}' OFS='\t' file.dat
This outputs the 3rd column from the file. This is the 'read in' field separator -F'\t' and OFS='\t' is the 'read out'.
A
4
U
For GNU awk. It processes the file twice. On the first time it examines all records for which string indexes have only space and considers continuous space sequences as separator strings building up FIELDWIDTHS variable. On the second time it uses that for fixed width processing of the data.
a[i]:s get valus 0/1 and h (header) with this input will be 100010101 and that leads to FIELDWIDTHS="4 2 2 1":
1 A 1 4
2 2 4
3 4 4
3 7 B
1 U 2
| | | |
100010101 - while(match(h,/10*/))
\ /|/|/|
4 2 2 1
Script:
$ awk '
NR==FNR {
for(i=1;i<=length;i++) # all record chars
a[i]=((a[i]!~/^(0|)$/) || substr($0,i,1)!=" ") # keep track of all space places
if(--i>m)
m=i # max record length...
next
}
BEGINFILE {
if(NR!=0) { # only do this once
for(i=1;i<=m;i++) # ... used here
h=h a[i] # h=100010101
while(match(h,/10*/)) { # build FIELDWIDTHS
FIELDWIDTHS=FIELDWIDTHS " " RLENGTH # qnd
h=substr(h,RSTART+RLENGTH)
}
}
}
{
print $2 # and output
}' file file
And output:
A
4
U
You need to trim off the space from the fields, though.

Substract two fields of two consecutive rows in awk

I have a file as follows:
5 6
7 8
12 15
Using awk, how can I find the distance between the second column of one line with the first column of the next line. In this case, distance between 6 and 7 and 8 and 12 and print as follows, distance of first line set to zero:
5 6 0
7 8 1
12 15 4
awk '{print $0, (NR>1?$1-p:0); p=$2}' file
try:
awk 'NR==1{val=$2;print $0,"0";next} {print $0,$1-val;val=$2}' Input_file
Adding explanation now too successfully.
Checking for NR==1(when first line of Input_file) is there, then create a variable named val tp second field of the Input_file and then print the current line with "0" then do next(which will skip all further statements). Then printing the current line along with $1-val's value and then assigning the value of variable of val to $2 of the current line then.
Short awk approach:
awk 'NR==1{ $3=0 }NR>1{ $3=$1-p }{ p=$2 }1' file
The output:
5 6 0
7 8 1
12 15 4
p=$2 - capture the 2nd field value (p - considered as previous line value)

how to determine if difference between two values falls within range with awk?

If I have the input:
0
5
7
13
I want to calculate the difference between the values for each subsequent line. I have done so with:
awk 'NR==1{x=$1;next}{print $1-x;x=$1}'
This will generate:
5
2
6
My struggle is that I want to print a + sign beside the output value if the two numbers used to calculate it (from the input file) encompass values from 6-8. So I would get the following output:
5 -
2 +
6 +
The 2 and 6 will have a + sign beside it because the two values (5 and 7 and 7 and 13) that were used to calculate them contain value between 6-8.
Please let me know if any clarification is necessary.
Thank you.
Same idea as the above awk
$ awk 'NR==1{p=$1;next}
{print $1-p,
((p-6)*(p-8)<=0 || ($1-6)*($1-8)<=0)?"+":"-"; p=$1}' file
5 -
2 +
6 +
ps. This checks whether at least one of the values is between the 6,8 range. If you want both change || with &&.
UPDATE: The range check should be based on the span of the two entries as explained in the comments. This should do:
$ awk 'function max(x,y) {return x>y?x:y};
function min(x,y) {return x>y?y:x};
NR==1{p=$1;next} {print $1-p,
(max($1,p)<6 || min($1,p)>8)?"-":"+"; p=$1}'
All you need are some additional checks, if I understand you correctly:
awk 'NR==1{x=$1;next}
{sign = (x >= 6 && x <= 8) || ($1 >= 6 && $1 <= 8) ? "+" : "-"
print $1-x" "sign;x=$1}' test
Output:
5 -
2 +
6 +

extracting segments from a file with awk

I have a file that I need to extract segments from based on a character range given in another file. I would like to do it using an awk command.
File one would look like this ( a single line):
AATTGTGAAGGTAGATGGCTCGCTCCGCGGCGGGGCGCGCGCGCGCGCGCGGGCTCGCTATATAGAGATATATGCGCGCGGCGCGCGGCGCGCGCGGCGCGCGCGTATATATATAGGCGCGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCCCCCCCCCCC
The second file would look like follows:
5 10
13 20
22 24
and the output would be:
GTGAAG
AGATGGCT
GCT
This one-liner will solve your problem:
awk 'BEGIN{getline sequence < "first_file"} {print substr(sequence, $1, $2 - $1 + 1) }' second_file
Explanation: This script reads string sequence from file named first_file(adjust it to the actual file name) using getline function. Then for each line of second file(which contains ranges for processing) it extracts necessary substring using substr function. substr accepts three parameters: string(sequence), position($1), and length($2 - $1 + 1).
Nya gave you the awk solution, here's one based on coreutils.
string
AATTGTGAAGGTAGATGGCTCGCTCCGCGGCGGGGCGCGCGCGCGCGCGCGGGCTCGCTATATAGAGATATATGCGCGCGGCGCGCGGCGCGCGCGGCGCGCGCGTATATATATAGGCGCGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCCCCCCCCCCC
offlen
5 10
13 20
22 24
You can get the output you want with:
while read off len; do cut -c${off}-${len} string; done < offlen
Output:
GTGAAG
AGATGGCT
GCT