Awk - Substring comparison - awk

Working native bash code :
while read line
do
a=${line:112:7}
b=${line:123:7}
if [[ $a != "0000000" || $b != "0000000" ]]
then
echo "$line" >> FILE_OT_YHAV
else
echo "$line" >> FILE_OT_NHAV
fi
done <$FILE_IN
I have the following file (its a dummy), the substrings being checked are both on the 4th field, so nm the exact numbers.
AAAAAAAAAAAAAA XXXXXX BB CCCCCCC 12312312443430000000
BBBBBBB AXXXXXX CC DDDDDDD 10101010000000000000
CCCCCCCCCC C C QWEQWEE DDD AAAAAAA A12312312312312310000
I m trying to write an awk script that compares two specific substrings, if either one is not 000000 it outputs the line into File A, if both of them are 000000 it outputs the line into File B, this is the code i have so far :
# Before first line.
BEGIN {
print "Awk Started"
FILE_OT_YHAV="FILE_OT_YHAV.test"
FILE_OT_NHAV="FILE_OT_NHAV.test"
FS=""
}
# For each line of input.
{
fline=$0
# print "length = #" length($0) "#"
print "length = #" length(fline) "#"
print "##" substr($0,112,7) "##" substr($0,123,7) "##"
if ( (substr($0,112,7) != "0000000") || (substr($0,123,7) != "0000000") )
print $0 > FILE_OT_YHAV;
else
print $0 > FILE_OT_NHAV;
}
# After last line.
END {
print "Awk Ended"
}
The problem is that when i run it, it :
a) Treats every line as having a different length
b) Therefore the substrings are applied to different parts of it (that is why i added the print length stuff before the if, to check on it.
This is a sample output of the line length awk reads and the different substrings :
Awk Started
length = #130#
## ## ##
length = #136#
##0000000##22016 ##
length = #133#
##0000001##16 ##
length = #129#
##0010220## ##
length = #138#
##0000000##1022016##
length = #136#
##0000000##22016 ##
length = #134#
##0000000##016 ##
length = #137#
##0000000##022016 ##
Is there a reason why awk treats lines of the same length as having a different length? Does it have something to do with the spacing of the input file?
Thanks in advance for any help.
After the comments about cleaning the file up with sed, i got this output (and yes now the lines have a different size) :
1 0M-DM-EM-G M-A.M-E. #DEH M-SM-TM-OM-IM-WM-EM-IM-A M-DM-V/M-DM-T/M-TM-AM-P 01022016 $
2 110000080103M-CM-EM-QM-OM-MM-TM-A M-A. 6M-AM-HM-GM-MM-A 1055801001102 0000120000012001001142 19500000120 0100M-D000000000000000000000001022016 $
3 110000106302M-TM-AM-QM-EM-KM-KM-A 5M-AM-AM-HM-GM-MM-A 1043801001101 0000100000010001001361 19500000100M-IM-SM-O0100M-D000000000000000000000001022016 $
4 110000178902M-JM-AM-QM-AM-CM-IM-AM-MM-MM-G M-KM-EM-KM-AM-S 71M-AM-HM-GM-MM-A 1136101001101 0000130000013001006061 19500000130 0100M-D000000000000000000000001022016 $

Related

Move new line character 5 positions downstream in a text (fasta) file

I am trying to transform a text file like this (fasta format):
>seq1
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
ATGATGATGGAATGAGGAT
TTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGG
TTGCAATGCGCGTATTTAT
TTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAGGCTGT
AAAAAAAAAAAAAAAGGGG
The objective is to displace newline character 5 positions downstream, except for those lines starting with >
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG
I would like to use AWK, but I am not sure how to proceed. I am thinking about something similar to this:
awk '{for(i=1;i<=NR;i++){ if($1 ~ /^>/){¿?¿?¿?}}}'
Do you know how can I solve this?
Assumptions:
all data lines are to be expanded to a max of 24 characters
One awk idea:
awk -v width=24 ' # pass width in as awk variable "width"
function print_sequence() {
if (sequence) # if sequence is not blank
while (sequence) { # while sequence is not blank
print substr(sequence,1,width) # print 1st 24 characters
sequence=substr(sequence,width+1) # remove 1st 24 characters
}
}
/^>/ { print_sequence() # flush previous set of data to stdout
print # print current input line
next # process next input line
}
{ sequence=sequence $1 } # append data to our "sequence" variable
END { print_sequence() } # flush last set of data to stdout
' fasta.in > fasta.out
This generates:
$ cat fasta.out
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG
I would do it following way, let file.txt content be
>seq1
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
ATGATGATGGAATGAGGAT
TTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGG
TTGCAATGCGCGTATTTAT
TTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAGGCTGT
AAAAAAAAAAAAAAAGGGG
then
awk 'BEGIN{width=24}/>/&&x{print x;x=""}/>/{print;next}{x = x $0}length(x)>=width{print substr(x,1,width);x=substr(x,width+1)}END{print x}' file.txt
gives output
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG
Explanation: I set width to 24 which is number of desired character, if > is found and there is something stored in x do print that and set x value to empty string, if line with > is encountered do print it and go to next line. For every line do append current line content to x, if length of x is equal to or greater than width do print width first characters of x and remove these characters from x. After processing all lines do print x. Disclaimer solution: this solution assumes that ratio between current width and desired with is lesser than 0.5
(GNU Awk 5.0.1)
Yet another approach you could try, using awk's field and record separators:
awk -v width=24 '
BEGIN {
FS="\n" # Set the Field separator to newline
RS=">" # Set the Record separator to ">"
ORS=OFS="" # Set the Output Record and Field separator to an empty string
}
NR>1 { # Using ">" as a record separator the first record is empty, so skip
header=$1 # Using "\n" as the Field separator, $1 contains the header, save it in a variable
$1=OFS # Assign an empty string to $1 so the record gets recalculated and the body becomes $0 i
# with all newlines are removed, since OFS == ""
gsub(".{" width "}", "&" FS) # Append every "width" characters with a newline (FS)
print RS header FS $0 FS # Print a ">", the header, a newline, the body and a newline
}
' fasta_in > fasta_out
Assuming the line that starts with > is never more than 24 chars long:
$ awk '{printf "%s", (/^>/ ? sep $0 ORS : $0); sep=ORS} END{print ""}' file | fold -w24
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG

Align trailing backslashes vertically

I have a file that looks like this:
abc \
d \
efgh \
i
jklmnop \
q \rst \
uv
wx
y \
z
For each group of consecutive lines that have a backslash at the end, I want to arrange those backslashes in a straight vertical line. So the expected output for above sample is:
abc \
d \
efgh \
i
jklmnop \
q \rst \
uv
wx
y \
z
I managed to align all backslashes with this program:
$ awk '/\\$/ { sub(/\\$/,""); printf "%-20s\\\n",$0; next} 1' file
abc \
d \
efgh \
i
jklmnop \
q \rst \
uv
wx
y \
z
But I have no idea how to proceed from here, so I'm asking for guidance. I tried searching on SO, but top results were all about removing trailing backslashes.
Details about the actual input:
Lines may contain any character including backslash and consist of any number of characters, there is no limit.
There might be multiple blanks and tabs before and after the last backslash.
There is always at least one blank or tab preceding the last backslash.
There is no line that consists of only a backslash and zero or more blanks/tabs around it.
P.S. I'm not looking for a Perl solution.
If you don't have any tabs before the spaces-then-backslash-then-spaces at the end of each line:
$ cat tst.awk
FNR == 1 { prevHasEsc = blockNr = 0 }
hasEsc = match($0,/[[:space:]]*\\[[:space:]]*$/) {
$0 = substr($0,1,RSTART-1)
if ( ! prevHasEsc ) {
++blockNr
}
}
{ prevHasEsc = hasEsc }
NR == FNR {
if ( hasEsc ) {
lgth[blockNr] = (lgth[blockNr] > length($0) ? lgth[blockNr] : length($0))
}
next
}
hasEsc {
$0 = sprintf("%-*s \\", lgth[blockNr], $0)
}
{ print }
.
$ awk -f tst.awk file file
abc \
d \
efgh \
i
jklmnop \
q \
r
If you do then I'd suggest running the text through pr -e -t first to convert tabs to the corresponding number of blanks.
Here is one attempt using gnu-awk with a custom RS that breaks input on each substring that ends without a blackslash:
awk -v RS='[^\n]*[^\\\\[:space:]][[:blank:]]*(\n|$)' '
{
sub(/\n$/, "", RT)
}
n = split($0, lines, /[[:blank:]]+\\[[:blank:]]*\n/) {
lines[n] = RT
mlen = 0
# determine max length of a block
for (i=1; i<=n; i++)
if (mlen < length(lines[i]))
mlen = length(lines[i])
# print each segment with backslash at the end of max length
for (i=1; i<n; i++)
printf "%-" (mlen+1) "s\\\n", lines[i]
}
RT {
print RT
}' file
abc \
d \
efgh \
i
jklmnop \
q \rst \
uv
wx
y \
z
Code Demo
Details:
-v RS='[^\n]*[^\\\\[:space:]][[:blank:]]*(\n|$): Sets record separator using this regex which basically matches a line that doesn't end with a \. As a result we'll get all contiguous lines that end with \ in each record.
split($0, lines, /[[:blank:]]+\\[[:blank:]]*\n/: Splits each record by ending \ and following line break.
In the first for loop, we loop through each array element and determine longest length of the line i.e. mlen
In the second for loop, we just print each line segment using mlen+1 as length to place trailing \
Finally we print RT which is the substring we capture as a result of RS=...

Concatenate columns and adds digits awk

I have a csv file:
number1;number2;min_length;max_length
"40";"1801";8;8
"40";"182";8;8
"42";"32";6;8
"42";"4";6;6
"43";"691";9;9
I want the output be:
4018010000;4018019999
4018200000;4018299999
42320000;42329999
423200000;423299999
4232000000;4232999999
42400000;42499999
43691000000;43691999999
So the new file will be consisting of:
column_1 = a concatenation of old_column_1 + old_column_2 + a number
of "0" equal to (old_column_3 - length of the old_column_2)
column_2 = a concatenation of old_column_1 + old_column_2 + a number of "9" equal
to (old_column_3 - length of the old_column_2) , when min_length = max_length. And when min_length is not equal with max_length , I need to take into account all the possible lengths. So for the line "42";"32";6;8 , all the lengths are: 6,7 and 8.
Also, i need to delete the quotation mark everywhere.
I tried with paste and cut like that:
paste -d ";" <(cut -f1,2 -d ";" < file1) > file2
for the concatenation of the first 2 columns, but i think with awk its easier. However, i can't figure out how to do it. Any help it's apreciated. Thanks!
Edit: Actually, added column 4 in input.
You may use this awk:
awk 'function padstr(ch, len, s) {
s = sprintf("%*s", len, "")
gsub(/ /, ch, s)
return s
}
BEGIN {
FS=OFS=";"
}
{
gsub(/"/, "");
for (i=0; i<=($4-$3); i++) {
d = $3 - length($2) + i
print $1 $2 padstr("0", d), $1 $2 padstr("9", d)
}
}' file
4018010000;4018019999
4018200000;4018299999
42320000;42329999
423200000;423299999
4232000000;4232999999
42400000;42499999
43691000000;43691999999
With awk:
awk '
BEGIN{FS = OFS = ";"} # set field separator and output field separator to be ";"
{
$0 = gensub("\"", "", "g"); # Drop double quotes
s = $1$2; # The range header number
l = $3-length($2); # Number of zeros or 9s to be appended
l = 10^l; # Get 10 raised to that number
print s*l, (s+1)*l-1; # Adding n zeros is multiplication by 10^n
# Adding n nines is multipliaction by 10^n + (10^n - 1)
}' input.txt
Explanation inline as comments.

Compare values in two rows fo specific column

I would like to print the lines of file based on a condition with respect the previous line. I would like to implement the following condition:
If the key (field 1 and field2) between the current line and the previous line is identical and the difference between field 8 and field 8 of the previous line is bigger than 1, print the current line and append the difference.
Input file:
47329,39785,2,12,10,351912.50,2533105.56,170.93,1
47329,39785,3,6,7,351912.82,2533105.07,170.89,1
47329,39785,2,12,28,351912.53,2533118.81,172.91,1
47329,39785,3,6,20,351913.03,2533117.41,170.93,1
47329,39797,2,12,10,352063.14,2533117.84,170.66,1
47329,39797,3,6,7,352064.11,2533119.32,170.64,1
47329,39797,2,12,28,352062.77,2533104.67,173.63,1
47329,39797,3,6,20,352063.50,2533107.10,170.69,1
Expected output file:
47329,39785,2,12,28,351912.53,2533118.81,172.91,1,1.98
47329,39797,2,12,28,352062.77,2533104.67,173.63,1,2.94
Lines 3 and 4 have an identical key (47329,39785) and the difference of the values in fields 8 is 172.91-170.93=1.98, so we print line 4. An identical reasoning goes for lines 6 and 7
attempt:
awk -F, 'NR%2{ab = $1 FS $2} ab == ob && $8 - O8 > 1; {ob = ab; O8 = $8}'
I've come up with this script, tested on gawk v5.0.0
BEGIN{
FS=","
}
{
if (NR == 1)
{
key1 = $1
key2 = $2
field = $8
# when on first record, there's nothing to compare with
next
}
if ($1 == key1)
{
if ($2 == key2)
{
if ($8 - field > 1)
{
print $0, $8-field
# uncomment following line to print line match number
# print "("NR")",$0, $8-field
}
}
}
# assign for next iteration
key1 = $1
key2 = $2
field = $8
}
tested on your input, found:
$ awk -f script.awk test.txt
47329,39785,2,12,28,351912.53,2533118.81,172.91,1 2.02
47329,39797,2,12,28,352062.77,2533104.67,173.63,1 2.99
Matches line 3 and 7.

awk one row substracts the next row if their first two colums are the same

If we would like to substract $17 if their $1 & $2 are the same: input
targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
111,CPD-123456,2222,1111,IC50,,6.1,uM,1183,1265,Ki,,0.16,uM,,38.125,1.7511,2003-03-03 00:00:00,2003-02-10 00:00:00,Cell,Enzyme
111,CPD-123456,2222,1111,IC50,,9.02053,uM,1183,1265,Ki,,0.16,uM,,56.3783,-1.5812,2003-02-27 00:00:00,2003-02-10 00:00:00,Cell,Enzyme
111,CPD-777888,3333,4444,IC50,,6.1,uM,1183,1265,Ki,,0.16,uM,,38.125,-1,2003-03-03 00:00:00,2003-02-10 00:00:00,Cell,Enzyme
111,CPD-777888,3333,4444,IC50,,9.02053,uM,1183,1265,Ki,,0.16,uM,,56.3783,-3,2003-02-27 00:00:00,2003-02-10 00:00:00,Cell,Enzyme
The desired output should be (1.7511-(-1.5812)=3.3323); (-1-(-3)=2)
3.3323
2
First attempt:
awk -F, ' last != $1""$2 && last{ # ONLY When last key "TargetID + Cpd_number"
print C # differs from actual , print line + substraction
C=0} # reset acumulators
{ # This block process each line of infile
C -= $17 # C calc
line=$0 # Line will be actual line without activity
last=$1""$2} # Store the key in orther to track switching
END{ # This block triggers after the complete file read
# to print the last average that cannot be trigger during
# the previous block
print C}' input
It will give the output:
-0.1699
4
The second attempt:
#!/bin/bash
tail -n+2 test > test2 # remove the title/header
awk -F, '$1 == $1 && $2 == $2 {print $17}' test2 >> test3 # print $17 if the $1 and $2 are the same
awk 'NR==1{s=$1;next}{s-=$1}END{print s}' test3
rm test2 test3
test3 will be
1.7511
-1.5812
-1
-3
Output is
7.3323
Could any guru kindly give some comments? Thanks!
You could try the below awk command,
$ awk -F, 'NR==1{next} {var=$1; foo=$2; bar=$17; getline;} $1==var && $2==foo{xxx=bar-$17; print xxx}' file
3.3323
2
awk '
BEGIN { FS = "," }
NR == 1 { next } # skip header line
{ # accumulate totals
if ($1 SUBSEP $2 in a) # if key already exists
a[$1,$2] -= $17 # subtract $17 from value
else # if first appearance of this key
a[$1,$2] = $17 # set value to $17
}
END { # print results
for (x in a)
print a[x]
}
' file