printing lines where certain columns do not match, with awk - awk

I have a tab separated file like this:
1 10502 C T
1 10506 C T
1 10567 G A
...
And I'm trying to print out all lines where column 3 != column 4, excluding the cases where column 3 = C and column 4 = T.
I tried
awk '{
if (($3 == $4) || ($3 == C && $4 == T) )
next ;
else
print $0; }'
but I'm not sure what's going wrong...

just fix your codes:
awk '($3 != $4) && !($3=="C" && $4=="T")' file

this one-liner should work for your file:
awk '($3==$4)||($3 =="C"&&$4=="T"){next}1' input

Related

Changing a list of values in Awk

I am trying to change values in the following list:
A 0.702
B 0.868
C 3.467
D 2.152
If the second column is less than 0.5 I would like to change to -2, between 0.5-1 to -1, between 1-1.5 to 1 and if > 1.5 then to 2.
When I try the following:
awk '$2<0.9 || $2>2' | awk '{if ($2 < 0.5) print $1,-2;}{if($2>0.5 || $2<1) print $1,-1;}{if($2>1 || $2<1.5) print $1,1;}{if($2>2) print $1,2;}'
I get the following:
A -1
A 1
B -1
B 1
C 1
C 2
D 1
D 2
I know I am missing something but for the life of me I can't figure out what - any help gratefully recieved.
If you have multiple if statements and the current value can match multiple statements, you can print multiple outputs.
If you only want to print the output of the first match, you would have to prevent running the if statements that follow.
You can use a single awk and define non overlapping matches with greater than and && lower than.
Note that using only > and < you will not for example 0.5
awk '{
if($2 < 0.5) print($1, -2)
if($2 > 0.5 && $2<1) print($1,-1)
if($2 > 1 && $2<1.5) print($1, 1)
if($2 > 1.5) print($1 ,2)
}
' file
Output
A -1
B -1
C 2
D 2
With your shown samples only. Adding one more solution with using ternary operators for condition checking(for Fun :) ).
awk '{print (NF?($2>1.5?($1 OFS 2):($2>1?($1 OFS 1):($2>0.5?($1 OFS "-1"):($1 OFS "-2")))):"")}' Input_file
Better readable form of above awk code. Since its a one-liner so breaking it up into multi form for better readability here.
awk '
{
print \
(\
NF\
?\
($2>1.5\
?\
($1 OFS 2)\
:\
($2>1\
?\
($1 OFS 1)\
:\
($2>0.5\
?\
($1 OFS "-1")\
:\
($1 OFS "-2")\
)\
)\
)\
:\
""\
)
}
' Input_file
Explanation: Simple explanation would be using ternary operators to perform conditions and accordingly printing values(since its happening in print function).
Another. Replace <s with <=s where needed:
$ awk '{
if($2<0.5) # from low to higher sets the lower limit
$2=-2
else if($2<1) # so only upper limit needs to be tested
$2=-1
else if($2<1.5)
$2=1
else
$2=2
}1' file
Output:
A -1
B -1
C 2
D 2
Probably overkill for your needs but here's a data-driven approach using GNU awk for arrays of arrays and +/-inf:
$ cat tst.awk
BEGIN {
range["-inf"][0.5] = -2
range[0.5][1] = -1
range[1][1.5] = 1
range[1.5]["+inf"] = 2
}
{
val = ""
for ( beg in range ) {
for ( end in range[beg] ) {
if ( (beg+0 < $2) && ($2 <= end+0) ) {
val = range[beg][end]
}
}
}
print $1, val
}
$ awk -f tst.awk file
A -1
B -1
C 2
D 2
I'm assuming above that "between" excludes the start of the range but includes the end of it. You could make it slightly more efficient with:
for ( beg in range ) {
if ( beg+0 < $2 ) {
for ( end in range[beg] ) {
if ( $2 <= end+0 ) {
val = range[beg][end]
}
}
}
}
but I just like having the range comparison all on 1 line and there's only 1 end for every begin so it doesn't make much difference.
UPDATE 1 : new equation should cover nearly all scenarios :
1st half equation handles the sign +/-
2nd half handles the magnitude of the binning
mawk '$NF = (-++_)^(+(__=$NF)<_) * ++_^(int(__+_--^-_)!=_--)'
X -1.25 -2
X -1.00 -2
X -0.75 -2
X -0.50 -2
X -0.25 -2
X 0.00 -2
X 0.25 -2
X 0.50 -1
X 0.75 -1
X 1.00 1
X 1.25 1
X 1.50 2
X 1.75 2
X 2.00 2
X 2.25 2
X 2.50 2
==============================
this may not cover every possible scenario, but if u want a single liner to cover the samples shown :
mawk '$NF = 4 < (_=int(2*$NF)-2)^2 ? 1+(-3)^(_<-_) :_'
A -1
B -1
C 2
D 2

Concatenating array elements into a one string in for loop using awk

I am working on a variant calling format (vcf) file, and I tried to show you guys what I am trying to do:
Input:
1 877803 838425 GC G
1 878077 966631 C CCACGG
Output:
1 877803 838425 C -
1 878077 966631 - CACGG
In summary, I am trying to delete the first letters of longer strings.
And here is my code:
awk 'BEGIN { OFS="\t" } /#/ {next}
{
m = split($4, a, //)
n = split($5, b, //)
x = "-"
delete y
if (m>n){
for (i = n+1; i <= m; i++) {
y = sprintf("%s", a[i])
}
print $1, $2, $3, y, x
}
else if (n>m){
for (j = m+1; i <= n; i++) {
y = sprintf("%s", b[j]) ## Problem here
}
print $1, $2, $3, x, y
}
}' input.vcf > output.vcf
But,
I am getting the following error in line 15, not even in line 9
awk: cmd. line:15: (FILENAME=input.vcf FNR=1) fatal: attempt to use array y in a scalar context
I don't know how to concatenate array elements into a one string using awk.
I will be very happy if you guys help me.
Merry X-Mas!
You may try this awk:
awk -v OFS="\t" 'function trim(s) { return (length(s) == 1 ? "-" : substr(s, 2)); } {$4 = trim($4); $5 = trim($5)} 1' file
1 877803 838425 C -
1 878077 966631 - CACGG
More readable form:
awk -v OFS="\t" 'function trim(s) {
return (length(s) == 1 ? "-" : substr(s, 2))
}
{
$4 = trim($4)
$5 = trim($5)
} 1' file
You can use awk's substr function to process the 4th and 5th space delimited fields:
awk '{ substr($4,2)==""?$4="-":$4=substr($4,2);substr($5,2)==""?$5="-":$5=substr($5,2)}1' file
If the string from position 2 onwards in field 4 is equal to "", set field 4 to "-" otherwise, set field 4 to the extract of the field from position 2 to the end of the field. Do the same with field 5. Print lines modified or not with short hand 1.

Compare values in two rows fo specific column

I would like to print the lines of file based on a condition with respect the previous line. I would like to implement the following condition:
If the key (field 1 and field2) between the current line and the previous line is identical and the difference between field 8 and field 8 of the previous line is bigger than 1, print the current line and append the difference.
Input file:
47329,39785,2,12,10,351912.50,2533105.56,170.93,1
47329,39785,3,6,7,351912.82,2533105.07,170.89,1
47329,39785,2,12,28,351912.53,2533118.81,172.91,1
47329,39785,3,6,20,351913.03,2533117.41,170.93,1
47329,39797,2,12,10,352063.14,2533117.84,170.66,1
47329,39797,3,6,7,352064.11,2533119.32,170.64,1
47329,39797,2,12,28,352062.77,2533104.67,173.63,1
47329,39797,3,6,20,352063.50,2533107.10,170.69,1
Expected output file:
47329,39785,2,12,28,351912.53,2533118.81,172.91,1,1.98
47329,39797,2,12,28,352062.77,2533104.67,173.63,1,2.94
Lines 3 and 4 have an identical key (47329,39785) and the difference of the values in fields 8 is 172.91-170.93=1.98, so we print line 4. An identical reasoning goes for lines 6 and 7
attempt:
awk -F, 'NR%2{ab = $1 FS $2} ab == ob && $8 - O8 > 1; {ob = ab; O8 = $8}'
I've come up with this script, tested on gawk v5.0.0
BEGIN{
FS=","
}
{
if (NR == 1)
{
key1 = $1
key2 = $2
field = $8
# when on first record, there's nothing to compare with
next
}
if ($1 == key1)
{
if ($2 == key2)
{
if ($8 - field > 1)
{
print $0, $8-field
# uncomment following line to print line match number
# print "("NR")",$0, $8-field
}
}
}
# assign for next iteration
key1 = $1
key2 = $2
field = $8
}
tested on your input, found:
$ awk -f script.awk test.txt
47329,39785,2,12,28,351912.53,2533118.81,172.91,1 2.02
47329,39797,2,12,28,352062.77,2533104.67,173.63,1 2.99
Matches line 3 and 7.

How to add numbers from files to computation?

I need to get results of this formula - a column of numbers
{x = ($1-T1)/Fi; print (x-int(x))}
from inputs file1
4 4
8 4
7 78
45 2
file2
0.2
3
2
1
From this files should be 4 outputs.
$1 is the first column from file1, T1 is the first line in first column of the file1 (number 4) - it is alway this number, Fi, where i = 1, 2, 3, 4 are numbers from the second file. So I need a cycle for i from 1 to 4 and compute the term one times with F1=0.2, the second output with F2=3, then third output with F3=2 and the last output will be for F4=1. How to express T1 and Fi in this way and how to do a cycle?
awk 'FNR == NR { F[++n] = $1; next } FNR == 1 { T1 = $1 } { for (i = 1; i <= n; ++i) { x = ($1 - T1)/F[i]; print x - int(x) >"output" FNR} }' file2 file1
This gives more than 4 outputs. What is wrong please?
FNR == 1 { T1 = $1 } is being run twice, when file2 is started being read T1 is set to 0.2,
>"output" FNR is problematic, you should enclose the output name expression in parentheses.
Here's how I'd do it:
awk '
NR==1 {t1=$1}
NR==FNR {f[NR]=$1; next}
{
fn="output"FNR
for(i in f) {
x=(f[i]-t1)/$1
print x-int(x) >fn
}
close(fn)
}
' file1 file2

What does this awk command do?

What does this awk command do?
awk 'NR > 1 {for(x=1;x<=NF;x++) if(x == 1 || (x >= 4 && x % 2 == 0))
printf "%s", $x (x == NF || x == (NF-1) ? "\n":"\t")}' depth.txt
> depth_concoct.txt
I think
NR > 1 means it starts from second line,
for(x=1;x<=NF;x++) means for every fields,
if(x == 1 || (x >= 4 && x % 2 == 0)) means if x equals 1 or (I don' understand the codes from this part and so on)
and I know that the input file for awk is depth.txt and the output of awk will be saved to depth_concoct.txt.
What does the codes in the middle mean?
$ awk '
NR > 1 { # starting from the second record
for(x=1;x<=NF;x++) # iterate every field
if(x == 1 || (x >= 4 && x % 2 == 0)) # for 1st, 4th and every even-numbered field after 4th
printf "%s", # print the field and after it
$x (x == NF || x == (NF-1) ? "\n":"\t") # a tab or a newline if its the last field
}' depth.txt > depth_concoct.txt
(x == NF || x == (NF-1) ? "\n":"\t") is called conditional operator, in this context it's basically streamlined version of:
if( x == NF || x == (NF-1) ) # if this is the last field to be printed
printf "\n" # finish the record with a newline
else # else
printf "\t"` # print a tab after the field
you can rewrite it as below, which should be trivial to read.
$ awk `NR>1 {printf "%s", $1;
for(x=4;x<=NF;x+=2) printf "\t%s", $x;
print ""}' inputfile > outputfile
the complexity of the code is sometimes just an implementation detail.
prints first and every second field starting from the 4th.
Assume your file has 8 fields, this is equivalent to
$ awk -v OFS='\t' 'NR>1{print $1,$4,$6,$8}' inputfile > outputfile