What does this awk command do?

What does this awk command do? - awk

What does this awk command do?
awk 'NR > 1 {for(x=1;x<=NF;x++) if(x == 1 || (x >= 4 && x % 2 == 0))
printf "%s", $x (x == NF || x == (NF-1) ? "\n":"\t")}' depth.txt
> depth_concoct.txt
I think
NR > 1 means it starts from second line,
for(x=1;x<=NF;x++) means for every fields,
if(x == 1 || (x >= 4 && x % 2 == 0)) means if x equals 1 or (I don' understand the codes from this part and so on)
and I know that the input file for awk is depth.txt and the output of awk will be saved to depth_concoct.txt.
What does the codes in the middle mean?

$ awk '
NR > 1 { # starting from the second record
for(x=1;x<=NF;x++) # iterate every field
if(x == 1 || (x >= 4 && x % 2 == 0)) # for 1st, 4th and every even-numbered field after 4th
printf "%s", # print the field and after it
$x (x == NF || x == (NF-1) ? "\n":"\t") # a tab or a newline if its the last field
}' depth.txt > depth_concoct.txt
(x == NF || x == (NF-1) ? "\n":"\t") is called conditional operator, in this context it's basically streamlined version of:
if( x == NF || x == (NF-1) ) # if this is the last field to be printed
printf "\n" # finish the record with a newline
else # else
printf "\t"` # print a tab after the field

you can rewrite it as below, which should be trivial to read.
$ awk `NR>1 {printf "%s", $1;
for(x=4;x<=NF;x+=2) printf "\t%s", $x;
print ""}' inputfile > outputfile
the complexity of the code is sometimes just an implementation detail.
prints first and every second field starting from the 4th.
Assume your file has 8 fields, this is equivalent to
$ awk -v OFS='\t' 'NR>1{print $1,$4,$6,$8}' inputfile > outputfile

Related

Concatenating array elements into a one string in for loop using awk

I am working on a variant calling format (vcf) file, and I tried to show you guys what I am trying to do:
Input:
1 877803 838425 GC G
1 878077 966631 C CCACGG
Output:
1 877803 838425 C -
1 878077 966631 - CACGG
In summary, I am trying to delete the first letters of longer strings.
And here is my code:
awk 'BEGIN { OFS="\t" } /#/ {next}
{
m = split($4, a, //)
n = split($5, b, //)
x = "-"
delete y
if (m>n){
for (i = n+1; i <= m; i++) {
y = sprintf("%s", a[i])
}
print $1, $2, $3, y, x
}
else if (n>m){
for (j = m+1; i <= n; i++) {
y = sprintf("%s", b[j]) ## Problem here
}
print $1, $2, $3, x, y
}
}' input.vcf > output.vcf
But,
I am getting the following error in line 15, not even in line 9
awk: cmd. line:15: (FILENAME=input.vcf FNR=1) fatal: attempt to use array y in a scalar context
I don't know how to concatenate array elements into a one string using awk.
I will be very happy if you guys help me.
Merry X-Mas!

You may try this awk:
awk -v OFS="\t" 'function trim(s) { return (length(s) == 1 ? "-" : substr(s, 2)); } {$4 = trim($4); $5 = trim($5)} 1' file
1 877803 838425 C -
1 878077 966631 - CACGG
More readable form:
awk -v OFS="\t" 'function trim(s) {
return (length(s) == 1 ? "-" : substr(s, 2))
}
{
$4 = trim($4)
$5 = trim($5)
} 1' file

You can use awk's substr function to process the 4th and 5th space delimited fields:
awk '{ substr($4,2)==""?$4="-":$4=substr($4,2);substr($5,2)==""?$5="-":$5=substr($5,2)}1' file
If the string from position 2 onwards in field 4 is equal to "", set field 4 to "-" otherwise, set field 4 to the extract of the field from position 2 to the end of the field. Do the same with field 5. Print lines modified or not with short hand 1.

Compare values in two rows fo specific column

I would like to print the lines of file based on a condition with respect the previous line. I would like to implement the following condition:
If the key (field 1 and field2) between the current line and the previous line is identical and the difference between field 8 and field 8 of the previous line is bigger than 1, print the current line and append the difference.
Input file:
47329,39785,2,12,10,351912.50,2533105.56,170.93,1
47329,39785,3,6,7,351912.82,2533105.07,170.89,1
47329,39785,2,12,28,351912.53,2533118.81,172.91,1
47329,39785,3,6,20,351913.03,2533117.41,170.93,1
47329,39797,2,12,10,352063.14,2533117.84,170.66,1
47329,39797,3,6,7,352064.11,2533119.32,170.64,1
47329,39797,2,12,28,352062.77,2533104.67,173.63,1
47329,39797,3,6,20,352063.50,2533107.10,170.69,1
Expected output file:
47329,39785,2,12,28,351912.53,2533118.81,172.91,1,1.98
47329,39797,2,12,28,352062.77,2533104.67,173.63,1,2.94
Lines 3 and 4 have an identical key (47329,39785) and the difference of the values in fields 8 is 172.91-170.93=1.98, so we print line 4. An identical reasoning goes for lines 6 and 7
attempt:
awk -F, 'NR%2{ab = $1 FS $2} ab == ob && $8 - O8 > 1; {ob = ab; O8 = $8}'

I've come up with this script, tested on gawk v5.0.0
BEGIN{
FS=","
}
{
if (NR == 1)
{
key1 = $1
key2 = $2
field = $8
# when on first record, there's nothing to compare with
next
}
if ($1 == key1)
{
if ($2 == key2)
{
if ($8 - field > 1)
{
print $0, $8-field
# uncomment following line to print line match number
# print "("NR")",$0, $8-field
}
}
}
# assign for next iteration
key1 = $1
key2 = $2
field = $8
}
tested on your input, found:
$ awk -f script.awk test.txt
47329,39785,2,12,28,351912.53,2533118.81,172.91,1 2.02
47329,39797,2,12,28,352062.77,2533104.67,173.63,1 2.99
Matches line 3 and 7.

Use row 1 column ith as output filename awk

I'm a very recent command line user thus I'm requiring some help to split a text file by columns using awk. The difficulty for me is that I want the ith filename to be the text from the 1st row of the ith column.
This is what I had in mind:
awk '{for(i = 2; i <= NF; i++){name= ??FNR == 1 $i?? ;print $1, $i > name}}' myfile.txt
But I don't know how to set the name variable...
Input: myfile.txt
'ID' 'sample_1' 'sample_2' ...
'id_1' 1 2 ...
'id_2' 2 3 ...
Excpected output:
sample_1.txt:
'ID' 'sample_1'
'id_1' 1
'id_2' 2
sample_2.txt:
'ID' 'sample_2'
'id_1' 2
'id_2' 3
Thanks

You should keep column headers in an array.
awk 'NR==1 {
for (i=2; i<=NF; ++i) {
fnames[i] = gensub(/\x27/, "", "g", $i)
print $1, $i > fnames[i] ".txt"
}
next
}
{
for (i=2; i<=NF; ++i)
print $1, "\x27" $i "\x27" > fnames[i] ".txt"
}' myfile.txt
\x27 is single quote in hex-escaped form
gensub(/\x27/, "", "g", $i) removes single quotes from column headers to name output files as you wanted.

You can try this awk :
awk -F'\t' ' # tab as field separator
{
for ( i = 2 ; i <= NF ; i++ ) { # for each record loop from field 2 to last field
if ( NR == 1 ) { # if first record
a[i] = $i # keep each field in array a
gsub ( /^'\''|'\''$/ , "" , a[i] ) # remove quote at start and end in array a
}
print $1 FS $i > a[i]".txt" # print needed field in corresponding file
}
}' myfile.txt

awk one row substracts the next row if their first two colums are the same

If we would like to substract $17 if their $1 & $2 are the same: input
targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
111,CPD-123456,2222,1111,IC50,,6.1,uM,1183,1265,Ki,,0.16,uM,,38.125,1.7511,2003-03-03 00:00:00,2003-02-10 00:00:00,Cell,Enzyme
111,CPD-123456,2222,1111,IC50,,9.02053,uM,1183,1265,Ki,,0.16,uM,,56.3783,-1.5812,2003-02-27 00:00:00,2003-02-10 00:00:00,Cell,Enzyme
111,CPD-777888,3333,4444,IC50,,6.1,uM,1183,1265,Ki,,0.16,uM,,38.125,-1,2003-03-03 00:00:00,2003-02-10 00:00:00,Cell,Enzyme
111,CPD-777888,3333,4444,IC50,,9.02053,uM,1183,1265,Ki,,0.16,uM,,56.3783,-3,2003-02-27 00:00:00,2003-02-10 00:00:00,Cell,Enzyme
The desired output should be (1.7511-(-1.5812)=3.3323); (-1-(-3)=2)
3.3323
2
First attempt:
awk -F, ' last != $1""$2 && last{ # ONLY When last key "TargetID + Cpd_number"
print C # differs from actual , print line + substraction
C=0} # reset acumulators
{ # This block process each line of infile
C -= $17 # C calc
line=$0 # Line will be actual line without activity
last=$1""$2} # Store the key in orther to track switching
END{ # This block triggers after the complete file read
# to print the last average that cannot be trigger during
# the previous block
print C}' input
It will give the output:
-0.1699
4
The second attempt:
#!/bin/bash
tail -n+2 test > test2 # remove the title/header
awk -F, '$1 == $1 && $2 == $2 {print $17}' test2 >> test3 # print $17 if the $1 and $2 are the same
awk 'NR==1{s=$1;next}{s-=$1}END{print s}' test3
rm test2 test3
test3 will be
1.7511
-1.5812
-1
-3
Output is
7.3323
Could any guru kindly give some comments? Thanks!

You could try the below awk command,
$ awk -F, 'NR==1{next} {var=$1; foo=$2; bar=$17; getline;} $1==var && $2==foo{xxx=bar-$17; print xxx}' file
3.3323
2

awk '
BEGIN { FS = "," }
NR == 1 { next } # skip header line
{ # accumulate totals
if ($1 SUBSEP $2 in a) # if key already exists
a[$1,$2] -= $17 # subtract $17 from value
else # if first appearance of this key
a[$1,$2] = $17 # set value to $17
}
END { # print results
for (x in a)
print a[x]
}
' file

printing lines where certain columns do not match, with awk

I have a tab separated file like this:
1 10502 C T
1 10506 C T
1 10567 G A
...
And I'm trying to print out all lines where column 3 != column 4, excluding the cases where column 3 = C and column 4 = T.
I tried
awk '{
if (($3 == $4) || ($3 == C && $4 == T) )
next ;
else
print $0; }'
but I'm not sure what's going wrong...

just fix your codes:
awk '($3 != $4) && !($3=="C" && $4=="T")' file

this one-liner should work for your file:
awk '($3==$4)||($3 =="C"&&$4=="T"){next}1' input

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

What does this awk command do? - awk

Related

Concatenating array elements into a one string in for loop using awk

Compare values in two rows fo specific column

Use row 1 column ith as output filename awk

awk one row substracts the next row if their first two colums are the same

printing lines where certain columns do not match, with awk

Categories

Resources