I am working on a variant calling format (vcf) file, and I tried to show you guys what I am trying to do:
Input:
1 877803 838425 GC G
1 878077 966631 C CCACGG
Output:
1 877803 838425 C -
1 878077 966631 - CACGG
In summary, I am trying to delete the first letters of longer strings.
And here is my code:
awk 'BEGIN { OFS="\t" } /#/ {next}
{
m = split($4, a, //)
n = split($5, b, //)
x = "-"
delete y
if (m>n){
for (i = n+1; i <= m; i++) {
y = sprintf("%s", a[i])
}
print $1, $2, $3, y, x
}
else if (n>m){
for (j = m+1; i <= n; i++) {
y = sprintf("%s", b[j]) ## Problem here
}
print $1, $2, $3, x, y
}
}' input.vcf > output.vcf
But,
I am getting the following error in line 15, not even in line 9
awk: cmd. line:15: (FILENAME=input.vcf FNR=1) fatal: attempt to use array y in a scalar context
I don't know how to concatenate array elements into a one string using awk.
I will be very happy if you guys help me.
Merry X-Mas!
You may try this awk:
awk -v OFS="\t" 'function trim(s) { return (length(s) == 1 ? "-" : substr(s, 2)); } {$4 = trim($4); $5 = trim($5)} 1' file
1 877803 838425 C -
1 878077 966631 - CACGG
More readable form:
awk -v OFS="\t" 'function trim(s) {
return (length(s) == 1 ? "-" : substr(s, 2))
}
{
$4 = trim($4)
$5 = trim($5)
} 1' file
You can use awk's substr function to process the 4th and 5th space delimited fields:
awk '{ substr($4,2)==""?$4="-":$4=substr($4,2);substr($5,2)==""?$5="-":$5=substr($5,2)}1' file
If the string from position 2 onwards in field 4 is equal to "", set field 4 to "-" otherwise, set field 4 to the extract of the field from position 2 to the end of the field. Do the same with field 5. Print lines modified or not with short hand 1.
I have a csv file:
number1;number2;min_length;max_length
"40";"1801";8;8
"40";"182";8;8
"42";"32";6;8
"42";"4";6;6
"43";"691";9;9
I want the output be:
4018010000;4018019999
4018200000;4018299999
42320000;42329999
423200000;423299999
4232000000;4232999999
42400000;42499999
43691000000;43691999999
So the new file will be consisting of:
column_1 = a concatenation of old_column_1 + old_column_2 + a number
of "0" equal to (old_column_3 - length of the old_column_2)
column_2 = a concatenation of old_column_1 + old_column_2 + a number of "9" equal
to (old_column_3 - length of the old_column_2) , when min_length = max_length. And when min_length is not equal with max_length , I need to take into account all the possible lengths. So for the line "42";"32";6;8 , all the lengths are: 6,7 and 8.
Also, i need to delete the quotation mark everywhere.
I tried with paste and cut like that:
paste -d ";" <(cut -f1,2 -d ";" < file1) > file2
for the concatenation of the first 2 columns, but i think with awk its easier. However, i can't figure out how to do it. Any help it's apreciated. Thanks!
Edit: Actually, added column 4 in input.
You may use this awk:
awk 'function padstr(ch, len, s) {
s = sprintf("%*s", len, "")
gsub(/ /, ch, s)
return s
}
BEGIN {
FS=OFS=";"
}
{
gsub(/"/, "");
for (i=0; i<=($4-$3); i++) {
d = $3 - length($2) + i
print $1 $2 padstr("0", d), $1 $2 padstr("9", d)
}
}' file
4018010000;4018019999
4018200000;4018299999
42320000;42329999
423200000;423299999
4232000000;4232999999
42400000;42499999
43691000000;43691999999
With awk:
awk '
BEGIN{FS = OFS = ";"} # set field separator and output field separator to be ";"
{
$0 = gensub("\"", "", "g"); # Drop double quotes
s = $1$2; # The range header number
l = $3-length($2); # Number of zeros or 9s to be appended
l = 10^l; # Get 10 raised to that number
print s*l, (s+1)*l-1; # Adding n zeros is multiplication by 10^n
# Adding n nines is multipliaction by 10^n + (10^n - 1)
}' input.txt
Explanation inline as comments.
Print all the files which are present in maximum depth
for example
abc/1/2/3/4/r.txt
abc/1/f1.txt
abc/11/22/44/66/77/f2.txt
abc/11/22/44/66/77/f4.txt
abc/11/22/44/66/77/f5.txt
so this would print
abc/11/22/44/66/77/f2.txt
abc/11/22/44/66/77/f4.txt
abc/11/22/44/66/77/f5.txt
I have written this command
$cat listoffiles.txt | awk -F "/" ' { if ( NF > x ) { x = NF; y = $0 } }END{ print y }'
but this is printing only the first occurrence.
Keep buffering deepest files and discard them whenever the max depth changes. At the end, dump what's in the buffer.
awk -F'/+' 'NF>max{max=NF;delete buf} NF==max{buf[$0]} END{for(f in buf) print f}' file
What does this awk command do?
awk 'NR > 1 {for(x=1;x<=NF;x++) if(x == 1 || (x >= 4 && x % 2 == 0))
printf "%s", $x (x == NF || x == (NF-1) ? "\n":"\t")}' depth.txt
> depth_concoct.txt
I think
NR > 1 means it starts from second line,
for(x=1;x<=NF;x++) means for every fields,
if(x == 1 || (x >= 4 && x % 2 == 0)) means if x equals 1 or (I don' understand the codes from this part and so on)
and I know that the input file for awk is depth.txt and the output of awk will be saved to depth_concoct.txt.
What does the codes in the middle mean?
$ awk '
NR > 1 { # starting from the second record
for(x=1;x<=NF;x++) # iterate every field
if(x == 1 || (x >= 4 && x % 2 == 0)) # for 1st, 4th and every even-numbered field after 4th
printf "%s", # print the field and after it
$x (x == NF || x == (NF-1) ? "\n":"\t") # a tab or a newline if its the last field
}' depth.txt > depth_concoct.txt
(x == NF || x == (NF-1) ? "\n":"\t") is called conditional operator, in this context it's basically streamlined version of:
if( x == NF || x == (NF-1) ) # if this is the last field to be printed
printf "\n" # finish the record with a newline
else # else
printf "\t"` # print a tab after the field
you can rewrite it as below, which should be trivial to read.
$ awk `NR>1 {printf "%s", $1;
for(x=4;x<=NF;x+=2) printf "\t%s", $x;
print ""}' inputfile > outputfile
the complexity of the code is sometimes just an implementation detail.
prints first and every second field starting from the 4th.
Assume your file has 8 fields, this is equivalent to
$ awk -v OFS='\t' 'NR>1{print $1,$4,$6,$8}' inputfile > outputfile
If we would like to substract $17 if their $1 & $2 are the same: input
targetID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
111,CPD-123456,2222,1111,IC50,,6.1,uM,1183,1265,Ki,,0.16,uM,,38.125,1.7511,2003-03-03 00:00:00,2003-02-10 00:00:00,Cell,Enzyme
111,CPD-123456,2222,1111,IC50,,9.02053,uM,1183,1265,Ki,,0.16,uM,,56.3783,-1.5812,2003-02-27 00:00:00,2003-02-10 00:00:00,Cell,Enzyme
111,CPD-777888,3333,4444,IC50,,6.1,uM,1183,1265,Ki,,0.16,uM,,38.125,-1,2003-03-03 00:00:00,2003-02-10 00:00:00,Cell,Enzyme
111,CPD-777888,3333,4444,IC50,,9.02053,uM,1183,1265,Ki,,0.16,uM,,56.3783,-3,2003-02-27 00:00:00,2003-02-10 00:00:00,Cell,Enzyme
The desired output should be (1.7511-(-1.5812)=3.3323); (-1-(-3)=2)
3.3323
2
First attempt:
awk -F, ' last != $1""$2 && last{ # ONLY When last key "TargetID + Cpd_number"
print C # differs from actual , print line + substraction
C=0} # reset acumulators
{ # This block process each line of infile
C -= $17 # C calc
line=$0 # Line will be actual line without activity
last=$1""$2} # Store the key in orther to track switching
END{ # This block triggers after the complete file read
# to print the last average that cannot be trigger during
# the previous block
print C}' input
It will give the output:
-0.1699
4
The second attempt:
#!/bin/bash
tail -n+2 test > test2 # remove the title/header
awk -F, '$1 == $1 && $2 == $2 {print $17}' test2 >> test3 # print $17 if the $1 and $2 are the same
awk 'NR==1{s=$1;next}{s-=$1}END{print s}' test3
rm test2 test3
test3 will be
1.7511
-1.5812
-1
-3
Output is
7.3323
Could any guru kindly give some comments? Thanks!
You could try the below awk command,
$ awk -F, 'NR==1{next} {var=$1; foo=$2; bar=$17; getline;} $1==var && $2==foo{xxx=bar-$17; print xxx}' file
3.3323
2
awk '
BEGIN { FS = "," }
NR == 1 { next } # skip header line
{ # accumulate totals
if ($1 SUBSEP $2 in a) # if key already exists
a[$1,$2] -= $17 # subtract $17 from value
else # if first appearance of this key
a[$1,$2] = $17 # set value to $17
}
END { # print results
for (x in a)
print a[x]
}
' file