I have a bunch of variables which look like this:
(DURATION 1.57) + (DURATION 2.07)
(T0 10) (T1 0) (TX 0) + (T0 12) (T1 0) (TX 1)
(TC 1) (IG 0) + (TC 2) (IG 3)
Is it possible to have awk process this such that the result is:
(DURATION 3.64)
(T0 22) (T1 0) (TX 1)
(TC 3) (IG 3)
Or can anyone recommend another unix program I can use to do this?
Here is one way to do it:
awk '{
gsub(/[()+]/, "")
for(nf=1; nf<=NF; nf+=2) {
flds[$nf] += $(nf+1)
}
sep = ""
for(fld in flds) {
printf "%s(%s %g)", sep, fld, flds[fld]
sep = FS
}
print "";
delete flds
}' file
(DURATION 3.64)
(T0 22) (T1 0) (TX 1)
(TC 3) (IG 3)
We remove the special characters ()+ using gsub() function.
We iterate over all fields adding variables to an array and adding the values
We iterate over the array, printing them in our desired format.
Add a new line after we are done printing
Delete the array so that we can re-use it on next line
Note: The order of lines will be same as input file but using the in operator for our for loop the variables on each line may appear in random order.
Related
I am working on a variant calling format (vcf) file, and I tried to show you guys what I am trying to do:
Input:
1 877803 838425 GC G
1 878077 966631 C CCACGG
Output:
1 877803 838425 C -
1 878077 966631 - CACGG
In summary, I am trying to delete the first letters of longer strings.
And here is my code:
awk 'BEGIN { OFS="\t" } /#/ {next}
{
m = split($4, a, //)
n = split($5, b, //)
x = "-"
delete y
if (m>n){
for (i = n+1; i <= m; i++) {
y = sprintf("%s", a[i])
}
print $1, $2, $3, y, x
}
else if (n>m){
for (j = m+1; i <= n; i++) {
y = sprintf("%s", b[j]) ## Problem here
}
print $1, $2, $3, x, y
}
}' input.vcf > output.vcf
But,
I am getting the following error in line 15, not even in line 9
awk: cmd. line:15: (FILENAME=input.vcf FNR=1) fatal: attempt to use array y in a scalar context
I don't know how to concatenate array elements into a one string using awk.
I will be very happy if you guys help me.
Merry X-Mas!
You may try this awk:
awk -v OFS="\t" 'function trim(s) { return (length(s) == 1 ? "-" : substr(s, 2)); } {$4 = trim($4); $5 = trim($5)} 1' file
1 877803 838425 C -
1 878077 966631 - CACGG
More readable form:
awk -v OFS="\t" 'function trim(s) {
return (length(s) == 1 ? "-" : substr(s, 2))
}
{
$4 = trim($4)
$5 = trim($5)
} 1' file
You can use awk's substr function to process the 4th and 5th space delimited fields:
awk '{ substr($4,2)==""?$4="-":$4=substr($4,2);substr($5,2)==""?$5="-":$5=substr($5,2)}1' file
If the string from position 2 onwards in field 4 is equal to "", set field 4 to "-" otherwise, set field 4 to the extract of the field from position 2 to the end of the field. Do the same with field 5. Print lines modified or not with short hand 1.
Is there a way to get only the ids that are contained in the stream? something like an XKEYS command?
XKEYS "test:stream"
=>
1599031407838-0
1599031407839-0
There is no way to get this with a Redis command.
You can get this with Lua Scripts - EVAL command.
Using the XRANGE command, you get the ids and the field-value pairs.
> XRANGE streamkey - +
1) 1) "1599077066502-0"
2) 1) "fielda"
2) "valuea"
3) "fieldb"
4) "valueb"
2) 1) "1599077076318-0"
2) 1) "fielda"
...
In a Lua Script you can discard the field-value pairs from the response, leaving just the IDs. This way at least you reduce the size of the response saving on network payload and Client Output Buffers.
This script would get you started:
local resp = redis.call('XRANGE', KEYS[1], ARGV[1], ARGV[2])
for key,value in ipairs(resp) do
resp[key] = value[1]
end
return resp
Use as
EVAL "local resp = redis.call('XRANGE', KEYS[1], ARGV[1], ARGV[2]) for key,value in ipairs(resp) do resp[key] = value[1] end return resp" 1 streamkey - +
with the key start end of your choice as parameters.
You get as response:
EVAL "local resp ... return resp" 1 streamkey - +
1) "1599077066502-0"
2) "1599077076318-0"
3) "1599077085694-0"
4) ...
I have a csv file:
number1;number2;min_length;max_length
"40";"1801";8;8
"40";"182";8;8
"42";"32";6;8
"42";"4";6;6
"43";"691";9;9
I want the output be:
4018010000;4018019999
4018200000;4018299999
42320000;42329999
423200000;423299999
4232000000;4232999999
42400000;42499999
43691000000;43691999999
So the new file will be consisting of:
column_1 = a concatenation of old_column_1 + old_column_2 + a number
of "0" equal to (old_column_3 - length of the old_column_2)
column_2 = a concatenation of old_column_1 + old_column_2 + a number of "9" equal
to (old_column_3 - length of the old_column_2) , when min_length = max_length. And when min_length is not equal with max_length , I need to take into account all the possible lengths. So for the line "42";"32";6;8 , all the lengths are: 6,7 and 8.
Also, i need to delete the quotation mark everywhere.
I tried with paste and cut like that:
paste -d ";" <(cut -f1,2 -d ";" < file1) > file2
for the concatenation of the first 2 columns, but i think with awk its easier. However, i can't figure out how to do it. Any help it's apreciated. Thanks!
Edit: Actually, added column 4 in input.
You may use this awk:
awk 'function padstr(ch, len, s) {
s = sprintf("%*s", len, "")
gsub(/ /, ch, s)
return s
}
BEGIN {
FS=OFS=";"
}
{
gsub(/"/, "");
for (i=0; i<=($4-$3); i++) {
d = $3 - length($2) + i
print $1 $2 padstr("0", d), $1 $2 padstr("9", d)
}
}' file
4018010000;4018019999
4018200000;4018299999
42320000;42329999
423200000;423299999
4232000000;4232999999
42400000;42499999
43691000000;43691999999
With awk:
awk '
BEGIN{FS = OFS = ";"} # set field separator and output field separator to be ";"
{
$0 = gensub("\"", "", "g"); # Drop double quotes
s = $1$2; # The range header number
l = $3-length($2); # Number of zeros or 9s to be appended
l = 10^l; # Get 10 raised to that number
print s*l, (s+1)*l-1; # Adding n zeros is multiplication by 10^n
# Adding n nines is multipliaction by 10^n + (10^n - 1)
}' input.txt
Explanation inline as comments.
I have to calculate the time complexity or theoretical running time of an algorithm (given the psuedocode), line by line as T(n). I've given it a try, but there are a couple things confusing me. For example, what is the time complexity for an "if" statement? And how do I deal with nested loops? The code is below along with my attempt which is commented.
length[A] = n
for i = 0 to length[A] - 1 // n - 1
k = i + 1 // n - 2
for j = 1 + 2 to length[A] // (n - 1)(n - 3)
if A[k] > A[j] // 1(n - 1)(n - 3)
k = j // 1(n - 1)(n - 3)
if k != i + 1 // 1(n - 1)
temp = A[i + 1] // 1(n - 1)
A[i + 1] = A[k] // 1(n - 1)
A[k] = temp // 1(n - 1)
Blender is right, the result is O(n^2): two nested loops that each have an iteration count dependent on n.
A longer explanation:
The if, in this case, does not really matter: Since O-notation only looks at the worst-case execution time of an algorithm, you'd simply choose the execution path that's worse for the overall execution time. Since, in your example, both execution paths (k != i+ 1 is true or false) have no further implication for the runtime, you can disregard it. If there were a third nested loop, also running to n, inside the if, you'd end up with O(n^3).
A line-by-line overview:
for i = 0 to length[A] - 1 // n + 1 [1]
k = i + 1 // n
for j = 1 + 2 to length[A] // (n)(n - 3 + 1) [1]
if A[k] > A[j] // (n)(n - 3)
k = j // (n)(n - 3)*x [2]
if k != i + 1 // n
temp = A[i + 1] // n*y [2]
A[i + 1] = A[k] // n*y
A[k] = temp // n*y
[1] The for loop statement will be executed n+1 times with the following values for i: 0 (true, continue loop), 1 (true, continue loop), ..., length[A] - 1 (true, continue loop), length[A] (false, break loop)
[2] Without knowing the data, you have to guess how often the if's condition is true. This guess can be done mathematically by introducing a variable 0 <= x <= 1. This is in line with what I said before: x is independent of n and therefore influences the overall runtime complexity only as a constant factor: you need to take a look at the execution paths .
I have two columns:
100011780 100016332
10100685 10105465
101190948 101195542
101286838 101288018
101411746 101413662
101686767 101718138
101949793 101950504
101989424 101993757
102095320 102106147
102133372 102143125
I want to get the middle value of those numbers.
Tried to:
awk '{print $1"\t"$2-$1}' input | awk '{print $1"\t"$2/2}' | awk '{print $1+$2}' > output
But some numbers after the division by 2 aren't natural anymore and probably of that my output is like this :
100014056
10103075
101193245
101287428
101412704
1.01702e+08
1.0195e+08
1.01992e+08
1.02101e+08
1.02138e+08
Maybe it's possible to locate non natural value and -/+ 0.5 to make it natural?
You certainly don't need to call awk 3 times to get the average of two numbers.
awk '{printf("%d\n", ($1+$2)/2)}' input
Use printf() to control the output.
100014056
10103075
101193245
101287428
101412704
101702452
101950148
101991590
102100733
102138248
You can add, and use, this round function in your AWK file:
function round(x) {
ival = int(x);
if (ival == x)
return x;
if (x < 0) {
aval = -x;
ival = int(aval);
fraction = aval - ival;
if (fraction >= .5)
return int(x) - 1;
else
return int(x);
} else {
fraction = x - ival;
if (fraction >= .5)
return ival + 1;
else
return ival;
}
}
For example, the avg value will be:
{print round(($1+$2)/2)}
Not sure what you want when the sum is uneven, but you could do all in one go:
gawk '{printf "%i\n", ($1 + $2) / 2}' input
What you are looking for are format control options to printf.