How do I sum values of a column cumulatively with awk? - awk

I have a sample.csv and want to sum it cumulatively by column, as below:
Input csv: Output csv:
01/01/2020, 0, 0, 2, 1 01/01/2020, 0, 0, 2, 1
18/04/2022, 7, 5, 1, 3 18/04/2022, 7, 5, 3, 4
01/05/2022, 8,21, 9, 4 01/05/2022,15,26,12, 8
I've tried
awk '{ for (i=1; i<=NF; ++i) {sum[i]+=$i; $i=sum[i] }; print $0}' sample.csv
But it returns this instead:
Input csv: Output csv:
01/01/2020, 0, 0, 2, 1 01/01/2020, 0, 0, 2, 1, 0, 0, 0, 0, 0
18/04/2022, 7, 5, 1, 3 18/04/2022, 7, 5, 1, 3, 0, 0, 0, 0, 0
01/05/2022, 8,21, 9, 4 01/05/2022, 8,21, 9, 4, 0, 0, 0, 0, 0
I'm at a loss as to how to resolve this.
Note: I am writing this in a bash script, not the terminal. And I'm not allowed to use any tools other than awk for this

I can't duplicate your output. Other than whitespace mangling, this seems to do what you want:
awk '{ for (i=2; i<=NF; i+=1) {
sum[i]+=$i; $(i)=sum[i];
}; print $0 }' FS=, OFS=, sample.csv
To get the whitespace you want, you could do:
awk '{
for (i=2; i<=NF; i+=1) {
sum[i]+=$i; $(i)=sum[i];
}
printf "%s,%2d,%2d,%2d,%2d\n", $1, $2, $3, $4, $5
}' FS=, sample.csv
If you don't know the number of columns, you could write that final printf in a loop.

Tested and confirmed working on
gawk 5.1.1,
mawk 1.3.4,
mawk 1.9.9.6, and
macos nawk
——————————————————————————————————————————————
# gawk profile, created Thu May 19 15:59:38 2022
function fmt(_) {
return +_<=_^(_<_) \
? "" : $_ = sprintf("%5.f",___[_]+=$_)
}
BEGIN { split(sprintf("%0*.f",((__=++_)+(++_*_)\
)^++_,!_),___,"""")
OFS = ", "
FS = "[,][ ]*"
} { _ = NF
while(_!=__) { fmt(_--) } }_'
——————————————————————————————————————————————
01/01/2020, 0, 0, 2, 1
18/04/2022, 7, 5, 3, 4
01/05/2022, 15, 26, 12, 8

Related

Text manipulation using AWK

Input file (example_file.txt):
chr20:1000026:T:C, 0.997, 0, 0.998, 0, 0.013, 0.980, 0.989, 1.000, 0, 0.995
chr20:10000775:A:G, 1.000, 0, 0.938, 0, 0, 0.982, 0, 0, 1.985, 1.180
Desired output (using awk):
chr20:1000026:T:C, C, T, 0.997, 0, 0.998, 0, 0.013, 0.980, 0.989, 1.000, 0, 0.995
chr20:10000775:A:G, G, A, 1.000, 0, 0.938, 0, 0, 0.982, 0, 0, 1.985, 1.180
I can get desired output with:
awk '{print $1}' example_file.txt > file1.tmp
awk -F: '{print $4",", $3","}' example_file.txt > file2.tmp
awk '{print $2, $3, $4, $5, $6, $7, $8, $9, $10, $11}' example_file.txt > file3.tmp
paste file1.tmp file2.tmp file3.tmp > output.file
output.file:
chr20:1000026:T:C, C, T, 0.997, 0, 0.998, 0, 0.013, 0.980, 0.989, 1.000, 0, 0.995
chr20:10000775:A:G, G, A, 1.000, 0, 0.938, 0, 0, 0.982, 0, 0, 1.985, 1.180
but this method is fragmented and tedious and the actual input files have >>11 columns.
splitting $1 and prepending parts to $2:
$ awk '
BEGIN {
FS=OFS=", " # proper field delimiters
}
{
n=split($1,a,/:/) # get parts of first field
for(i=3;i<=n;i++) # from the 3rd part on
$2=a[i] OFS $2 # prepend to 2nd field
}1' file # output
Output:
chr20:1000026:T:C, C, T, 0.997, 0, 0.998, 0, 0.013, 0.980, 0.989, 1.000, 0, 0.995
chr20:10000775:A:G, G, A, 1.000, 0, 0.938, 0, 0, 0.982, 0, 0, 1.985, 1.180

RegEx help in an awk script

So I have a log file that contains entries like this:
[STAT] - December 11, 2017 13:16:05.360
.\something.cpp(99): [Text] Code::Open Port 1, baud 9600, parity 0, Stop bits 0, flow control 0
[STAT] - December 11, 2017 13:20:24.637
.\something\more\code.cpp(100): [log]
fooBarBaz[32] = 32, 1, 2, 7, 3, 1092, 5, 196875, 6, 270592, 20, 196870, 8, 289, 30, 196867, 11, 1156, 5, 196875, 28, 278784, 5, 196874, 32, 266496, 30, 6866, 36, 147712, 5, 196874,
[STAT] - December 11, 2017 13:20:40.939
.\something\more\code.cpp(100): [log]
fooBarBaz[8] = 8, 1, 2, 1, 31, 532992, 5, 196875,
[STAT] - December 11, 2017 13:18:16.214
.\something\more\code.cpp(100): [log]
fooBarBaz[12] = 12, 1, 2, 2, 17, 296960, 10, 196872, 51, 1792, 50, 196878,
On the command line, I can do this:
gawk -F', *' '/fooBarBaz\[[^0].*\]/ {for (f=5; f<=NF; f+=4) print $f | "sort -n" }' log
Which produces an output like this:
3
6
8
11
17
28
31
32
36
51
I'd like to have an awk script do the same thing, but my efforts so far haven't
worked.
#!/usr/local/bin/gawk -f
BEGIN { print "lines"
FS=", *";
/fooBarBaz\[[^0].*\]/
}
{
{for (f=5; f<=NF; f+=4) print $f}
}
I don't think my regular expression statement is in the right place, because
running gawk -f script.awk prints lines not relevant to my data.
What am I doing wrong?
tl;dr: On lines with fooBarBaz and not [0], I want to parse the digits starting at position 5 and then position 4 to the end of the line.
Optimized GNU awk solution:
parse_digits.awk script:
#!/bin/awk -f
BEGIN{
FS=", *";
PROCINFO["sorted_in"]="#ind_num_asc";
print "lines";
}
/fooBarBaz\[[1-9]+[0-9]*\]/{
for (i=5; i <= NF; i+=4)
if ($i != "") a[$i]
}
END{
for (i in a) print i
}
Usage:
awk -f parse_digits.awk inputfile
The output:
lines
3
6
8
11
17
28
31
32
36
51

JAGS: prior's distribution inside a distribution

This is our first model:
# Data:
x1 = as.factor(c(0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1))
x2 = as.factor(c(0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1))
x3 = as.factor(c(0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1))
x4 = as.factor(c(0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1))
n = rep(1055, 16)
y = c(239, 31, 15, 11, 7, 5, 18, 100, 262, 32, 38, 32, 8, 7, 16, 234)
# Model:
mymodel = function(){
for (i in 1:16){
y[i] ~ dbin(theta[i], n[i] )
eta[i] <- gamma*x1[i]+beta1*x2[i]+beta2*x3[i]+beta3*x4[i]
theta[i] <- 1/(1+exp(-eta[i]))
}
# Prior
gamma ~ dnorm(0,0.00001)
beta1 ~ dnorm(0,0.00001)
beta2 ~ dnorm(0,0.00001)
beta3 ~ dnorm(0,0.00001)
}
Now we are asked to add alpha as a Normal, with known mean and unknown variance. But the variance has a uniform prior as shown in the image:
I do not know how to add alpha to the model, and then specify the new parameter in the priors...
You would just add alpha into your linear predictor and give it a distribution like any of the other parameters. However, JAGS parameterizes the Normal distribution as precision instead of variance (precision is just the inverse of the variance). The model would look something like this instead. Also, you can just use logit(eta) instead of applying the inverse logit.
mymodel = function(){
for (i in 1:16){
y[i] ~ dbin(eta[i], n[i] )
logit(eta[i]) <- alpha + gamma*x1[i]+beta1*x2[i]+beta2*x3[i]+beta3*x4[i]
}
# Prior
alpha ~ dnorm(0, tau_alpha)
tau_alpha <- 1 / var_alpha
var_alpha ~ dunif(0, 10)
gamma ~ dnorm(0,0.00001)
beta1 ~ dnorm(0,0.00001)
beta2 ~ dnorm(0,0.00001)
beta3 ~ dnorm(0,0.00001)
}

using awk to perform operation on second value, only on lines with a specific string

ok. i have this which works. yay!
awk '/string/ { print $0 }' file1.csv | awk '$2=$2+(int(rand()*10))' > file2.csv
but i want the context of my file also printed into file2.csv. this program only prints the lines which contain my string, which is good, it's a start.
i would, but, i can't simply apply the operation to the $2 values on every line because the values of $2 on lines that don't contain my string, are not to be changed.
so, i want the original (file1.csv) contents intact with the only difference being an adjusted value at $2 on lines matching my string.
can someone help me? thank you.
And here are 4 lines from the original csv:
2, 0, Control, 0, 7, 1000
2, 0, Control, 0, 10, 540
2, 30720, String, 0, 76, 100
2, 32620, String, 0, 76, 0
Expected output is the same aside from small variations to $2:
2, 0, Control, 0, 7, 1000
2, 0, Control, 0, 10, 540
2, 30725, String, 0, 76, 100
2, 32621, String, 0, 76, 0
$ awk 'BEGIN{FS=OFS=", "} /String/{$2+=int(rand()*10)}1' file
2, 0, Control, 0, 7, 1000
2, 0, Control, 0, 10, 540
2, 30722, String, 0, 76, 100
2, 32622, String, 0, 76, 0
you probably need to initialize the rand seed in the BEGIN block as well, otherwise you'll always get the same rand sequence.
Something like this, maybe:
$ awk -F", " -v OFS=", " '$3 ~ /String/ { $2=$2+(int(rand()*10))} { print $0 }' file1.csv
2, 0, Control, 0, 7, 1000
2, 0, Control, 0, 10, 540
2, 30722, String, 0, 76, 100
2, 32622, String, 0, 76, 0

Getting each field following regular pattern with awk

Having an input text file as below:
1234, aaa = 34 </T><AT/>X_CONST = 34 </T><AT/>AAA_Z = 3 </T><AT/>Y_CONST = 34 </T><AT/>FOUND_ME_1 = 5 </T><AT/>BBB_X = 3 </T><AT/>CCC_X = 8 </T><AT/>FOUND_ME_2 = 8 </T><AT/>FOUND_ME_3 = 8 </T><AT/>RRR_Z = 3 </T><AT/>T_CONST = 37 </T><AT/>FOUND_ME_4 = 10 </T><AT/>X_CONST = 34
7844, aaa = 33 </T><AT/>X_CONST = 21 </T><AT/>AAA_Z = 3 </T><AT/>R_CONST = 34 </T><AT/>FOUND_ME_1 = 50 </T><AT/>BBB_X = 3 </T><AT/>CCC_X = 8 </T><AT/>FOUND_ME_2 = 81 </T><AT/>FOUND_ME_3 = 8 </T><AT/>RRR_Z = 3 </T><AT/>T_CONST = 37 </T><AT/>X_CONST = 55
8888, aaa = 31 </T><AT/>X_CONST = 21 </T><AT/>AAA_Z = 3 </T><AT/>R_CONST = 34 </T><AT/>FOUND_ME_1 = 54 </T><AT/>BBB_Z = 3 </T><AT/>CCC_X = 8 </T><AT/>FOUND_ME_2 = 81 </T><AT/>FOUND_ME_3 = 8 </T><AT/>RRR_Z = 3 </T><AT/>T_CONST = 37 </T><AT/>FOUND_ME_4 = 11 </T><AT/>X_CONST = 55 </T><AT/>FOUND_ME_5 = 8 </T><AT/>TTT_X = 8 </T><AT/>FOUND_ME_6 = 20
I need to extract all the values related to the field FOUND_ME_[0-9] , possibly with awk. I know that converting each field to separate lines it would be easier but I'm finding a solution working with the file as it is.
My goal is to have an output like the following (values separated by commas)
5, 8, 8, 10
50, 81, 8
54, 81, 8, 11, 8, 20
I'm trying the following but no luck:
awk '{for(i=1;i<=NF;i++){ if($i==".*FOUND_ME_[0-9]"){print $($i+2)} } }'
I have also problems with this special regular pattern FOUND_ME_[0-9]
This awk script gets you the output you want (although I'm guessing that file might have started out as XML once upon a time...):
$ cat script.awk
BEGIN { FS = "[[:space:]=]+" }
{
s = ""
for (i = 1; i <= NF; ++i)
if($i ~ /FOUND_ME_[0-9]/)
s = s sprintf("%s, ", $(++i))
print substr(s, 1, length(s) - 2)
}
$ awk -f script.awk file
5, 8, 8, 10
50, 81, 8
54, 81, 8, 11, 8, 20
It builds a string s from the field after each one matching the pattern. sprintf("%s, ", $(++i)) returns the value of the next field followed by a comma and a space. $(++i) increments the field number i and then returns the value of the field. In awk, strings are concatenated, so the string returned by sprintf is added to the existing value of s.
I set the field separator FS to one or more space or = characters, so the field you're interested is the one after the one matching the pattern. Note that I'm using ~ to match a regex pattern - you cannot use == as you were doing as this performs a string comparison.
The substr strips the last , from the the string before it is printed.
A much shorter option, inspired by Kent's use of FPAT on GNU awk (note that this requires a version >=4.0) :
$ awk -v FPAT="FOUND_ME_[0-9] *= *[0-9]+" -v OFS=", " '{$1=$1;gsub(/FOUND_ME_[0-9] *= */,"")}1' file
5, 8, 8, 10
50, 81, 8
54, 81, 8, 11, 8, 20
$1=$1 causes awk to "touch" each record, removing the parts which aren't matched by FPAT. gsub performs a global substitution, removing the part we aren't interested in. 1 at the end is always true, so the default action {print} is performed. Setting the OFS variable causes each field in the output to be comma-separated as desired.
gawk has FPAT, which we could use for this problem:
awk -v FPAT="FOUND_ME_[0-9] *= *[0-9]+" '
{for(i=1;i<=NF;i++){sub("FOUND_ME_[0-9] *= *","",$i);
printf "%s%s",$i,(NF==i?"\n":", ")}}' file
output:
5, 8, 8, 10
50, 81, 8
54, 81, 8, 11, 8, 20