Formula in gawk - gawk

I have a problem that I’m trying to work out in gawk. This should be so simple, but my attempts ended up with a divide by zero error.
What I trying to accomplish is as follows –
maxlines = 22 (fixed value)
maxnumber = > max lines (unknown value)
Example:
maxlines=22
maxnumber=60
My output should look like the following:
print lines:
1
2
...
22
print lines:
23
24
...
45
print lines:
46 (remainder of 60 (maxnumber))
47
...
60

It's not clear what you're asking, but I assume you want to loop through input lines and print a new header (page header?) after every 22 lines. Using a simple counter and check for
count % 22 == 1
which tells you it's time to print the next page.
Or you could keep two counters, one for the absolute line number and another for the line number within the current page. When the second counter exceeds 22, reset it to zero and print the next page heading.

Worked out gawk precedence with some help and this works -
maxlines = 22
maxnumber = 60
for (i = 1; i <= maxnumber; i++){
if ( ! ( (i-1) % maxlines) ){
print "\nprint lines:"
}
print i
}

Related

Awk failing extraction

I have a huge file containing the xyz positions of some atoms from different molecules. The whole file contains ~ 10000 configurations. I have created a script that iterates over the total number of configurations and extracts the coordinates associated with a specific atomic species that is systematically repeated at a fixed position, along with each frame associated with each system. My code works perfectly, except in the case in which the atomic position coincides with the last position of the frame I have to process, skipping to grab it and print in the corresponding file.
Each frame contains 384 atoms. In the xyz format, we have to take into account two extra lines at the beginning, where the number of atoms (in this case 384, line #1) and a blank/commented line are (line #2) are located.
The awk file with the list of atoms position lines is of the form:
{n = NR%386}
n == 1 {print "24"; next}
n == 2 ||
n == 91 ||
...
n == 378 ||
n == 380 ||
n == 381 ||
n == 386
where the n=NR%386 is the number of lines that awk has to account at every iteration in order to have the correct number of frames, in
n == 1 {print "24"; next}
the code prints the number of atoms I want to extract for each frame, in this case, 24.
The problem arises with the last value, in the last position of each frame before advancing to the next frame:
n == 386
When using the command
awk -f file.awk filename.xyz >> test.txt
the code will skip reading, extracting, and printing the last coordinate.
The filename.xyz I have to process is something like:
384
i = 3171, time = 3171.000, E = -3298.3005315786
C 6.66359796 19.29831718 16.63773520
C 6.19922671 19.83243350 15.35406226
C 7.73577004 21.24303011 16.94974860
C 7.32315891 21.77975003 15.67093925
N 5.08248005 17.55384984 15.51887635
N 7.75857672 23.00895664 15.43811018
N 8.58649028 22.07495287 17.61330368
N 7.45555304 19.97249138 17.42360101
...
...
...
N 3.62924684 23.22942656 15.38486984
N 4.52670891 22.25077226 17.55981432
N 3.17369677 20.23465407 17.45881199
N 2.28230853 21.30557433 14.86646780
S 1.48394488 18.18032187 17.21253664
S 0.70072709 19.13053602 14.60582837
S 4.67511560 23.53830074 16.57005901
Currently, just trying to extract only position 386
n == 386
produces something like:
1
i = 3171, time = 3171.000, E = -3298.3005315786
1
i = 3172, time = 3172.000, E = -3298.3023115390
1
i = 3173, time = 3173.000, E = -3298.3056102462
1
i = 3174, time = 3174.000, E = -3298.3101590395
that are just the corresponding to the commented lines, apparently skipping or not correctly interpreting which line to grep.
I would like to understand why awk if not able to extract the last line properly and how to solve the problem.
This appears to be a math problem. NR%386 will never be 386 because of the way the modulus operator works (there is no remainder when you divide 386 by 386). So your n==386 will never get executed. Try using (NR-1)%386 instead of NR%386 and shift all your conditionals accordingly:
n == 0 {print "24"; next}
etc. If you need n for calculations, add one to it.

Reading fields in previous lines for moving average

Main Question
What is the correct syntax for recursively calling AWK inside of another AWK program, and then saving the output to a (numeric) variable?
I want to call AWK using 2/3 variables:
N -> Can be read from Bash or from container AWK script.
Linenum -> Read from container AWK program
J -> Field that I would like to read
This is my attempt.
Container AWk program:
BEGIN {}
{
...
# Loop in j
...
k=NR
# Call to other instance of AWK
var=(awk -f -v n="$n_steps" linenum=k input-file 'linenum-n {printf "%5.4E", $j}'
...
}
END{}
Background for more general questions:
I have a file for which I would like to calculate a moving average of n (for example 2280) steps.
Ideally, for the first n rows the average is of the values 1 to k,
where k <= n.
For rows k > n the average would be of the last n values.
I will eventually execute the code in many large files, with several columns, and thousands to millions of rows, so I'm interested in streamlining the code as much as possible.
Code Excerpt and Description
The code I'm trying to develop looks something like this:
NR>1
{
# Loop over fields
for (j in columns)
{
# Rows before full moving average is done
if ( $1 <= n )
{
cumsum[j]=cumsum[j]+$j #Cumulative sum
$j=cumsum[j]/$1 # Average
}
#moving average
if ( $1 > n )
{
k=NR
last[j]=(awk -f -v n="$n_steps" ln=k input-file 'ln-n {printf "%5.4E", $j}') # Obtain value that will get ubstracted from moving average
cumsum[j]=cumsum[j]+$j-last[j] # Cumulative sum adds last step and deleted unwanted value
$j=cumsum[j]/n # Moving average
}
}
}
My input file contains several columns. The first column contains the row number, and the other columns contain values.
For the cumulative sum of the moving average: If I am in row k, I want to add it to the cumulative sum, but also start subtracting the first value that I don't need (k-n).
I don't want to have to create an array of cumulative sums for the last steps, because I feel it could impact performance. I prefer to directly select the values that I want to substract.
For that I need to call AWK once again (but on a different line). I attempt to do it in this line:
k=NR
last[j]=(awk -f -v n="$n_steps" ln=k input-file 'ln-n {printf "%5.4E", $j}'
I am sure that this code cannot be correct.
Discussion Questions
What is the best way to obtain information about a field in a previous line to the one that AWK is working on? Can it be then saved into a variable?
Is this recursive use of AWK allowed or even recommended?
If not, what could be the most efficient way to update the cumulative sum values so that I get an efficient enough code?
Sample input and Output
Here is a sample of the input (second column) and the desired output (third column). I'm using 3 as the number of averaging steps (n)
N VAL AVG_VAL
1 1 1
2 2 1.5
3 3 2
4 4 3
5 5 4
6 6 5
7 7 6
8 8 7
9 9 8
10 10 9
11 11 10
12 12 11
13 13 12
14 14 13
14 15 14
If you want to do a running average of a single column, you can do it this way:
BEGIN{n=2280; c=7}
{ s += $c - a[NR%n]; a[NR%n] = $c }
{ print $0, s /(NR < n : NR ? n) }
Here we store the last n values in an array a and keep track of the cumulative sum s. Every time we update the sum we correct by first removing the last value from it.
If you want to do this for a couple of columns, you have to be a bit handy with keeping track of your arrays
BEGIN{n=2280; c[0]=7; c[1]=8; c[2]=9}
{ for(i in c) { s[i] += $c[i] - a[n*i + NR%n]; a[n*i + NR%n] = $c[i] } }
{ printf $0
for(i=0;i<length(c);++i) printf OFS (s[i]/(NR < n : NR ? n))
printf ORS
}
However, you mentioned that you have to add millions of entries. That is where it becomes a bit more tricky. Summing a lot of values will introduce numeric errors as you loose precision bit by bit (when you add floats). So in this case, I would suggest implementing the Kahan summation.
For a single column you get:
BEGIN{n=2280; c=7}
{ y = $c - a[NR%n] - k; t = s + y; k = (t - s) - y; s = t; a[NR%n] = $c }
{ print $0, s /(NR < n : NR ? n) }
or a bit more expanded as:
BEGIN{n=2280; c=7}
{ y = $c - k; t = s + y; k = (t - s) - y; s = t; }
{ y = -a[NR%n] - k; t = s + y; k = (t - s) - y; s = t; }
{ a[NR%n] = $c }
{ print $0, s /(NR < n : NR ? n) }
For a multi-column problem, it is now straightforward to adjust the above script. All you need to know is that y and t are temporary values and k is the compensation term which needs to be stored in memory.

Distance between two lines

I have a set of points for which I need to calculate the distance between lines.
Especially for the range 70:80. Can it be possible via awk ? or any other method
sample data
70.9247 24
73.6148 24
70.9231 25
73.6144 25
70.9216 26
73.6141 26
70.9201 27
73.6138 27
70.9187 28
73.6136 28
Few points
1) Data sorted on y. So each value of y has 2 points.
2) I want the distance between x points for every y. i.e. y(new) = y(n+1)-y(n)
expected output:
2.6901 24
2.6912 25
...........
2.6949 28
thanks
What you are after is something like:
awk 'NR%2{t=$1;next}{print $1-t,$2}'
This does something like:
If the record/line number NR is an odd number, store the value of the first field in t and skip to the next record/line
Otherwise, print the expected output.
A similar way of writing this is:
awk '{if(NR%2){t=$1}else{print $1-t,$2}}'
but this is less awk-ish!

Binding a scalar to a sigilless variable (Perl 6)

Let me start by saying that I understand that what I'm asking about in the title is dubious practice (as explained here), but my lack of understanding concerns the syntax involved.
When I first tried to bind a scalar to a sigilless symbol, I did this:
my \a = $(3);
thinking that $(...) would package the Int 3 in a Scalar (as seemingly suggested in the documentation), which would then be bound to symbol a. This doesn't seem to work though: the Scalar is nowhere to be found (a.VAR.WHAT returns (Int), not (Scalar)).
In the above-referenced post, raiph mentions that the desired binding can be performed using a different syntax:
my \a = $ = 3;
which works. Given the result, I suspect that the statement can be phrased equivalently, though less concisely, as: my \a = (my $ = 3), which I could then understand.
That leaves the question: why does the attempt with $(...) not work, and what does it do instead?
What $(…) does is turn a value into an item.
(A value in a scalar variable ($a) also gets marked as being an item)
say flat (1,2, (3,4) );
# (1 2 3 4)
say flat (1,2, $((3,4)) );
# (1 2 (3 4))
say flat (1,2, item((3,4)) );
# (1 2 (3 4))
Basically it is there to prevent a value from flattening. The reason for its existence is that Perl 6 does not flatten lists as much as most other languages, and sometimes you need a little more control over flattening.
The following only sort-of does what you want it to do
my \a = $ = 3;
A bare $ is an anonymous state variable.
my \a = (state $) = 3;
The problem shows up when you run that same bit of code more than once.
sub foo ( $init ) {
my \a = $ = $init; # my \a = (state $) = $init;
(^10).map: {
sleep 0.1;
++a
}
}
.say for await (start foo(0)), (start foo(42));
# (43 44 45 46 47 48 49 50 51 52)
# (53 54 55 56 57 58 59 60 61 62)
# If foo(42) beat out foo(0) instead it would result in:
# (1 2 3 4 5 6 7 8 9 10)
# (11 12 13 14 15 16 17 18 19 20)
Note that variable is shared between calls.
The first Promise halts at the sleep call, and then the second sets the state variable before the first runs ++a.
If you use my $ instead, it now works properly.
sub foo ( $init ) {
my \a = my $ = $init;
(^10).map: {
sleep 0.1;
++a
}
}
.say for await (start foo(0)), (start foo(42));
# (1 2 3 4 5 6 7 8 9 10)
# (43 44 45 46 47 48 49 50 51 52)
The thing is that sigiless “variables” aren't really variables (they don't vary), they are more akin to lexically scoped (non)constants.
constant \foo = (1..10).pick; # only pick one value and never change it
say foo;
for ^5 {
my \foo = (1..10).pick; # pick a new one each time through
say foo;
}
Basically the whole point of them is to be as close as possible to referring to the value you assign to it. (Static Single Assignment)
# these work basically the same
-> \a {…}
-> \a is raw {…}
-> $a is raw {…}
# as do these
my \a = $i;
my \a := $i;
my $a := $i;
Note that above I wrote the following:
my \a = (state $) = 3;
Normally in the declaration of a state var, the assignment only happens the first time the code gets run. Bare $ doesn't have a declaration as such, so I had to prevent that behaviour by putting the declaration in parens.
# bare $
for (5 ... 1) {
my \a = $ = $_; # set each time through the loop
say a *= 2; # 15 12 9 6 3
}
# state in parens
for (5 ... 1) {
my \a = (state $) = $_; # set each time through the loop
say a *= 2; # 15 12 9 6 3
}
# normal state declaration
for (5 ... 1) {
my \a = state $ = $_; # set it only on the first time through the loop
say a *= 2; # 15 45 135 405 1215
}
Sigilless variables are not actually variables, they are more of an alias, that is, they are not containers but bind to the values they get on the right hand side.
my \a = $(3);
say a.WHAT; # OUTPUT: «(Int)␤»
say a.VAR.WHAT; # OUTPUT: «(Int)␤»
Here, by doing $(3) you are actually putting in scalar context what is already in scalar context:
my \a = 3; say a.WHAT; say a.VAR.WHAT; # OUTPUT: «(Int)␤(Int)␤»
However, the second form in your question does something different. You're binding to an anonymous variable, which is a container:
my \a = $ = 3;
say a.WHAT; # OUTPUT: «(Int)␤»
say a.VAR.WHAT;# OUTPUT: «(Scalar)␤»
In the first case, a was an alias for 3 (or $(3), which is the same); in the second, a is an alias for $, which is a container, whose value is 3. This last case is equivalent to:
my $anon = 3; say $anon.WHAT; say $anon.VAR.WHAT; # OUTPUT: «(Int)␤(Scalar)␤»
(If you have some suggestion on how to improve the documentation, I'd be happy to follow up on it)

Do .. While Loop/Textfile/Operation Problem

Hi I have a problem with the following code:
int skp = 1;
do{
file.seekp(skp);
file>>s;
cout<<s;
stats[s]++;
skp++;
skp++;
}while(skp <= 10);
The Textfile has the following:
0
1
2
3
0
1
0
1
0
What I want this programming to do is start from reading the second number which it does, then skip one read next, skip one read the next etc. etc. what it's doing is read the second number which is good, then reads it again for 2 times, then read the next number for 3 times and the next for 3 times. So the output i receive from the above textfile is
1112223330.
Can any one help me please!
Thank you!
That's because your lines are separated by line feeds (actually CR and LF). Also, file >> s will skip leading white space, so you end up with
<CR><LF>1
<LF>1
1
All of which result in s being 1.
The same is repeated for 2, 3 and so on.
Forget yout seekp() and simply use
while (file.good()) {
file >> s; // skip line
if (!file.good()) break;
file >> s;
cout << s;
stats[s]++;
}