Awk print certain lines in a certain format - awk

I have a text file containing the following information.
Table X1: Circuit Thermal Overload Ratings for xxx
Rated Temperature: 50 ºC
ALL RATINGS ARE Winter Spring / Autumn Summer
PER CIRCUIT Amps MVA Amps MVA Amps MVA
Pre-Fault Continuous 485 111 450 103 390 89
Post-Fault Continuous 580 132 540 123 465 106
Table X2: Circuit Thermal Overload Ratings for xxx
Rated Temperature: 65 ºC
ALL RATINGS ARE Winter Spring / Autumn Summer
PER CIRCUIT Amps MVA Amps MVA Amps MVA
Pre-Fault Continuous 555 126 520 119 470 108
Post-Fault Continuous 660 150 620 142 560 128
An image of the example table is here:
example table
I wish to format the lines depending on the first field of each line.
For lines starting with 'Pre' or 'Post', insert commas after the first string and between every two numbers.
printf("%s %s,%d,%d,%d,%d,%d,%d\n",$1,$2,$3,$4,$5,$6,$7,$8)
For lines starting with 'ALL', insert commas after 'ARE', 'Winter' and 'Autumn'.
For lines starting with 'PER', insert commas after 'CIRCUIT', 'Amps' and 'MVA'.
For other lines, keep the original text...
The expected output should look like,
ALL RATINGS ARE, Winter, Spring / Autumn, Summer
PER CIRCUIT, Amps, MVA, Amps, MVA, Amps, MVA
Pre-Fault Continuous, 555, 126, 520, 119, 470, 108
Post-Fault Continuous, 660, 150, 620, 142, 560, 128
I have tried the following but it does not produce the results I am looking for. Any help would be greatly appreciated...
/Table/
{for(n=0; n<=5; n=n+1) {
if(n<2){getline}
if(n==2)
{printf("%s%s%s,%s,%s%s%s,%s",$1,$2,$3,$4,$5,$6,$7,$8)}
if(n==3)
{printf("%s%s,%s,%s,%s,%s,%s,%s",$1,$2,$3,$4,$5,$6,$7,$8)}
if(n==4||n==5)
{printf("%s %s,%d,%d,%d,%d,%d,%d\n",$1,$2,$3,$4,$5,$6,$7,$8) }
}
}

Your approach is not going to help, instead use pattern matching. Here is a template you can follow
$ awk '/^ALL/ {print ...; next}
/^PER/ {print ...; next}
/^Pre/ || /^Post/ {print ...; next}
{print}' file

Related

Getting col % from a base size

I'm trying to get an output for a multi-response table in col%. I can get a % from column total but not from a fixed based. How do I do it? For example
Past week used (Seg A, Seg B, Seg C) =
Olive Oil: 80, 100, 150
Sunflower Oil: 35, 95, 105
Coconut Oil: 109, 209, 15
Segment sizes A=120, B=250, C=165
I need col% by each segment
So Seg A should be calculated as
Olive Oil= 80/120; Sunflower Oil=35/120 & Coconut Oil=109/120
Similarly for Seg B & Seg C.
I'm using tidyr and dplyr to generate my outputs.
Any advice will be much appreciated.

Find average of numbers from a specific line

I have a text file with 2 columns of numbers.
10 2
20 3
30 4
40 5
50 6
60 7
70 8
80 9
90 10
100 11
110 12
120 13
130 14
I would like to find the average of the 2nd column data from the 6th line. That is ( (7+8+9+10+11+12+13+14)/8 = 10.5 )
I could find this post Scripts for computing the average of a list of numbers in a data file
and used the following:
awk'{s+=$2}END{print "ave:",s/NR}' fileName
but I get an average of entire second column data.
Any hint here.
This one-liner should do:
awk -v s=6 'NR<s{next} {c++; t+=$2} END{printf "%.2f (%d samples)\n", t/c, c}' file
This awk script has three pattern/action pairs. The first is responsible for skipping the first s lines. The second executes on every line (from s onwards); it increments a counter and adds column 2 to a running total. The third runs after all data have been processed, and prints your results.
Below script should do the job
awk 'NR>=6{avg+=$2}END{printf "Average of field 2 starting from 6th line %.1f\n",avg/(NR-5)}' file
Output
Average of field 2 starting from 6th line 10.5

How to do multi-row calculations using awk on a large file

I have a big file that is sorted on the first word. I need to add a new column for each line with the proportional value: line value/total value for that group; group is determined by the first column. In the below example, the total of group "a" = 100 and hence each line gets a proportion. The total of group "the" is 1000 and hence each line gets the proprotion value of the total of that group.
I need an awk script to do this.
Sample File:
a lot 10
a few 20
a great 20
a little 40
a good 10
the best 250
the dog 750
zisty cool 20
Output:
a lot 10 0.1
a few 20 0.2
a great 20 0.1
a little 40 0.4
a good 10 0.1
the best 25 .25
the dog 75 .75
zisty cool 20 1
You describe this as a "big file." Consequently, this solution tries to save memory: it holds no more than one group in memory at a time. When we are done with that group, we print it out before starting on the next group:
$ awk -v i=0 'NR==1{name=$1} $1==name{a[i]=$0;b[i++]=$3;tot+=$3+0;next} {for (j=0;j<i;j++){print a[j],b[j]/tot} name=$1;a[0]=$0;tot=b[0]=$3;i=1} END{for (j=0;j<i;j++){print a[j],b[j]/tot}}' file
a lot 10 0.1
a few 20 0.2
a great 20 0.2
a little 40 0.4
a good 10 0.1
the best 250 0.25
the dog 750 0.75
zisty cool 20 1
How it works
-v i=0
This initializes the variable i to zero.
NR==1{name=$1}
For the first line, set the variable name to the first field, $1. This is the name of the group.
$1==name {a[i]=$0; b[i++]=$3; tot+=$3+0; next}
If the first field matches name, then save the whole line into array a and save the value of column (field) three into array b. Increment the variable tot by the value of the third field. Then, skip the rest of the commands and jump to the next line.
for (j=0;j<i;j++){print a[j],b[j]/tot} name=$1;a[0]=$0;tot=b[0]=$3;i=1
If we get to this line, then we are at the start of a new group. Print out all the values for the old group and initialize the variables for the start of the next group.
END{for (j=0;j<i;j++){print a[j],b[j]/tot}}
After we get to the last line, print out what we have for the last group.
awk '{a[$1]+=$3; b[i++]=$0; c[j++]=$1; d[k++]=$3} END{for(i=0;i<NR;i++) {print b[i], d[i]/a[c[i]]}}' File
Example:
sdlcb#Goofy-Gen:~/AMD$ cat ff
a lot 10
a few 20
a great 20
a little 40
a good 10
the best 250
the dog 750
zisty cool 20
sdlcb#Goofy-Gen:~/AMD$ awk '{a[$1]+=$3; b[i++]=$0; c[j++]=$1; d[k++]=$3} END{for(i=0;i<NR;i++) {print b[i], d[i]/a[c[i]]}}' ff
a lot 10 0.1
a few 20 0.2
a great 20 0.2
a little 40 0.4
a good 10 0.1
the best 250 0.25
the dog 750 0.75
zisty cool 20 1
Logic: update an array (a[]) with first column as index for each line. save array b[] with complete line for each line, to be used in the end for printing. similarly, update arrays c[] and d[] with first and third column values for each line. at the end, use these arrays to get the results using a for loop, looping through all the lines processed. First printing the line as itself, then the proportion value.

count and print the number of occurences

I have some files as shown below
GLL ALM 654-656 654 656
SEM LYG 655-657 655 657
SEM LYG 655-657 655 657
ALM LEG 656-658 656 658
ALM LEG 656-658 656 658
ALM LEG 656-658 656 658
LEG LEG 658-660 658 660
LEG LEG 658-660 658 660
The value of GLL is 654. The value of ALM is 656. In the same way, 4th column represents the values of first column. 5th column represents the values of second column.I would like to count the unique occurrences of each number in the fourth and fifth column.
Desired output
654 GLL 1
655 SEM 1
656 ALM 2
657 LYG 1
658 LEG 2
660 LEG 1
If I understand your question right, this script could give you the output:
awk '{d[$4]=$1;d[$5]=$2;p[$4];l[$5]}
END{
for(k in p){
if (k in l){
delete l[k]
print k,d[k],"2"
}else
print k,d[k],"1"
}
for (k in l)
print k, d[k],1
} ' file
with your input data, the output of above script:
654 GLL 1
655 SEM 1
656 ALM 2
658 LEG 2
657 LYG 1
660 LEG 1
so it is not 100% same as your expected output (the order), but if you pipe it to sort -n, it is gonna give you the exactly same thing. The sorting part could be done within the awk too. I was a bit lazy... :)
My take:
sort -u file |
awk '
BEGIN {SUBSEP = OFS}
{count[$4,$1]++; count[$5,$2]++}
END {for (key in count) print key, count[key]}
' |
sort -n
654 GLL 1
655 SEM 1
656 ALM 2
657 LYG 1
658 LEG 2
660 LEG 1
Sorry it is so long, but it works and has a bonus built in if such a thing occurred! See edit 2 for more info. :-)
awk '
BEGIN { SUBSEP = FS;
before = 0;
between = 1;
after = 0;
}
{
offset = int((NF - after - before - between) / 2) + between;
for (i=1 + before; i <= offset + before - between; i++) {
j = i + offset;
if (! ((i, $j, $i) in entry))
entry[i, $j, $i]++;
}
}
END {
for (item in entry) {
split(item, itema);
entry[itema[2], itema[3]]++;
delete entry[item];
}
for (item in entry)
print item, entry[item];
}' filename | sort -n
The first part filters the input, only accepting unique occurrences of the pair that should be in the first and second columns of the output. The second part combines the results, adding 1 for each occurrence in a unique column (e.g. LEG,658 appears at least once in both $1,$4 and $2,$5, so it is counted twice), and prints the results, which is passed to the sort utility to sort the output numerically.
It is generalized for N pairs, so if you have something like the following in the future, the script still works, so long as only pairs are added (you can't add another separate field, or the script breaks):
GLL ALM LEG 654-660 654 656 660
I suppose if you wanted, you could add extra fields to the beginning and change the start value of i. Or maybe add at the end and subtract one more from the end value of i for each new field you add (e.g. NF - 2 if you add 1 one more unpaired field at the end). It would require a redesign to accommodate unpaired values in the middle because the data set would be completely different.
Edit
It's only so long because it is flexible (somewhat) and because I'm still an awk newbie. I'd recommend Kent's if you don't like mine (or it doesn't work--I'm not using a computer that has awk installed at the moment).
Edit 2
Updated script. It didn't work before, and it can now handle arbitrary offsets so long as no unpaired fields split the pairs up. Something like the following works:
GLL ALM LYG 654-657 654 656 657
SEM LYG 655-657 655 657
SEM LYG LEG 655-660 655 657 660
ALM LEG 656-658 656 658
LEG LEG 658-660 658 660
LYG LEG 657-660 657 660
Output:
654 GLL 1
655 SEM 1
656 ALM 2
657 LYG 3
658 LEG 2
660 LEG 2
Edit 3
The script now handles arbitrary contiguous unpaired fields. You must configure how many fields you have before the first part of a pair begins (e.g. how many fields before the first GLL, ALM, etc. on the line), how many fields are between the first and second parts of the pairs, and how many fields are after the list of second parts of the pairs. Note that it must be contiguous and consistent, meaning you can't have something like 1 field before the first pair start component for one line and 5 fields before the first pair start component on another line, and you can't have a pair start/end component separated from another of the same (e.g. "GLL xyz ALM 654 656" doesn't work because "xyz" separates "GLL" and "ALM", which are both pair start components).
For anything more than this, actual knowledge about the data set would be required, such as if GLL may have extra information immediately after it, but ALM does not ever have such data.

awk n-gram extracting not correct

I'm currently working on an awk script which extracts all n-grams from an input file.
When running my awk script on a file it prints out every n-gram (sorted) with the number of occurrences next to it.
When testing on an input file it prints out the correct order of n-grams. Only the number of occurrences are not correct.
For extracting n-grams I have the following code:
$1=$1
line=tolower($0)
split(line,chars,"")
begin_len=0
for (i in chars){
ngram=""
for (ind=0;ind<n;ind++){
ngram=ngram""chars[i+ind]
}
if(begin_len == 0){
begin_len=length(ngram)
}
if(length(ngram) == begin_len){
counter+=1
freq_tabel[ngram]+=1
}
}
(sort function not included)
I was wondering if there is something wrong in the code. Or are there some aspects which I have overlooked?
The output I should have is the following:
35383
1580 n
1323 en
1081 e
940 de
839 v
780 er
716 d
713 an
615 t
instead, i have the following output:
34845
1561 n
1302 en
1067 e
930 de
827 v
772 er
711 d
703 an
609 t
As you can see, the n-grams are correct but the number of occurences not.
INPUT FILE: http://cl.ly/202j3r0B1342
Not an answer but may help you (assuming n=2).
Did you happen to convert the original file (that seems UTF-8) to latin-1? I got two sets of figures:
==> sorted.latin1_in_utf8_locale <==
1566 n
1308 en
1072 e
929 de
836 v
==> sorted.utf8_in_utf8_locale <==
1579 n
1320 en
1080 e
940 de
838 v
with latin-1 input the figures are closer to yours. with utf-8 to the expected ones.
However, neither matches. Scratching my head.
BTW, I am not sorting the ngrams in the script but outputting in the form suitable for piping them to sort -rn. But this should not cause difference, I guess.
for (ngram in freq_tabel)
printf "%7i %s\n", freq_tabel[ngram], ngram
I'm in your class, so here's a couple of hints:
Copy the exact input file (using clone from github, don't do a raw copy)
Re-read the assignment, you're supposed to get rid of the leading and trailing spaces, and replace all multiple tabs/spaces with one space.
Also, what's the point of the $1 = $1 on top?