I have some files as shown below
GLL ALM 654-656 654 656
SEM LYG 655-657 655 657
SEM LYG 655-657 655 657
ALM LEG 656-658 656 658
ALM LEG 656-658 656 658
ALM LEG 656-658 656 658
LEG LEG 658-660 658 660
LEG LEG 658-660 658 660
The value of GLL is 654. The value of ALM is 656. In the same way, 4th column represents the values of first column. 5th column represents the values of second column.I would like to count the unique occurrences of each number in the fourth and fifth column.
Desired output
654 GLL 1
655 SEM 1
656 ALM 2
657 LYG 1
658 LEG 2
660 LEG 1
If I understand your question right, this script could give you the output:
awk '{d[$4]=$1;d[$5]=$2;p[$4];l[$5]}
END{
for(k in p){
if (k in l){
delete l[k]
print k,d[k],"2"
}else
print k,d[k],"1"
}
for (k in l)
print k, d[k],1
} ' file
with your input data, the output of above script:
654 GLL 1
655 SEM 1
656 ALM 2
658 LEG 2
657 LYG 1
660 LEG 1
so it is not 100% same as your expected output (the order), but if you pipe it to sort -n, it is gonna give you the exactly same thing. The sorting part could be done within the awk too. I was a bit lazy... :)
My take:
sort -u file |
awk '
BEGIN {SUBSEP = OFS}
{count[$4,$1]++; count[$5,$2]++}
END {for (key in count) print key, count[key]}
' |
sort -n
654 GLL 1
655 SEM 1
656 ALM 2
657 LYG 1
658 LEG 2
660 LEG 1
Sorry it is so long, but it works and has a bonus built in if such a thing occurred! See edit 2 for more info. :-)
awk '
BEGIN { SUBSEP = FS;
before = 0;
between = 1;
after = 0;
}
{
offset = int((NF - after - before - between) / 2) + between;
for (i=1 + before; i <= offset + before - between; i++) {
j = i + offset;
if (! ((i, $j, $i) in entry))
entry[i, $j, $i]++;
}
}
END {
for (item in entry) {
split(item, itema);
entry[itema[2], itema[3]]++;
delete entry[item];
}
for (item in entry)
print item, entry[item];
}' filename | sort -n
The first part filters the input, only accepting unique occurrences of the pair that should be in the first and second columns of the output. The second part combines the results, adding 1 for each occurrence in a unique column (e.g. LEG,658 appears at least once in both $1,$4 and $2,$5, so it is counted twice), and prints the results, which is passed to the sort utility to sort the output numerically.
It is generalized for N pairs, so if you have something like the following in the future, the script still works, so long as only pairs are added (you can't add another separate field, or the script breaks):
GLL ALM LEG 654-660 654 656 660
I suppose if you wanted, you could add extra fields to the beginning and change the start value of i. Or maybe add at the end and subtract one more from the end value of i for each new field you add (e.g. NF - 2 if you add 1 one more unpaired field at the end). It would require a redesign to accommodate unpaired values in the middle because the data set would be completely different.
Edit
It's only so long because it is flexible (somewhat) and because I'm still an awk newbie. I'd recommend Kent's if you don't like mine (or it doesn't work--I'm not using a computer that has awk installed at the moment).
Edit 2
Updated script. It didn't work before, and it can now handle arbitrary offsets so long as no unpaired fields split the pairs up. Something like the following works:
GLL ALM LYG 654-657 654 656 657
SEM LYG 655-657 655 657
SEM LYG LEG 655-660 655 657 660
ALM LEG 656-658 656 658
LEG LEG 658-660 658 660
LYG LEG 657-660 657 660
Output:
654 GLL 1
655 SEM 1
656 ALM 2
657 LYG 3
658 LEG 2
660 LEG 2
Edit 3
The script now handles arbitrary contiguous unpaired fields. You must configure how many fields you have before the first part of a pair begins (e.g. how many fields before the first GLL, ALM, etc. on the line), how many fields are between the first and second parts of the pairs, and how many fields are after the list of second parts of the pairs. Note that it must be contiguous and consistent, meaning you can't have something like 1 field before the first pair start component for one line and 5 fields before the first pair start component on another line, and you can't have a pair start/end component separated from another of the same (e.g. "GLL xyz ALM 654 656" doesn't work because "xyz" separates "GLL" and "ALM", which are both pair start components).
For anything more than this, actual knowledge about the data set would be required, such as if GLL may have extra information immediately after it, but ALM does not ever have such data.
Related
I have struggled with this even after looking at the various past answers to no avail.
My data consists of columns numeric and non numeric. I'd like to average the numeric columns and display my data on the GUI together with the information on the non-numeric columns.The non numeric columns have info such as names,rollno,stream while the numeric columns contain students marks for various subjects. It works well when dealing with one dataframe but fails when I combine two or more dataframes in which it returms only the average of the numeric columns and displays it leaving the non numeric columns undisplayed. Below is one of the codes I've tried so far.
df=pd.concat((df3,df5))
dfs =df.groupby(df.index,level=0).mean()
headers = list(dfs)
self.marks_table.setRowCount(dfs.shape[0])
self.marks_table.setColumnCount(dfs.shape[1])
self.marks_table.setHorizontalHeaderLabels(headers)
df_array = dfs.values
for row in range(dfs.shape[0]):
for col in range(dfs.shape[1]):
self.marks_table.setItem(row, col,QTableWidgetItem(str(df_array[row,col])))
A working code should return averages in something like this
STREAM ADM NAME KCPE ENG KIS
0 EAGLE 663 FLOYCE ATI 250 43 5
1 EAGLE 664 VERONICA 252 32 33
2 EAGLE 665 MACREEN A 341 23 23
3 EAGLE 666 BRIDGIT 286 23 2
Rather than
ADM KCPE ENG KIS
0 663.0 250.0 27.5 18.5
1 664.0 252.0 26.5 33.0
2 665.0 341.0 17.5 22.5
3 666.0 286.0 38.5 23.5
Sample data
Df1 = pd.DataFrame({
'STREAM':[NORTH,SOUTH],
'ADM':[437,238,439],
'NAME':[JAMES,MARK,PETER],
'KCPE':[233,168,349],
'ENG':[70,28,79],
'KIS':[37,82,79],
'MAT':[67,38,29]})
Df2 = pd.DataFrame({
'STREAM':[NORTH,SOUTH],
'ADM':[437,238,439],
'NAME':[JAMES,MARK,PETER],
'KCPE':[233,168,349],
'ENG':[40,12,56],
'KIS':[33,43,43],
'MAT':[22,58,23]})
Your question not clear. However guessing the origin of question based on content. I have modified your datframes which were not well done by adding a stream called 'CENTRAL', see
Df1 = pd.DataFrame({'STREAM':['NORTH','SOUTH', 'CENTRAL'],'ADM':[437,238,439], 'NAME':['JAMES','MARK','PETER'],'KCPE':[233,168,349],'ENG':[70,28,79],'KIS':[37,82,79],'MAT':[67,38,29]})
Df2 = pd.DataFrame({ 'STREAM':['NORTH','SOUTH','CENTRAL'],'ADM':[437,238,439], 'NAME':['JAMES','MARK','PETER'],'KCPE':[233,168,349],'ENG':[40,12,56],'KIS':[33,43,43],'MAT':[22,58,23]})
I have assumed you want to merge the two dataframes and find avarage
df3=Df2.append(Df1)
df3.groupby(['STREAM','ADM','NAME'],as_index=False).sum()
Outcome
I have a text file with 2 columns of numbers.
10 2
20 3
30 4
40 5
50 6
60 7
70 8
80 9
90 10
100 11
110 12
120 13
130 14
I would like to find the average of the 2nd column data from the 6th line. That is ( (7+8+9+10+11+12+13+14)/8 = 10.5 )
I could find this post Scripts for computing the average of a list of numbers in a data file
and used the following:
awk'{s+=$2}END{print "ave:",s/NR}' fileName
but I get an average of entire second column data.
Any hint here.
This one-liner should do:
awk -v s=6 'NR<s{next} {c++; t+=$2} END{printf "%.2f (%d samples)\n", t/c, c}' file
This awk script has three pattern/action pairs. The first is responsible for skipping the first s lines. The second executes on every line (from s onwards); it increments a counter and adds column 2 to a running total. The third runs after all data have been processed, and prints your results.
Below script should do the job
awk 'NR>=6{avg+=$2}END{printf "Average of field 2 starting from 6th line %.1f\n",avg/(NR-5)}' file
Output
Average of field 2 starting from 6th line 10.5
I have a text file containing the following information.
Table X1: Circuit Thermal Overload Ratings for xxx
Rated Temperature: 50 ºC
ALL RATINGS ARE Winter Spring / Autumn Summer
PER CIRCUIT Amps MVA Amps MVA Amps MVA
Pre-Fault Continuous 485 111 450 103 390 89
Post-Fault Continuous 580 132 540 123 465 106
Table X2: Circuit Thermal Overload Ratings for xxx
Rated Temperature: 65 ºC
ALL RATINGS ARE Winter Spring / Autumn Summer
PER CIRCUIT Amps MVA Amps MVA Amps MVA
Pre-Fault Continuous 555 126 520 119 470 108
Post-Fault Continuous 660 150 620 142 560 128
An image of the example table is here:
example table
I wish to format the lines depending on the first field of each line.
For lines starting with 'Pre' or 'Post', insert commas after the first string and between every two numbers.
printf("%s %s,%d,%d,%d,%d,%d,%d\n",$1,$2,$3,$4,$5,$6,$7,$8)
For lines starting with 'ALL', insert commas after 'ARE', 'Winter' and 'Autumn'.
For lines starting with 'PER', insert commas after 'CIRCUIT', 'Amps' and 'MVA'.
For other lines, keep the original text...
The expected output should look like,
ALL RATINGS ARE, Winter, Spring / Autumn, Summer
PER CIRCUIT, Amps, MVA, Amps, MVA, Amps, MVA
Pre-Fault Continuous, 555, 126, 520, 119, 470, 108
Post-Fault Continuous, 660, 150, 620, 142, 560, 128
I have tried the following but it does not produce the results I am looking for. Any help would be greatly appreciated...
/Table/
{for(n=0; n<=5; n=n+1) {
if(n<2){getline}
if(n==2)
{printf("%s%s%s,%s,%s%s%s,%s",$1,$2,$3,$4,$5,$6,$7,$8)}
if(n==3)
{printf("%s%s,%s,%s,%s,%s,%s,%s",$1,$2,$3,$4,$5,$6,$7,$8)}
if(n==4||n==5)
{printf("%s %s,%d,%d,%d,%d,%d,%d\n",$1,$2,$3,$4,$5,$6,$7,$8) }
}
}
Your approach is not going to help, instead use pattern matching. Here is a template you can follow
$ awk '/^ALL/ {print ...; next}
/^PER/ {print ...; next}
/^Pre/ || /^Post/ {print ...; next}
{print}' file
I am trying to get the value of 'id' in the vmstat result.
However, I found out that the position of 'id' column is different between platforms such as linux/AIX/HP...
## Linux
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 35268 117568 158244 1849104 0 0 3 11321 5 2 9 15 73 3 0
So, I think I should find the string 'id' and get the position(the) then, get the value of the position in the next row.
How can I do that with awk script?
this oneliner does what you want:
awk '{for(i=NF;i>0;i--)if($i=="id"){x=i;break}}END{print $x}'
first find out the id index, then print the corresponding column in the last line.
I'm currently working on an awk script which extracts all n-grams from an input file.
When running my awk script on a file it prints out every n-gram (sorted) with the number of occurrences next to it.
When testing on an input file it prints out the correct order of n-grams. Only the number of occurrences are not correct.
For extracting n-grams I have the following code:
$1=$1
line=tolower($0)
split(line,chars,"")
begin_len=0
for (i in chars){
ngram=""
for (ind=0;ind<n;ind++){
ngram=ngram""chars[i+ind]
}
if(begin_len == 0){
begin_len=length(ngram)
}
if(length(ngram) == begin_len){
counter+=1
freq_tabel[ngram]+=1
}
}
(sort function not included)
I was wondering if there is something wrong in the code. Or are there some aspects which I have overlooked?
The output I should have is the following:
35383
1580 n
1323 en
1081 e
940 de
839 v
780 er
716 d
713 an
615 t
instead, i have the following output:
34845
1561 n
1302 en
1067 e
930 de
827 v
772 er
711 d
703 an
609 t
As you can see, the n-grams are correct but the number of occurences not.
INPUT FILE: http://cl.ly/202j3r0B1342
Not an answer but may help you (assuming n=2).
Did you happen to convert the original file (that seems UTF-8) to latin-1? I got two sets of figures:
==> sorted.latin1_in_utf8_locale <==
1566 n
1308 en
1072 e
929 de
836 v
==> sorted.utf8_in_utf8_locale <==
1579 n
1320 en
1080 e
940 de
838 v
with latin-1 input the figures are closer to yours. with utf-8 to the expected ones.
However, neither matches. Scratching my head.
BTW, I am not sorting the ngrams in the script but outputting in the form suitable for piping them to sort -rn. But this should not cause difference, I guess.
for (ngram in freq_tabel)
printf "%7i %s\n", freq_tabel[ngram], ngram
I'm in your class, so here's a couple of hints:
Copy the exact input file (using clone from github, don't do a raw copy)
Re-read the assignment, you're supposed to get rid of the leading and trailing spaces, and replace all multiple tabs/spaces with one space.
Also, what's the point of the $1 = $1 on top?