AWK: Average of each row from different measurement series - awk
My objective is to calculate the average of the second column from multiple measurement series (the average of the first row of K blocks, the average of the second row of K blocks etc.). All data is contained in one file and is seperated in blocks with a blank line. The file has the following structure:
#
#
33 -0.23
34.5 -0.32
36 -0.4
.
.
.
#
#
33 -0.25
34.5 -0.31
36 -0.38
.
.
.
$ cat avg.awk
BEGIN { FS=" " }
/^#/ { next }
/^\s*$/ { print col1/nr " " col2/nr; col1=col2=nr=0; next }
{ col1 += $1; col2 += $2; nr++ }
END {print col1/nr " " col2/nr }
with input:
$ cat test.txt
#
#
33 -0.23
34.5 -0.32
36 -0.4
#
#
33 -0.25
34.5 -0.31
36 -0.38
gives as result:
$ awk -f avg.awk test.txt
34.5 -0.316667
34.5 -0.313333
Related
How to print something multiple times in awk
I have a file sample.txt that looks like this: Sequence: chr18_gl000207_random Repeat 1 Indices: 2822--2996 Score: 135 Period size: 36 Copynumber: 4.8 Consensus size: 36 Consensus pattern (36 bp): TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT Repeat 2 Indices: 2736--3623 Score: 932 Period size: 111 Copynumber: 8.1 Consensus size: 111 Consensus pattern (111 bp): TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA GTGGTGGCTGTTGTGGTTGTAGCCTCAGTGGAAGTGCCTGCAGTTG Repeat 3 Indices: 3421--3496 Score: 89 Period size: 39 Copynumber: 1.9 Consensus size: 39 Consensus pattern (39 bp): AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT I have used awk to extract values for parameters that are relevant for me like this: paste <(awk '/Indices/ {print $2}' sample.txt) <(awk '/Period size/ {print $3}' sample.txt) <(awk '/Copynumber/ {print $5}' sample.txt) <(awk '/Consensus pattern/ {getline; print $0}' sample.txt) Output: 2822--2996 36 4.8 TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT 2736--3623 111 8.1 TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA 3421--3496 39 1.9 AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT Now I want to add the parameter Sequence to every row. Desired output: chr18_gl000207_random:2822--2996 36 4.8 TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT chr18_gl000207_random:2736--3623 111 8.1 TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA chr18_gl000207_random:3421--3496 39 1.9 AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT I want to do this for many files in a loop so I need a solution that would work with a different number of Repeats as well.
$ cat tst.awk BEGIN { OFS="\t" } $1 == "Sequence:" { seq = $2; next } $1 == "Indices:" { ind = $2; next } $1 == "Period" { per = $3; cpy = $5; next } $1 == "Consensus" { isCon=1; next } isCon { print seq":"ind, per, cpy, $1; isCon=0 } $ awk -f tst.awk file chr18_gl000207_random:2822--2996 36 4.8 TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT chr18_gl000207_random:2736--3623 111 8.1 TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA chr18_gl000207_random:3421--3496 39 1.9 AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT
awk equivalents for tidyverse concepts (melt and spread)
I have some text logs that I need to parse and format into CSV. I have a working R script but it is slow once file sizes increase and this problem seems like a good candidate for a speed up using awk (or other commandline tools?) as I understand. I have not done much with awk, and the issue I am having is translating how I think about processing in R to how awk scripting is done. Example truncated input data (Scrap.log): ; these are comment lines ; ******************************************************************************* ; \\C:\Users\Computer\Folder\Folder\Scrap.log !!G 99999 % % % % % % % % CURRENT XYZ ABC STATE1 STATE2 _START Header1 Header2 Header3 Header4 Header5 Header6 Header7 10 12.23 1.91 6.63 1.68 50.03 0.50 13.97 11 11.32 1.94 6.64 1.94 50.12 0.58 15.10 12 12.96 2.15 6.57 2.12 55.60 0.62 16.24 13 11.43 2.18 6.60 2.36 50.89 0.68 17.39 14 14.91 2.32 6.64 2.59 56.09 0.73 18.41 15 13.16 2.38 6.53 2.85 51.62 0.81 19.30 16 15.02 2.50 6.67 3.05 56.22 0.85 20.12 !!G 99999 % % % % % % % % CURRENT XYZ ABC STATE1 STATE2 _START Header8 Header9 Header10 Header11 Header12 Header13 Header14 10 22.03 24.41 15.01 51.44 44.28 16.57 11.52 11 21.05 24.62 15.62 51.23 45.42 16.47 11.98 12 20.11 24.64 16.38 52.16 46.59 16.54 12.42 13 24.13 24.93 17.23 52.34 47.72 16.51 12.88 14 27.17 24.95 18.06 52.79 48.72 16.45 13.30 15 22.87 25.04 19.27 53.01 49.50 16.47 13.63 16 23.08 25.22 20.12 53.75 50.64 16.55 14.03 Expected output (truncated): HH1,HH2,HH3,HH4,HH5,HH6,HH7,HH8,HH9,HH10,HH11,HH12,HH13,HH14,START,HeaderName,Value 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header1,12.23 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header2,1.91 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header3,6.63 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header4,1.68 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header5,50.03 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header6,0.5 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header7,13.97 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header1,11.32 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header2,1.94 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header3,6.64 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header4,1.94 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header5,50.12 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header6,0.58 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header7,15.1 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header1,12.96 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header2,2.15 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header3,6.57 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header4,2.12 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header5,55.6 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header6,0.62 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header7,16.24 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,13,Header1,11.43 99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,13,Header2,2.18 ... My general steps in the R script: add a single header row with new names at the top of file spread the top row (starting with !!G) to each row melt the header column (_START) from wide to long format Pieces I have working in awk so far include: how to grab and print the header lines awk '/_START/ {header = $0; print header}' Scrap.log How to write a single row with the new header values awk ' BEGIN{ ORS=" "; for (counter = 1; counter <= 14; counter++) print "HH",counter;}' I know each block is separated by a newline and starts with a !!G, so can write a match on that. Unsure if a split-apply-combine type of thinking works well in awk? awk '/!!G/,/\n/ {print}' Scrap.log alternatively, I tried setting RS/FS parameters like: awk ' BEGIN{RS="\n";FS=" ";}/^!!G/{header=$0;print header}/[0-9]/{print $2}END{}' Scrap.log I then get stuck on iterating over the rows and fields to do the melt step as well as combining the capture groups correctly. How do I combine all these pieces to get to the CSV format?
I think the following: awk ' BEGIN{ # output the header line print "HH1,HH2,HH3,HH4,HH5,HH6,HH7,HH8,HH9,HH10,HH11,HH12,HH13,HH14,START,HeaderName,Value" } # ignore comment lines /;/{next} /!!G/{ valcnt = 1 # save and shuffle the values val[valcnt++] = $2 val[valcnt++] = $11 val[valcnt++] = $12 val[valcnt++] = $13 val[valcnt++] = $14 val[valcnt++] = $15 val[valcnt++] = $3 val[valcnt++] = $4 val[valcnt++] = $5 val[valcnt++] = $6 val[valcnt++] = $7 val[valcnt++] = $8 val[valcnt++] = $9 val[valcnt++] = $10 next } /_START /{ # these are headers - save them to head, to be reused later for (i = 2; i <= NF; ++i) { # fun fact: its indexed on NF head[i] = $i } next } # this function is redundant, but its just easier for me to think about the code function output(firstval, header, value, \ cur, i) { cur = valcnt val[cur++] = firstval val[cur++] = header val[cur++] = value # output val as csv for (i = 1; i < cur; ++i) { printf "%s%s", val[i], i != cur - 1 ? "," : "\n" } } /[0-9]+/{ for (i = 2; i <= NF; ++i) { # add these 3 to all the other values and output them # ie. add first column, the header from header and the value output($1, head[i], $i) } } ' Should output what you want. Tested on repl.
How to replace data in pandas by using values in dict?
I have a series which contains several numbers. I want to replace them to other string type data by using dictionary values. But I don't know how to do that... GDP_group['GdpForYearPer$1M'].head(5) 0 46.919625 1 47.515189 2 47.737955 3 54.832578 4 56.338028 5 63.101272 \ This is the dict that I made to replace data. range_GDP = {'$0 ~ $100M': np.arange(0,100), '$100M ~ $1B': np.arange(100.0000001,1000), '$1B ~ $10B': np.arange(1000.000001, 10000), '$10B ~ $100B': np.arange(10000.000001, 100000), '$100B ~ $1T': np.arange(100000.000001, 1000000), '$1T ~': np.arange(1000000.000001, 20000000)}
You can use pd.cut to segment your data in ranges and apply labels. (re)generate dummy data sampled uniformly in log space: import numpy as np import pandas as pd GdpForYearPer1M = pd.Series(10**np.random.randint(0, 8, 100)) """ 0 1 1 1000 2 100 3 10 4 100 ... 95 1000000 96 100 97 100000 98 10000 99 10 """ solution: # generate "cuts" (bins) and associated labels from `range_GDP`. cut_data = [(np.min(v), k) for k, v in range_GDP.items()] bins, labels = zip(*cut_data) # bins required to have one more value than labels bins = list(bins) + [np.inf] pd.cut(GdpForYearPer1M, bins=bins, labels=labels) output: 0 $0 ~ $100M 1 $100M ~ $1B 2 $0 ~ $100M 3 $0 ~ $100M 4 $0 ~ $100M ... 95 $100B ~ $1T 96 $0 ~ $100M 97 $10B ~ $100B 98 $1B ~ $10B 99 $0 ~ $100M Length: 100, dtype: category Categories (6, object): [$0 ~ $100M < $100M ~ $1B < $1B ~ $10B < $10B ~ $100B < $100B ~ $1T < $1T ~]
rearrange from specific string into respective column
I'm trying to rearrange from specific string into respective column. etc: 126N (will be sorted into "Normal" column) Value 1 (the integer will be concatenated with 126) Resulting : N=Normal 126 # 1 Here is the input (N=Normal, W=Weak) Value 1 126N, Value 3 18N, Value 4 559N, 562N, 564N, Value 6 553W, 565A, 553N, Value 5 490W, Value 9 564N, And the output should be W=Weak 490 # 5 553 # 6 A=Absolute 565 # 6 N=Normal 126 # 1 18 # 3 559 # 4 562 # 4 564 # 4 553 # 6 564 # 9 Let me know your thought on this. I've tried this script, I'm still figuring out to concatenating the value cat input.txt | sed '/^\s*$/d' | awk 'BEGIN{RS=","};match($0,/N/){print $3"c"$2}' | sed ':a;N;$!ba;s/\n/;/g' | sed 's/W//g;s/N//g;s/S//g' And some of it, are missing
This should give you what you want using gnu awk IT will work with any number of letters, not just A N W awk -F, ' !/Value/ { for (i=1;i<NF;i++) { hd=substr($i,length($i),1); arr[hd][++cnt[hd]]=($i+0" # "f)} } {split($0,b," ");f=b[2];} END { for (i in arr) { print "\n"i"\n---"; for (j in arr[i]) { print arr[i][j]}} }' file A --- 565 # 6 N --- 562 # 4 564 # 4 553 # 6 564 # 9 126 # 1 18 # 3 559 # 4 W --- 553 # 6 490 # 5
Another alternative in awk would be: awk -F',| ' ' $1 == "Value" {value = $2; next} { for (i=1; i<=NF; i++) { if ($i~"N$") N[substr($i, 1, length($i) - 1)] = value if ($i~"W$") W[substr($i, 1, length($i) - 1)] = value } } END { print "W=Weak" for (i in W) print i, "#", W[i] print "\nN=Normal" for (i in N) print i, "#", N[i] } ' file (note: this relies on knowing the wanted headers are W=Weak and N=Normal. If would take a few additional expression if the headers are subject to change.) Output $ awk -F',| ' ' > $1 == "Value" {value = $2; next} > { for (i=1; i<=NF; i++) { > if ($i~"N$") > N[substr($i, 1, length($i) - 1)] = value > if ($i~"W$") > W[substr($i, 1, length($i) - 1)] = value > } > } > END { > print "W=Weak" > for (i in W) > print i, "#", W[i] > print "\nN=Normal" > for (i in N) > print i, "#", N[i] > } > ' file W=Weak 490 # 5 N=Normal 18 # 3 126 # 1 559 # 4 562 # 4 564 # 9
$ cat tst.awk NR%2 { val = $NF; next } { for (i=1; i<=NF; i++) { num = $i+0 abbr = $i gsub(/[^[:alpha:]]/,"",abbr) list[abbr] = list[abbr] num " # " val ORS } } END { n = split("Weak Absolute Normal",types) for (i=1; i<=n; i++) { name = types[i] abbr = substr(name,1,1) print abbr "=" name ORS list[abbr] } } . $ awk -f tst.awk file W=Weak 553 # 6 490 # 5 A=Absolute 565 # 6 N=Normal 126 # 1 18 # 3 559 # 4 562 # 4 564 # 4 553 # 6 564 # 9
AWK/SED/getline - How to simplify/improve this example?
I'm trying to take a 3 column input file and separate it based on a condition in column 3. I think it'll be easier to show you than explain: Input File: outputfile1.txt 26 NCC 1 # First Start 38 NME 2 44 NSC 1 # Start2 56 NME 2 62 NCC 1 # Start3 ... 314 NCC 1 # Start17 326 NME 2 332 NSC 1 # Start18 344 NME 2 349 NME 2 # Final End (The hashed comments aren't part of the file, I've added to make things clearer). Column 3 is used to determine a new "START" entry "START/END" values are from Column 1 "TITLE" I would like to be all values from Column 2 between consecutive "STARTS" Desired Output outputfile2.txt START=26 ; END=43 ; TITLE=NCC_NME START=44 ; END=61 ; TITLE=NSC_NME START=62 ; END=79 ; TITLE=NCC_... ... START=314 ; END=331 ; TITLE=NCC_NME START=332 ; END=349 ; TITLE=NSC_NME Crude script that 'almost' does this but makes 5 single column temporary files in the process. awk '{ print $1 }' outputfile1.txt | sed '$d' > tempfile1.txt awk '{ print $1-1 }' outputfile1.txt | sed '$d' > tempfile2.txt sed '$d' outputfile1.txt | awk 'NR{print $3-p}{p=$3}' > tempfile3.txt awk ' { getline value < "tempfile1.txt" } { if (NR==1) print value ; else if( $1 != 1 ) print value }' tempfile3.txt > tempfile4.txt awk ' { getline value < "tempfile2.txt" } { if (NR==1) print value ; else if ( $1 != 1 ) print value }' tempfile3.txt | sed '1d' > tempfile5.txt awk 'END{print $1}' outputfile1.txt >> tempfile5.txt awk ' { getline value < "tempfile5.txt" } {print "START="$0 " ; END="value}' tempfile4.txt > outputfile2.txt Contents of temp files | temp1 temp2 temp3 NR=1 | 26 25 1 NR=2 | 38 37 1 NR=3 | 44 43 -1 NR=4 | 56 55 1 NR=5 | 62 61 -1 ... | ... ... ... NR=33 | 314 313 -1 NR=34 | 326 325 1 NR=35 | 332 331 -1 NR=36 | 344 343 1 ---------------------------------- | temp4 temp5 NR=1 | 26 43 NR=2 | 44 61 NR=3 | 62 79 ... | ... ... NR=17 | 314 331 NR=18 | 332 359 Current output outputfile2.txt START=26 ; END=43 START=44 ; END=61 START=62 ; END=79 ... START=314 ; END=331 START=332 ; END=349
Try: awk ' function print_range() { printf "START=%s ; END=%s ; TITLE=%s\n", start, end-1, title } { end=$1 } # if column 3 is equal to 1, then there is a new start $3==1 { if(title) print_range() start=$1 title=$2 next } # if the label in field 2 is not part of the title then add it title!~"(^|_)" $2 "(_|$)" { title=title"_"$2 } END { end++ print_range() } ' file
You can do everything in one go using: awk '{ if(NR==1){ # if we are the first record we initialize our variables PREVIOUS_ONE=$1 TITLE=$2 PREVIOUS_THIRD=$3 } else { # as long as the new third column is larger we update our variables if(PREVIOUS_THIRD < $3) { TITLE=TITLE"_"$2 PREVIOUS_THIRD=$3 } else { # this means the third column was smaller # we print out the data and reinitialize our variables print "START="PREVIOUS_ONE" ; END="$1-1" ; TITLE= "TITLE; PREVIOUS_ONE=$1 TITLE=$2 PREVIOUS_THIRD=$3 } } }' outputfile1.txt