AWK: Average of each row from different measurement series - awk

My objective is to calculate the average of the second column from multiple measurement series (the average of the first row of K blocks, the average of the second row of K blocks etc.). All data is contained in one file and is seperated in blocks with a blank line. The file has the following structure:
#
#
33 -0.23
34.5 -0.32
36 -0.4
.
.
.
#
#
33 -0.25
34.5 -0.31
36 -0.38
.
.
.

$ cat avg.awk
BEGIN { FS=" " }
/^#/ { next }
/^\s*$/ { print col1/nr " " col2/nr; col1=col2=nr=0; next }
{ col1 += $1; col2 += $2; nr++ }
END {print col1/nr " " col2/nr }
with input:
$ cat test.txt
#
#
33 -0.23
34.5 -0.32
36 -0.4
#
#
33 -0.25
34.5 -0.31
36 -0.38
gives as result:
$ awk -f avg.awk test.txt
34.5 -0.316667
34.5 -0.313333

Related

How to print something multiple times in awk

I have a file sample.txt that looks like this:
Sequence: chr18_gl000207_random
Repeat 1
Indices: 2822--2996 Score: 135
Period size: 36 Copynumber: 4.8 Consensus size: 36
Consensus pattern (36 bp):
TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT
Repeat 2
Indices: 2736--3623 Score: 932
Period size: 111 Copynumber: 8.1 Consensus size: 111
Consensus pattern (111 bp):
TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA
GTGGTGGCTGTTGTGGTTGTAGCCTCAGTGGAAGTGCCTGCAGTTG
Repeat 3
Indices: 3421--3496 Score: 89
Period size: 39 Copynumber: 1.9 Consensus size: 39
Consensus pattern (39 bp):
AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT
I have used awk to extract values for parameters that are relevant for me like this:
paste <(awk '/Indices/ {print $2}' sample.txt) <(awk '/Period size/ {print $3}' sample.txt) <(awk '/Copynumber/ {print $5}' sample.txt) <(awk '/Consensus pattern/ {getline; print $0}' sample.txt)
Output:
2822--2996 36 4.8 TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT
2736--3623 111 8.1 TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA
3421--3496 39 1.9 AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT
Now I want to add the parameter Sequence to every row.
Desired output:
chr18_gl000207_random:2822--2996 36 4.8 TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT
chr18_gl000207_random:2736--3623 111 8.1 TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA
chr18_gl000207_random:3421--3496 39 1.9 AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT
I want to do this for many files in a loop so I need a solution that would work with a different number of Repeats as well.
$ cat tst.awk
BEGIN { OFS="\t" }
$1 == "Sequence:" { seq = $2; next }
$1 == "Indices:" { ind = $2; next }
$1 == "Period" { per = $3; cpy = $5; next }
$1 == "Consensus" { isCon=1; next }
isCon { print seq":"ind, per, cpy, $1; isCon=0 }
$ awk -f tst.awk file
chr18_gl000207_random:2822--2996 36 4.8 TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT
chr18_gl000207_random:2736--3623 111 8.1 TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA
chr18_gl000207_random:3421--3496 39 1.9 AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT

awk equivalents for tidyverse concepts (melt and spread)

I have some text logs that I need to parse and format into CSV.
I have a working R script but it is slow once file sizes increase and this problem seems like a good candidate for a speed up using awk (or other commandline tools?) as I understand.
I have not done much with awk, and the issue I am having is translating how I think about processing in R to how awk scripting is done.
Example truncated input data (Scrap.log):
; these are comment lines
; *******************************************************************************
; \\C:\Users\Computer\Folder\Folder\Scrap.log
!!G 99999 % % % % % % % % CURRENT XYZ ABC STATE1 STATE2
_START Header1 Header2 Header3 Header4 Header5 Header6 Header7
10 12.23 1.91 6.63 1.68 50.03 0.50 13.97
11 11.32 1.94 6.64 1.94 50.12 0.58 15.10
12 12.96 2.15 6.57 2.12 55.60 0.62 16.24
13 11.43 2.18 6.60 2.36 50.89 0.68 17.39
14 14.91 2.32 6.64 2.59 56.09 0.73 18.41
15 13.16 2.38 6.53 2.85 51.62 0.81 19.30
16 15.02 2.50 6.67 3.05 56.22 0.85 20.12
!!G 99999 % % % % % % % % CURRENT XYZ ABC STATE1 STATE2
_START Header8 Header9 Header10 Header11 Header12 Header13 Header14
10 22.03 24.41 15.01 51.44 44.28 16.57 11.52
11 21.05 24.62 15.62 51.23 45.42 16.47 11.98
12 20.11 24.64 16.38 52.16 46.59 16.54 12.42
13 24.13 24.93 17.23 52.34 47.72 16.51 12.88
14 27.17 24.95 18.06 52.79 48.72 16.45 13.30
15 22.87 25.04 19.27 53.01 49.50 16.47 13.63
16 23.08 25.22 20.12 53.75 50.64 16.55 14.03
Expected output (truncated):
HH1,HH2,HH3,HH4,HH5,HH6,HH7,HH8,HH9,HH10,HH11,HH12,HH13,HH14,START,HeaderName,Value
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header1,12.23
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header2,1.91
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header3,6.63
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header4,1.68
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header5,50.03
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header6,0.5
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,10,Header7,13.97
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header1,11.32
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header2,1.94
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header3,6.64
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header4,1.94
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header5,50.12
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header6,0.58
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,11,Header7,15.1
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header1,12.96
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header2,2.15
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header3,6.57
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header4,2.12
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header5,55.6
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header6,0.62
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,12,Header7,16.24
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,13,Header1,11.43
99999,CURRENT,XYZ,ABC,STATE1,STATE2,%,%,%,%,%,%,%,%,13,Header2,2.18
...
My general steps in the R script:
add a single header row with new names at the top of file
spread the top row (starting with !!G) to each row
melt the header column (_START) from wide to long format
Pieces I have working in awk so far include:
how to grab and print the header lines
awk '/_START/ {header = $0; print header}' Scrap.log
How to write a single row with the new header values
awk ' BEGIN{ ORS=" "; for (counter = 1; counter <= 14; counter++) print "HH",counter;}'
I know each block is separated by a newline and starts with a !!G, so can write a match on that. Unsure if a split-apply-combine type of thinking works well in awk?
awk '/!!G/,/\n/ {print}' Scrap.log
alternatively, I tried setting RS/FS parameters like:
awk ' BEGIN{RS="\n";FS=" ";}/^!!G/{header=$0;print header}/[0-9]/{print $2}END{}' Scrap.log
I then get stuck on iterating over the rows and fields to do the melt step as well as combining the capture groups correctly.
How do I combine all these pieces to get to the CSV format?
I think the following:
awk '
BEGIN{
# output the header line
print "HH1,HH2,HH3,HH4,HH5,HH6,HH7,HH8,HH9,HH10,HH11,HH12,HH13,HH14,START,HeaderName,Value"
}
# ignore comment lines
/;/{next}
/!!G/{
valcnt = 1
# save and shuffle the values
val[valcnt++] = $2
val[valcnt++] = $11
val[valcnt++] = $12
val[valcnt++] = $13
val[valcnt++] = $14
val[valcnt++] = $15
val[valcnt++] = $3
val[valcnt++] = $4
val[valcnt++] = $5
val[valcnt++] = $6
val[valcnt++] = $7
val[valcnt++] = $8
val[valcnt++] = $9
val[valcnt++] = $10
next
}
/_START /{
# these are headers - save them to head, to be reused later
for (i = 2; i <= NF; ++i) {
# fun fact: its indexed on NF
head[i] = $i
}
next
}
# this function is redundant, but its just easier for me to think about the code
function output(firstval, header, value, \
cur, i) {
cur = valcnt
val[cur++] = firstval
val[cur++] = header
val[cur++] = value
# output val as csv
for (i = 1; i < cur; ++i) {
printf "%s%s", val[i], i != cur - 1 ? "," : "\n"
}
}
/[0-9]+/{
for (i = 2; i <= NF; ++i) {
# add these 3 to all the other values and output them
# ie. add first column, the header from header and the value
output($1, head[i], $i)
}
}
'
Should output what you want. Tested on repl.

How to replace data in pandas by using values in dict?

I have a series which contains several numbers. I want to replace them to other string type data by using dictionary values. But I don't know how to do that...
GDP_group['GdpForYearPer$1M'].head(5)
0 46.919625
1 47.515189
2 47.737955
3 54.832578
4 56.338028
5 63.101272 \
This is the dict that I made to replace data.
range_GDP = {'$0 ~ $100M': np.arange(0,100), '$100M ~ $1B': np.arange(100.0000001,1000), '$1B ~ $10B': np.arange(1000.000001, 10000), '$10B ~ $100B': np.arange(10000.000001, 100000),
'$100B ~ $1T': np.arange(100000.000001, 1000000), '$1T ~': np.arange(1000000.000001, 20000000)}
You can use pd.cut to segment your data in ranges and apply labels.
(re)generate dummy data sampled uniformly in log space:
import numpy as np
import pandas as pd
GdpForYearPer1M = pd.Series(10**np.random.randint(0, 8, 100))
"""
0 1
1 1000
2 100
3 10
4 100
...
95 1000000
96 100
97 100000
98 10000
99 10
"""
solution:
# generate "cuts" (bins) and associated labels from `range_GDP`.
cut_data = [(np.min(v), k) for k, v in range_GDP.items()]
bins, labels = zip(*cut_data)
# bins required to have one more value than labels
bins = list(bins) + [np.inf]
pd.cut(GdpForYearPer1M, bins=bins, labels=labels)
output:
0 $0 ~ $100M
1 $100M ~ $1B
2 $0 ~ $100M
3 $0 ~ $100M
4 $0 ~ $100M
...
95 $100B ~ $1T
96 $0 ~ $100M
97 $10B ~ $100B
98 $1B ~ $10B
99 $0 ~ $100M
Length: 100, dtype: category
Categories (6, object): [$0 ~ $100M < $100M ~ $1B < $1B ~ $10B < $10B ~ $100B < $100B ~ $1T < $1T ~]

rearrange from specific string into respective column

I'm trying to rearrange from specific string into respective column.
etc:
126N (will be sorted into "Normal" column)
Value 1 (the integer will be concatenated with 126)
Resulting :
N=Normal
126 # 1
Here is the input
(N=Normal, W=Weak)
Value 1
126N,
Value 3
18N,
Value 4
559N, 562N, 564N,
Value 6
553W, 565A, 553N,
Value 5
490W,
Value 9
564N,
And the output should be
W=Weak
490 # 5
553 # 6
A=Absolute
565 # 6
N=Normal
126 # 1
18 # 3
559 # 4
562 # 4
564 # 4
553 # 6
564 # 9
Let me know your thought on this.
I've tried this script, I'm still figuring out to concatenating the value
cat input.txt | sed '/^\s*$/d' | awk 'BEGIN{RS=","};match($0,/N/){print $3"c"$2}' | sed ':a;N;$!ba;s/\n/;/g' | sed 's/W//g;s/N//g;s/S//g'
And some of it, are missing
This should give you what you want using gnu awk
IT will work with any number of letters, not just A N W
awk -F, '
!/Value/ {
for (i=1;i<NF;i++) {
hd=substr($i,length($i),1);
arr[hd][++cnt[hd]]=($i+0" # "f)}
}
{split($0,b," ");f=b[2];}
END {
for (i in arr) { print "\n"i"\n---";
for (j in arr[i]) {
print arr[i][j]}}
}' file
A
---
565 # 6
N
---
562 # 4
564 # 4
553 # 6
564 # 9
126 # 1
18 # 3
559 # 4
W
---
553 # 6
490 # 5
Another alternative in awk would be:
awk -F',| ' '
$1 == "Value" {value = $2; next}
{ for (i=1; i<=NF; i++) {
if ($i~"N$")
N[substr($i, 1, length($i) - 1)] = value
if ($i~"W$")
W[substr($i, 1, length($i) - 1)] = value
}
}
END {
print "W=Weak"
for (i in W)
print i, "#", W[i]
print "\nN=Normal"
for (i in N)
print i, "#", N[i]
}
' file
(note: this relies on knowing the wanted headers are W=Weak and N=Normal. If would take a few additional expression if the headers are subject to change.)
Output
$ awk -F',| ' '
> $1 == "Value" {value = $2; next}
> { for (i=1; i<=NF; i++) {
> if ($i~"N$")
> N[substr($i, 1, length($i) - 1)] = value
> if ($i~"W$")
> W[substr($i, 1, length($i) - 1)] = value
> }
> }
> END {
> print "W=Weak"
> for (i in W)
> print i, "#", W[i]
> print "\nN=Normal"
> for (i in N)
> print i, "#", N[i]
> }
> ' file
W=Weak
490 # 5
N=Normal
18 # 3
126 # 1
559 # 4
562 # 4
564 # 9
$ cat tst.awk
NR%2 { val = $NF; next }
{
for (i=1; i<=NF; i++) {
num = $i+0
abbr = $i
gsub(/[^[:alpha:]]/,"",abbr)
list[abbr] = list[abbr] num " # " val ORS
}
}
END {
n = split("Weak Absolute Normal",types)
for (i=1; i<=n; i++) {
name = types[i]
abbr = substr(name,1,1)
print abbr "=" name ORS list[abbr]
}
}
.
$ awk -f tst.awk file
W=Weak
553 # 6
490 # 5
A=Absolute
565 # 6
N=Normal
126 # 1
18 # 3
559 # 4
562 # 4
564 # 4
553 # 6
564 # 9

AWK/SED/getline - How to simplify/improve this example?

I'm trying to take a 3 column input file and separate it based on a condition in column 3. I think it'll be easier to show you than explain:
Input File:
outputfile1.txt
26 NCC 1 # First Start
38 NME 2
44 NSC 1 # Start2
56 NME 2
62 NCC 1 # Start3
...
314 NCC 1 # Start17
326 NME 2
332 NSC 1 # Start18
344 NME 2
349 NME 2 # Final End
(The hashed comments aren't part of the file, I've added to make things clearer).
Column 3 is used to determine a new "START" entry
"START/END" values are from Column 1
"TITLE" I would like to be all values from Column 2 between consecutive "STARTS"
Desired Output
outputfile2.txt
START=26 ; END=43 ; TITLE=NCC_NME
START=44 ; END=61 ; TITLE=NSC_NME
START=62 ; END=79 ; TITLE=NCC_...
...
START=314 ; END=331 ; TITLE=NCC_NME
START=332 ; END=349 ; TITLE=NSC_NME
Crude script that 'almost' does this but makes 5 single column temporary files in the process.
awk '{ print $1 }' outputfile1.txt | sed '$d' > tempfile1.txt
awk '{ print $1-1 }' outputfile1.txt | sed '$d' > tempfile2.txt
sed '$d' outputfile1.txt | awk 'NR{print $3-p}{p=$3}' > tempfile3.txt
awk ' { getline value < "tempfile1.txt" }
{ if (NR==1)
print value ;
else if( $1 != 1 )
print value }' tempfile3.txt > tempfile4.txt
awk ' { getline value < "tempfile2.txt" }
{ if (NR==1)
print value ;
else if ( $1 != 1 )
print value }' tempfile3.txt | sed '1d' > tempfile5.txt
awk 'END{print $1}' outputfile1.txt >> tempfile5.txt
awk ' { getline value < "tempfile5.txt" }
{print "START="$0 " ; END="value}' tempfile4.txt > outputfile2.txt
Contents of temp files
| temp1 temp2 temp3
NR=1 | 26 25 1
NR=2 | 38 37 1
NR=3 | 44 43 -1
NR=4 | 56 55 1
NR=5 | 62 61 -1
... | ... ... ...
NR=33 | 314 313 -1
NR=34 | 326 325 1
NR=35 | 332 331 -1
NR=36 | 344 343 1
----------------------------------
| temp4 temp5
NR=1 | 26 43
NR=2 | 44 61
NR=3 | 62 79
... | ... ...
NR=17 | 314 331
NR=18 | 332 359
Current output
outputfile2.txt
START=26 ; END=43
START=44 ; END=61
START=62 ; END=79
...
START=314 ; END=331
START=332 ; END=349
Try:
awk '
function print_range() {
printf "START=%s ; END=%s ; TITLE=%s\n", start, end-1, title
}
{
end=$1
}
# if column 3 is equal to 1, then there is a new start
$3==1 {
if(title) print_range()
start=$1
title=$2
next
}
# if the label in field 2 is not part of the title then add it
title!~"(^|_)" $2 "(_|$)" {
title=title"_"$2
}
END {
end++
print_range()
}
' file
You can do everything in one go using:
awk '{
if(NR==1){
# if we are the first record we initialize our variables
PREVIOUS_ONE=$1
TITLE=$2
PREVIOUS_THIRD=$3
} else {
# as long as the new third column is larger we update our variables
if(PREVIOUS_THIRD < $3) {
TITLE=TITLE"_"$2
PREVIOUS_THIRD=$3
} else {
# this means the third column was smaller
# we print out the data and reinitialize our variables
print "START="PREVIOUS_ONE" ; END="$1-1" ; TITLE= "TITLE;
PREVIOUS_ONE=$1
TITLE=$2
PREVIOUS_THIRD=$3
}
}
}' outputfile1.txt