Awk: Formatting the order of table data in awk script - awk

I have the following output file. Please note that this data is dynamic, so there could be more or less years and many more categories A,B,C,D...:
2015 2016 2017
EX
FE
B 0.00 -2.00 -1.00
D 0.00 -1.00 0.00
sumFE 0.00 -3.00 -1.00
VE
B 0.00 -3.00 0.00
C -4.00 0.00 0.00
D 0.00 -5.00 0.00
sumVE -4.00 -8.00 0.00
sumE -4.00 -11.00 -1.00
IN
FI
A 8.00 0.00 0.00
C 0.00 0.00 8.00
sumFI 8.00 0.00 8.00
VI
A 0.00 0.00 5.00
B 4.00 0.00 0.00
sumVI 4.00 0.00 5.00
sumI 12.00 0.00 13.00
net 8.00 -11.00 12.00
I am trying to format it like this.
2015 2016 2017
IN
VI
A 0.00 0.00 5.00
B 4.00 0.00 0.00
sumVI 4.00 0.00 5.00
FI
A 8.00 0.00 0.00
C 0.00 0.00 8.00
sumFI 8.00 0.00 8.00
sumI 12.00 0.00 13.00
EX
VE
B 0.00 -3.00 0.00
C -4.00 0.00 0.00
D 0.00 -5.00 0.00
sumVE -4.00 -8.00 0.00
FE
B 0.00 -2.00 -1.00
D 0.00 -1.00 0.00
sumFE 0.00 -3.00 -1.00
sumE -4.00 -11.00 -1.00
net 8.00 -11.00 12.00
I have tried the following script as a start:
#!/usr/bin/env bash
awk '
BEGIN{FS="\t"}
3>NR {print "D" $0}
$1 ~ /^I$/,$1 ~ /^sumI$/ {
print
}
$1 ~ /^E$/,$1 ~ /^sumE$/{
print
}
$1 ~ /net/ {print ORS $0}
' "${#:--}"
The script would go a long way in replacing all I data for E data however the execution order is not preserved and the I block is printed out last. Can someone please help with this.

This will probably be easier to modify the originating code to use GNU awk's predefined array scanning orders. The key objective is to switch the scanning order (PROCINFO["sorted_in"]) just prior to the associated for (index in array) loop.
Adding four lines of code (see # comments) to what I'm guessing is the originating code:
...
END {
for (year = minYear; year <= maxYear; year++) {
printf "%s%s", OFS, year
}
print ORS
PROCINFO["sorted_in"]="#ind_str_desc" # sort cat == { I | E } in descending order
for (cat in ctiys2amounts) {
printf "%s\n\n",(cat=="I") ? "IN" : "EX" # print { IN | EX }
delete catSum
PROCINFO["sorted_in"]="#ind_str_desc" # sort type == { VI | FI } || { VE | FE } in descending order
for (type in ctiys2amounts[cat]) {
print type
delete typeSum
PROCINFO["sorted_in"]="#ind_str_asc" # sort item == { A | B | C | D } in ascending order
for (item in ctiys2amounts[cat][type]) {
printf "%s", item
for (year = minYear; year <= maxYear; year++) {
amount = ctiys2amounts[cat][type][item][year]
printf "%s%0.2f", OFS, amount
typeSum[year] += amount
}
print ""
}
....
This generates:
2015 2016 2017
IN
VI
A 0.00 0.00 5.00
B 4.00 0.00 0.00
sumVI 4.00 0.00 5.00
FI
A 8.00 0.00 0.00
C 0.00 0.00 8.00
sumFI 8.00 0.00 8.00
sumI 12.00 0.00 13.00
EX
VE
B 0.00 -3.00 0.00
C -4.00 0.00 0.00
D 0.00 -5.00 0.00
sumVE -4.00 -8.00 0.00
FE
B 0.00 -2.00 -1.00
D 0.00 -1.00 0.00
sumFE 0.00 -3.00 -1.00
sumE -4.00 -11.00 -1.00
net 8.00 -11.00 12.00

Related

Pandas: Generate Experiment Results in Adjacency-Like Matrix Table

I have a set of experimental results (anonymised subset below) in dataframe format read from a CSV file ('Input.csv'). I want to output a table comprising the columns - 'Experimenter', 'Subject', 'F', and 'G' - in an adjacency-matrix-like format. It should include aggregating by average for multiple entries - for example, 'Alpha' and 'Bravo' - in reciprocal roles as 'Experimenter' and 'Subject'. In addition, there should be '1.00's along the main diagonal. Finally, the final output table should be written to a CSV file ('Output.csv').
Actual Input:
Day,Experimenter,Subject,D,E,F,G
Monday,Alpha,Bravo,4,2,2.68,0.44
Monday,Charlie,Delta,0,2,0.62,2.29
Monday,Echo,Foxtrot,1,2,1.03,3.14
Monday,Golf,Hotel,1,2,0.75,2.53
Tuesday,India,Juliet,2,1,0.71,1.60
Wednesday,Foxtrot,Charlie,2,0,0.48,0.61
Thursday,Delta,Hotel,2,3,2.06,1.93
Thursday,Bravo,Alpha,1,1,0.53,0.41
Friday,Bravo,Delta,1,1,1.65,0.84
Friday,Golf,Alpha,0,0,0.19,1.30
Friday,India,Echo,1,0,1.31,0.58
Expected Output:
Alpha Bravo Charlie Delta Echo Foxtrot Golf Hotel India Juliet
Alpha 1.00 1.39 0.00 0.00 0.00 0.00 1.30 0.00 0.00 0.00
Bravo 0.485 1.00 0.00 1.65 0.00 0.00 0.00 0.00 0.00 0.00
Charlie 0.00 0.00 1.00 0.62 0.00 0.61 0.00 0.00 0.00 0.00
Delta 0.00 0.84 2.29 1.00 0.00 0.00 0.00 2.06 0.00 0.00
Echo 0.00 0.00 0.00 0.00 1.00 1.03 0.00 0.00 0.58 0.00
Foxtrot 0.00 0.00 0.48 0.00 3.14 1.00 0.00 0.00 0.00 0.00
Golf 0.19 0.00 0.00 0.00 0.00 0.00 1.00 0.75 0.00 0.00
Hotel 0.00 0.00 0.00 1.93 0.00 0.00 2.53 1.00 0.00 0.00
India 0.00 0.00 0.00 0.00 1.31 0.00 0.00 0.00 1.00 0.71
Juliet 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.60 1.00
Code:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Day': ['Monday', 'Monday', 'Monday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Thursday', 'Friday', 'Friday', 'Friday'],
'Experimenter': ['Alpha', 'Charlie', 'Echo', 'Golf', 'India', 'Foxtrot', 'Delta', 'Bravo', 'Bravo', 'Golf', 'India'],
'Subject': ['Bravo', 'Delta', 'Foxtrot', 'Hotel', 'Juliet', 'Charlie', 'Hotel', 'Alpha', 'Delta', 'Alpha', 'Echo'],
'D': [4, 0, 1, 1, 2, 2, 2, 1, 1, 0, 1],
'E': [2, 2, 2, 2, 1, 0, 3, 1, 1, 0, 0],
'F': [2.68, 0.62, 1.03, 0.75, 0.71, 0.48, 2.06, 0.53, 1.65, 0.19, 1.31],
'G': [0.44, 2.29, 3.14, 2.53, 1.60, 0.61, 1.93, 0.41, 0.84, 1.30, 0.58]})
adjacency_matrix = pd.crosstab(df['Experimenter'], df['Subject'], values=df['F'], aggfunc=np.mean)
adjacency_matrix = adjacency_matrix.fillna(0)
print('')
print(adjacency_matrix)
Actual Output:
Subject Alpha Bravo Charlie Delta Echo Foxtrot Hotel Juliet
Experimenter
Alpha 0.00 2.68 0.00 0.00 0.00 0.00 0.00 0.00
Bravo 0.53 0.00 0.00 1.65 0.00 0.00 0.00 0.00
Charlie 0.00 0.00 0.00 0.62 0.00 0.00 0.00 0.00
Delta 0.00 0.00 0.00 0.00 0.00 0.00 2.06 0.00
Echo 0.00 0.00 0.00 0.00 0.00 1.03 0.00 0.00
Foxtrot 0.00 0.00 0.48 0.00 0.00 0.00 0.00 0.00
Golf 0.19 0.00 0.00 0.00 0.00 0.00 0.75 0.00
India 0.00 0.00 0.00 0.00 1.31 0.00 0.00 0.71
which is correct but only includes column 'F' not both 'F' and 'G', as required.
Please advise?
The following code appears to generate the correct output (not very idiomatic, but functional):
ct_a = pd.crosstab(df['Experimenter'], df['Subject'], values=df['F'], aggfunc=np.mean).fillna(0)
ct_a.values[[np.arange(ct_a.shape[0])]*2] = 1
print('')
print(ct_a.head(23))
ct_b = pd.crosstab(df['Subject'], df['Experimenter'], values=df['G'], aggfunc=np.mean).fillna(0)
ct_b.values[[np.arange(ct_b.shape[0])]*2] = 1
print('')
print(ct_b.head(23))
a_m = (ct_a + ct_b).fillna(0)
a_m.values[[np.arange(a_m.shape[0])]*2] = 1
print('')
print(a_m.head(23))
However, I am still struggling to generate the 'Eigenvector Centrality' measure from the generated matrix (a_m) - any help would be very welcome!

Printing specific columns as a percentage

I have multi index dataframe and I want to convert two columns' value into percentage values.
Capacity\nMWh Day-Ahead\nMWh Intraday\nMWh UEVM\nMWh ... Cost Per. MW\n(with Imp.)\n$/MWh Cost Per. MW\n(w/o Imp.)\n$/MWh Intraday\nMape Day-Ahead\nMape
Power Plants Date ...
powerplant1 2020 January 3.6 446.40 492.70 482.50 ... 0.05 0.32 0.04 0.10
2020 February 0.0 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00
2020 March 0.0 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00
2020 April 0.0 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00
I used apply('{:0%}'.format):
nested_df[['Intraday\nMape', 'Day-Ahead\nMape']] = \
nested_df[['Intraday\nMape', 'Day-Ahead\nMape']].apply('{:.0%}'.format)
But I got this error:
TypeError: ('unsupported format string passed to Series.__format__', 'occurred at index Intraday\nMape')
How can I solve that?
Use DataFrame.applymap:
nested_df[['Intraday\nMape', 'Day-Ahead\nMape']] = \
nested_df[['Intraday\nMape', 'Day-Ahead\nMape']].applymap('{:.0%}'.format)

Awk: Removing duplicate lines without sorting after matching conditions

I've got a list of devices which I need to remove duplicates (keep only the first occurrence) while preserving order and matching a condition. In this case I'm looking for a specific string and then printing the field with the device name. Here is some example raw data from the sar application:
10:02:01 AM sdc 0.70 0.00 8.13 11.62 0.00 1.29 0.86 0.06
10:02:01 AM sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:02:01 AM sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Average: sdc 1.31 3.73 99.44 78.46 0.02 17.92 0.92 0.12
Average: sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Average: sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:05:01 AM sdc 2.70 0.00 39.92 14.79 0.02 5.95 0.31 0.08
10:05:01 AM sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:05:01 AM sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:06:01 AM sdc 0.83 0.00 10.00 12.00 0.00 0.78 0.56 0.05
11:04:01 AM sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:04:01 AM sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Average: sdc 0.70 2.55 8.62 15.91 0.00 1.31 0.78 0.05
Average: sda 0.12 0.95 0.00 7.99 0.00 0.60 0.60 0.01
Average: sdb 0.22 1.78 0.00 8.31 0.00 0.54 0.52 0.01
The following will give me the list of devices from lines containing the word "average" but it sorts the output:
sar -dp | awk '/Average/ {devices[$2]} END {for (device in devices) {print device}}'
sda
sdb
sdc
The following gives me exactly what I want (command from here):
sar -dp | awk '/Average/ {print $2}' | awk '!devices[$0]++'
sdc
sda
sdb
Maybe I'm missing something painfully obvious but I can't figure out how to do the same in one awk command, that is without piping the output of the first awk into the second awk.
You can do:
sar -dp | awk '/Average/ && !devices[$2]++ {print $2}'
sdc
sda
sdb
The problem is this part for (device in devices). For some reason the for does randomize the output.
I have read a long complicated information on why some where but have not the link.
awk '/Average/ && !devices[$2]++ {print $2}' sar.in
You just need to combine the two tests. The only caveat is that in the original the entire line is field two from the original input so you need to replace $0 with $2.

Deleting entire columns of from a text file using CUT command or AWK program

I have a text file in the form below. Could someone help me as to how I could delete columns 2, 3, 4, 5, 6 and 7? I want to keep only 1,8 and 9.
37.55 6.00 24.98 0.00 -2.80 -3.90 26.675 './gold_soln_CB_FragLib_Controls_m1_9.mol2' 'ethyl'
38.45 1.39 27.36 0.00 -0.56 -2.48 22.724 './gold_soln_CB_FragLib_Controls_m2_6.mol2' 'pyridin-2-yl(pyridin-3-yl)methanone'
38.47 0.00 28.44 0.00 -0.64 -2.42 20.387 './gold_soln_CB_FragLib_Controls_m3_3.mol2' 'pyridin-2-yl(pyridin-4-yl)methanone'
42.49 0.07 30.87 0.00 -0.03 -3.24 22.903 './gold_soln_CB_FragLib_Controls_m4_5.mol2' '(3-chlorophenyl)(pyridin-3-yl)methanone'
38.20 1.47 27.53 0.00 -1.13 -3.28 22.858 './gold_soln_CB_FragLib_Controls_m5_2.mol2' 'dipyridin-4-ylmethanone'
41.87 0.57 30.53 0.00 -0.67 -3.16 22.829 './gold_soln_CB_FragLib_Controls_m6_9.mol2' '(3-chlorophenyl)(pyridin-4-yl)methanone'
38.18 1.49 27.09 0.00 -0.56 -1.63 7.782 './gold_soln_CB_FragLib_Controls_m7_1.mol2' '3-hydrazino-6-phenylpyridazine'
39.45 1.50 27.71 0.00 -0.15 -4.17 17.130 './gold_soln_CB_FragLib_Controls_m8_6.mol2' '3-hydrazino-6-phenylpyridazine'
41.54 4.10 27.71 0.00 -0.65 -4.44 9.702 './gold_soln_CB_FragLib_Controls_m9_4.mol2' '3-hydrazino-6-phenylpyridazine'
41.05 1.08 29.30 0.00 -0.31 -2.44 28.590 './gold_soln_CB_FragLib_Controls_m10_3.mol2' '3-hydrazino-6-(4-methylphenyl)pyridazine'
Try:
awk '{print $1"\t"$8"\t"$9}' yourfile.tsv > only189.tsv

SQL Calculating Turn-Around-Time with Overlapping Concideration

I have a Table (parts) where I store when an item was requested and when it was issued. With this, I can easily compute each items turn-around-time ("TAT"). What I'd like to do is have another column ("Computed") where any overlapping request-to-issue dates are properly computed.
RecID Requested Issued TAT Computed
MD0001 11/28/2012 12/04/2012 6.00 0.00
MD0002 11/28/2012 11/28/2012 0.00 0.00
MD0003 11/28/2012 12/04/2012 6.00 0.00
MD0004 11/28/2012 11/28/2012 0.00 0.00
MD0005 11/28/2012 12/10/2012 12.00 0.00
MD0006 11/28/2012 01/21/2013 54.00 54.00
MD0007 11/28/2012 11/28/2012 0.00 0.00
MD0008 11/28/2012 12/04/2012 6.00 0.00
MD0009 01/29/2013 01/30/2013 1.00 1.00
MD0010 01/29/2013 01/30/2013 1.00 0.00
MD0011 02/05/2013 02/06/2013 1.00 1.00
MD0012 02/07/2013 03/04/2013 25.00 25.00
MD0013 03/07/2013 03/14/2013 7.00 7.00
MD0014 03/07/2013 03/08/2013 1.00 0.00
MD0015 03/13/2013 03/25/2013 12.00 11.00
MD0016 03/20/2013 03/21/2013 1.00 0.00
Totals 133.00 99.00 <- waiting for parts TAT summary
In the above, I manually filled in the ("Computed") column so that there is an example of what I'm trying to accomplish.
NOTE: Notice how MD0013 affects the computed time for MD0015 as MD0013 was "computed" first. This could have been where MD0015 was computed first, then MD0013 would be affected accordingly - the net result is there is -1 day.