I have a set of experimental results (anonymised subset below) in dataframe format read from a CSV file ('Input.csv'). I want to output a table comprising the columns - 'Experimenter', 'Subject', 'F', and 'G' - in an adjacency-matrix-like format. It should include aggregating by average for multiple entries - for example, 'Alpha' and 'Bravo' - in reciprocal roles as 'Experimenter' and 'Subject'. In addition, there should be '1.00's along the main diagonal. Finally, the final output table should be written to a CSV file ('Output.csv').
Actual Input:
Day,Experimenter,Subject,D,E,F,G
Monday,Alpha,Bravo,4,2,2.68,0.44
Monday,Charlie,Delta,0,2,0.62,2.29
Monday,Echo,Foxtrot,1,2,1.03,3.14
Monday,Golf,Hotel,1,2,0.75,2.53
Tuesday,India,Juliet,2,1,0.71,1.60
Wednesday,Foxtrot,Charlie,2,0,0.48,0.61
Thursday,Delta,Hotel,2,3,2.06,1.93
Thursday,Bravo,Alpha,1,1,0.53,0.41
Friday,Bravo,Delta,1,1,1.65,0.84
Friday,Golf,Alpha,0,0,0.19,1.30
Friday,India,Echo,1,0,1.31,0.58
Expected Output:
Alpha Bravo Charlie Delta Echo Foxtrot Golf Hotel India Juliet
Alpha 1.00 1.39 0.00 0.00 0.00 0.00 1.30 0.00 0.00 0.00
Bravo 0.485 1.00 0.00 1.65 0.00 0.00 0.00 0.00 0.00 0.00
Charlie 0.00 0.00 1.00 0.62 0.00 0.61 0.00 0.00 0.00 0.00
Delta 0.00 0.84 2.29 1.00 0.00 0.00 0.00 2.06 0.00 0.00
Echo 0.00 0.00 0.00 0.00 1.00 1.03 0.00 0.00 0.58 0.00
Foxtrot 0.00 0.00 0.48 0.00 3.14 1.00 0.00 0.00 0.00 0.00
Golf 0.19 0.00 0.00 0.00 0.00 0.00 1.00 0.75 0.00 0.00
Hotel 0.00 0.00 0.00 1.93 0.00 0.00 2.53 1.00 0.00 0.00
India 0.00 0.00 0.00 0.00 1.31 0.00 0.00 0.00 1.00 0.71
Juliet 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.60 1.00
Code:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Day': ['Monday', 'Monday', 'Monday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Thursday', 'Friday', 'Friday', 'Friday'],
'Experimenter': ['Alpha', 'Charlie', 'Echo', 'Golf', 'India', 'Foxtrot', 'Delta', 'Bravo', 'Bravo', 'Golf', 'India'],
'Subject': ['Bravo', 'Delta', 'Foxtrot', 'Hotel', 'Juliet', 'Charlie', 'Hotel', 'Alpha', 'Delta', 'Alpha', 'Echo'],
'D': [4, 0, 1, 1, 2, 2, 2, 1, 1, 0, 1],
'E': [2, 2, 2, 2, 1, 0, 3, 1, 1, 0, 0],
'F': [2.68, 0.62, 1.03, 0.75, 0.71, 0.48, 2.06, 0.53, 1.65, 0.19, 1.31],
'G': [0.44, 2.29, 3.14, 2.53, 1.60, 0.61, 1.93, 0.41, 0.84, 1.30, 0.58]})
adjacency_matrix = pd.crosstab(df['Experimenter'], df['Subject'], values=df['F'], aggfunc=np.mean)
adjacency_matrix = adjacency_matrix.fillna(0)
print('')
print(adjacency_matrix)
Actual Output:
Subject Alpha Bravo Charlie Delta Echo Foxtrot Hotel Juliet
Experimenter
Alpha 0.00 2.68 0.00 0.00 0.00 0.00 0.00 0.00
Bravo 0.53 0.00 0.00 1.65 0.00 0.00 0.00 0.00
Charlie 0.00 0.00 0.00 0.62 0.00 0.00 0.00 0.00
Delta 0.00 0.00 0.00 0.00 0.00 0.00 2.06 0.00
Echo 0.00 0.00 0.00 0.00 0.00 1.03 0.00 0.00
Foxtrot 0.00 0.00 0.48 0.00 0.00 0.00 0.00 0.00
Golf 0.19 0.00 0.00 0.00 0.00 0.00 0.75 0.00
India 0.00 0.00 0.00 0.00 1.31 0.00 0.00 0.71
which is correct but only includes column 'F' not both 'F' and 'G', as required.
Please advise?
The following code appears to generate the correct output (not very idiomatic, but functional):
ct_a = pd.crosstab(df['Experimenter'], df['Subject'], values=df['F'], aggfunc=np.mean).fillna(0)
ct_a.values[[np.arange(ct_a.shape[0])]*2] = 1
print('')
print(ct_a.head(23))
ct_b = pd.crosstab(df['Subject'], df['Experimenter'], values=df['G'], aggfunc=np.mean).fillna(0)
ct_b.values[[np.arange(ct_b.shape[0])]*2] = 1
print('')
print(ct_b.head(23))
a_m = (ct_a + ct_b).fillna(0)
a_m.values[[np.arange(a_m.shape[0])]*2] = 1
print('')
print(a_m.head(23))
However, I am still struggling to generate the 'Eigenvector Centrality' measure from the generated matrix (a_m) - any help would be very welcome!
I have multi index dataframe and I want to convert two columns' value into percentage values.
Capacity\nMWh Day-Ahead\nMWh Intraday\nMWh UEVM\nMWh ... Cost Per. MW\n(with Imp.)\n$/MWh Cost Per. MW\n(w/o Imp.)\n$/MWh Intraday\nMape Day-Ahead\nMape
Power Plants Date ...
powerplant1 2020 January 3.6 446.40 492.70 482.50 ... 0.05 0.32 0.04 0.10
2020 February 0.0 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00
2020 March 0.0 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00
2020 April 0.0 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00
I used apply('{:0%}'.format):
nested_df[['Intraday\nMape', 'Day-Ahead\nMape']] = \
nested_df[['Intraday\nMape', 'Day-Ahead\nMape']].apply('{:.0%}'.format)
But I got this error:
TypeError: ('unsupported format string passed to Series.__format__', 'occurred at index Intraday\nMape')
How can I solve that?
Use DataFrame.applymap:
nested_df[['Intraday\nMape', 'Day-Ahead\nMape']] = \
nested_df[['Intraday\nMape', 'Day-Ahead\nMape']].applymap('{:.0%}'.format)
I've got a list of devices which I need to remove duplicates (keep only the first occurrence) while preserving order and matching a condition. In this case I'm looking for a specific string and then printing the field with the device name. Here is some example raw data from the sar application:
10:02:01 AM sdc 0.70 0.00 8.13 11.62 0.00 1.29 0.86 0.06
10:02:01 AM sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:02:01 AM sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Average: sdc 1.31 3.73 99.44 78.46 0.02 17.92 0.92 0.12
Average: sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Average: sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:05:01 AM sdc 2.70 0.00 39.92 14.79 0.02 5.95 0.31 0.08
10:05:01 AM sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:05:01 AM sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:06:01 AM sdc 0.83 0.00 10.00 12.00 0.00 0.78 0.56 0.05
11:04:01 AM sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:04:01 AM sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Average: sdc 0.70 2.55 8.62 15.91 0.00 1.31 0.78 0.05
Average: sda 0.12 0.95 0.00 7.99 0.00 0.60 0.60 0.01
Average: sdb 0.22 1.78 0.00 8.31 0.00 0.54 0.52 0.01
The following will give me the list of devices from lines containing the word "average" but it sorts the output:
sar -dp | awk '/Average/ {devices[$2]} END {for (device in devices) {print device}}'
sda
sdb
sdc
The following gives me exactly what I want (command from here):
sar -dp | awk '/Average/ {print $2}' | awk '!devices[$0]++'
sdc
sda
sdb
Maybe I'm missing something painfully obvious but I can't figure out how to do the same in one awk command, that is without piping the output of the first awk into the second awk.
You can do:
sar -dp | awk '/Average/ && !devices[$2]++ {print $2}'
sdc
sda
sdb
The problem is this part for (device in devices). For some reason the for does randomize the output.
I have read a long complicated information on why some where but have not the link.
awk '/Average/ && !devices[$2]++ {print $2}' sar.in
You just need to combine the two tests. The only caveat is that in the original the entire line is field two from the original input so you need to replace $0 with $2.
From the below table I want to add 2 values from the same records and one value from different record, that are (extraaamt from the first record)+(trnamt from the second record)
5140560000001183 1016.00 0.00 2014-05-23 R 0.00 1017 13
5140560000001183 1016.00 0.00 2014-05-24 N 30.00 1017 0
carno emi recamt lastrecdate status penamt trnamt extraamt
5140560000001183 1016.00 0.00 2014-05-23 R 0.00 1017 13
5140560000001191 880.00 0.00 2014-05-23 R 0.00 880 0
5140560000001142 934.00 0.00 2014-05-23 P 0.00 500 0
5140560000001209 963.00 0.00 2014-05-23 P 0.00 600 0
5140560000001175 1024.00 0.00 2014-05-23 N 0.00 0 0
5140560000001167 1117.00 0.00 2014-05-23 N 0.00 0 0
5140560000001159 834.00 0.00 2014-05-23 N 0.00 0 0
5140560000001183 1016.00 0.00 2014-05-24 N 30.00 1017 0
5140560000001191 880.00 0.00 2014-05-24 N 0.00 880 0
5140560000001142 934.00 0.00 2014-05-24 N 0.00 500 0
5140560000001209 963.00 0.00 2014-05-24 N 0.00 600 0
5140560000001175 1024.00 0.00 2014-05-24 N 0.00 0 0
5140560000001167 1117.00 0.00 2014-05-24 N 0.00 0 0
5140560000001159 834.00 0.00 2014-05-24 N 0.00 0 0
I have used the below query but still it is not helping:
Select
Case WHEN ( lastrecdate=( cast (GETDATE() as DATE))and CardNo=CardNo and Status in('N','P') ) then trnammt else 0 end +
Case WHEN ( lastrecdate=( cast (GETDATE() as DATE))and CardNo=CardNo and Status in('N','P')) then pendamt else 0 end +
Case WHEN (lastrecdate= (select MAX(lastrecdate ) from Tbl_Emi WHERE Status ='R' and CardNo=CardNo) ) then extraamt else 0 end as totalamount
from Tbl_Emi where CardNo=CardNo
Please google on CrossTab query/Pivot Query. You can achieve this task using this.
CrossTab query is amazing, which helps generating reports and play with aggregate values. Excel/Ms Access gives nice user interface for Pivot Table. It’s way to transfer rows into column. It is more often used to generate matrix form of report.
Look at this blog.
select tt.carno, t1.extraamt+t2.trnamt total from
(select t.carno,
MIN(t.lastrecdate) first
, MAX(t.lastrecdate) second
from dbo.[Table] t
group by t.carno) tt
inner join dbo.[Table] t1
on t1.carno=tt.carno and t1.lastrecdate=tt.first
inner join dbo.[Table] t2
on t2.carno=tt.carno and t2.lastrecdate=tt.second