Pandas: Generate Experiment Results in Adjacency-Like Matrix Table - pandas
I have a set of experimental results (anonymised subset below) in dataframe format read from a CSV file ('Input.csv'). I want to output a table comprising the columns - 'Experimenter', 'Subject', 'F', and 'G' - in an adjacency-matrix-like format. It should include aggregating by average for multiple entries - for example, 'Alpha' and 'Bravo' - in reciprocal roles as 'Experimenter' and 'Subject'. In addition, there should be '1.00's along the main diagonal. Finally, the final output table should be written to a CSV file ('Output.csv').
Actual Input:
Day,Experimenter,Subject,D,E,F,G
Monday,Alpha,Bravo,4,2,2.68,0.44
Monday,Charlie,Delta,0,2,0.62,2.29
Monday,Echo,Foxtrot,1,2,1.03,3.14
Monday,Golf,Hotel,1,2,0.75,2.53
Tuesday,India,Juliet,2,1,0.71,1.60
Wednesday,Foxtrot,Charlie,2,0,0.48,0.61
Thursday,Delta,Hotel,2,3,2.06,1.93
Thursday,Bravo,Alpha,1,1,0.53,0.41
Friday,Bravo,Delta,1,1,1.65,0.84
Friday,Golf,Alpha,0,0,0.19,1.30
Friday,India,Echo,1,0,1.31,0.58
Expected Output:
Alpha Bravo Charlie Delta Echo Foxtrot Golf Hotel India Juliet
Alpha 1.00 1.39 0.00 0.00 0.00 0.00 1.30 0.00 0.00 0.00
Bravo 0.485 1.00 0.00 1.65 0.00 0.00 0.00 0.00 0.00 0.00
Charlie 0.00 0.00 1.00 0.62 0.00 0.61 0.00 0.00 0.00 0.00
Delta 0.00 0.84 2.29 1.00 0.00 0.00 0.00 2.06 0.00 0.00
Echo 0.00 0.00 0.00 0.00 1.00 1.03 0.00 0.00 0.58 0.00
Foxtrot 0.00 0.00 0.48 0.00 3.14 1.00 0.00 0.00 0.00 0.00
Golf 0.19 0.00 0.00 0.00 0.00 0.00 1.00 0.75 0.00 0.00
Hotel 0.00 0.00 0.00 1.93 0.00 0.00 2.53 1.00 0.00 0.00
India 0.00 0.00 0.00 0.00 1.31 0.00 0.00 0.00 1.00 0.71
Juliet 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.60 1.00
Code:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Day': ['Monday', 'Monday', 'Monday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Thursday', 'Friday', 'Friday', 'Friday'],
'Experimenter': ['Alpha', 'Charlie', 'Echo', 'Golf', 'India', 'Foxtrot', 'Delta', 'Bravo', 'Bravo', 'Golf', 'India'],
'Subject': ['Bravo', 'Delta', 'Foxtrot', 'Hotel', 'Juliet', 'Charlie', 'Hotel', 'Alpha', 'Delta', 'Alpha', 'Echo'],
'D': [4, 0, 1, 1, 2, 2, 2, 1, 1, 0, 1],
'E': [2, 2, 2, 2, 1, 0, 3, 1, 1, 0, 0],
'F': [2.68, 0.62, 1.03, 0.75, 0.71, 0.48, 2.06, 0.53, 1.65, 0.19, 1.31],
'G': [0.44, 2.29, 3.14, 2.53, 1.60, 0.61, 1.93, 0.41, 0.84, 1.30, 0.58]})
adjacency_matrix = pd.crosstab(df['Experimenter'], df['Subject'], values=df['F'], aggfunc=np.mean)
adjacency_matrix = adjacency_matrix.fillna(0)
print('')
print(adjacency_matrix)
Actual Output:
Subject Alpha Bravo Charlie Delta Echo Foxtrot Hotel Juliet
Experimenter
Alpha 0.00 2.68 0.00 0.00 0.00 0.00 0.00 0.00
Bravo 0.53 0.00 0.00 1.65 0.00 0.00 0.00 0.00
Charlie 0.00 0.00 0.00 0.62 0.00 0.00 0.00 0.00
Delta 0.00 0.00 0.00 0.00 0.00 0.00 2.06 0.00
Echo 0.00 0.00 0.00 0.00 0.00 1.03 0.00 0.00
Foxtrot 0.00 0.00 0.48 0.00 0.00 0.00 0.00 0.00
Golf 0.19 0.00 0.00 0.00 0.00 0.00 0.75 0.00
India 0.00 0.00 0.00 0.00 1.31 0.00 0.00 0.71
which is correct but only includes column 'F' not both 'F' and 'G', as required.
Please advise?
The following code appears to generate the correct output (not very idiomatic, but functional):
ct_a = pd.crosstab(df['Experimenter'], df['Subject'], values=df['F'], aggfunc=np.mean).fillna(0)
ct_a.values[[np.arange(ct_a.shape[0])]*2] = 1
print('')
print(ct_a.head(23))
ct_b = pd.crosstab(df['Subject'], df['Experimenter'], values=df['G'], aggfunc=np.mean).fillna(0)
ct_b.values[[np.arange(ct_b.shape[0])]*2] = 1
print('')
print(ct_b.head(23))
a_m = (ct_a + ct_b).fillna(0)
a_m.values[[np.arange(a_m.shape[0])]*2] = 1
print('')
print(a_m.head(23))
However, I am still struggling to generate the 'Eigenvector Centrality' measure from the generated matrix (a_m) - any help would be very welcome!
Related
Why does my host OS experience high system cpu usage on cores performing networking when using SR-IOV?
I am trying to determine why my KVM host shows high system CPU usage for a specific guest. I have setup a KVM host (Ubuntu 20.04) to host a guest VM (Ubuntu 20.04). I configured the guest to use cores (via vcpu / vcpupin / emulatorpin) from the processor in socket 1, memory (via numatune) connected to channels on the processor in socket 1, and the NIC connected (via interface / address source) to the channels on the processor in socket 1 to ensure the VMs run in the most optimal state. Even though I have gone through the process of configure IOMMU, SR-IOV, and Intel VF to pass Virtual Function PCI devices for the NICs directly into the VMs, the host instance still sees high system cpu usage on the cores where the guest has parked it's NIC interrupts. I cannot figure out why the host system is involved, since the PCI device is disconnected from the host and attached directly to the guest. This causes huge performance issues. Has any experienced this? Is there a configuration error on my end? Please see the configurations and mpstat data below: <vcpu placement='static' cpuset='0,2,4,6,8,10,12,14,16,18,20,22,24,52,54,56,58,60,62,64,66,68,70,72,74,76'>26</vcpu> <cputune> <vcpupin vcpu='0' cpuset='0'/> <vcpupin vcpu='1' cpuset='2'/> <vcpupin vcpu='2' cpuset='4'/> <vcpupin vcpu='3' cpuset='6'/> <vcpupin vcpu='4' cpuset='8'/> <vcpupin vcpu='5' cpuset='10'/> <vcpupin vcpu='6' cpuset='12'/> <vcpupin vcpu='7' cpuset='14'/> <vcpupin vcpu='8' cpuset='16'/> <vcpupin vcpu='9' cpuset='18'/> <vcpupin vcpu='10' cpuset='20'/> <vcpupin vcpu='11' cpuset='22'/> <vcpupin vcpu='12' cpuset='24'/> <vcpupin vcpu='13' cpuset='52'/> <vcpupin vcpu='14' cpuset='54'/> <vcpupin vcpu='15' cpuset='56'/> <vcpupin vcpu='16' cpuset='58'/> <vcpupin vcpu='17' cpuset='60'/> <vcpupin vcpu='18' cpuset='62'/> <vcpupin vcpu='19' cpuset='64'/> <vcpupin vcpu='20' cpuset='66'/> <vcpupin vcpu='21' cpuset='68'/> <vcpupin vcpu='22' cpuset='70'/> <vcpupin vcpu='23' cpuset='72'/> <vcpupin vcpu='24' cpuset='74'/> <vcpupin vcpu='25' cpuset='76'/> <emulatorpin cpuset='0,2,4,6,8,10,12,14,16,18,20,22,24,52,54,56,58,60,62,64,66,68,70,72,74,76'/> </cputune> <numatune> <memory mode='strict' nodeset='0'/> </numatune> <cpu mode='host-passthrough' check='none'/> <interface type='hostdev' managed='yes'> <mac address='52:54:08:06:48:01'/> <driver name='vfio'/> <source> <address type='pci' domain='0x0000' bus='0x19' slot='0x02' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/> </interface> <interface type='hostdev' managed='yes'> <mac address='52:54:08:06:48:01'/> <driver name='vfio'/> <source> <address type='pci' domain='0x0000' bus='0x19' slot='0x0a' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/> </interface> <interface type='hostdev' managed='yes'> <mac address='52:54:08:06:48:02'/> <driver name='vfio'/> <source> <address type='pci' domain='0x0000' bus='0x19' slot='0x02' function='0x1'/> </source> <address type='pci' domain='0x0000' bus='0x09' slot='0x00' function='0x0'/> </interface> <interface type='hostdev' managed='yes'> <mac address='52:54:08:06:48:02'/> <driver name='vfio'/> <source> <address type='pci' domain='0x0000' bus='0x19' slot='0x0a' function='0x1'/> </source> <address type='pci' domain='0x0000' bus='0x0a' slot='0x00' function='0x0'/> </interface> 22:26:38 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 22:26:43 all 0.00 0.00 3.39 0.00 0.00 0.09 0.00 7.30 0.00 89.22 22:26:43 0 0.00 0.00 7.30 0.00 0.00 0.00 0.00 26.57 0.00 66.13 22:26:43 1 0.00 0.00 0.00 0.00 0.00 8.43 0.00 0.00 0.00 91.57 22:26:43 2 0.00 0.00 10.38 0.00 0.00 0.00 0.00 29.34 0.00 60.28 22:26:43 4 0.00 0.00 8.69 0.00 0.00 0.00 0.00 29.09 0.00 62.22 22:26:43 6 0.00 0.00 5.00 0.00 0.00 0.00 0.00 23.20 0.00 71.80 22:26:43 8 0.20 0.00 39.31 0.00 0.00 0.00 0.00 35.03 0.00 25.46 22:26:43 10 0.00 0.00 6.96 0.00 0.00 0.00 0.00 31.21 0.00 61.83 22:26:43 12 0.00 0.00 44.90 0.00 0.00 0.00 0.00 35.31 0.00 19.80 22:26:43 14 0.00 0.00 13.18 0.00 0.00 0.00 0.00 32.05 0.00 54.77 22:26:43 16 0.20 0.00 5.31 0.00 0.00 0.00 0.00 26.12 0.00 68.37 22:26:43 18 0.00 0.00 15.95 0.00 0.00 0.20 0.00 30.47 0.00 53.37 22:26:43 20 0.00 0.00 7.04 0.00 0.00 0.00 0.00 26.16 0.00 66.80 22:26:43 22 0.00 0.00 15.76 0.00 0.00 0.00 0.00 31.92 0.00 52.32 22:26:43 24 0.00 0.00 7.35 0.00 0.00 0.00 0.00 27.96 0.00 64.69 22:26:43 52 0.00 0.00 10.46 0.00 0.00 0.00 0.00 29.78 0.00 59.76 22:26:43 54 0.00 0.00 7.35 0.00 0.00 0.00 0.00 26.12 0.00 66.53 22:26:43 56 0.00 0.00 6.91 0.00 0.00 0.00 0.00 29.67 0.00 63.41 22:26:43 58 0.00 0.00 7.82 0.00 0.00 0.00 0.00 29.42 0.00 62.76 22:26:43 60 0.00 0.00 37.17 0.00 0.00 0.00 0.00 35.15 0.00 27.68 22:26:43 62 0.00 0.00 14.99 0.00 0.00 0.00 0.00 28.54 0.00 56.47 22:26:43 64 0.00 0.00 37.50 0.00 0.00 0.00 0.00 33.53 0.00 28.97 22:26:43 66 0.00 0.00 7.30 0.00 0.00 0.00 0.00 29.61 0.00 63.08 22:26:43 68 0.00 0.00 8.08 0.00 0.00 0.00 0.00 27.88 0.00 64.04 22:26:43 70 0.00 0.00 7.09 0.00 0.00 0.00 0.00 24.49 0.00 68.42 22:26:43 72 0.20 0.00 6.94 0.00 0.00 0.00 0.00 29.96 0.00 62.90 22:26:43 74 0.00 0.00 7.63 0.00 0.00 0.00 0.00 27.91 0.00 64.46 22:26:43 76 0.00 0.00 8.63 0.00 0.00 0.00 0.00 29.32 0.00 62.05 22:26:43 98 0.00 0.00 0.20 0.00 0.00 0.00 0.00 0.00 0.00 99.80 22:26:43 100 0.00 0.00 0.20 0.00 0.00 0.00 0.00 0.00 0.00 99.80 Does anyone know what is configured incorrectly or why the %sys is so high on cores 8, 12, 60, 64?
This ended up being an issue with the default kvm.halt_poll_ns value. Setting /sys/module/kvm/parameters/halt_poll_ns to 0 caused this problem to stop happening.
Printing specific columns as a percentage
I have multi index dataframe and I want to convert two columns' value into percentage values. Capacity\nMWh Day-Ahead\nMWh Intraday\nMWh UEVM\nMWh ... Cost Per. MW\n(with Imp.)\n$/MWh Cost Per. MW\n(w/o Imp.)\n$/MWh Intraday\nMape Day-Ahead\nMape Power Plants Date ... powerplant1 2020 January 3.6 446.40 492.70 482.50 ... 0.05 0.32 0.04 0.10 2020 February 0.0 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 2020 March 0.0 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 2020 April 0.0 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 I used apply('{:0%}'.format): nested_df[['Intraday\nMape', 'Day-Ahead\nMape']] = \ nested_df[['Intraday\nMape', 'Day-Ahead\nMape']].apply('{:.0%}'.format) But I got this error: TypeError: ('unsupported format string passed to Series.__format__', 'occurred at index Intraday\nMape') How can I solve that?
Use DataFrame.applymap: nested_df[['Intraday\nMape', 'Day-Ahead\nMape']] = \ nested_df[['Intraday\nMape', 'Day-Ahead\nMape']].applymap('{:.0%}'.format)
Awk: Removing duplicate lines without sorting after matching conditions
I've got a list of devices which I need to remove duplicates (keep only the first occurrence) while preserving order and matching a condition. In this case I'm looking for a specific string and then printing the field with the device name. Here is some example raw data from the sar application: 10:02:01 AM sdc 0.70 0.00 8.13 11.62 0.00 1.29 0.86 0.06 10:02:01 AM sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:02:01 AM sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Average: sdc 1.31 3.73 99.44 78.46 0.02 17.92 0.92 0.12 Average: sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Average: sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:05:01 AM sdc 2.70 0.00 39.92 14.79 0.02 5.95 0.31 0.08 10:05:01 AM sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:05:01 AM sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10:06:01 AM sdc 0.83 0.00 10.00 12.00 0.00 0.78 0.56 0.05 11:04:01 AM sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 11:04:01 AM sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Average: sdc 0.70 2.55 8.62 15.91 0.00 1.31 0.78 0.05 Average: sda 0.12 0.95 0.00 7.99 0.00 0.60 0.60 0.01 Average: sdb 0.22 1.78 0.00 8.31 0.00 0.54 0.52 0.01 The following will give me the list of devices from lines containing the word "average" but it sorts the output: sar -dp | awk '/Average/ {devices[$2]} END {for (device in devices) {print device}}' sda sdb sdc The following gives me exactly what I want (command from here): sar -dp | awk '/Average/ {print $2}' | awk '!devices[$0]++' sdc sda sdb Maybe I'm missing something painfully obvious but I can't figure out how to do the same in one awk command, that is without piping the output of the first awk into the second awk.
You can do: sar -dp | awk '/Average/ && !devices[$2]++ {print $2}' sdc sda sdb The problem is this part for (device in devices). For some reason the for does randomize the output. I have read a long complicated information on why some where but have not the link.
awk '/Average/ && !devices[$2]++ {print $2}' sar.in You just need to combine the two tests. The only caveat is that in the original the entire line is field two from the original input so you need to replace $0 with $2.
Why my SQLite3 query takes time?
Context: I developed a Read-Only VFS for SQLite using the C API. The thing is what I need is both speed and small file size. I solved the size problem with a LZ4 based VFS. However I have some speed issues when I query my DB. Specifications: - I work on Linux (Ubuntu 12.10) - DB files are 275MB compressed and about 700MB uncompressed - I am doing queries on indexed fields. - I evaluate the time taken for a given query after droping caches (echo 3 | sudo tee /proc/sys/vm/dropcaches) Problem: When I query the DB with the command time, I get the following output: real 0m5.933s user 0m0.124s sys 0m0.096s What is surprising is the difference between user+sys and real. This is why I decided to profile, with gprof, the code I have written as well as its dependencies (sqlite3,lz4). Hereafter, you will find few lines of the gprof flat and call-graph representation. After that, I have no idea about what to look out if I want to find a solution. Mainly because I do not understand why (and where) all this time is wasted. I hope you can help me. Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 93.75 0.15 0.15 2948 0.05 0.05 LZ4_decompress_fast 6.25 0.16 0.01 26068 0.00 0.01 sqlite3VdbeCursorMoveto 0.00 0.16 0.00 54088 0.00 0.00 sqlite3GetVarint 0.00 0.16 0.00 28459 0.00 0.00 sqlite3VdbeSerialGet 0.00 0.16 0.00 23708 0.00 0.00 sqlite3VdbeMemNulTerminate 0.00 0.16 0.00 23699 0.00 0.00 sqlite3_transfer_bindings 0.00 0.16 0.00 11910 0.00 0.00 sqlite3DbMallocSize 0.00 0.16 0.00 11883 0.00 0.00 sqlite3VdbeMemGrow 0.00 0.16 0.00 11851 0.00 0.00 sqlite3VdbeMemMakeWriteable 0.00 0.16 0.00 9480 0.00 0.00 fetchPayload 0.00 0.16 0.00 9478 0.00 0.00 sqlite3VdbeMemRelease 0.00 0.16 0.00 7170 0.00 0.02 btreeGetPage 0.00 0.16 0.00 7170 0.00 0.00 pcache1Fetch 0.00 0.16 0.00 7170 0.00 0.00 releasePage 0.00 0.16 0.00 7170 0.00 0.02 sqlite3PagerAcquire 0.00 0.16 0.00 7170 0.00 0.00 sqlite3PagerUnrefNotNull 0.00 0.16 0.00 7170 0.00 0.00 sqlite3PcacheFetch 0.00 0.16 0.00 7170 0.00 0.00 sqlite3PcacheRelease 0.00 0.16 0.00 7169 0.00 0.00 pcache1Unpin 0.00 0.16 0.00 7169 0.00 0.00 pcacheUnpin 0.00 0.16 0.00 7168 0.00 0.02 getAndInitPage 0.00 0.16 0.00 7165 0.00 0.02 moveToChild granularity: each sample hit covers 2 byte(s) for 6.25% of 0.16 seconds index % time self children called name <spontaneous> [1] 100.0 0.00 0.16 main [1] 0.00 0.16 1/1 sqlite3_exec <cycle 1> [5] 0.00 0.00 1/1 openDatabase [18] 0.00 0.00 2/3 sqlite3_vfs_find [249] 0.00 0.00 1/1 sqlite3_crodvfs [358] 0.00 0.00 1/14 sqlite3_vfs_register <cycle 5> [204] 0.00 0.00 1/1 sqlite3_open_v2 [359] ----------------------------------------------- [2] 99.9 0.00 0.16 1+198 <cycle 1 as a whole> [2] 0.00 0.16 2 sqlite3_exec <cycle 1> [5] 0.00 0.00 2 sqlite3InitOne <cycle 1> [27] 0.00 0.00 124 sqlite3Parser <cycle 1> [82] 0.00 0.00 6+4 sqlite3WalkSelect <cycle 1> [186] 0.00 0.00 9 sqlite3ReadSchema <cycle 1> [149] 0.00 0.00 7 sqlite3LockAndPrepare <cycle 1> [168] 0.00 0.00 7 sqlite3Prepare <cycle 1> [171] 0.00 0.00 7 sqlite3RunParser <cycle 1> [172] 0.00 0.00 5 sqlite3InitCallback <cycle 1> [196] 0.00 0.00 5 sqlite3_prepare <cycle 1> [203] 0.00 0.00 5 sqlite3SelectPrep <cycle 1> [198] 0.00 0.00 4 sqlite3LocateTable <cycle 1> [213] 0.00 0.00 3 sqlite3StartTable <cycle 1> [243] 0.00 0.00 2 selectExpander <cycle 1> [277] 0.00 0.00 2 sqlite3Select <cycle 1> [299] 0.00 0.00 2 sqlite3CreateIndex <cycle 1> [282] 0.00 0.00 2 sqlite3_prepare_v2 <cycle 1> [312] 0.00 0.00 2 resolveSelectStep <cycle 1> [275] 0.00 0.00 2 resolveOrderGroupBy <cycle 1> [274] 0.00 0.00 1 sqlite3Init <cycle 1> [347] ----------------------------------------------- 0.00 0.16 2374/2374 sqlite3_step [4] [3] 99.9 0.00 0.16 2374 sqlite3VdbeExec [3] 0.01 0.15 26068/26068 sqlite3VdbeCursorMoveto [6] 0.00 0.00 53/54 moveToLeftmost [17] 0.00 0.00 2/3 sqlite3BtreeBeginTrans [20] 0.00 0.00 1/2370 sqlite3BtreeMovetoUnpacked [16] 0.00 0.00 2372/2372 sqlite3BtreeNext [28] 0.00 0.00 1/2371 moveToRoot [23] 0.00 0.00 26068/28459 sqlite3VdbeSerialGet [34] 0.00 0.00 23699/23708 sqlite3VdbeMemNulTerminate [35] 0.00 0.00 23699/23699 sqlite3_transfer_bindings [36] 0.00 0.00 11851/11851 sqlite3VdbeMemMakeWriteable [39] 0.00 0.00 7108/7109 sqlite3BtreeKeySize [49] 0.00 0.00 4741/9480 fetchPayload [40] 0.00 0.00 4739/4739 sqlite3VdbeMemFromBtree [55] 0.00 0.00 4739/9478 sqlite3VdbeMemRelease [41] 0.00 0.00 2374/2376 sqlite3VdbeMemShallowCopy [65] 0.00 0.00 2372/2376 sqlite3VdbeCheckFk [64] 0.00 0.00 2372/2372 sqlite3VdbeCloseStatement [66] 0.00 0.00 2372/4742 btreeParseCellPtr [54] 0.00 0.00 2372/4742 btreeParseCell [53] 0.00 0.00 2370/2391 sqlite3VdbeRecordCompare [63] 0.00 0.00 2369/2369 sqlite3VdbeIntValue [67] 0.00 0.00 893/893 sqlite3VdbeRealValue [74] 0.00 0.00 3/3 sqlite3VdbeFreeCursor [245] 0.00 0.00 3/3 allocateCursor [226] 0.00 0.00 3/3 sqlite3BtreeCursor [237] 0.00 0.00 2/9 sqlite3VdbeHalt [152] 0.00 0.00 1/36 sqlite3BtreeLeave [98] 0.00 0.00 1/6 sqlite3BtreeGetMeta [185] 0.00 0.00 1/1 sqlite3GetVarint32 [345]
Deleting entire columns of from a text file using CUT command or AWK program
I have a text file in the form below. Could someone help me as to how I could delete columns 2, 3, 4, 5, 6 and 7? I want to keep only 1,8 and 9. 37.55 6.00 24.98 0.00 -2.80 -3.90 26.675 './gold_soln_CB_FragLib_Controls_m1_9.mol2' 'ethyl' 38.45 1.39 27.36 0.00 -0.56 -2.48 22.724 './gold_soln_CB_FragLib_Controls_m2_6.mol2' 'pyridin-2-yl(pyridin-3-yl)methanone' 38.47 0.00 28.44 0.00 -0.64 -2.42 20.387 './gold_soln_CB_FragLib_Controls_m3_3.mol2' 'pyridin-2-yl(pyridin-4-yl)methanone' 42.49 0.07 30.87 0.00 -0.03 -3.24 22.903 './gold_soln_CB_FragLib_Controls_m4_5.mol2' '(3-chlorophenyl)(pyridin-3-yl)methanone' 38.20 1.47 27.53 0.00 -1.13 -3.28 22.858 './gold_soln_CB_FragLib_Controls_m5_2.mol2' 'dipyridin-4-ylmethanone' 41.87 0.57 30.53 0.00 -0.67 -3.16 22.829 './gold_soln_CB_FragLib_Controls_m6_9.mol2' '(3-chlorophenyl)(pyridin-4-yl)methanone' 38.18 1.49 27.09 0.00 -0.56 -1.63 7.782 './gold_soln_CB_FragLib_Controls_m7_1.mol2' '3-hydrazino-6-phenylpyridazine' 39.45 1.50 27.71 0.00 -0.15 -4.17 17.130 './gold_soln_CB_FragLib_Controls_m8_6.mol2' '3-hydrazino-6-phenylpyridazine' 41.54 4.10 27.71 0.00 -0.65 -4.44 9.702 './gold_soln_CB_FragLib_Controls_m9_4.mol2' '3-hydrazino-6-phenylpyridazine' 41.05 1.08 29.30 0.00 -0.31 -2.44 28.590 './gold_soln_CB_FragLib_Controls_m10_3.mol2' '3-hydrazino-6-(4-methylphenyl)pyridazine'
Try: awk '{print $1"\t"$8"\t"$9}' yourfile.tsv > only189.tsv