I have a table like this:
Var score
1 1.00
1 1.06
1 1.03
1 0.65
1 0.68
2 1.06
2 1.07
2 0.64
2 1.05
3 0.71
3 0.72
3 1.03
4 0.68
4 1.08
5 0.11
Want to convert this into matrix like:
1 2 3 4 5 6
1 0.00 1.00 1.06 1.03 0.65 0.68
2 1.00 0.00 1.06 1.07 0.64 1.05
3 1.06 1.06 0.00 0.71 0.72 1.03
4 1.03 1.07 0.71 0.00 0.68 1.08
5 0.65 0.64 0.72 0.68 0.00 0.11
6 0.68 1.05 1.03 1.08 0.11 0.00
I tried awk but its keep running:
awk '{if(NF>max) max=NF} END{while(getline<"file"){for(i=NF+1;i<=max;i++)$i="0";print}}'
It keeps running because you forgot to pass it the file name. So awk takes its input from the standard input and waits for you to enter something on the keyboard. Use awk '...' file, not just awk '...'. But even with this error fixed it will not work as you expect.
You don't need to read the file twice. You can build your matrix in one single pass and populate the missing cells in the END block (tested with GNU and BSD awk):
awk 'NR > 1 {
num[$1] += 1
mat[$1, $1 + num[$1]] = mat[$1 + num[$1], $1] = $2
n = num[$1] > n ? num[$1] : n
}
END {
n += 1
mat[0, 0] = ""
for(i = 1; i <= n; i += 1) {
mat[0, i] = mat[i, 0] = i
mat[i, i] = "0.00"
}
for(i = 0; i <= n; i += 1)
for(j = 0; j <= n; j += 1)
printf("%s%s", mat[i, j], j == n ? "\n" : "\t")
}' file
1 2 3 4 5 6
1 0.00 1.00 1.06 1.03 0.65 0.68
2 1.00 0.00 1.06 1.07 0.64 1.05
3 1.06 1.06 0.00 0.71 0.72 1.03
4 1.03 1.07 0.71 0.00 0.68 1.08
5 0.65 0.64 0.72 0.68 0.00 0.11
6 0.68 1.05 1.03 1.08 0.11 0.00
Related
I have a file named file1 consisting of 4350 lines and 12 columns, as shown below.
ATOM 1 CE1 LIG H 1 75.206 62.966 59.151 0.00 0.00 HAB1
ATOM 2 NE2 LIG H 1 74.984 62.236 58.086 0.00 0.00 HAB1
ATOM 3 CD2 LIG H 1 74.926 63.041 57.027 0.00 0.00 HAB1
.
.
.
ATOM 4348 ZN ZN2 H 1 1.886 22.818 51.215 0.00 0.00 HAC1
ATOM 4349 ZN ZN2 H 1 62.517 30.663 5.219 0.00 0.00 HAC1
ATOM 4350 ZN ZN2 H 1 59.442 35.851 2.791 0.00 0.00 HAC1
I am using awk -v d="74.106" '{$7=sprintf("%0.3f", $7+d)} 1' file1 > file2 to add a value d to the 7th column of file1. After this, my file2 does not retain the correct formatting. A section of file2 is shown below.
ATOM 1 CE1 LIG H 1 149.312 62.966 59.151 0.00 0.00 HAB1
ATOM 2 NE2 LIG H 1 149.090 62.236 58.086 0.00 0.00 HAB1
ATOM 3 CD2 LIG H 1 149.032 63.041 57.027 0.00 0.00 HAB1
.
.
.
ATOM 4348 ZN ZN2 H 1 75.992 22.818 51.215 0.00 0.00 HAC1
ATOM 4349 ZN ZN2 H 1 136.623 30.663 5.219 0.00 0.00 HAC1
ATOM 4350 ZN ZN2 H 1 133.548 35.851 2.791 0.00 0.00 HAC1
I need my file2 to keep the same formatting as my file1, where only columns 2, 8, and 9 are left leaning.
I have tried to use awk -v FIELDWIDTHS="7 6 4 4 4 5 8 8 8 6 6 10" '{print $1 $2 $3 $4 $5 $6 $7 $8 $9 $10 $11 $12}' to specify the maximum width for each of the 12th columns. This line does not change my file2. Moreover, I cannot find a way to make columns 2, 8, and 9 left leaning as in file1.
How can I achieve these two things?
I appreciate any guidance. Thank you!
Well, with the default FS, awk strips the duplicate spaces when you modify a field.
What you need to do first is to understand your ATOM record format:
COLUMNS
DATA TYPE
CONTENTS
1 - 6
Record name
"ATOM "
7 - 11
Integer
Atom serial number.
13 - 16
Atom
Atom name.
17
Character
Alternate location indicator.
18 - 20
Residue name
Residue name.
22
Character
Chain identifier.
23 - 26
Integer
Residue sequence number.
27
AChar
Code for insertion of residues.
31 - 38
Real(8.3)
Orthogonal coordinates for X in Angstroms.
39 - 46
Real(8.3)
Orthogonal coordinates for Y in Angstroms.
47 - 54
Real(8.3)
Orthogonal coordinates for Z in Angstroms.
55 - 60
Real(6.2)
Occupancy.
61 - 66
Real(6.2)
Temperature factor (Default = 0.0).
73 - 76
LString(4)
Segment identifier, left-justified.
77 - 78
LString(2)
Element symbol, right-justified.
79 - 80
LString(2)
Charge on the atom.
Then you can use substr for generating a modified record:
awk -v d="74.106" '
/^ATOM / {
xCoord = sprintf( "%8.3f", substr($0,31,8) + d )
$0 = substr($0,1,30) xCoord substr($0,39)
}
1
' file.pdb
ATOM 1 CE1 LIG H 1 149.312 62.966 59.151 0.00 0.00 HAB1
ATOM 2 NE2 LIG H 1 149.090 62.236 58.086 0.00 0.00 HAB1
ATOM 3 CD2 LIG H 1 149.032 63.041 57.027 0.00 0.00 HAB1
.
.
.
ATOM 4348 ZN ZN2 H 1 75.992 22.818 51.215 0.00 0.00 HAC1
ATOM 4349 ZN ZN2 H 1 136.623 30.663 5.219 0.00 0.00 HAC1
ATOM 4350 ZN ZN2 H 1 133.548 35.851 2.791 0.00 0.00 HAC1
Using awk
$ awk -v d=74.106 '/ATOM/{sub($7,sprintf("%0.3f", $7+d))}1' input_file
ATOM 1 CE1 LIG H 1 149.312 62.966 59.151 0.00 0.00 HAB1
ATOM 2 NE2 LIG H 1 149.090 62.236 58.086 0.00 0.00 HAB1
ATOM 3 CD2 LIG H 1 149.032 63.041 57.027 0.00 0.00 HAB1
.
.
.
ATOM 4348 ZN ZN2 H 1 75.992 22.818 51.215 0.00 0.00 HAC1
ATOM 4349 ZN ZN2 H 1 136.623 30.663 5.219 0.00 0.00 HAC1
ATOM 4350 ZN ZN2 H 1 133.548 35.851 2.791 0.00 0.00 HAC1
I have two dataframes for groundtruth and predicted trajectories and one dataframe for matching between the groundtruth and predicted trajectories at each frame. I have dataframe of the groundtruth tracks and predicted tracks as follows:
df_pred_batch =
CENTER_X CENTER_Y LENGTH SPEED ACCELERATION HEADING
FrameId HId
0 0 -1.870000 -0.41 1.51 1.280 1.670 0.39
1 0 -1.730000 -0.36 1.51 1.440 1.660 0.40
2 0 -1.180000 -1.57 2.05 2.220 0.390 0.61
0 1 -1.540000 -1.83 2.05 2.140 0.390 0.61
1 1 -1.370000 -1.70 2.05 2.180 0.390 0.61
2 1 -1.590000 -0.29 1.51 1.610 1.630 0.41
1 2 -1.910000 -1.12 1.04 0.870 1.440 0.30
2 2 -1.810000 -1.09 1.04 1.010 1.440 0.27
0 3 17.190001 -3.15 1.80 2.178 -0.028 3.36
1 3 15.000000 -3.60 1.80 2.170 -0.020 3.38
df_gt_batch =
CENTER_X CENTER_Y LENGTH SPEED ACCELERATION HEADING
FrameId OId
1 0 -1.91 -1.12 1.040 0.87 1.44 0.30
2 0 -1.81 -1.09 1.040 1.01 1.44 0.27
0 1 -1.87 -0.41 1.510 1.28 1.67 0.39
1 1 -1.73 -0.36 1.510 1.44 1.66 0.40
2 1 -1.59 -0.29 1.510 1.61 1.63 0.41
0 2 -1.54 -1.83 2.056 2.14 0.39 0.61
1 2 -1.37 -1.70 2.050 2.18 0.39 0.61
2 2 -1.18 -1.57 2.050 2.22 0.39 0.61
0 3 1.71 -0.31 1.800 2.17 -0.02 3.36
1 3 1.50 -0.36 1.800 2.17 -0.02 3.38
2 3 1.29 -0.41 1.800 2.17 -0.01 3.40
Also, I know their matching at each timestamp:
matched_gt_pred =
FrameId Type OId HId
0 0 MATCH 1.0 0.0
1 0 MATCH 2.0 1.0
4 1 MATCH 1.0 0.0
5 1 MATCH 2.0 1.0
6 1 MATCH 0.0 2.0
9 2 MATCH 0.0 2.0
I would like to look at each row of matched_gt_pred and get the corresponding CENTER_X from df_pred_batch and df_gt_batch and calculate the error.
For instance looking at the first row of the matched_gt_pred I know at FrameId == 0 and OId == 1 and HId == 0 are matched. I should get the Center_X from gt_center_x = df_gt_batch["FrameId==0" and "OId == 1"].CENTER_X and pred_center_x = df_pred_batch["FrameId==0" and "HId == 0"].CENTER_X And compute error = abs(gt_center_x - pred_center_x)
IIUC, I would reshape your df_gt_batch and df_pred_batch and use lookup:
gt_x = df_gt_batch['Center_X'].unstack().lookup(match_gt_pred['FrameId'], match_gt_pred['OId'])
pred_x = df_pred_batch['Center_X'].unstack().lookup(match_gt_pred['FrameId'], match_gt_pred['HId'])
match_gt_pred['X Error'] = np.abs(gt_x - pred_x)
Output:
FrameId Type OId HId X Error
0 0 MATCH 1.0 0.0 0.0
1 0 MATCH 2.0 1.0 0.0
4 1 MATCH 1.0 0.0 0.0
5 1 MATCH 2.0 1.0 0.0
6 1 MATCH 0.0 2.0 0.0
9 2 MATCH 0.0 2.0 0.0
Another option is to use reindex with pd.MultiIndex:
match_gt_pred['X Error'] = (df_pred_batch.reindex(pd.MultiIndex.from_arrays([match_gt_pred['FrameId'], match_gt_pred['HId']]))['Center_X'].to_numpy() -
df_gt_batch.reindex(pd.MultiIndex.from_arrays([match_gt_pred['FrameId'], match_gt_pred['OId']]))['Center_X'].to_numpy())
I have a large dataframe df as:
Col1 Col2 ATC_Dzr ATC_Last ATC_exp Op_Dzr2 Op_Last2
1Loc get1 0.26 3.88 3.73 0.16 3.15
2Loc get2 0.4 -0.85 -0.86 0.1 -0.54
3Loc get3 -0.59 1.47 2.01 -0.53 1.29
I need to dump this to excel so that it looks as following:
where ATC and Op are in a merged cells
I am not sure how to approach this?
You can set the first 2 columns as index and split the rest and expand to create a Multiindex:
df1 = df.set_index(['Col1','Col2'])
df1.columns = df1.columns.str.split('_',expand=True)
print(df1)
ATC Op
Dzr Last exp Dzr2 Last2
Col1 Col2
1Loc get1 0.26 3.88 3.73 0.16 3.15
2Loc get2 0.40 -0.85 -0.86 0.10 -0.54
3Loc get3 -0.59 1.47 2.01 -0.53 1.29
Then export df1 into excel.
As per coments by #Datanovice , you can also use Pd.MultiIndex.from_tuples:
df1 = df.set_index(['Col1','Col2'])
df1.columns = pd.MultiIndex.from_tuples([(col.split('_')[0], col.split('_')[1])
for col in df1.columns])
print(df1)
ATC Op
Dzr Last exp Dzr2 Last2
Col1 Col2
1Loc get1 0.26 3.88 3.73 0.16 3.15
2Loc get2 0.40 -0.85 -0.86 0.10 -0.54
3Loc get3 -0.59 1.47 2.01 -0.53 1.29
I have a dataframe with 5 columns all of which contain numerical values. The columns represent time steps. I have a threshold which, if reached within the time, stops the values from changing. So let's say the original values are [ 0 , 1.5, 2, 4, 1] arranged in a row, and threshold is 2, then i want the manipulated row values to be [0, 1, 2 , 2, 2]
Is there a way to do this without loops?
A bigger example:
>>> threshold = 0.25
>>> input
Out[75]:
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.20
143 0.11 0.27 0.12 0.28 0.35
146 0.30 0.20 0.12 0.25 0.20
324 0.06 0.20 0.12 0.15 0.20
>>> output
Out[75]:
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20
Use:
df = df.where(df.ge(threshold).cumsum(axis=1).cumsum(axis=1).eq(1)).ffill(axis=1).fillna(df)
print (df)
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20
Explanation:
Compare by threshold by ge (>=):
print (df.ge(threshold))
0 1 2 3 4
130 False False False True False
143 False True False True True
146 True False False True False
324 False False False False False
Create cumulative sum per rows:
print (df.ge(threshold).cumsum(axis=1))
0 1 2 3 4
130 0 0 0 1 1
143 0 1 1 2 3
146 1 1 1 2 2
324 0 0 0 0 0
Again for get first matched values:
print (df.ge(threshold).cumsum(axis=1).cumsum(axis=1))
0 1 2 3 4
130 0 0 0 1 2
143 0 1 2 4 7
146 1 2 3 5 7
324 0 0 0 0 0
Compare by 1:
print (df.ge(threshold).cumsum(axis=1).cumsum(axis=1).eq(1))
0 1 2 3 4
130 False False False True False
143 False True False False False
146 True False False False False
324 False False False False False
Replace to NaNs of no matched values:
print (df.where(df.ge(threshold).cumsum(axis=1).cumsum(axis=1).eq(1)))
0 1 2 3 4
130 NaN NaN NaN 0.25 NaN
143 NaN 0.27 NaN NaN NaN
146 0.3 NaN NaN NaN NaN
324 NaN NaN NaN NaN NaN
Forward fill missing values:
print (df.where(df.ge(threshold).cumsum(axis=1).cumsum(axis=1).eq(1)).ffill(axis=1))
0 1 2 3 4
130 NaN NaN NaN 0.25 0.25
143 NaN 0.27 0.27 0.27 0.27
146 0.3 0.30 0.30 0.30 0.30
324 NaN NaN NaN NaN NaN
Replace first values to original:
print (df.where(df.ge(threshold).cumsum(1).cumsum(1).eq(1)).ffill(axis=1).fillna(df))
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20
A bit more complicated but I like it.
v = df.values
a = v >= threshold
b = np.where(np.logical_or.accumulate(a, axis=1), np.nan, v)
r = np.arange(len(a))
j = a.argmax(axis=1)
b[r, j] = v[r, j]
pd.DataFrame(b, df.index, df.columns).ffill(axis=1)
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20
I like this one too:
v = df.values
a = v >= threshold
b = np.logical_or.accumulate(a, axis=1)
r = np.arange(len(df))
g = a.argmax(1)
fill = pd.Series(v[r, g], df.index)
df.mask(b, fill, axis=0)
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20
I have a text file that looks like this.
A 102
B 456
C 678
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
I need to extract all the lines that start with B,H and two lines after H . How can I do this using awk?
The expected output would be
B 456
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
Any suggestions please.
Ignoring the blank line after B in your output (your problem specifications give no indication as to why that blank line is in the output, so I'm assuming it should not be there):
awk '/^H/{t=3} /^B/ || t-- >0' input.file
will print all lines that start with B and each line that starts with H along with the next two lines.
awk '/^[BH]/ || /^[[:blank:]]*[[:digit:]]/' inputfile
bash-3.00$ cat t
A 102
B 456
C 678
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
bash-3.00$ awk '{if(( $1 == "B") || ($1 == "H") || ($0 ~ /^ / )) print;}' t
B 456
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
OR in short
awk '{if($0 ~ /^[BH ]/ ) print;}' t
OR even shorter
awk '/^[BH ]/' t
If H and B aren't the only headers that are sent before tabular data and you intend to omit those blocks of data (you don't specify the requirements fully) you have to use a flip-flop to remember if you're currently in a block you want to keep or not:
awk '/^[^ 0-9]/ {inblock=0}; /^[BH]/ {inblock=1}; { if (inblock) print }' d.txt
cat filename.txt | awk '/^[B(H(^ .*$){2})].*$/' > output.txt
EDIT: Updated for OP's edit