Awk fixed width columns and left leaning columns - awk

I have a file named file1 consisting of 4350 lines and 12 columns, as shown below.
ATOM 1 CE1 LIG H 1 75.206 62.966 59.151 0.00 0.00 HAB1
ATOM 2 NE2 LIG H 1 74.984 62.236 58.086 0.00 0.00 HAB1
ATOM 3 CD2 LIG H 1 74.926 63.041 57.027 0.00 0.00 HAB1
.
.
.
ATOM 4348 ZN ZN2 H 1 1.886 22.818 51.215 0.00 0.00 HAC1
ATOM 4349 ZN ZN2 H 1 62.517 30.663 5.219 0.00 0.00 HAC1
ATOM 4350 ZN ZN2 H 1 59.442 35.851 2.791 0.00 0.00 HAC1
I am using awk -v d="74.106" '{$7=sprintf("%0.3f", $7+d)} 1' file1 > file2 to add a value d to the 7th column of file1. After this, my file2 does not retain the correct formatting. A section of file2 is shown below.
ATOM 1 CE1 LIG H 1 149.312 62.966 59.151 0.00 0.00 HAB1
ATOM 2 NE2 LIG H 1 149.090 62.236 58.086 0.00 0.00 HAB1
ATOM 3 CD2 LIG H 1 149.032 63.041 57.027 0.00 0.00 HAB1
.
.
.
ATOM 4348 ZN ZN2 H 1 75.992 22.818 51.215 0.00 0.00 HAC1
ATOM 4349 ZN ZN2 H 1 136.623 30.663 5.219 0.00 0.00 HAC1
ATOM 4350 ZN ZN2 H 1 133.548 35.851 2.791 0.00 0.00 HAC1
I need my file2 to keep the same formatting as my file1, where only columns 2, 8, and 9 are left leaning.
I have tried to use awk -v FIELDWIDTHS="7 6 4 4 4 5 8 8 8 6 6 10" '{print $1 $2 $3 $4 $5 $6 $7 $8 $9 $10 $11 $12}' to specify the maximum width for each of the 12th columns. This line does not change my file2. Moreover, I cannot find a way to make columns 2, 8, and 9 left leaning as in file1.
How can I achieve these two things?
I appreciate any guidance. Thank you!

Well, with the default FS, awk strips the duplicate spaces when you modify a field.
What you need to do first is to understand your ATOM record format:
COLUMNS
DATA TYPE
CONTENTS
1 - 6
Record name
"ATOM "
7 - 11
Integer
Atom serial number.
13 - 16
Atom
Atom name.
17
Character
Alternate location indicator.
18 - 20
Residue name
Residue name.
22
Character
Chain identifier.
23 - 26
Integer
Residue sequence number.
27
AChar
Code for insertion of residues.
31 - 38
Real(8.3)
Orthogonal coordinates for X in Angstroms.
39 - 46
Real(8.3)
Orthogonal coordinates for Y in Angstroms.
47 - 54
Real(8.3)
Orthogonal coordinates for Z in Angstroms.
55 - 60
Real(6.2)
Occupancy.
61 - 66
Real(6.2)
Temperature factor (Default = 0.0).
73 - 76
LString(4)
Segment identifier, left-justified.
77 - 78
LString(2)
Element symbol, right-justified.
79 - 80
LString(2)
Charge on the atom.
Then you can use substr for generating a modified record:
awk -v d="74.106" '
/^ATOM / {
xCoord = sprintf( "%8.3f", substr($0,31,8) + d )
$0 = substr($0,1,30) xCoord substr($0,39)
}
1
' file.pdb
ATOM 1 CE1 LIG H 1 149.312 62.966 59.151 0.00 0.00 HAB1
ATOM 2 NE2 LIG H 1 149.090 62.236 58.086 0.00 0.00 HAB1
ATOM 3 CD2 LIG H 1 149.032 63.041 57.027 0.00 0.00 HAB1
.
.
.
ATOM 4348 ZN ZN2 H 1 75.992 22.818 51.215 0.00 0.00 HAC1
ATOM 4349 ZN ZN2 H 1 136.623 30.663 5.219 0.00 0.00 HAC1
ATOM 4350 ZN ZN2 H 1 133.548 35.851 2.791 0.00 0.00 HAC1

Using awk
$ awk -v d=74.106 '/ATOM/{sub($7,sprintf("%0.3f", $7+d))}1' input_file
ATOM 1 CE1 LIG H 1 149.312 62.966 59.151 0.00 0.00 HAB1
ATOM 2 NE2 LIG H 1 149.090 62.236 58.086 0.00 0.00 HAB1
ATOM 3 CD2 LIG H 1 149.032 63.041 57.027 0.00 0.00 HAB1
.
.
.
ATOM 4348 ZN ZN2 H 1 75.992 22.818 51.215 0.00 0.00 HAC1
ATOM 4349 ZN ZN2 H 1 136.623 30.663 5.219 0.00 0.00 HAC1
ATOM 4350 ZN ZN2 H 1 133.548 35.851 2.791 0.00 0.00 HAC1

Related

Melted distance table into distance matrix

I have a table like this:
Var score
1 1.00
1 1.06
1 1.03
1 0.65
1 0.68
2 1.06
2 1.07
2 0.64
2 1.05
3 0.71
3 0.72
3 1.03
4 0.68
4 1.08
5 0.11
Want to convert this into matrix like:
1 2 3 4 5 6
1 0.00 1.00 1.06 1.03 0.65 0.68
2 1.00 0.00 1.06 1.07 0.64 1.05
3 1.06 1.06 0.00 0.71 0.72 1.03
4 1.03 1.07 0.71 0.00 0.68 1.08
5 0.65 0.64 0.72 0.68 0.00 0.11
6 0.68 1.05 1.03 1.08 0.11 0.00
I tried awk but its keep running:
awk '{if(NF>max) max=NF} END{while(getline<"file"){for(i=NF+1;i<=max;i++)$i="0";print}}'
It keeps running because you forgot to pass it the file name. So awk takes its input from the standard input and waits for you to enter something on the keyboard. Use awk '...' file, not just awk '...'. But even with this error fixed it will not work as you expect.
You don't need to read the file twice. You can build your matrix in one single pass and populate the missing cells in the END block (tested with GNU and BSD awk):
awk 'NR > 1 {
num[$1] += 1
mat[$1, $1 + num[$1]] = mat[$1 + num[$1], $1] = $2
n = num[$1] > n ? num[$1] : n
}
END {
n += 1
mat[0, 0] = ""
for(i = 1; i <= n; i += 1) {
mat[0, i] = mat[i, 0] = i
mat[i, i] = "0.00"
}
for(i = 0; i <= n; i += 1)
for(j = 0; j <= n; j += 1)
printf("%s%s", mat[i, j], j == n ? "\n" : "\t")
}' file
1 2 3 4 5 6
1 0.00 1.00 1.06 1.03 0.65 0.68
2 1.00 0.00 1.06 1.07 0.64 1.05
3 1.06 1.06 0.00 0.71 0.72 1.03
4 1.03 1.07 0.71 0.00 0.68 1.08
5 0.65 0.64 0.72 0.68 0.00 0.11
6 0.68 1.05 1.03 1.08 0.11 0.00

Pandas DataFrame - How to apply Lambda Function on multiple columns and create a new column

Pandas DataFrame = df (example) is as follows:
----------------------------------
col_1 col_2 col_3 col_4 ... etc.
----------------------------------
0 34.91 12.45 0.00 256.95
1 0.00 0.00 0.00 0.00
2 2.34 346.78 1.23 0.02
3 0.00 78.95 36.78 2.95
4 0.03 46.21 128.05 30.00
5 0.05 0.10 0.07 0.05
----------------------------------
df = df.assign(col_new_bool = lambda x: True if ((x['col_1'] < 0.0001) and
(x['col_2'] < 0.0002) and
(x['col_3'] < 0.0003) and
(x['col_3'] < 0.0004))
else False)
I want to create a new column (named new_col_bool) in dataframe df.
new_col_bool will contain boolean True if all 4 columns have zeroes.
new_col_bool will contain boolean False if any of the 4 columns is non-zero.
Pls help with the correct lambda function ?
NOTE:
df has 100+ columns but my new_col_bool is calculated based on only 4 columns.
** How do I check a different threshold value for each of those 4 columns ?**
You don't need a lambda function for something trivial, use DataFrame.all over axis=1:
df['new_col_bool'] = df.eq(0).all(axis=1)
col_1 col_2 col_3 col_4 new_col_bool
0 34.91 12.45 0.00 256.95 False
1 0.00 0.00 0.00 0.00 True
2 2.34 346.78 1.23 0.02 False
3 0.00 78.95 36.78 2.95 False
4 0.03 46.21 128.05 30.00 False
5 0.05 0.10 0.07 0.05 False
To only check certain columns, select these first:
cols = ['col_1', 'col_2', 'col_3', 'col_4']
df['new_col_bool'] = df[cols].eq(0).all(axis=1)
col_1 col_2 col_3 col_4 new_col_bool
0 34.91 12.45 0.00 256.95 False
1 0.00 0.00 0.00 0.00 True
2 2.34 346.78 1.23 0.02 False
3 0.00 78.95 36.78 2.95 False
4 0.03 46.21 128.05 30.00 False
5 0.05 0.10 0.07 0.05 False
To check any condition:
cols = ['col_1', 'col_2', 'col_3', 'col_4']
cond = df[cols] > 0.5
# or cond = df[cols] <= -1.3
df['new_col_bool'] = cond.all(axis=1)
I think it would be efficient to transpose the dataframe and take the sum:
df['new_col_bool'] = df.T.sum() == 0
df
Out[1]:
col_1 col_2 col_3 col_4 new_col_bool
0 34.91 12.45 0.00 256.95 False
1 0.00 0.00 0.00 0.00 True
2 2.34 346.78 1.23 0.02 False
3 0.00 78.95 36.78 2.95 False
4 0.03 46.21 128.05 30.00 False
Or for specific columns:
df['new_col_bool'] = df.T.iloc[0:4].sum() == 0
df
Out[1]:
col_1 col_2 col_3 col_4 new_col_bool
0 34.91 12.45 0.00 256.95 False
1 0.00 0.00 0.00 0.00 True
2 2.34 346.78 1.23 0.02 False
3 0.00 78.95 36.78 2.95 False
4 0.03 46.21 128.05 30.00 False
To do by a threshold, use max:
df['new_col_bool'] = df.T.iloc[0:4].max() < 100
df
Out[1]:
col_1 col_2 col_3 col_4 new_col_bool
0 34.91 12.45 0.00 256.95 False
1 0.00 0.00 0.00 0.00 True
2 2.34 346.78 1.23 0.02 False
3 0.00 78.95 36.78 2.95 True
4 0.03 46.21 128.05 30.00 False
5 0.05 0.10 0.07 0.05 True

Pandas DataFrame Transpose and Matrix Multiplication

I am looking for a way to perform a matrix multiplication on two sets of columns in a dataframe. One set of columns will need to be transposed and then multiplied with the other set. Then I need to take the resulting matrix and do an element wise product with a scalar matrix and add up. Below is an example:
Data for testing:
import pandas as pd
import numpy as np
dftest = pd.DataFrame(data=[['A',0.18,0.25,0.36,0.21,0,0.16,0.16,0.64,0.04,0,0],['B',0,0,0.5,0.5,0,0,0,0.25,0.75,0,0]],columns = ['Ticker','f1','f2','f3','f4','f5','p1','p2','p3','p4','p5','multiplier'])
Starting dataframe with data for Tickers. f1 through f5 represent one set of categories and p1 through p5 represent another.
dftest
Out[276]:
Ticker f1 f2 f3 f4 f5 p1 p2 p3 p4 p5 multiplier
0 A 0.18 0.25 0.36 0.21 0 0.16 0.16 0.64 0.04 0 0
1 B 0.00 0.00 0.50 0.50 0 0.00 0.00 0.25 0.75 0 0
For each row, I need to do transpose columns p1 through p5 and then multiply them to columns f1 through f5. I think I have found the solution using below.
dftest.groupby('Ticker')['f1','f2','f3','f4','f5','p1','p2','p3','p4','p5'].apply(lambda x: x[['p1','p2','p3','p4','p5']].T.dot(x[['f1','f2','f3','f4','f5']]))
Out[408]:
f1 f2 f3 f4 f5
Ticker
A p1 0.0288 0.04 0.0576 0.0336 0.0
p2 0.0288 0.04 0.0576 0.0336 0.0
p3 0.1152 0.16 0.2304 0.1344 0.0
p4 0.0072 0.01 0.0144 0.0084 0.0
p5 0.0000 0.00 0.0000 0.0000 0.0
B p1 0.0000 0.00 0.0000 0.0000 0.0
p2 0.0000 0.00 0.0000 0.0000 0.0
p3 0.0000 0.00 0.1250 0.1250 0.0
p4 0.0000 0.00 0.3750 0.3750 0.0
p5 0.0000 0.00 0.0000 0.0000 0.0
Next I need to do a element wise product of the above matrix against another 5x5 matrix that is in another DataFrame and then add up the columns or rows (you get the same result either way). If I extend the above statement as below, I get the result I want.
dftest.groupby('Ticker')['f1','f2','f3','f4','f5','p1','p2','p3','p4','p5'].apply(lambda x: pd.DataFrame(m.values * x[['p1','p2','p3','p4','p5']].T.dot(x[['f1','f2','f3','f4','f5']]).values, columns = m.columns, index = m.index).sum().sum())
Out[409]:
Ticker
A 2.7476
B 1.6250
dtype: float64
So far so good, I think. Happy to know a better and faster way to do this. The next question and this is where I am stuck.
How do I take this and update the "multiplier" column on my original dataFrame?
if I try to do the following:
dftest['multiplier']=dftest.groupby('Ticker')['f1','f2','f3','f4','f5','p1','p2','p3','p4','p5'].apply(lambda x: pd.DataFrame(m.values * x[['p1','p2','p3','p4','p5']].T.dot(x[['f1','f2','f3','f4','f5']]).values, columns = m.columns, index = m.index).sum().sum())
I get NaNs in the multiplier column.
dftest
Out[407]:
Ticker f1 f2 f3 f4 f5 p1 p2 p3 p4 p5 multiplier
0 A 0.18 0.25 0.36 0.21 0 0.16 0.16 0.64 0.04 0 NaN
1 B 0.00 0.00 0.50 0.50 0 0.00 0.00 0.25 0.75 0 NaN
I suspect it has to do with indexing and whether all the indices after grouping are translating back to the original dataframe. Second, do I need a group by statement for this one? Since it is a row by row solution, can't I just do it without grouping or group by the index? any suggestions on that?
I need to do this without iterating row by row because the whole code will iterate due to some optimization I have to do. So I need to run this whole process, look at the results and if they are outside some constraints, calculate new f1 through f5 and p1 through p5 and run the whole thing again.
I posted a question on this earlier but it was confusing so this a second attempt. Hope it makes sense.
Thanks in advance for all your help.

AWK Curly Braces not working

Trying to run an awk command according to some documentation (linky).
However, whenever I add {1} or {2} to the awk command as documentation describes (see link above or example below) my search stops working. Zero results even on gigantic multi-gigabyte files. Any advice?
These work
awk '($3=="+" && $4~/^CG/)' example
awk '($3=="+" && $4~/..CG/)' example
awk '($3=="+" && $4~/.CG/)' example
awk '($3=="+" && $4~/^..CG/)' example
These don't return anything (even on a 3 gigabyte file)
awk '($3=="+" && $4~/.{2}CG/)' example
awk '($3=="+" && $4~/.{1}CG/)' example
awk '($3=="+" && $4~/^.{2}CG/)' example
Full command according to documentation:
awk '($3=="-" && $4~/^.{1}CG/ ) || ($3=="+" && $4~/^.{2}CG/)' BSMAPexample.txt > CpG.txt
Example dataset
EDIT (A COLUMN DISAPPEARED WHEN I PASTED INTO STACK EXCHANGE, TYPO FIXED)
chr pos strand context ratio total_C methy_C CI_lower CI_upper
chr1 3121589 + CGCGT 0.000 56 0 0.000 0.064
chr1 3121597 + ATCGG 0.000 56 0 0.000 0.064
chr1 3121599 + GTCGT 0.000 56 0 0.000 0.064
chr1 3121605 + CTCGG 0.000 56 0 0.000 0.064
chr1 3121606 + TGCGC 0.000 56 0 0.000 0.064
chr1 3121607 + GGCGC 0.000 56 0 0.000 0.064
chr1 3121611 + CTCGA 0.000 56 0 0.000 0.064
chr1 3121614 + TACGC 0.000 56 0 0.000 0.064
chr1 3121631 + CTCGT 0.000 56 0 0.000 0.064
You have removed some columns from the original sample data.
This is the original data in the link you sent:
chr pos strand context ratio total_C methy_C CI_lower CI_upper
chr1 3121589 + CGCGT 0.000 56 0 0.000 0.064
chr1 3121597 + ATCGG 0.000 56 0 0.000 0.064
chr1 3121599 + GTCGT 0.000 56 0 0.000 0.064
And this is the sample data you posted:
chr pos strand context ratio total_C methy_C CI_lower CI_upper
chr1 3121589 CGCGT 0.000 56 0 0.000 0.064
chr1 3121597 ATCGG 0.000 56 0 0.000 0.064
chr1 3121599 GTCGT 0.000 56 0 0.000 0.064
This is a problem for an expression like this:
awk '($3=="+" && $4~/.{2}CG/)' example
Which expects a + symbol in the third column ($3, inexisting in your data) and some repetitions of CG in the fourth column ($4 which seems to be located in position number 3). It won't match any line in your file.
If you modify the expression to refer to the proper column ($3) and forget the + sign since it does not appear in your data, you will get to match lines in your file.
$ awk '($3~/.{2}CG/)' example
chr1 3121589 CGCGT 0.000 56 0 0.000 0.064
chr1 3121597 ATCGG 0.000 56 0 0.000 0.064
chr1 3121599 GTCGT 0.000 56 0 0.000 0.064
chr1 3121605 CTCGG 0.000 56 0 0.000 0.064
chr1 3121606 TGCGC 0.000 56 0 0.000 0.064
chr1 3121607 GGCGC 0.000 56 0 0.000 0.064
chr1 3121611 CTCGA 0.000 56 0 0.000 0.064
chr1 3121614 TACGC 0.000 56 0 0.000 0.064
chr1 3121631 CTCGT 0.000 56 0 0.000 0.064
$
Actually all lines in the example file have 2 characters before the CG (**CG*). Only the header will be skipped.
Problem solved. I used gawk and --posix
gawk --posix '($3=="-" && $4~/^.{1}CG/ ) || ($3=="+" && $4~/^.{2}CG/)'
Works just fine now.

How to extract specific lines from a text file using awk?

I have a text file that looks like this.
A 102
B 456
C 678
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
I need to extract all the lines that start with B,H and two lines after H . How can I do this using awk?
The expected output would be
B 456
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
Any suggestions please.
Ignoring the blank line after B in your output (your problem specifications give no indication as to why that blank line is in the output, so I'm assuming it should not be there):
awk '/^H/{t=3} /^B/ || t-- >0' input.file
will print all lines that start with B and each line that starts with H along with the next two lines.
awk '/^[BH]/ || /^[[:blank:]]*[[:digit:]]/' inputfile
bash-3.00$ cat t
A 102
B 456
C 678
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
bash-3.00$ awk '{if(( $1 == "B") || ($1 == "H") || ($0 ~ /^ / )) print;}' t
B 456
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
OR in short
awk '{if($0 ~ /^[BH ]/ ) print;}' t
OR even shorter
awk '/^[BH ]/' t
If H and B aren't the only headers that are sent before tabular data and you intend to omit those blocks of data (you don't specify the requirements fully) you have to use a flip-flop to remember if you're currently in a block you want to keep or not:
awk '/^[^ 0-9]/ {inblock=0}; /^[BH]/ {inblock=1}; { if (inblock) print }' d.txt
cat filename.txt | awk '/^[B(H(^ .*$){2})].*$/' > output.txt
EDIT: Updated for OP's edit