How to extract specific lines from a text file using awk? - awk

I have a text file that looks like this.
A 102
B 456
C 678
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
I need to extract all the lines that start with B,H and two lines after H . How can I do this using awk?
The expected output would be
B 456
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
Any suggestions please.

Ignoring the blank line after B in your output (your problem specifications give no indication as to why that blank line is in the output, so I'm assuming it should not be there):
awk '/^H/{t=3} /^B/ || t-- >0' input.file
will print all lines that start with B and each line that starts with H along with the next two lines.

awk '/^[BH]/ || /^[[:blank:]]*[[:digit:]]/' inputfile

bash-3.00$ cat t
A 102
B 456
C 678
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
bash-3.00$ awk '{if(( $1 == "B") || ($1 == "H") || ($0 ~ /^ / )) print;}' t
B 456
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
OR in short
awk '{if($0 ~ /^[BH ]/ ) print;}' t
OR even shorter
awk '/^[BH ]/' t

If H and B aren't the only headers that are sent before tabular data and you intend to omit those blocks of data (you don't specify the requirements fully) you have to use a flip-flop to remember if you're currently in a block you want to keep or not:
awk '/^[^ 0-9]/ {inblock=0}; /^[BH]/ {inblock=1}; { if (inblock) print }' d.txt

cat filename.txt | awk '/^[B(H(^ .*$){2})].*$/' > output.txt
EDIT: Updated for OP's edit

Related

Awk fixed width columns and left leaning columns

I have a file named file1 consisting of 4350 lines and 12 columns, as shown below.
ATOM 1 CE1 LIG H 1 75.206 62.966 59.151 0.00 0.00 HAB1
ATOM 2 NE2 LIG H 1 74.984 62.236 58.086 0.00 0.00 HAB1
ATOM 3 CD2 LIG H 1 74.926 63.041 57.027 0.00 0.00 HAB1
.
.
.
ATOM 4348 ZN ZN2 H 1 1.886 22.818 51.215 0.00 0.00 HAC1
ATOM 4349 ZN ZN2 H 1 62.517 30.663 5.219 0.00 0.00 HAC1
ATOM 4350 ZN ZN2 H 1 59.442 35.851 2.791 0.00 0.00 HAC1
I am using awk -v d="74.106" '{$7=sprintf("%0.3f", $7+d)} 1' file1 > file2 to add a value d to the 7th column of file1. After this, my file2 does not retain the correct formatting. A section of file2 is shown below.
ATOM 1 CE1 LIG H 1 149.312 62.966 59.151 0.00 0.00 HAB1
ATOM 2 NE2 LIG H 1 149.090 62.236 58.086 0.00 0.00 HAB1
ATOM 3 CD2 LIG H 1 149.032 63.041 57.027 0.00 0.00 HAB1
.
.
.
ATOM 4348 ZN ZN2 H 1 75.992 22.818 51.215 0.00 0.00 HAC1
ATOM 4349 ZN ZN2 H 1 136.623 30.663 5.219 0.00 0.00 HAC1
ATOM 4350 ZN ZN2 H 1 133.548 35.851 2.791 0.00 0.00 HAC1
I need my file2 to keep the same formatting as my file1, where only columns 2, 8, and 9 are left leaning.
I have tried to use awk -v FIELDWIDTHS="7 6 4 4 4 5 8 8 8 6 6 10" '{print $1 $2 $3 $4 $5 $6 $7 $8 $9 $10 $11 $12}' to specify the maximum width for each of the 12th columns. This line does not change my file2. Moreover, I cannot find a way to make columns 2, 8, and 9 left leaning as in file1.
How can I achieve these two things?
I appreciate any guidance. Thank you!
Well, with the default FS, awk strips the duplicate spaces when you modify a field.
What you need to do first is to understand your ATOM record format:
COLUMNS
DATA TYPE
CONTENTS
1 - 6
Record name
"ATOM "
7 - 11
Integer
Atom serial number.
13 - 16
Atom
Atom name.
17
Character
Alternate location indicator.
18 - 20
Residue name
Residue name.
22
Character
Chain identifier.
23 - 26
Integer
Residue sequence number.
27
AChar
Code for insertion of residues.
31 - 38
Real(8.3)
Orthogonal coordinates for X in Angstroms.
39 - 46
Real(8.3)
Orthogonal coordinates for Y in Angstroms.
47 - 54
Real(8.3)
Orthogonal coordinates for Z in Angstroms.
55 - 60
Real(6.2)
Occupancy.
61 - 66
Real(6.2)
Temperature factor (Default = 0.0).
73 - 76
LString(4)
Segment identifier, left-justified.
77 - 78
LString(2)
Element symbol, right-justified.
79 - 80
LString(2)
Charge on the atom.
Then you can use substr for generating a modified record:
awk -v d="74.106" '
/^ATOM / {
xCoord = sprintf( "%8.3f", substr($0,31,8) + d )
$0 = substr($0,1,30) xCoord substr($0,39)
}
1
' file.pdb
ATOM 1 CE1 LIG H 1 149.312 62.966 59.151 0.00 0.00 HAB1
ATOM 2 NE2 LIG H 1 149.090 62.236 58.086 0.00 0.00 HAB1
ATOM 3 CD2 LIG H 1 149.032 63.041 57.027 0.00 0.00 HAB1
.
.
.
ATOM 4348 ZN ZN2 H 1 75.992 22.818 51.215 0.00 0.00 HAC1
ATOM 4349 ZN ZN2 H 1 136.623 30.663 5.219 0.00 0.00 HAC1
ATOM 4350 ZN ZN2 H 1 133.548 35.851 2.791 0.00 0.00 HAC1
Using awk
$ awk -v d=74.106 '/ATOM/{sub($7,sprintf("%0.3f", $7+d))}1' input_file
ATOM 1 CE1 LIG H 1 149.312 62.966 59.151 0.00 0.00 HAB1
ATOM 2 NE2 LIG H 1 149.090 62.236 58.086 0.00 0.00 HAB1
ATOM 3 CD2 LIG H 1 149.032 63.041 57.027 0.00 0.00 HAB1
.
.
.
ATOM 4348 ZN ZN2 H 1 75.992 22.818 51.215 0.00 0.00 HAC1
ATOM 4349 ZN ZN2 H 1 136.623 30.663 5.219 0.00 0.00 HAC1
ATOM 4350 ZN ZN2 H 1 133.548 35.851 2.791 0.00 0.00 HAC1

Melted distance table into distance matrix

I have a table like this:
Var score
1 1.00
1 1.06
1 1.03
1 0.65
1 0.68
2 1.06
2 1.07
2 0.64
2 1.05
3 0.71
3 0.72
3 1.03
4 0.68
4 1.08
5 0.11
Want to convert this into matrix like:
1 2 3 4 5 6
1 0.00 1.00 1.06 1.03 0.65 0.68
2 1.00 0.00 1.06 1.07 0.64 1.05
3 1.06 1.06 0.00 0.71 0.72 1.03
4 1.03 1.07 0.71 0.00 0.68 1.08
5 0.65 0.64 0.72 0.68 0.00 0.11
6 0.68 1.05 1.03 1.08 0.11 0.00
I tried awk but its keep running:
awk '{if(NF>max) max=NF} END{while(getline<"file"){for(i=NF+1;i<=max;i++)$i="0";print}}'
It keeps running because you forgot to pass it the file name. So awk takes its input from the standard input and waits for you to enter something on the keyboard. Use awk '...' file, not just awk '...'. But even with this error fixed it will not work as you expect.
You don't need to read the file twice. You can build your matrix in one single pass and populate the missing cells in the END block (tested with GNU and BSD awk):
awk 'NR > 1 {
num[$1] += 1
mat[$1, $1 + num[$1]] = mat[$1 + num[$1], $1] = $2
n = num[$1] > n ? num[$1] : n
}
END {
n += 1
mat[0, 0] = ""
for(i = 1; i <= n; i += 1) {
mat[0, i] = mat[i, 0] = i
mat[i, i] = "0.00"
}
for(i = 0; i <= n; i += 1)
for(j = 0; j <= n; j += 1)
printf("%s%s", mat[i, j], j == n ? "\n" : "\t")
}' file
1 2 3 4 5 6
1 0.00 1.00 1.06 1.03 0.65 0.68
2 1.00 0.00 1.06 1.07 0.64 1.05
3 1.06 1.06 0.00 0.71 0.72 1.03
4 1.03 1.07 0.71 0.00 0.68 1.08
5 0.65 0.64 0.72 0.68 0.00 0.11
6 0.68 1.05 1.03 1.08 0.11 0.00

Using transform with condition within a dataframe

Have the following df
import numpy as np
import random
i = ['dog', 'cat', 'rabbit', 'elephant'] * 20
df = pd.DataFrame(np.random.randn(len(i), 3), index=i, \
columns=list('ABC')).rename_axis('animal').reset_index()
df.insert(1, 'type', pd.Series(random.choice(['X', 'Y']) \
for _ in range(len(df))))
I would like to have the max of column A, if the type of the animal is X ... else the min of column A, in a separate column.
Apply lambda with group by shows the multi-indexed array with the following code:
g = df.groupby(['animal', 'type'])
g.apply(lambda g: np.where (g.type == 'X', g.A.max(), g.A.min()))
Is there a way to convert this to a series, that can be added to df as a column... say by using transform?
Is this what you want?
>>> df
animal type A B C
0 cat Y 0.96 -0.02 -0.14
1 cat Y -0.80 0.86 1.75
2 dog X 1.13 -0.49 -1.66
3 dog Y 0.84 -0.07 0.15
4 elephant X 0.13 -0.54 0.73
5 elephant Y 0.14 1.77 0.94
6 rabbit X -0.12 -0.39 0.05
7 rabbit X 0.58 -1.17 0.77
>>> def max_min_A(g):
animal, type_ = g.name
return np.where(type_ == 'X', g.max(), g.min())
>>> df['new_col'] = df.groupby(['animal', 'type'])['A'].transform(max_min_A)
animal type A B C new_col
0 cat Y 0.96 -0.02 -0.14 -0.80
1 cat Y -0.80 0.86 1.75 -0.80
2 dog X 1.13 -0.49 -1.66 1.13
3 dog Y 0.84 -0.07 0.15 0.84
4 elephant X 0.13 -0.54 0.73 0.13
5 elephant Y 0.14 1.77 0.94 0.14
6 rabbit X -0.12 -0.39 0.05 0.58
7 rabbit X 0.58 -1.17 0.77 0.58
#HarryPlotter: Thx for the name info. It is wonderful to see that the name of the group propagates as a tuple. In case one does not want to use a function, the following will work:
df.assign(new_col=g.A.transform(lambda x: np.where(x.name[1] =='X', \
x.max(), x.min())))
# x.name[1] is used to select the second element of the tuple, which is `type`
I'd like to think that performance wise, it is better to build the temporary columns, rather than iterating through groupby:
grp = df.groupby(['animal', 'type'])['A']
(df
.assign(
mi = grp.transform('min'),
ma = grp.transform('max'),
new_col = lambda df: np.where(df['type'] == 'X', df['ma'], df['mi']))
.drop(columns=['mi','ma'])
)
animal type A B C new_col
0 cat Y 0.96 -0.02 -0.14 -0.80
1 cat Y -0.80 0.86 1.75 -0.80
2 dog X 1.13 -0.49 -1.66 1.13
3 dog Y 0.84 -0.07 0.15 0.84
4 elephant X 0.13 -0.54 0.73 0.13
5 elephant Y 0.14 1.77 0.94 0.14
6 rabbit X -0.12 -0.39 0.05 0.58
7 rabbit X 0.58 -1.17 0.77 0.58

Display pandas dataframe in excel file with split level column and merged cells

I have a large dataframe df as:
Col1 Col2 ATC_Dzr ATC_Last ATC_exp Op_Dzr2 Op_Last2
1Loc get1 0.26 3.88 3.73 0.16 3.15
2Loc get2 0.4 -0.85 -0.86 0.1 -0.54
3Loc get3 -0.59 1.47 2.01 -0.53 1.29
I need to dump this to excel so that it looks as following:
where ATC and Op are in a merged cells
I am not sure how to approach this?
You can set the first 2 columns as index and split the rest and expand to create a Multiindex:
df1 = df.set_index(['Col1','Col2'])
df1.columns = df1.columns.str.split('_',expand=True)
print(df1)
ATC Op
Dzr Last exp Dzr2 Last2
Col1 Col2
1Loc get1 0.26 3.88 3.73 0.16 3.15
2Loc get2 0.40 -0.85 -0.86 0.10 -0.54
3Loc get3 -0.59 1.47 2.01 -0.53 1.29
Then export df1 into excel.
As per coments by #Datanovice , you can also use Pd.MultiIndex.from_tuples:
df1 = df.set_index(['Col1','Col2'])
df1.columns = pd.MultiIndex.from_tuples([(col.split('_')[0], col.split('_')[1])
for col in df1.columns])
print(df1)
ATC Op
Dzr Last exp Dzr2 Last2
Col1 Col2
1Loc get1 0.26 3.88 3.73 0.16 3.15
2Loc get2 0.40 -0.85 -0.86 0.10 -0.54
3Loc get3 -0.59 1.47 2.01 -0.53 1.29

cutting off the values at a threshold in pandas dataframe

I have a dataframe with 5 columns all of which contain numerical values. The columns represent time steps. I have a threshold which, if reached within the time, stops the values from changing. So let's say the original values are [ 0 , 1.5, 2, 4, 1] arranged in a row, and threshold is 2, then i want the manipulated row values to be [0, 1, 2 , 2, 2]
Is there a way to do this without loops?
A bigger example:
>>> threshold = 0.25
>>> input
Out[75]:
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.20
143 0.11 0.27 0.12 0.28 0.35
146 0.30 0.20 0.12 0.25 0.20
324 0.06 0.20 0.12 0.15 0.20
>>> output
Out[75]:
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20
Use:
df = df.where(df.ge(threshold).cumsum(axis=1).cumsum(axis=1).eq(1)).ffill(axis=1).fillna(df)
print (df)
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20
Explanation:
Compare by threshold by ge (>=):
print (df.ge(threshold))
0 1 2 3 4
130 False False False True False
143 False True False True True
146 True False False True False
324 False False False False False
Create cumulative sum per rows:
print (df.ge(threshold).cumsum(axis=1))
0 1 2 3 4
130 0 0 0 1 1
143 0 1 1 2 3
146 1 1 1 2 2
324 0 0 0 0 0
Again for get first matched values:
print (df.ge(threshold).cumsum(axis=1).cumsum(axis=1))
0 1 2 3 4
130 0 0 0 1 2
143 0 1 2 4 7
146 1 2 3 5 7
324 0 0 0 0 0
Compare by 1:
print (df.ge(threshold).cumsum(axis=1).cumsum(axis=1).eq(1))
0 1 2 3 4
130 False False False True False
143 False True False False False
146 True False False False False
324 False False False False False
Replace to NaNs of no matched values:
print (df.where(df.ge(threshold).cumsum(axis=1).cumsum(axis=1).eq(1)))
0 1 2 3 4
130 NaN NaN NaN 0.25 NaN
143 NaN 0.27 NaN NaN NaN
146 0.3 NaN NaN NaN NaN
324 NaN NaN NaN NaN NaN
Forward fill missing values:
print (df.where(df.ge(threshold).cumsum(axis=1).cumsum(axis=1).eq(1)).ffill(axis=1))
0 1 2 3 4
130 NaN NaN NaN 0.25 0.25
143 NaN 0.27 0.27 0.27 0.27
146 0.3 0.30 0.30 0.30 0.30
324 NaN NaN NaN NaN NaN
Replace first values to original:
print (df.where(df.ge(threshold).cumsum(1).cumsum(1).eq(1)).ffill(axis=1).fillna(df))
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20
A bit more complicated but I like it.
v = df.values
a = v >= threshold
b = np.where(np.logical_or.accumulate(a, axis=1), np.nan, v)
r = np.arange(len(a))
j = a.argmax(axis=1)
b[r, j] = v[r, j]
pd.DataFrame(b, df.index, df.columns).ffill(axis=1)
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20
I like this one too:
v = df.values
a = v >= threshold
b = np.logical_or.accumulate(a, axis=1)
r = np.arange(len(df))
g = a.argmax(1)
fill = pd.Series(v[r, g], df.index)
df.mask(b, fill, axis=0)
0 1 2 3 4
130 0.10 0.20 0.12 0.25 0.25
143 0.11 0.27 0.27 0.27 0.27
146 0.30 0.30 0.30 0.30 0.30
324 0.06 0.20 0.12 0.15 0.20