Trying to run an awk command according to some documentation (linky).
However, whenever I add {1} or {2} to the awk command as documentation describes (see link above or example below) my search stops working. Zero results even on gigantic multi-gigabyte files. Any advice?
These work
awk '($3=="+" && $4~/^CG/)' example
awk '($3=="+" && $4~/..CG/)' example
awk '($3=="+" && $4~/.CG/)' example
awk '($3=="+" && $4~/^..CG/)' example
These don't return anything (even on a 3 gigabyte file)
awk '($3=="+" && $4~/.{2}CG/)' example
awk '($3=="+" && $4~/.{1}CG/)' example
awk '($3=="+" && $4~/^.{2}CG/)' example
Full command according to documentation:
awk '($3=="-" && $4~/^.{1}CG/ ) || ($3=="+" && $4~/^.{2}CG/)' BSMAPexample.txt > CpG.txt
Example dataset
EDIT (A COLUMN DISAPPEARED WHEN I PASTED INTO STACK EXCHANGE, TYPO FIXED)
chr pos strand context ratio total_C methy_C CI_lower CI_upper
chr1 3121589 + CGCGT 0.000 56 0 0.000 0.064
chr1 3121597 + ATCGG 0.000 56 0 0.000 0.064
chr1 3121599 + GTCGT 0.000 56 0 0.000 0.064
chr1 3121605 + CTCGG 0.000 56 0 0.000 0.064
chr1 3121606 + TGCGC 0.000 56 0 0.000 0.064
chr1 3121607 + GGCGC 0.000 56 0 0.000 0.064
chr1 3121611 + CTCGA 0.000 56 0 0.000 0.064
chr1 3121614 + TACGC 0.000 56 0 0.000 0.064
chr1 3121631 + CTCGT 0.000 56 0 0.000 0.064
You have removed some columns from the original sample data.
This is the original data in the link you sent:
chr pos strand context ratio total_C methy_C CI_lower CI_upper
chr1 3121589 + CGCGT 0.000 56 0 0.000 0.064
chr1 3121597 + ATCGG 0.000 56 0 0.000 0.064
chr1 3121599 + GTCGT 0.000 56 0 0.000 0.064
And this is the sample data you posted:
chr pos strand context ratio total_C methy_C CI_lower CI_upper
chr1 3121589 CGCGT 0.000 56 0 0.000 0.064
chr1 3121597 ATCGG 0.000 56 0 0.000 0.064
chr1 3121599 GTCGT 0.000 56 0 0.000 0.064
This is a problem for an expression like this:
awk '($3=="+" && $4~/.{2}CG/)' example
Which expects a + symbol in the third column ($3, inexisting in your data) and some repetitions of CG in the fourth column ($4 which seems to be located in position number 3). It won't match any line in your file.
If you modify the expression to refer to the proper column ($3) and forget the + sign since it does not appear in your data, you will get to match lines in your file.
$ awk '($3~/.{2}CG/)' example
chr1 3121589 CGCGT 0.000 56 0 0.000 0.064
chr1 3121597 ATCGG 0.000 56 0 0.000 0.064
chr1 3121599 GTCGT 0.000 56 0 0.000 0.064
chr1 3121605 CTCGG 0.000 56 0 0.000 0.064
chr1 3121606 TGCGC 0.000 56 0 0.000 0.064
chr1 3121607 GGCGC 0.000 56 0 0.000 0.064
chr1 3121611 CTCGA 0.000 56 0 0.000 0.064
chr1 3121614 TACGC 0.000 56 0 0.000 0.064
chr1 3121631 CTCGT 0.000 56 0 0.000 0.064
$
Actually all lines in the example file have 2 characters before the CG (**CG*). Only the header will be skipped.
Problem solved. I used gawk and --posix
gawk --posix '($3=="-" && $4~/^.{1}CG/ ) || ($3=="+" && $4~/^.{2}CG/)'
Works just fine now.
Related
I have a file named file1 consisting of 4350 lines and 12 columns, as shown below.
ATOM 1 CE1 LIG H 1 75.206 62.966 59.151 0.00 0.00 HAB1
ATOM 2 NE2 LIG H 1 74.984 62.236 58.086 0.00 0.00 HAB1
ATOM 3 CD2 LIG H 1 74.926 63.041 57.027 0.00 0.00 HAB1
.
.
.
ATOM 4348 ZN ZN2 H 1 1.886 22.818 51.215 0.00 0.00 HAC1
ATOM 4349 ZN ZN2 H 1 62.517 30.663 5.219 0.00 0.00 HAC1
ATOM 4350 ZN ZN2 H 1 59.442 35.851 2.791 0.00 0.00 HAC1
I am using awk -v d="74.106" '{$7=sprintf("%0.3f", $7+d)} 1' file1 > file2 to add a value d to the 7th column of file1. After this, my file2 does not retain the correct formatting. A section of file2 is shown below.
ATOM 1 CE1 LIG H 1 149.312 62.966 59.151 0.00 0.00 HAB1
ATOM 2 NE2 LIG H 1 149.090 62.236 58.086 0.00 0.00 HAB1
ATOM 3 CD2 LIG H 1 149.032 63.041 57.027 0.00 0.00 HAB1
.
.
.
ATOM 4348 ZN ZN2 H 1 75.992 22.818 51.215 0.00 0.00 HAC1
ATOM 4349 ZN ZN2 H 1 136.623 30.663 5.219 0.00 0.00 HAC1
ATOM 4350 ZN ZN2 H 1 133.548 35.851 2.791 0.00 0.00 HAC1
I need my file2 to keep the same formatting as my file1, where only columns 2, 8, and 9 are left leaning.
I have tried to use awk -v FIELDWIDTHS="7 6 4 4 4 5 8 8 8 6 6 10" '{print $1 $2 $3 $4 $5 $6 $7 $8 $9 $10 $11 $12}' to specify the maximum width for each of the 12th columns. This line does not change my file2. Moreover, I cannot find a way to make columns 2, 8, and 9 left leaning as in file1.
How can I achieve these two things?
I appreciate any guidance. Thank you!
Well, with the default FS, awk strips the duplicate spaces when you modify a field.
What you need to do first is to understand your ATOM record format:
COLUMNS
DATA TYPE
CONTENTS
1 - 6
Record name
"ATOM "
7 - 11
Integer
Atom serial number.
13 - 16
Atom
Atom name.
17
Character
Alternate location indicator.
18 - 20
Residue name
Residue name.
22
Character
Chain identifier.
23 - 26
Integer
Residue sequence number.
27
AChar
Code for insertion of residues.
31 - 38
Real(8.3)
Orthogonal coordinates for X in Angstroms.
39 - 46
Real(8.3)
Orthogonal coordinates for Y in Angstroms.
47 - 54
Real(8.3)
Orthogonal coordinates for Z in Angstroms.
55 - 60
Real(6.2)
Occupancy.
61 - 66
Real(6.2)
Temperature factor (Default = 0.0).
73 - 76
LString(4)
Segment identifier, left-justified.
77 - 78
LString(2)
Element symbol, right-justified.
79 - 80
LString(2)
Charge on the atom.
Then you can use substr for generating a modified record:
awk -v d="74.106" '
/^ATOM / {
xCoord = sprintf( "%8.3f", substr($0,31,8) + d )
$0 = substr($0,1,30) xCoord substr($0,39)
}
1
' file.pdb
ATOM 1 CE1 LIG H 1 149.312 62.966 59.151 0.00 0.00 HAB1
ATOM 2 NE2 LIG H 1 149.090 62.236 58.086 0.00 0.00 HAB1
ATOM 3 CD2 LIG H 1 149.032 63.041 57.027 0.00 0.00 HAB1
.
.
.
ATOM 4348 ZN ZN2 H 1 75.992 22.818 51.215 0.00 0.00 HAC1
ATOM 4349 ZN ZN2 H 1 136.623 30.663 5.219 0.00 0.00 HAC1
ATOM 4350 ZN ZN2 H 1 133.548 35.851 2.791 0.00 0.00 HAC1
Using awk
$ awk -v d=74.106 '/ATOM/{sub($7,sprintf("%0.3f", $7+d))}1' input_file
ATOM 1 CE1 LIG H 1 149.312 62.966 59.151 0.00 0.00 HAB1
ATOM 2 NE2 LIG H 1 149.090 62.236 58.086 0.00 0.00 HAB1
ATOM 3 CD2 LIG H 1 149.032 63.041 57.027 0.00 0.00 HAB1
.
.
.
ATOM 4348 ZN ZN2 H 1 75.992 22.818 51.215 0.00 0.00 HAC1
ATOM 4349 ZN ZN2 H 1 136.623 30.663 5.219 0.00 0.00 HAC1
ATOM 4350 ZN ZN2 H 1 133.548 35.851 2.791 0.00 0.00 HAC1
I have file 1 with 5778 lines with 15 columns.
Sample from output_armlympho.txt:
NUMBER CHROM POS ID REF ALT A1 TEST OBS_CT BETA SE L95 U95 T_STAT P
42484 1 18052645 rs260514:18052645:G:A G A G ADD 1597 0.0147047 0.0656528 -0.113972 0.143382 0.223977 0.822804
42485 1 18054638 rs35592535:18054638:GC:G GC G G ADD 1597 0.0138673 0.0269643 -0.0389816 0.0667163 0.514286 0.607124
42486 7 18054785 rs1572792:18054785:G:A G A A ADD 1597 -0.0126002 0.0256229 -0.0628202
I have another file with 25958247 lines and 16 columns
Sample from file1:
column1 column2 column3 column4 column5 column6 column7 column8 column9 column10 column11 column12 column13 column14 column15 column16
1 chr1_10000044_A_T_b38 ENS 171773 29 30 0.02 0.33 0.144 0.14 chr1 10000044 A T chr1 10060102
2 chr7_10000044_A_T_b38 ENS -58627 29 30 0.024 0.26 0.16 0.15 chr7 10000044 A T chr7 18054785
4 chr1_10000044_A_T_b38 ENS 89708 29 30 0.0 0.03 -0.0 0.038 chr1 10000044 A T chr1 18054638
5 chr1_10000044_A_T_b38 ENS -472482 29 30 0.02 0.16 0.11 0.07 chr1 10000044 A T chr1 18052645
I want to merge these files together so that the second and third column from file 1 (CHROM POS) exactly matches the 15th and 16th columns from file 2 (column15 column16). However a problem is that in column15, the format is chr[number] e.g. chr1 and in file 1 is just 1. So I need a way to match 1 to chr1 or 7 to chr7 and via position. There may also be repeated lines in file 2. E.g. repeated values that are the same in column15 and column16. Both files aren't ordered in the same way.
Expected output: (outputs all the columns from file 1 and 2).
column1 column2 column3 column4 column5 column6 column7 column8 column9 column10 column11 column12 column13 column14 column15 column16 NUMBER CHROM POS ID REF ALT A1 TEST OBS_CT BETA SE L95 U95 T_STAT P
2 chr7_10000044_A_T_b38 ENS -58627 29 30 0.024 0.26 0.16 0.15 chr7 10000044 A T chr7 18054785 42486 7 18054785 42486 7 18054785 rs1572792:18054785:G:A G A A ADD 1597 -0.0126002 0.0256229 -0.0628202
4 chr1_10000044_A_T_b38 ENS 89708 29 30 0.0 0.03 -0.0 0.038 chr1 10000044 A T chr1 18054638 42485 1 18054638 rs35592535:18054638:GC:G GC G G ADD 1597 0.0138673 0.0269643 -0.0389816 0.0667163 0.514286 0.607124
5 chr1_10000044_A_T_b38 ENS -472482 29 30 0.02 0.16 0.11 0.07 chr1 10000044 A T chr1 18052645 42484 1 18052645 rs260514:18052645:G:A G A G ADD 1597 0.0147047 0.0656528 -0.113972 0.143382 0.223977 0.822804
Current attempt:
awk 'NR==FNR {Tmp[$3] = $16 FS $4; next} ($16 in Tmp) {print $0 FS Tmp[$16]}' output_armlympho.txt file1 > test
Assumptions:
within the file output_armlympho.txt the combination of the 2nd and 3rd columns are unique
One awk idea:
awk '
FNR==1 { if (header) print $0,header; else header=$0; next }
FNR==NR { lines["chr" $2,$3]=$0; next }
($15,$16) in lines { print $0, lines[$15,$16] }
' output_armlympho.txt file1
This generates:
column1 column2 column3 column4 column5 column6 column7 column8 column9 column10 column11 column12 column13 column14 column15 column16 NUMBER CHROM POS ID REF ALT A1 TEST OBS_CT BETA SE L95 U95 T_STAT P
2 chr7_10000044_A_T_b38 ENS -58627 29 30 0.024 0.26 0.16 0.15 chr7 10000044 A T chr7 18054785 42486 7 18054785 rs1572792:18054785:G:A G A A ADD 1597 -0.0126002 0.0256229 -0.0628202
4 chr1_10000044_A_T_b38 ENS 89708 29 30 0.0 0.03 -0.0 0.038 chr1 10000044 A T chr1 18054638 42485 1 18054638 rs35592535:18054638:GC:G GC G G ADD 1597 0.0138673 0.0269643 -0.0389816 0.0667163 0.514286 0.607124
5 chr1_10000044_A_T_b38 ENS -472482 29 30 0.02 0.16 0.11 0.07 chr1 10000044 A T chr1 18052645 42484 1 18052645 rs260514:18052645:G:A G A G ADD 1597 0.0147047 0.0656528 -0.113972 0.143382 0.223977 0.822804
I have a table like this:
Var score
1 1.00
1 1.06
1 1.03
1 0.65
1 0.68
2 1.06
2 1.07
2 0.64
2 1.05
3 0.71
3 0.72
3 1.03
4 0.68
4 1.08
5 0.11
Want to convert this into matrix like:
1 2 3 4 5 6
1 0.00 1.00 1.06 1.03 0.65 0.68
2 1.00 0.00 1.06 1.07 0.64 1.05
3 1.06 1.06 0.00 0.71 0.72 1.03
4 1.03 1.07 0.71 0.00 0.68 1.08
5 0.65 0.64 0.72 0.68 0.00 0.11
6 0.68 1.05 1.03 1.08 0.11 0.00
I tried awk but its keep running:
awk '{if(NF>max) max=NF} END{while(getline<"file"){for(i=NF+1;i<=max;i++)$i="0";print}}'
It keeps running because you forgot to pass it the file name. So awk takes its input from the standard input and waits for you to enter something on the keyboard. Use awk '...' file, not just awk '...'. But even with this error fixed it will not work as you expect.
You don't need to read the file twice. You can build your matrix in one single pass and populate the missing cells in the END block (tested with GNU and BSD awk):
awk 'NR > 1 {
num[$1] += 1
mat[$1, $1 + num[$1]] = mat[$1 + num[$1], $1] = $2
n = num[$1] > n ? num[$1] : n
}
END {
n += 1
mat[0, 0] = ""
for(i = 1; i <= n; i += 1) {
mat[0, i] = mat[i, 0] = i
mat[i, i] = "0.00"
}
for(i = 0; i <= n; i += 1)
for(j = 0; j <= n; j += 1)
printf("%s%s", mat[i, j], j == n ? "\n" : "\t")
}' file
1 2 3 4 5 6
1 0.00 1.00 1.06 1.03 0.65 0.68
2 1.00 0.00 1.06 1.07 0.64 1.05
3 1.06 1.06 0.00 0.71 0.72 1.03
4 1.03 1.07 0.71 0.00 0.68 1.08
5 0.65 0.64 0.72 0.68 0.00 0.11
6 0.68 1.05 1.03 1.08 0.11 0.00
On implementing an edge preserving filter similar to ImageJ's Kuwahara filter, which assigns each pixel to the mean of the area with the smallest deviation around it, I'm struggling with performance issues.
Counterintuitively, the calculation of means and deviations to separate matrices is fast compared to the final resorting to compile the output array. The ImageJ implementation above seems to expect about 70% of total processing time for this step though.
Given two arrays means and stds, whose sizes are 2 kernel sizes p bigger than the output array 'res' in each axis, I want to assign a pixel to the mean of the area with the smallest deviation:
#vector to middle of surrounding area (approx.)
p2 = p/2
# x and y components of vector to each quadrant
index2quadrant = np.array([[1, 1, -1, -1],[1, -1, 1, -1]]) * p2
Iterate over all pixels of output array of shape (asize, asize):
for col in np.arange(asize) + p:
for row in np.arange(asize) + p:
Searching for the minimum std dev in the 4 quadrants around the current coordinate, and using the corresponding index to assign the previously computed mean:
minidx = np.argmin(stds[index2quadrant[0] + col, index2quadrant[1] + row])
#assign mean
res[col - p, row - p] = means[index2quadrant[:,minidx][0] + col,index2quadrant[:,minidx][1] + row]
The Python profiler gives the following results on filtering a 1024x1024 array with a 8x8 pixel kernel:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 30.024 30.024 <string>:1(<module>)
1048576 2.459 0.000 4.832 0.000 fromnumeric.py:740(argmin)
1 23.720 23.720 30.024 30.024 kuwahara.py:4(kuwahara)
2 0.000 0.000 0.012 0.006 numeric.py:65(zeros_like)
2 0.000 0.000 0.000 0.000 {math.log}
1048576 2.373 0.000 2.373 0.000 {method 'argmin' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2 0.012 0.006 0.012 0.006 {method 'fill' of 'numpy.ndarray' objects}
8256 0.189 0.000 0.189 0.000 {method 'mean' of 'numpy.ndarray' objects}
16512 0.524 0.000 0.524 0.000 {method 'reshape' of 'numpy.ndarray' objects}
8256 0.730 0.000 0.730 0.000 {method 'std' of 'numpy.ndarray' objects}
1042 0.012 0.000 0.012 0.000 {numpy.core.multiarray.arange}
1 0.000 0.000 0.000 0.000 {numpy.core.multiarray.array}
2 0.000 0.000 0.000 0.000 {numpy.core.multiarray.empty_like}
2 0.003 0.002 0.003 0.002 {numpy.core.multiarray.zeros}
8 0.002 0.000 0.002 0.000 {zip}
For me, there is not much of an indication (-> numpy?), where the time is lost, since except for argmin, the total time seems to be negligible.
Do you have any suggestions, how to improve performance?
I have a text file that looks like this.
A 102
B 456
C 678
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
I need to extract all the lines that start with B,H and two lines after H . How can I do this using awk?
The expected output would be
B 456
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
Any suggestions please.
Ignoring the blank line after B in your output (your problem specifications give no indication as to why that blank line is in the output, so I'm assuming it should not be there):
awk '/^H/{t=3} /^B/ || t-- >0' input.file
will print all lines that start with B and each line that starts with H along with the next two lines.
awk '/^[BH]/ || /^[[:blank:]]*[[:digit:]]/' inputfile
bash-3.00$ cat t
A 102
B 456
C 678
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
bash-3.00$ awk '{if(( $1 == "B") || ($1 == "H") || ($0 ~ /^ / )) print;}' t
B 456
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
OR in short
awk '{if($0 ~ /^[BH ]/ ) print;}' t
OR even shorter
awk '/^[BH ]/' t
If H and B aren't the only headers that are sent before tabular data and you intend to omit those blocks of data (you don't specify the requirements fully) you have to use a flip-flop to remember if you're currently in a block you want to keep or not:
awk '/^[^ 0-9]/ {inblock=0}; /^[BH]/ {inblock=1}; { if (inblock) print }' d.txt
cat filename.txt | awk '/^[B(H(^ .*$){2})].*$/' > output.txt
EDIT: Updated for OP's edit