calculating atom distances in PDB files - awk

I have this pdb file and I want to calculate the distance between the atom 7 and 8 ($2) with the atoms 12,14,15,17 and 18. If the distance is lower than 5 angstrons, the value should be printed
ATOM 1 N ASN p 140 38.455 18.232 -3.207 1.00 7.39 N
ATOM 2 CA ASN p 140 37.856 18.151 -4.534 1.00 7.91 C
ATOM 3 C ASN p 140 38.700 18.848 -5.595 1.00 10.75 C
ATOM 4 O ASN p 140 39.797 19.271 -5.313 1.00 9.25 O
ATOM 5 CB ASN p 140 36.435 18.715 -4.446 1.00 7.62 C
ATOM 6 CG ASN p 140 35.556 17.898 -3.501 1.00 6.82 C
ATOM 7 OD1 ASN p 140 35.269 18.315 -2.323 1.00 8.53 O
ATOM 8 ND2 ASN p 140 35.197 16.691 -3.945 1.00 5.41 N
TER 9 ASN 140
HETATM 10 C 08H p 1 29.121 15.727 -1.182 1.00 5.89 C
HETATM 11 C 08H p 1 29.763 16.230 -0.040 1.00 5.86 C
HETATM 12 N 08H p 1 31.023 16.810 -0.046 1.00 6.15 N
HETATM 13 C 08H p 1 31.533 17.872 0.633 1.00 6.24 C
HETATM 14 N 08H p 1 32.815 18.037 0.299 1.00 6.83 N
HETATM 15 N 08H p 1 33.151 17.112 -0.526 1.00 7.37 C
HETATM 16 C 08H p 1 32.058 16.349 -0.758 1.00 7.06 C
HETATM 17 O 08H p 1 31.956 15.215 -1.730 1.00 8.15 O
HETATM 18 N 08H p 1 30.979 15.691 -2.746 1.00 10.31 N
HETATM 19 C 08H p 1 29.651 15.777 -2.509 1.00 6.71 C
HETATM 20 O HOH p 170 34.699 19.032 2.134 1.00 6.42 O
Based on a similar script, I wrote this code
# usage: awk -f test.awk structure.pdb
BEGIN{print "asparagine and ligand in the structure..."; ORS=""}
$1=="ATOM" && $3~"ND2|OD1" && $4=="ASN" || $1=="HETATM" && $12~"N|O" && $4!~"HOH" {
print $2,$3,$4,$6"\n"
atm_x[$2]=$7; atm_y[$2]=$8; atm_z[$2]=$9
}
END{ ORS="\n"
for (key1 in atm_x) { list=list" "key1
for (key2 in atm_x) {
if (index(list, key2) != 0 ) continue
dx=atm_x[key1]-atm_x[key2]
dy=atm_y[key1]-atm_y[key2]
dz=atm_z[key1]-atm_z[key2]
distance=sqrt(dx^2+dy^2+dz^2)
if (distance < 5 && distance != 0 ) {
i++
candidate[i]=key1"-"key2": "distance
}
}
}
print "\nCandidates ..."
for (keys in candidate) {print candidate[keys]}
}
when I run this script I get the following result
asparagine and ligand in the structure...
7 OD1 ASN 140
8 ND2 ASN 140
12 N 08H 1
14 N 08H 1
17 O 08H 1
18 N 08H 1
Candidates ...
7-8: 2.2964
7-14: 3.60198
7-17: 4.57576
8-17: 4.19391
8-18: 4.49768
12-14: 2.19905
12-17: 2.50007
12-18: 2.92303
14-17: 3.58028
14-18: 4.25989
17-18: 1.48774
The problem is that I don't want to print the distances when the atoms have the same residue name ($4). I'm new to awk and was wondering what's the best way to handle this. Any suggestions would be appreciated!!

awk '
($1=="ATOM" && ($3=="ND2" || $3=="OD1") && $4=="ASN") || \
($1=="HETATM" && ($12=="N" || $12 =="O") && $4!="HOH") {
atom[$2] = 1
x[$2] = $7
y[$2] = $8
z[$2] = $9
name[$2] = $4
}
END {
for (a in atom) {
for (b in atom) {
if (a > b && name[a] != name[b]) {
dist = sqrt((x[a]-x[b])^2 + (y[a]-y[b])^2 + (z[a]-z[b])^2)
if (dist < 5)
printf "%s-%s: %.4f\n", a, b, dist
}
}
}
}
' pdbfile
7-17: 4.5758
7-14: 3.6020
8-17: 4.1939
8-18: 4.4977

Related

Counting max and min per row across columns and outputting associated column names

I'm trying to count both max and min (except 0s) per row across columns and outputting associated column names.
I'm trying this:
BEGIN{OFS="\t"}
NR==1{print $1,$2,"ref","max","ref","min";
for(i=3;i<=6;++i)BASES[i]=$(i);
}
NR>1{l=1;basemax=BASES[3];basemin=BASES[3]; max=$3; min=$3;
for(i=4;i<=6;++i){
if($i>max){basemax=BASES[i];max=$i;}
else if($i==max){basemax=basemax","BASES[i];++l}
}
for(i=4;i<=6;++i){
if($i<min && $i !=0){basemmin=BASES[i];mim=$i}
else if($i==min){basemin=basemin","BASES[i];++l}
}
print $1,$2,basemax,max,basemin,min
}
In a input that looks like this
chr pos C T A G
NC_044998.1 3732 22 0 7 0
NC_044998.1 3733 22 0 0 0
NC_044998.1 3734 22 3 3 0
NC_044998.1 3735 22 0 0 3
NC_044998.1 3736 0 7 22 3
NC_044998.1 3737 0 0 0 25
NC_044998.1 3738 22 7 0 0
NC_044998.1 3739 7 3 22 25
NC_044998.1 3740 0 22 22 0
NC_044998.1 3741 22 0 0 0
The desired output is
chr pos ref max ref min
NC_044998.1 3732 C 22 A 7
NC_044998.1 3733 C 22 C 22
NC_044998.1 3734 C 22 T,A 3
NC_044998.1 3735 C 22 G 3
NC_044998.1 3736 A 22 G 3
NC_044998.1 3737 G 25 G 25
NC_044998.1 3738 C 22 C 22
NC_044998.1 3739 G 25 C 7
NC_044998.1 3740 T,A 22 T,A 22
NC_044998.1 3741 C 22 C 22
But it outputs this instead
chr pos ref max ref min
NC_044998.1 3732 C 22 C 22
NC_044998.1 3733 C 22 C 22
NC_044998.1 3734 C 22 C 22
NC_044998.1 3735 C 22 C 22
NC_044998.1 3736 A 22 C 0
NC_044998.1 3737 G 25 C,T,A 0
NC_044998.1 3738 C 22 C 22
NC_044998.1 3739 G 25 C 7
NC_044998.1 3740 T 22 C,A,G 0
NC_044998.1 3741 C 22 C 22
With your shown samples, please try following awk code. Written and tested in GNU awk.
awk -v startField="3" -v endField="6" '
BEGIN{ OFS="\t"; print "chr pos ref max ref min"}
FNR==1{
for(i=startField;i<=endField;i++){
heading[i]=$i
}
next
}
{
min=max2=maxInd2=minInd=max=maxInd=minAllInd=maxAllInd=maxAllInd2=""
for(i=startField;i<=endField;i++){
if($i!=0){
minInd=(min>$i?i:(min==$i?minInd","i:(minInd!=""?minInd:i)))
min=(min>$i?$i:(min!=""?min:$i))
}
maxInd=(max<$i?i:(max==$i?maxInd","i:(maxInd!=""?maxInd:i)))
max=(max<$i?$i:(max!=""?max:$i))
}
for(i=startField+1;i<=endField;i++){
maxInd2=(max2<$i?i:(max2==$i?maxInd2","i:(maxInd2!=""?maxInd2:i)))
max2=(max2<$i?$i:(max2!=""?max2:$i))
}
num1=split(maxInd,arr1,",")
num2=split(minInd,arr2,",")
num3=split(maxInd2,arr3,",")
if(num1>1){
for(k=1;k<=num1;k++){
maxAllInd = (maxAllInd?maxAllInd ",":"") heading[arr1[k]]
}
}
else{
maxAllInd = heading[maxInd]
}
if(num2>1){
for(k=1;k<=num2;k++){
minAllInd = (minAllInd?minAllInd ",":"") heading[arr2[k]]
}
}
else{
minAllInd = heading[minInd]
}
if(num3>1){
for(k=1;k<=num3;k++){
maxAllInd2 = (maxAllInd2?maxAllInd2 ",":"") heading[arr3[k]]
}
}
else{
maxAllInd2 = heading[maxInd2]
}
if(startField>1){
NF=(startField-1)
if(min !=0 ){
print $0,maxAllInd,max,minAllInd,min
}
if(min == 0 && max2 != 0){
print $0,maxAllInd,max,maxAllInd2,max2
}
if(min == 0 && max2 == 0){
print $0,maxAllInd,max,maxAllInd,max
}
}
else{
if(min !=0 ){
print maxAllInd,max,minAllInd,min
}
if(min == 0 && max2 != 0){
print maxAllInd,max,maxAllInd2,max2
}
if(min == 0 && max2 == 0){
print maxAllInd,max,maxAllInd,max
}
}
}
' Input_file
This awk script should work for you:
cat maxmin.awk
NR == 1 {
for (i=b; i<=NF; ++i)
hdr[i] = $i
print $1, $2, "ref", "max", "ref", "min"
next
}
{
for (i=b; i<=NF; ++i) {
max = ($i > max ? $i : max)
min = ($i && (min == "" || $i < min) ? $i : min)
}
for (i=b; i<=NF; ++i) {
if ($i == min)
rmin = (rmin ? rmin "," : "") hdr[i]
if ($i == max)
rmax = (rmax ? rmax "," : "") hdr[i]
}
print $1, $2, rmax, max, rmin, min
max = min = rmax = rmin = ""
}
And use it as:
awk -v b=3 -f maxmin.awk gg | column -t
chr pos ref max ref min
NC_044998.1 3732 C 22 A 7
NC_044998.1 3733 C 22 C 22
NC_044998.1 3734 C 22 T,A 3
NC_044998.1 3735 C 22 G 3
NC_044998.1 3736 A 22 G 3
NC_044998.1 3737 G 25 G 25
NC_044998.1 3738 C 22 T 7
NC_044998.1 3739 G 25 T 3
NC_044998.1 3740 T,A 22 T,A 22
NC_044998.1 3741 C 22 C 22
column -t has been used for tabular output only.
You have some typos in variable names such as basemmin and mim.
If the count of C is 0, the min value has no chance to be updated.
You can combine the two for loops into one.
The variable l is not used.
Then would you please try the following:
awk -v OFS="\t" '
NR==1 {
print $1, $2, "ref", "max", "ref", "min"
for (i = 3; i <= 6; i++) bases[i] = $i
}
NR>1 {
basemax = bases[3]; basemin = bases[3]; max = $3; min = $3
for (i = 4; i <= 6; i++) {
if ($i > max) {basemax = bases[i]; max = $i}
else if ($i == max) {basemax = basemax "," bases[i]}
if ($i < min && $i != 0 || min == 0) {basemin = bases[i]; min = $i}
else if ($i == min) {basemin = basemin "," bases[i]}
}
print $1, $2, basemax, max, basemin, min
}' input_file
Output:
chr pos ref max ref min
NC_044998.1 3732 C 22 A 7
NC_044998.1 3733 C 22 C 22
NC_044998.1 3734 C 22 T,A 3
NC_044998.1 3735 C 22 G 3
NC_044998.1 3736 A 22 G 3
NC_044998.1 3737 G 25 G 25
NC_044998.1 3738 C 22 T 7
NC_044998.1 3739 G 25 T 3
NC_044998.1 3740 T,A 22 T,A 22
NC_044998.1 3741 C 22 C 22
Please note the output slightly differs from your desired output, which may contain typos.

For each unique occurrence in field, transform each unique occurrence in another field in a different column

I have a file
splice_region_variant,intron_variant A1CF 1
3_prime_UTR_variant A1CF 18
intron_variant A1CF 204
downstream_gene_variant A1CF 22
synonymous_variant A1CF 6
missense_variant A1CF 8
5_prime_UTR_variant A2M 1
stop_gained A2M 1
missense_variant A2M 15
splice_region_variant,intron_variant A2M 2
synonymous_variant A2M 2
upstream_gene_variant A2M 22
intron_variant A2M 308
missense_variant A4GNT 1
intron_variant A4GNT 21
5_prime_UTR_variant A4GNT 3
3_prime_UTR_variant A4GNT 7
This file is sorted by $2
for each occurrence of an unique element in $2, I wanna transform in a column each unique occurrence of an element in $1, with corresponding value in $3, or 0 if the record is not there. So that I have:
splice_region_variant,intron_variant 3_prime_UTR_variant intron_variant downstream_gene_variant synonymous_variant missense_variant 5_prime_UTR_variant stop_gained upstream_gene_variant
A1CF 1 18 204 22 6 8 0 0 0
A2M 2 0 308 0 2 15 1 1 22
A4GNT 0 7 21 0 0 22 3 0 0
test file:
a x 2
b,c x 4
dd x 3
e,e,t x 5
a b 1
cc b 2
e,e,t b 1
This is what I'm getting:
a b,c dd e,e,t cc
x 5 2 4 3
b 1 2 1
EDIT: This might be doing it but doesn't output 0s in blank fields
'BEGIN {FS = OFS = "\t"}
NR > 1 {data[$2][$1] = $3; blocks[$1]}
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
# header
printf "gene"
for (block in blocks) {
printf "%s%s", OFS, block
}
print ""
# data
for (ts in data) {
printf "%s", ts
for (block in blocks) {
printf "%s%s", OFS, data[ts][block]
}
print ""
}
}' file
modified from https://unix.stackexchange.com/questions/424642/dynamic-transposing-rows-to-columns-using-awk-based-on-row-value
If you want to print 0 if a certain value is absent, you could do something like this:
val = data[ts][block] ? data[ts][block] : 0;
printf "%s%s", OFS, val

SQL query is not working (Error in rsqlite_send_query)

This is what the head of my data frame looks like
> head(d19_1)
SMZ SIZ1_diff SIZ1_base SIZ2_diff SIZ2_base SIZ3_diff SIZ3_base SIZ4_diff SIZ4_base SIZ5_diff SIZ5_base
1 1 -620 4170 -189 1347 -35 2040 82 1437 244 1533
2 2 -219 831 -57 255 -4 392 8 282 14 297
3 3 -426 834 -162 294 -134 379 -81 241 -22 221
4 4 -481 676 -142 216 -114 267 -50 158 -43 166
5 5 -233 1711 -109 584 54 913 71 624 74 707
6 6 -322 1539 -79 512 -50 799 23 532 63 576
Total_og Total_base %_SIZ1 %_SIZ2 %_SIZ3 %_SIZ4 %_SIZ5 Total_og Total_base
1 11980 12648 14.86811 14.03118 1.715686 5.706333 15.916504 11980 12648
2 2156 2415 26.35379 22.35294 1.020408 2.836879 4.713805 2156 2415
3 1367 2314 51.07914 55.10204 35.356201 33.609959 9.954751 1367 2314
4 790 1736 71.15385 65.74074 42.696629 31.645570 25.903614 790 1736
5 5339 5496 13.61777 18.66438 5.914567 11.378205 10.466761 5339 5496
6 4362 4747 20.92268 15.42969 6.257822 4.323308 10.937500 4362 4747
The datatype of the data frame is as below str(d19_1)
> str(d19_1)
'data.frame': 1588 obs. of 20 variables:
$ SMZ : int 1 2 3 4 5 6 7 8 9 10 ...
$ SIZ1_diff : int -620 -219 -426 -481 -233 -322 -176 -112 -34 -103 ...
$ SIZ1_base : int 4170 831 834 676 1711 1539 720 1396 998 1392 ...
$ SIZ2_diff : int -189 -57 -162 -142 -109 -79 -12 72 -36 -33 ...
$ SIZ2_base : int 1347 255 294 216 584 512 196 437 343 479 ...
$ SIZ3_diff : int -35 -4 -134 -114 54 -50 16 4 26 83 ...
$ SIZ3_base : int 2040 392 379 267 913 799 361 804 566 725 ...
$ SIZ4_diff : int 82 8 -81 -50 71 23 36 127 46 75 ...
$ SIZ4_base : int 1437 282 241 158 624 532 242 471 363 509 ...
$ SIZ5_diff : int 244 14 -22 -43 74 63 11 143 79 125 ...
$ SIZ5_base : int 1533 297 221 166 707 576 263 582 429 536 ...
$ Total_og : int 11980 2156 1367 790 5339 4362 2027 4715 3465 4561 ...
$ Total_base: int 12648 2415 2314 1736 5496 4747 2168 4464 3278 4375 ...
$ %_SIZ1 : num 14.9 26.4 51.1 71.2 13.6 ...
$ %_SIZ2 : num 14 22.4 55.1 65.7 18.7 ...
$ %_SIZ3 : num 1.72 1.02 35.36 42.7 5.91 ...
$ %_SIZ4 : num 5.71 2.84 33.61 31.65 11.38 ...
$ %_SIZ5 : num 15.92 4.71 9.95 25.9 10.47 ...
$ Total_og : int 11980 2156 1367 790 5339 4362 2027 4715 3465 4561 ...
$ Total_base: int 12648 2415 2314 1736 5496 4747 2168 4464 3278 4375 ...
When I run the below query, it is returning me the below error and I don't know why. I don't have any column in table
Query
d20_1 <- sqldf('SELECT *, CASE
WHEN SMZ BETWEEN 1 AND 110 THEN "Baltimore City"
WHEN SMZ BETWEEN 111 AND 217 THEN "Anne Arundel County"
WHEN SMZ BETWEEN 218 AND 405 THEN "Baltimore County"
WHEN SMZ BETWEEN 406 AND 453 THEN "Carroll County"
WHEN SMZ BETWEEN 454 AND 524 THEN "Harford County"
WHEN SMZ BETWEEN 1667 AND 1674 THEN "York County"
ELSE 0
END Jurisdiction
FROM d19_1')
Error:
Error in rsqlite_send_query(conn#ptr, statement) :
table d19_1 has no column named <NA>
Your code works correctly for me:
d19_1 <- structure(list(SMZ = 1:6, SIZ1_diff = c(-620L, -219L, -426L,
-481L, -233L, -322L), SIZ1_base = c(4170L, 831L, 834L, 676L,
1711L, 1539L), SIZ2_diff = c(-189L, -57L, -162L, -142L, -109L,
-79L), SIZ2_base = c(1347L, 255L, 294L, 216L, 584L, 512L), SIZ3_diff = c(-35L,
-4L, -134L, -114L, 54L, -50L), SIZ3_base = c(2040L, 392L, 379L,
267L, 913L, 799L), SIZ4_diff = c(82L, 8L, -81L, -50L, 71L, 23L
), SIZ4_base = c(1437L, 282L, 241L, 158L, 624L, 532L), SIZ5_diff = c(244L,
14L, -22L, -43L, 74L, 63L), SIZ5_base = c(1533L, 297L, 221L,
166L, 707L, 576L), Total_og = c(11980L, 2156L, 1367L, 790L, 5339L,
4362L), Total_base = c(12648L, 2415L, 2314L, 1736L, 5496L, 4747L
), X._SIZ1 = c(14.86811, 26.35379, 51.07914, 71.15385, 13.61777,
20.92268), X._SIZ2 = c(14.03118, 22.35294, 55.10204, 65.74074,
18.66438, 15.42969), X._SIZ3 = c(1.715686, 1.020408, 35.356201,
42.696629, 5.914567, 6.257822), X._SIZ4 = c(5.706333, 2.836879,
33.609959, 31.64557, 11.378205, 4.323308), X._SIZ5 = c(15.916504,
4.713805, 9.954751, 25.903614, 10.466761, 10.9375), Total_og.1 = c(11980L,
2156L, 1367L, 790L, 5339L, 4362L), Total_base.1 = c(12648L, 2415L,
2314L, 1736L, 5496L, 4747L)), .Names = c("SMZ", "SIZ1_diff",
"SIZ1_base", "SIZ2_diff", "SIZ2_base", "SIZ3_diff", "SIZ3_base",
"SIZ4_diff", "SIZ4_base", "SIZ5_diff", "SIZ5_base", "Total_og",
"Total_base", "X._SIZ1", "X._SIZ2", "X._SIZ3", "X._SIZ4", "X._SIZ5",
"Total_og.1", "Total_base.1"), row.names = c(NA, -6L), class = "data.frame")
library(sqldf)
sqldf('SELECT *, CASE
WHEN SMZ BETWEEN 1 AND 110 THEN "Baltimore City"
WHEN SMZ BETWEEN 111 AND 217 THEN "Anne Arundel County"
WHEN SMZ BETWEEN 218 AND 405 THEN "Baltimore County"
WHEN SMZ BETWEEN 406 AND 453 THEN "Carroll County"
WHEN SMZ BETWEEN 454 AND 524 THEN "Harford County"
WHEN SMZ BETWEEN 1667 AND 1674 THEN "York County"
ELSE 0
END Jurisdiction
FROM d19_1')

What is the fastest way for adding the vector elements horizontally in odd order?

According to this question I implemented the horizontal addition this time 5 by 5 and 7 by 7. It does the job correctly but it is not fast enough.
Can it be faster than what it is? I tried to use hadd and other instruction but the improvement is restricted. For examlple, when I use _mm256_bsrli_epi128 it is slightly better but it needs some extra permutation that ruins the benefit because of the lanes. So the question is how it should be implemented to gain more performance. The same story is for 9 elements, etc.
This adds 5 elements horizontally and puts the results in places 0, 5, and 10:
//it put the results in places 0, 5, and 10
inline __m256i _mm256_hadd5x5_epi16(__m256i a )
{
__m256i a1, a2, a3, a4;
a1 = _mm256_alignr_epi8(_mm256_permute2x128_si256(a, _mm256_setzero_si256(), 0x31), a, 1 * 2);
a2 = _mm256_alignr_epi8(_mm256_permute2x128_si256(a, _mm256_setzero_si256(), 0x31), a, 2 * 2);
a3 = _mm256_bsrli_epi128(a2, 2);
a4 = _mm256_bsrli_epi128(a3, 2);
return _mm256_add_epi16(_mm256_add_epi16(_mm256_add_epi16(a1, a2), _mm256_add_epi16(a3, a4)) , a );
}
And this adds 7 elements horizontally and puts the results in places 0 and 7:
inline __m256i _mm256_hadd7x7_epi16(__m256i a )
{
__m256i a1, a2, a3, a4, a5, a6;
a1 = _mm256_alignr_epi8(_mm256_permute2x128_si256(a, _mm256_setzero_si256(), 0x31), a, 1 * 2);
a2 = _mm256_alignr_epi8(_mm256_permute2x128_si256(a, _mm256_setzero_si256(), 0x31), a, 2 * 2);
a3 = _mm256_alignr_epi8(_mm256_permute2x128_si256(a, _mm256_setzero_si256(), 0x31), a, 3 * 2);
a4 = _mm256_alignr_epi8(_mm256_permute2x128_si256(a, _mm256_setzero_si256(), 0x31), a, 4 * 2);
a5 = _mm256_alignr_epi8(_mm256_permute2x128_si256(a, _mm256_setzero_si256(), 0x31), a, 5 * 2);
a6 = _mm256_alignr_epi8(_mm256_permute2x128_si256(a, _mm256_setzero_si256(), 0x31), a, 6 * 2);
return _mm256_add_epi16(_mm256_add_epi16(_mm256_add_epi16(a1, a2), _mm256_add_epi16(a3, a4)) , _mm256_add_epi16(_mm256_add_epi16(a5, a6), a ));
}
Indeed it is possible calculate these sums with less instructions. The idea is to accumulate
the partial sums not only in columns 10, 5 and 0, but also in other columns. This reduces the number of
vpaddw instructions and the number of 'shuffles' compared to your solution.
#include <stdio.h>
#include <x86intrin.h>
/* gcc -O3 -Wall -m64 -march=haswell hor_sum5x5.c */
int print_vec_short(__m256i x);
int print_10_5_0_short(__m256i x);
__m256i _mm256_hadd5x5_epi16(__m256i a );
__m256i _mm256_hadd7x7_epi16(__m256i a );
int main() {
short x[16];
for(int i=0; i<16; i++) x[i] = i+1; /* arbitrary initial values */
__m256i t0 = _mm256_loadu_si256((__m256i*)x);
__m256i t2 = _mm256_permutevar8x32_epi32(t0,_mm256_set_epi32(0,7,6,5,4,3,2,1));
__m256i t02 = _mm256_add_epi16(t0,t2);
__m256i t3 = _mm256_bsrli_epi128(t2,4); /* byte shift right */
__m256i t023 = _mm256_add_epi16(t02,t3);
__m256i t13 = _mm256_srli_epi64(t02,16); /* bit shift right */
__m256i sum = _mm256_add_epi16(t023,t13);
printf("t0 = ");print_vec_short(t0 );
printf("t2 = ");print_vec_short(t2 );
printf("t02 = ");print_vec_short(t02 );
printf("t3 = ");print_vec_short(t3 );
printf("t023= ");print_vec_short(t023);
printf("t13 = ");print_vec_short(t13 );
printf("sum = ");print_vec_short(sum );
printf("\nVector elements of interest: columns 10, 5, 0:\n");
printf("t0 [10, 5, 0] = ");print_10_5_0_short(t0 );
printf("t2 [10, 5, 0] = ");print_10_5_0_short(t2 );
printf("t02 [10, 5, 0] = ");print_10_5_0_short(t02 );
printf("t3 [10, 5, 0] = ");print_10_5_0_short(t3 );
printf("t023[10, 5, 0] = ");print_10_5_0_short(t023);
printf("t13 [10, 5, 0] = ");print_10_5_0_short(t13 );
printf("sum [10, 5, 0] = ");print_10_5_0_short(sum );
printf("\nSum with _mm256_hadd5x5_epi16(t0)\n");
sum = _mm256_hadd5x5_epi16(t0);
printf("sum [10, 5, 0] = ");print_10_5_0_short(sum );
/* now the sum of 7 elements: */
printf("\n\nSum of short ints 13...7 and short ints 6...0:\n");
__m256i t = _mm256_loadu_si256((__m256i*)x);
t0 = _mm256_permutevar8x32_epi32(t0,_mm256_set_epi32(3,6,5,4,3,2,1,0));
t0 = _mm256_and_si256(t0,_mm256_set_epi16(0xFFFF,0,0xFFFF,0xFFFF,0xFFFF,0xFFFF,0xFFFF,0xFFFF, 0,0xFFFF,0xFFFF,0xFFFF,0xFFFF,0xFFFF,0xFFFF,0xFFFF));
__m256i t1 = _mm256_alignr_epi8(t0,t0,2);
__m256i t01 = _mm256_add_epi16(t0,t1);
__m256i t23 = _mm256_alignr_epi8(t01,t01,4);
__m256i t0123 = _mm256_add_epi16(t01,t23);
__m256i t4567 = _mm256_alignr_epi8(t0123,t0123,8);
__m256i sum08 = _mm256_add_epi16(t0123,t4567); /* all elements are summed, but another permutation is needed to get the answer at position 7 */
sum = _mm256_permutevar8x32_epi32(sum08,_mm256_set_epi32(4,4,4,4,4,0,0,0));
printf("t = ");print_vec_short(t );
printf("t0 = ");print_vec_short(t0 );
printf("t1 = ");print_vec_short(t1 );
printf("t01 = ");print_vec_short(t01 );
printf("t23 = ");print_vec_short(t23 );
printf("t0123 = ");print_vec_short(t0123 );
printf("t4567 = ");print_vec_short(t4567 );
printf("sum08 = ");print_vec_short(sum08 );
printf("sum = ");print_vec_short(sum );
printf("\nSum with _mm256_hadd7x7_epi16(t) (the answer is in column 0 and in column 7)\n");
sum = _mm256_hadd7x7_epi16(t);
printf("sum = ");print_vec_short(sum );
return 0;
}
inline __m256i _mm256_hadd5x5_epi16(__m256i a )
{
__m256i a1, a2, a3, a4;
a1 = _mm256_alignr_epi8(_mm256_permute2x128_si256(a, _mm256_setzero_si256(), 0x31), a, 1 * 2);
a2 = _mm256_alignr_epi8(_mm256_permute2x128_si256(a, _mm256_setzero_si256(), 0x31), a, 2 * 2);
a3 = _mm256_bsrli_epi128(a2, 2);
a4 = _mm256_bsrli_epi128(a3, 2);
return _mm256_add_epi16(_mm256_add_epi16(_mm256_add_epi16(a1, a2), _mm256_add_epi16(a3, a4)) , a );
}
inline __m256i _mm256_hadd7x7_epi16(__m256i a )
{
__m256i a1, a2, a3, a4, a5, a6;
a1 = _mm256_alignr_epi8(_mm256_permute2x128_si256(a, _mm256_setzero_si256(), 0x31), a, 1 * 2);
a2 = _mm256_alignr_epi8(_mm256_permute2x128_si256(a, _mm256_setzero_si256(), 0x31), a, 2 * 2);
a3 = _mm256_alignr_epi8(_mm256_permute2x128_si256(a, _mm256_setzero_si256(), 0x31), a, 3 * 2);
a4 = _mm256_alignr_epi8(_mm256_permute2x128_si256(a, _mm256_setzero_si256(), 0x31), a, 4 * 2);
a5 = _mm256_alignr_epi8(_mm256_permute2x128_si256(a, _mm256_setzero_si256(), 0x31), a, 5 * 2);
a6 = _mm256_alignr_epi8(_mm256_permute2x128_si256(a, _mm256_setzero_si256(), 0x31), a, 6 * 2);
return _mm256_add_epi16(_mm256_add_epi16(_mm256_add_epi16(a1, a2), _mm256_add_epi16(a3, a4)) , _mm256_add_epi16(_mm256_add_epi16(a5, a6), a ));
}
int print_vec_short(__m256i x){
short int v[16];
_mm256_storeu_si256((__m256i *)v,x);
printf("%4hi %4hi %4hi %4hi | %4hi %4hi %4hi %4hi | %4hi %4hi %4hi %4hi | %4hi %4hi %4hi %4hi \n",
v[15],v[14],v[13],v[12],v[11],v[10],v[9],v[8],v[7],v[6],v[5],v[4],v[3],v[2],v[1],v[0]);
return 0;
}
int print_10_5_0_short(__m256i x){
short int v[16];
_mm256_storeu_si256((__m256i *)v,x);
printf("%4hi %4hi %4hi \n",v[10],v[5],v[0]);
return 0;
}
The output is:
$ ./a.out
t0 = 16 15 14 13 | 12 11 10 9 | 8 7 6 5 | 4 3 2 1
t2 = 2 1 16 15 | 14 13 12 11 | 10 9 8 7 | 6 5 4 3
t02 = 18 16 30 28 | 26 24 22 20 | 18 16 14 12 | 10 8 6 4
t3 = 0 0 2 1 | 16 15 14 13 | 0 0 10 9 | 8 7 6 5
t023= 18 16 32 29 | 42 39 36 33 | 18 16 24 21 | 18 15 12 9
t13 = 0 18 16 30 | 0 26 24 22 | 0 18 16 14 | 0 10 8 6
sum = 18 34 48 59 | 42 65 60 55 | 18 34 40 35 | 18 25 20 15
Vector elements of interest: columns 10, 5, 0:
t0 [10, 5, 0] = 11 6 1
t2 [10, 5, 0] = 13 8 3
t02 [10, 5, 0] = 24 14 4
t3 [10, 5, 0] = 15 10 5
t023[10, 5, 0] = 39 24 9
t13 [10, 5, 0] = 26 16 6
sum [10, 5, 0] = 65 40 15
Sum with _mm256_hadd5x5_epi16(t0)
sum [10, 5, 0] = 65 40 15
Sum of short ints 13...7 and short ints 6...0:
t = 16 15 14 13 | 12 11 10 9 | 8 7 6 5 | 4 3 2 1
t0 = 8 0 14 13 | 12 11 10 9 | 0 7 6 5 | 4 3 2 1
t1 = 9 8 0 14 | 13 12 11 10 | 1 0 7 6 | 5 4 3 2
t01 = 17 8 14 27 | 25 23 21 19 | 1 7 13 11 | 9 7 5 3
t23 = 21 19 17 8 | 14 27 25 23 | 5 3 1 7 | 13 11 9 7
t0123 = 38 27 31 35 | 39 50 46 42 | 6 10 14 18 | 22 18 14 10
t4567 = 39 50 46 42 | 38 27 31 35 | 22 18 14 10 | 6 10 14 18
sum08 = 77 77 77 77 | 77 77 77 77 | 28 28 28 28 | 28 28 28 28
sum = 77 77 77 77 | 77 77 77 77 | 77 77 28 28 | 28 28 28 28
Sum with _mm256_hadd7x7_epi16(t) (the answer is in column 0 and in column 7)
sum = 16 31 45 58 | 70 81 91 84 | 77 70 63 56 | 49 42 35 28

AWK - Select lines to print according to score

I have a tab-separated file containing a series of lemmas with associated scores.
The file contains 5 columns, the first column is the lemma and the third is the one that contains the score. What I need to do is print the line as it is, when lemma is not repeated and print the line with the highest score when lemma is repeated.
IN
Lemma --- Score --- ---
cserép 06a 55 6 bueno
darázs 05 38 1 bueno
dél 06a 34 1 bueno
dér 06a 29 1 bueno
díj 05 14 89 malo
díj 06a 2 101 malo
díj 06b 2 101 malo
díj 07 90 13 bueno
díj 08a 2 101 malo
díj 08b 2 101 malo
egér 06a 66 5 bueno
fonal 05 12 1 bueno
fonal 07 52 4 bueno
Desired output
Lemma --- Score --- ---
cserép 06a 55 6 bueno
darázs 05 38 1 bueno
dél 06a 34 1 bueno
dér 06a 29 1 bueno
díj 07 90 13 bueno
egér 06a 66 5 bueno
fonal 07 52 4 malo
What I have done. But it only works when the lemma is repeated once.
BEGIN {
OFS=FS="\t";
flag="";
}
{
id=$1;
if (id != flag)
{
if (line != "")
{
sub("^;","",line);
z=split(line,A,";");
if ((A[3] > A[8]) && (A[8] != ""))
{
print A[1]"\t"A[2]"\t"A[3]"\t"A[4]"\t"A[5];
}
else if ((A[8] > A[3]) && (A[8] != ""))
{
print A[6]"\t"A[7]"\t"A[8]"\t"A[9]"\t"A[10]
}
else
{
print A[1]"\t"A[2]"\t"A[3]"\t"A[4]"\t"A[5];
}
}
delete line;
flag=id;
}
line[$1]=line[$1]";"$2";"$3";"$4";"$5;
}
END {
line=line ";"$1";"$2";"$3";"$4";"$5
sub("^;","",line);
z=split(line,A,";");
if ((A[3] > A[8]) && (A[8] != ""))
{
print A[1]"\t"A[2]"\t"A[3]"\t"A[4]"\t"A[5];
}
else if ((A[8] > A[3]) && (A[8] != ""))
{
print A[6]"\t"A[7]"\t"A[8]"\t"A[9]"\t"A[10]
}
else
{
print A[1]"\t"A[2]"\t"A[3]"\t"A[4]"\t"A[5]
}
}
This one doesn't require the file to be sorted by lemma, but, it keeps all the lines to be printed in memory (one for each lemma) so may not be appropriate for a file with millions of different lemmas.
It also does not respect the order of the original file.
Finally, it assumes that all scores are non-negative!
$ cat lemma.awk
BEGIN { FS = OFS = "\t" }
NR == 1 { print }
NR > 1 {
if ($3 > score[$1]) {
score[$1] = $3
line[$1] = $0
}
}
END { for (lemma in line) print line[lemma] }
$ awk -f lemma.awk lemma.txt
Lemma --- Score --- ---
cserép 06a 55 6 bueno
díj 07 90 13 bueno
fonal 07 52 4 bueno
darázs 05 38 1 bueno
egér 06a 66 5 bueno
dél 06a 34 1 bueno
dér 06a 29 1 bueno
Tested with gnu awk:
prevLemma != $1 {
if( prevLemma ) {
print line;
}
prevLemma = $1;
prevScore = $3;
line = $0;
}
prevLemma == $1 { if( prevScore < $3 ) {
prevScore = $3;
line = $0;
}
}
END { print line;}
assumption is: the file is sorted by lemma
when the lemma changes (or at the very beginning when the var is empty) the lemma, score and line are saved
when the lemma changes (or in the END), the line for the previous lemma is printed
when the current line belongs to the same lemma and has a higher score the values are saved again
$ cat tst.awk
$1 != prev { printf "%s", maxLine; maxLine=""; max=$3; prev=$1 }
$3 >= max { max=$3; maxLine=$0 ORS }
END { printf "%s", maxLine }
$ awk -f tst.awk file
Lemma --- Score --- ---
cserép 06a 55 6 bueno
darázs 05 38 1 bueno
dél 06a 34 1 bueno
dér 06a 29 1 bueno
díj 07 90 13 bueno
egér 06a 66 5 bueno
fonal 07 52 4 bueno
Use a script:
if ($1 != $5) print $0
else
{
score($NR) = $3
print $0
}
Actually , this might be better done with perl.