selecting highest value in a table - awk

A file having 3 columns with tab separated. I want to select the highest value from column 3rd (among the same name of column 1st) and print as the 4th column.
input file
A l 10
A l 2
A m 6
B l 12
B m 13
B n 7
C l 9
C l 8
C n 19
Output file
A l 10 10
A l 2 10
A m 6 10
B l 12 13
B m 13 13
B n 7 13
C l 9 19
C l 8 19
C n 19 19
Could you please suggest awk,or sed command. Thanks

You can use this awk
awk 'FNR==NR {arr[$1]=arr[$1]>$3?arr[$1]:$3;next} {print $0,arr[$1]}' OFS="\t" file{,}
A l 10 10
A l 2 10
A m 6 10
B l 12 13
B m 13 13
B n 7 13
C l 9 19
C l 8 19
C n 19 19
This passes two times over the file. First time to find highest, next to print.
The file{,} make the filename double. You can also use file file instead.

Related

pandas: get top n including the duplicates of a sorted column

I have some data like
This is a table sorted by score column and also then by cat column
score cat
18 B
18 A
17 A
16 B
16 A
15 B
14 B
13 A
12 A
10 B
9 B
I want to get the top 5 of score including the duplicates and also add the rank
i.e
rank score cat
1 18 B
1 18 A
2 17 A
3 16 B
3 16 A
4 15 B
5 14 B
How can i get this using pandas
Since the data frame is ordered, try factorize
df['rnk'] = df.score.factorize()[0]+1
out = df[df['rnk'] <= 5]
out
score cat rnk
0 18 B 1
1 18 A 1
2 17 A 2
3 16 B 3
4 16 A 3
5 15 B 4
6 14 B 5

Merging multiple files with null values using AWK

Sorry I am posting it again as i messed up in my earlier post:
I am interesting in joining multiple files (e.g., file 1 file2 file 3...) using matching values in column 1 and get this desired output. Would appreciate any help please:
file1:
A 2 3 4
B 3 7 8
C 4 6 9
file2:
A 7 6 3
C 2 4 7
D 1 6 4
file3:
A 3 2 7
B 4 7 3
E 3 6 8
Output:
A 2 3 4 7 6 3 3 2 7
B 3 7 8 n n n 4 7 3
C 4 6 9 2 4 7 n n n
D n n n 1 6 4 n n n
E n n n n n n 3 6 8
Here is one for awk. Tested with GNU awk, mawk, original-awk ie. awk 20121220 and Busybox awk:
$ awk '
function nn(c, b,i) {
if(c)
for(i=1;i<=c;i++)
b=b "n "
return b
}
FNR==1{nf+=(NF-1)}
{
for(i=2;i<=NF;i++)
b[$1]=b[$1] $i OFS
a[$1]=a[$1] (n[$1]<(nf-NF+1)?nn(nf-NF+1-n[$1]):"") b[$1]
n[$1]=nf+0
delete b[$1]
}
END{
for(i in a)
print i,a[i] (n[i]<(nf)?nn(nf-n[i]):"")
}' file1 file2 file3
Output:
A 2 3 4 7 6 3 3 2 7
B 3 7 8 n n n 4 7 3
C 4 6 9 2 4 7 n n n
D n n n 1 6 4 n n n
E n n n n n n 3 6 8

Join information from table A with the information from another table B multiple times

So this maybe a simple question but I would like to learn if this can be done in one query.
Table A: contains gene information
gene start end
1 a 5 0
2 b 6 1
3 c 7 2
4 d 8 3
5 e 9 4
6 f 10 5
7 g 11 6
8 h 12 7
9 i 13 8
10 j 14 9
Table B: contains calculated gene information.
gene1 gene2 cor
1 d j -0.7600805
2 c i 0.4274278
3 e g -0.9249361
4 a f 0.8567928
5 b h -0.3018518
6 d j -0.3723553
7 c i 0.1617981
8 e g 0.8575933
9 a f 0.8409788
10 b h 0.1506035
The result table I'm trying to get is:
gene1 gene2 cor start1 end1 start2 end2
1 d j -0.7600805 8 3 14 9
2 c i 0.4274278 7 2 13 8
3 e g -0.9249361
4 a f 0.8567928
5 b h -0.3018518
6 d j -0.3723553 etc.
7 c i 0.1617981
8 e g 0.8575933
9 a f 0.8409788
10 b h 0.1506035
The method I can think of is to join table A onto table B twice, firstly by gene1 and then by gene2, which would require for an intermediate table. Is there any simpler ways to achieve this in one step?
Yes, two joins will do it
You simply need to do this:
SELECT b.Gene1
,b.Gene2
,b.cor
,a1.Start AS Start1
,a1.End AS End1
,a2.Start AS Start2
,a2.End AS End2
FROM TableB b
INNER JOIN TableA a1
ON a1.Gene = b.Gene1
INNER JOIN TableA a2
ON a2.Gene = b.Gene2
Depending on your dbms you may need to tweek the syntax a bit

Extracting Sequential Pattern

Can anyone help me on how to write a script to extract sequential lines?
I was able to find and get a script working to create all the permutations of the given inputs, but that's not what I need.
awk 'function perm(p,s, i) {
for(i=1;i<=n;i++)
if(p==1)
printf "%s%s\n",s,A[i]
else
perm(p-1,s A[i]", ")
}
{
A[++n]=$1
}
END{
perm(n)
}' infile
Unfortunately, I don't understand the script well enough to made a modification (not due to lack of trying).
I need to extract 2 to 5 sequential lines/word patterns.
An illustration of what I need is as follows:
Eg.
inputfile.txt:
A
B
C
D
E
F
G
outputfile.txt:
A B
B C
C D
D E
E F
F G
A B C
B C D
C D E
D E F
E F G
A B C D
B C D E
C D E F
D E F G
A B C D E
B C D E F
C D E F G
Here's a Python answer.
General algorithm:
Load all letters into a list
For n = 2..5, where n is size of the "window". You "slide" that window over the list and print those n characters.
Python is nice for this because of list slicing.
with open('input.txt') as f_in, open('output.txt', 'w') as f_out:
chars = f_in.read().splitlines()
for n in range(2, 6):
for start_window in range(len(chars) - n + 1):
f_out.write(' '.join(chars[start_window:start_window + n]))
f_out.write('\n')
awk to the rescue!
$ awk 'BEGIN{n=1}
FNR==1{n++}
{a[c++]=$0; c=c%n}
FNR>n-1{for(i=c;i<c+n-1;i++) printf "%s ",a[i%n];
print}' file{,,,}
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
1 2 3
2 3 4
3 4 5
4 5 6
5 6 7
6 7 8
7 8 9
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7
4 5 6 7 8
5 6 7 8 9
multiple scans of the input file (number of commas). Used seq 9 as the input file.
Another in awk:
{ a[NR]=$0 }
END {
o[0]=ORS
for(i=2;i<=5;i++)
for(j=1;j<=length(a);j++) {
printf "%s%s", a[j], (++k==i?o[k=0]:OFS)
if(!k&&j!=length(a)) j-=(i-1)
}
}

Dedup using HiveQL

I have a hive table with field 'a'(int), 'b'(string), 'c'(bigint), 'd'(bigint) and 'e'(string).
I have data like:
a b c d e
---------------
1 a 10 18 i
2 b 11 19 j
3 c 12 20 k
4 d 13 21 l
1 e 14 22 m
4 f 15 23 n
2 g 16 24 o
3 h 17 25 p
Table is sorted on key 'b'.
Now we want output like below:
a b c d e
---------------
1 e 14 22 m
4 f 15 23 n
2 g 16 24 o
3 h 17 25 p
which will be deduped on key 'a' but will keep last(latest) 'b'.
Is it possible using Hive query(HiveQL)?
If column b is unique, Try follow hql:
select
*
from
(
select max(b) as max_b
from
table
group by a
) table1
join table on table1.max_b = table.b