Ordering several tables in the same file using awk

Ordering several tables in the same file using awk - awk

In my workflow, files containing simple tables with a two-line header (see end of post) are created. I want to order these tables by number using:
(head -n 2 && tail -n +3 | sort -n -r) > ordered.txt
That works fine, but I don't know how to split the file so that I can order every table and print it in ONE file. My approach is:
awk '/^TARGET/ {(head -n 2 && tail -n +3 | sort -n -r) >> ordered.txt}' output.txt
However, this causes an error message. I want to avoid any intermediate output files. What is missing in my awk command?
The input files look like that:
TARGET 1
Sample1 Sample2 Sample3 Pattern
3 3 3 z..........................Z........................................z.........Z...z
147 171 49 Z..........................Z........................................Z.........Z...Z
27 28 13 z..........................Z........................................z.........z...z
75 64 32 Z..........................Z........................................Z.........z...Z
TARGET 2
Sample1 Sample2 Sample3 Pattern
2 0 1 z..........................z........................................z.........Z...Z
21 21 7 z..........................Z........................................Z.........Z...Z
1 0 0 ...........................Z........................................Z.............Z
4 8 6 Z..........................Z........................................z.........Z...z
2 0 1 Z..........................Z........................................Z.........Z....
1 0 0 z..........................Z........................................Z.............Z
1 0 0 z...................................................................Z.........Z...Z
TARGET 3
Sample1 Sample2 Sample3 Pattern
1 0 0 z..........................Z........................................z.............z
1 3 0 z..........................z........................................Z.........Z...Z
1 1 0 Z..........................Z........................................Z.............z
1 0 0 Z..........................Z........................................Z.............Z
0 1 2 ...........................Z........................................Z.........Z...Z
0 0 1 z..........................z........................................z..............
My output should like that - no dropping of any line:
TARGET 1
Sample1 Sample2 Sample3 Pattern
147 171 49 Z..........................Z........................................Z.........Z...Z
75 64 32 Z..........................Z........................................Z.........z...Z
27 28 13 z..........................Z........................................z.........z...z
3 3 3 z..........................Z........................................z.........Z...z
TARGET 2
Sample1 Sample2 Sample3 Pattern
21 21 7 z..........................Z........................................Z.........Z...Z
4 8 6 Z..........................Z........................................z.........Z...z
2 0 1 z..........................z........................................z.........Z...Z
2 0 1 z..........................z........................................z.........Z...Z
1 0 0 ...........................Z........................................Z.............Z
1 0 0 ...........................Z........................................Z.............Z
1 0 0 ...........................Z........................................Z.............Z
TARGET 3
Sample1 Sample2 Sample3 Pattern
1 0 0 z..........................Z........................................z.............z
1 0 0 z..........................Z........................................z.............z
1 0 0 z..........................Z........................................z.............z
1 0 0 z..........................Z........................................z.............z
0 1 2 ...........................Z........................................Z.........Z...Z
0 0 1 z..........................z........................................z..............

requires GNU awk for the array traversal sorting:
gawk '
BEGIN {PROCINFO["sorted_in"] = "#val_num_asc"}
function output_table() {
for (key in table) print table[key]
delete table
i=0
}
/TARGET/ {print; getline; print; next}
/^$/ {output_table(); print; next}
{table[++i] = $0}
END {output_table()}
' file
outputs
TARGET 1
Sample1 Sample2 Sample3 Pattern
3 3 3 z..........................Z........................................z.........Z...z
27 28 13 z..........................Z........................................z.........z...z
75 64 32 Z..........................Z........................................Z.........z...Z
147 171 49 Z..........................Z........................................Z.........Z...Z
TARGET 2
Sample1 Sample2 Sample3 Pattern
1 0 0 ...........................Z........................................Z.............Z
1 0 0 z...................................................................Z.........Z...Z
1 0 0 z..........................Z........................................Z.............Z
2 0 1 Z..........................Z........................................Z.........Z....
2 0 1 z..........................z........................................z.........Z...Z
4 8 6 Z..........................Z........................................z.........Z...z
21 21 7 z..........................Z........................................Z.........Z...Z
TARGET 3
Sample1 Sample2 Sample3 Pattern
0 0 1 z..........................z........................................z..............
0 1 2 ...........................Z........................................Z.........Z...Z
1 0 0 Z..........................Z........................................Z.............Z
1 0 0 z..........................Z........................................z.............z
1 1 0 Z..........................Z........................................Z.............z
1 3 0 z..........................z........................................Z.........Z...Z

This is a bit of a mess but assuming you dont want to lose records when you sort this should work
awk 'function sortit(){
x=asort(a)
for(i=1;i<=x;i++)print b[a[i]" "d[i]++]
delete(a);delete(b);delete(c);delete(d)
}
/^[0-9]/{a[$0]=$1;b[$1" "c[$1]++]=$0}
/TARGET/{print;getline;print}
!NF{sortit();print}
END(sortit()}' file

Related

SQL Does it make sense to re-order database instead of using ORDER BY to increase performance?

I have a database with around 120.000 Entries and I need to do substring comparisons (where ... like 'test%') for an autocomplete function. The database won't change.
I have a column called "relevance" and for my searches I want them to be ordered by relevance DESC. I noticed, that as soon as I add the "ORDER BY relevance DESC" to my queries, the execution time increases by about 100% - since my queries already take around 100ms on average, this causes significant lag.
Does it make sense to re-order the whole database by relevance once so I can remove the ORDER BY? Can I be certain, that when searching through the table with SQL it will always go through the database in the order that I added the rows?
This is how my query looks like right now:
select *
from hao2_dict
where definitions like 'ba%'
or searchable_pinyin like 'ba%'
ORDER BY relevance DESC
LIMIT 100
UPDATE: For context, here is my DB structure:
And some time measurements:
Using an Index (relevance DESC) for the search term 'b%' gives me 50ms, which is faster than not using an Index. But the search term 'banana%' takes over 1700ms which is way slower than not using an Index. These are the results from 'explain':
b%:
0 Init 0 27 0 0
1 Noop 1 11 0 0
2 Integer 100 1 0 0
3 OpenRead 0 5 0 9 0
4 OpenRead 2 4223 0 k(2,-,) 0
5 Rewind 2 26 2 0 0
6 DeferredSeek 2 0 0 0
7 Column 0 6 4 0
8 Function 1 3 2 like(2) 0
9 If 2 13 0 0
10 Column 0 4 6 0
11 Function 1 5 2 like(2) 0
12 IfNot 2 25 1 0
13 IdxRowid 2 7 0 0
14 Column 0 1 8 0
15 Column 0 2 9 0
16 Column 0 3 10 0
17 Column 0 4 11 0
18 Column 0 5 12 0
19 Column 0 6 13 0
20 Column 0 7 14 0
21 Column 2 0 15 0
22 RealAffinity 15 0 0 0
23 ResultRow 7 9 0 0
24 DecrJumpZero 1 26 0 0
25 Next 2 6 0 1
26 Halt 0 0 0 0
27 Transaction 0 0 10 0 1
28 String8 0 3 0 b% 0
29 String8 0 5 0 b% 0
30 Goto 0 1 0 0
banana%:
0 Init 0 27 0 0
1 Noop 1 11 0 0
2 Integer 100 1 0 0
3 OpenRead 0 5 0 9 0
4 OpenRead 2 4223 0 k(2,-,) 0
5 Rewind 2 26 2 0 0
6 DeferredSeek 2 0 0 0
7 Column 0 6 4 0
8 Function 1 3 2 like(2) 0
9 If 2 13 0 0
10 Column 0 4 6 0
11 Function 1 5 2 like(2) 0
12 IfNot 2 25 1 0
13 IdxRowid 2 7 0 0
14 Column 0 1 8 0
15 Column 0 2 9 0
16 Column 0 3 10 0
17 Column 0 4 11 0
18 Column 0 5 12 0
19 Column 0 6 13 0
20 Column 0 7 14 0
21 Column 2 0 15 0
22 RealAffinity 15 0 0 0
23 ResultRow 7 9 0 0
24 DecrJumpZero 1 26 0 0
25 Next 2 6 0 1
26 Halt 0 0 0 0
27 Transaction 0 0 10 0 1
28 String8 0 3 0 banana% 0
29 String8 0 5 0 banana% 0
30 Goto 0 1 0 0

Can I be certain, that when searching through the table with SQL it will always go through the database in the order that I added the rows?
No. SQL results have no inherent order. They might come out in the order you inserted them, but there is no guarantee.
Instead, put an index on the column. Indexes keep their values in order.
However, this will only deal with the sorting. In the query above it still has to search the whole table for rows with matching definitions and searchable_pinyins. In general, SQL will only use one index per table at a time; usually trying to use two is inefficient. So you need one multi-column index to make this query not have to search the whole table and get the results in sorted order. Make sure relevance is first, you need to have the index columns in the same order as your order by.
(relevance, definitions, searchable_pinyins) will make that query use only the index for searching and sorting. Adding (relevance, searchable_pinyins) as well will handle searching by definitions, searchable_pinyins, or both.

Reset 'Id' value of appended Dataframe

I have appended multiple dataframes to form single dataframe. Each dataframe had multiple rows assigned with specific ID. After appending, Big dataframe has multiple rows with same Id. Would like assign new id's.
Current Dataframe:
Index name groupid
0 Abc 0
1 cvb 0
2 sdf 0
3 ksh 1
4 kjl 1
5 lmj 2
6 hyb 2
0 khf 0
1 uyt 0
2 tre 1
3 awe 1
4 uys 2
5 asq 2
6 lsx 2
Desired Output:
Index name groupid new_id
0 Abc 0 0
1 cvb 0 0
2 sdf 0 0
3 ksh 1 1
4 kjl 1 1
5 lmj 2 2
6 hyb 2 2
7 khf 0 3
8 uyt 0 3
9 tre 1 4
10 awe 1 4
11 uys 2 5
12 asq 2 5
13 lsx 2 5

You would have to use a slightly modified version of groupby:
df['new_id'] = df.groupby(df['groupid'].ne(df['groupid'].shift()).cumsum(), sort=False)
.ngroup())
Output is:
Index name groupid new_id
0 0 Abc 0 0
1 1 cvb 0 0
2 2 sdf 0 0
3 3 ksh 1 1
4 4 kjl 1 1
5 5 lmj 2 2
6 6 hyb 2 2
7 0 khf 0 3
8 1 uyt 0 3
9 2 tre 1 4
10 3 awe 1 4
11 4 uys 2 5
12 5 asq 2 5
13 6 lsx 2 5
See previous answer for reference.

Python - Sort column ascending - using groupby

The following code:
import pandas as pd
df_original=pd.DataFrame({\
'race_num':[1,1,1,2,2,2,2,3,3],\
'race_position':[2,3,0,1,0,0,2,3,0],\
'percentage_place':[77,55,88,50,34,56,99,12,75]
})
Gives an output of:
race_num
race_position
percentage_place
1
2
77
1
3
55
1
0
88
2
1
50
2
0
34
2
0
56
2
2
99
3
3
12
3
0
75
I need to mainpulate this dataframe to keep the race_num grouped but sort the percentage place in ascending order - and the race_position is to stay aligned with the original percentage_place.
Desired out is:
race_num
race_position
percentage_place
1
0
88
1
2
77
1
3
55
2
2
99
2
0
56
2
1
50
2
0
34
3
0
75
3
3
12
My attempt is:
df_new = df_1.groupby(['race_num','race_position'])\['percentage_place'].nlargest().reset_index()
Thank you in advance.

Look into sort_values
In [137]: df_original.sort_values(['race_num', 'percentage_place'], ascending=[True, False])
Out[137]:
race_num race_position percentage_place
2 1 0 88
0 1 2 77
1 1 3 55
6 2 2 99
5 2 0 56
3 2 1 50
4 2 0 34
8 3 0 75
7 3 3 12

Pad column with n zeros and trim excess values

For example, the original data file
file.org :
1 2 3 4 5
6 7 8 9 0
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
Insert three data points (0) in column 2,
The output file should look like this
file.out :
1 0 3 4 5
6 0 8 9 0
11 0 13 14 15
16 2 18 19 20
21 7 23 24 25
Please help.

The following awk will do the trick:
awk -v n=3 '{a[NR]=$2; $2=a[NR-n]+0}1' file

$ awk -v n=3 '{x=$2; $2=a[NR%n]+0; a[NR%n]=x} 1' file
1 0 3 4 5
6 0 8 9 0
11 0 13 14 15
16 2 18 19 20
21 7 23 24 25

If you want to try Perl,
$ cat file.orig
1 2 3 4 5
6 7 8 9 0
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
$ perl -lane ' BEGIN { push(#t,0,0,0) } push(#t,$F[1]);$F[1]=shift #t; print join(" ",#F) ' file.orig
1 0 3 4 5
6 0 8 9 0
11 0 13 14 15
16 2 18 19 20
21 7 23 24 25
$

EDIT: Since OP has edited question so adding solution as per new question.
awk -v count=3 '++val<=count{a[val]=$2;$2=0} val>count{if(++re_count<=count){$2=a[re_count]}} 1' Input_file
Output will be as follows.
1 0 3 4 5
6 0 8 9 0
11 0 13 14 15
16 2 18 19 20
21 7 23 24 25
Could you please try following.
awk -v count=5 '
BEGIN{
OFS="\t"
}
$2{
val=(val?val ORS OFS:OFS)$2
$2=0
occ++
$1=$1
}
1
END{
while(++occ<=count){
print OFS 0
}
print val
}' Input_file
Output will be as follows.
1 0 3 4 5
6 0 8 9 0
11 0 13 14 15
0
0
2
7
12

How to find matched rows in 2 files based on column3 and create extra file with rank value

I have 2 files, I need to merge based on column3 (pos). Then find matched position and create an desirable output as follows using awk. I would like to have output with 4 columns. The 4th columns indicate common position across 2 files with rank number.
File1.txt
SNP-ID Chr Pos
rs62637813 1 52058
rs150021059 1 52238
rs4477212 1 52356
kgp15717912 1 53424
rs140052487 1 54353
rs9701779 1 56537
kgp7727307 1 56962
kgp15297216 1 72391
rs3094315 1 75256
rs3131972 1 75272
kgp6703048 1 75406
kgp22792200 1 75665
kgp15557302 1 75769
File2.txt:
SNP-ID Chr Pos Chip1
rs58108140 1 10583 1
rs189107123 1 10611 2
rs180734498 1 13302 3
rs144762171 1 13327 4
rs201747181 1 13957 5
rs151276478 1 13980 6
rs140337953 1 30923 7
rs199681827 1 46402 8
rs200430748 1 47190 9
rs187298206 1 51476 10
rs116400033 1 51479 11
rs190452223 1 51914 12
rs181754315 1 51935 13
rs185832753 1 51954 14
rs62637813 1 52058 15
rs190291950 1 52144 16
rs201374420 1 52185 17
rs150021059 1 52238 18
rs199502715 1 53234 19
rs140052487 1 54353 20
Desirable-output:
SNP-ID Chr Pos Chip1 Chip2
rs58108140 1 10583 1 0
rs189107123 1 10611 2 0
rs180734498 1 13302 3 0
rs144762171 1 13327 4 0
rs201747181 1 13957 5 0
rs151276478 1 13980 6 0
rs140337953 1 30923 7 0
rs199681827 1 46402 8 0
rs200430748 1 47190 9 0
rs187298206 1 51476 10 0
rs116400033 1 51479 11 0
rs190452223 1 51914 12 0
rs181754315 1 51935 13 0
rs185832753 1 51954 14 0
rs62637813 1 52058 15 1
rs190291950 1 52144 16 0
rs201374420 1 52185 17 0
rs150021059 1 52238 18 2
rs199502715 1 53234 19 0
rs140052487 1 54353 20 3

I don't quite understand what you mean by "rank"
awk '
NR==FNR {pos[$3]=1; next}
FNR==1 {print $0, "Chip2"; next}
{print $0, ($3 in pos ? ++rank : 0)}
' File1.txt File2.txt | column -t
SNP-ID Chr Pos Chip1 Chip2
rs58108140 1 10583 1 0
rs189107123 1 10611 2 0
rs180734498 1 13302 3 0
rs144762171 1 13327 4 0
rs201747181 1 13957 5 0
rs151276478 1 13980 6 0
rs140337953 1 30923 7 0
rs199681827 1 46402 8 0
rs200430748 1 47190 9 0
rs187298206 1 51476 10 0
rs116400033 1 51479 11 0
rs190452223 1 51914 12 0
rs181754315 1 51935 13 0
rs185832753 1 51954 14 0
rs62637813 1 52058 15 1
rs190291950 1 52144 16 0
rs201374420 1 52185 17 0
rs150021059 1 52238 18 2
rs199502715 1 53234 19 0
rs140052487 1 54353 20 3

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Ordering several tables in the same file using awk - awk

Related

SQL Does it make sense to re-order database instead of using ORDER BY to increase performance?

Reset 'Id' value of appended Dataframe

Python - Sort column ascending - using groupby

Pad column with n zeros and trim excess values

How to find matched rows in 2 files based on column3 and create extra file with rank value

Categories

Resources