I want to match two files based on different fields in the two files using awk? - awk

Hi I have a big data set and i want to match two files based on $5 from file 1 and $1 or $3 of file 2 and print file 1 which match with file 2. In addition, i want to print $5 and $6 of file 2 in file 1 after matching.
file 1
7 81 1 47 32070
7 83 1 67 29446
7 92 1 84 28234
file 2
32070 0 0 19360101 HF 8 0 M C
28234 0 0 19350101 HF 8 0 M C
124332 0 0 19340101 HF 8 0 M C
29446 0 0 19340101 HF 8 0 M C
I would like to print like this
7 81 1 47 32070 HF 8
7 83 1 67 29446 HF 8
7 92 1 84 28234 HF 8

This awk one-liner should do the job:
awk 'NR==FNR{a[$1]=$5 FS $6;next}$0=$0 FS a[$NF]' f2 f1
If give it a test on your example input files:
kent$ awk 'NR==FNR{a[$1]=$5 FS $6;next}$0=$0 FS a[$NF]' f2 f1
7 81 1 47 32070 HF 8
7 83 1 67 29446 HF 8
7 92 1 84 28234 HF 8

Related

iteration calculation based on another dataframe

How to do iteration calculation as shown in df2 as desired output ?
any reference links for this > many thanks for helping
df1
a b c
0 1 0 5
1 9 9 2
2 2 2 8
3 6 3 0
4 6 1 7
df2 :
a b c
0 1 0 5 >> values from df1
1 19 18 9 >> values from (df1.iloc[1] * 2) + df2.iloc[0] *1)
2 23 22 25 >> values from (df1.iloc[2] * 2) + df2.iloc[1] *1)
3 35 28 25 >> values from (df1.iloc[3] * 2) + df2.iloc[2] *1)
4 47 30 39 >> values from (df1.iloc[4] * 2) + df2.iloc[3] *1)
IIUC, you can try:
df2 = df1.mul(2).cumsum().sub(df1.iloc[0])
Output:
a b c
0 1 0 5
1 19 18 9
2 23 22 25
3 35 28 25
4 47 30 39
more complex operation
If you want x[n] = x[n]*2 + x[n-1]*2, you need to iterate:
def process(s):
out = [s[0]]
for x in s[1:]:
out.append(x*2+out[-1]*3)
return out
df1.apply(process)
Output:
a b c
0 1 0 5
1 21 18 19
2 67 58 73
3 213 180 219
4 651 542 671

Group Data according to set of rows

In the first column there is a values from 5.0 to 12 ( increment of .5 each row) which I call set, in from of each row there is 5 columns with values.
The purpose is for each set found ( 5.0 to 12.0 ), take the 2 column for all set and add in a single set, and to the same case till end of the file.
So the output will be groups of 4 columns ( 1 column + 3 columns from the 3 sets )
Input file
5.0 0 0 0 128 128
5.5 3 69 18 128 128
6.0 2 79 12 63 15
6.5 -1 75 11 28 24
7.0 0 73 12 32 27
7.5 -1 73 11 35 21
5.0 0 0 0 128 128
5.5 3 69 18 128 128
6.0 1 79 12 61 18
6.5 -1 75 11 28 25
7.0 0 73 11 31 28
7.5 -2 73 9 32 19
5.0 0 0 0 128 128
5.5 3 69 18 128 128
6.0 1 79 12 51 15
6.5 -2 74 12 26 22
7.0 -1 72 11 25 21
7.5 -1 71 11 28 12
Desired Output
5.0 0 0 0
5.5 3 3 3
6.0 2 1 1
6.5 - 1 -1 -2
7.0 0 0 -1
7.5 - 1 -2 -1
5.0 0 0 0
5.5 69 69 69
6.0 79 79 79
6.5 75 75 74
7.0 73 73 72
7.5 73 73 71
5.0 0 0 0
5.5 18 18 18
6.0 12 12 12
6.5 11 11 12
7.0 12 11 11
7.5 11 9 11
5.0 128 128 128
5.5 128 128 128
6.0 63 61 51
6.5 28 28 26
7.0 32 31 25
7.5 35 32 28
5.0 128 128 128
5.5 128 128 128
6.0 15 18 15
6.5 24 25 22
7.0 27 28 21
7.5 21 19 12
I have the solution with pearl but I would like to do same with awk
here the code I have
#!/usr/bin/perl
use strict;
use warnings;
my #id;
my %data;
my $set = 0;
while(<>) {
unless (/^\d/) { ++$set; next }
my ($id, #fields) = (split)[0..5];
push #id, $id unless $set;
for my $f (0..$#{fields}) {
push #{$data{$id}[$f]}, $fields[$f];
}
}
for my $t (0..4) {
for my $i (#id) {
my $p = shift #{$data{$i}};
print join (",", $i, #{$p})."\n";
}
print "\n";
}
Please will be great to get the file in csv format.
Thanks in advance
Using multi-dimensional array in gnu-awk:
awk 'FNR == 1 || !NF {++n} n==1{key[++k]=$1} {for (i=2; i<=NF; ++i) map[i-1][$1][n] = $i} END {for (j=1; j<=length(map); ++j) {for (i=1; i<=k; ++i) {printf "%s", key[i]; for (p=1; p<=n; ++p) printf "%s%s", OFS, map[j][key[i]][p]; print ""} print ""}}' OFS='\t' file
5.0 0 0 0
5.5 3 3 3
6.0 2 1 1
6.5 -1 -1 -2
7.0 0 0 -1
7.5 -1 -2 -1
5.0 0 0 0
5.5 69 69 69
6.0 79 79 79
6.5 75 75 74
7.0 73 73 72
7.5 73 73 71
5.0 0 0 0
5.5 18 18 18
6.0 12 12 12
6.5 11 11 12
7.0 12 11 11
7.5 11 9 11
5.0 128 128 128
5.5 128 128 128
6.0 63 61 51
6.5 28 28 26
7.0 32 31 25
7.5 35 32 28
5.0 128 128 128
5.5 128 128 128
6.0 15 18 15
6.5 24 25 22
7.0 27 28 21
7.5 21 19 12
A more readable form:
awk -v OFS='\t' '
FNR == 1 || !NF {++n}
n == 1 {key[++k] = $1}
{
for (i=2; i<=NF; ++i)
map[i-1][$1][n] = $i
}
END {
for (j=1; j<=length(map); ++j) {
for (i=1; i<=k; ++i) {
printf "%s", key[i]
for (p=1; p<=n; ++p)
printf "%s%s", OFS, map[j][key[i]][p]
print ""
}
print ""
}
}' file

applying defined function over different ranges of rows in pandas

I have a df like this with tons of rows :
BB AA FF
2 5 0
3 7 A
6 5 A
9 6 A
8 3 0
And a function like this :
def test(a,b):
# a=array col AA
# b=array col BB
return (a*b)+a
I would like that for the rows in column FF where values are != 0 to apply the function test over that slice (array) of the df that involves column BB and AA to generate the following output in the new column ZZ:
BB AA FF ZZ
2 5 0 0
3 7 A 28
6 5 A 35
9 6 A 51
8 3 0 0
I was thinking in something like:
df['zz']= df.apply(lambda x: test(df.AA,df.BB) for the range of values among zero)
But my issue is that I am not sure on how to specify de arrays in column FF to apply the column
You can use DataFrame.apply + mask:
def test(x):
return (x[0]*x[1])+x[0]
df['ZZ']=df[['AA','BB']].apply(test,axis=1).mask(df['FF'].eq('0'),0)
print(df)
BB AA FF ZZ
0 2 5 0 0
1 3 7 A 28
2 6 5 A 35
3 9 6 A 60
4 8 3 0 0
or you can use lambda function:
df['ZZ']=df.apply(lambda x: x[['BB','AA']].prod()+ x['AA'] if x['FF'] != '0' else x['FF'],axis=1)
print(df)
BB AA FF ZZ
0 2 5 0 0
1 3 7 A 28
2 6 5 A 35
3 9 6 A 60
4 8 3 0 0

Pad column with n zeros and trim excess values

For example, the original data file
file.org :
1 2 3 4 5
6 7 8 9 0
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
Insert three data points (0) in column 2,
The output file should look like this
file.out :
1 0 3 4 5
6 0 8 9 0
11 0 13 14 15
16 2 18 19 20
21 7 23 24 25
Please help.
The following awk will do the trick:
awk -v n=3 '{a[NR]=$2; $2=a[NR-n]+0}1' file
$ awk -v n=3 '{x=$2; $2=a[NR%n]+0; a[NR%n]=x} 1' file
1 0 3 4 5
6 0 8 9 0
11 0 13 14 15
16 2 18 19 20
21 7 23 24 25
If you want to try Perl,
$ cat file.orig
1 2 3 4 5
6 7 8 9 0
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
$ perl -lane ' BEGIN { push(#t,0,0,0) } push(#t,$F[1]);$F[1]=shift #t; print join(" ",#F) ' file.orig
1 0 3 4 5
6 0 8 9 0
11 0 13 14 15
16 2 18 19 20
21 7 23 24 25
$
EDIT: Since OP has edited question so adding solution as per new question.
awk -v count=3 '++val<=count{a[val]=$2;$2=0} val>count{if(++re_count<=count){$2=a[re_count]}} 1' Input_file
Output will be as follows.
1 0 3 4 5
6 0 8 9 0
11 0 13 14 15
16 2 18 19 20
21 7 23 24 25
Could you please try following.
awk -v count=5 '
BEGIN{
OFS="\t"
}
$2{
val=(val?val ORS OFS:OFS)$2
$2=0
occ++
$1=$1
}
1
END{
while(++occ<=count){
print OFS 0
}
print val
}' Input_file
Output will be as follows.
1 0 3 4 5
6 0 8 9 0
11 0 13 14 15
0
0
2
7
12

Ordering several tables in the same file using awk

In my workflow, files containing simple tables with a two-line header (see end of post) are created. I want to order these tables by number using:
(head -n 2 && tail -n +3 | sort -n -r) > ordered.txt
That works fine, but I don't know how to split the file so that I can order every table and print it in ONE file. My approach is:
awk '/^TARGET/ {(head -n 2 && tail -n +3 | sort -n -r) >> ordered.txt}' output.txt
However, this causes an error message. I want to avoid any intermediate output files. What is missing in my awk command?
The input files look like that:
TARGET 1
Sample1 Sample2 Sample3 Pattern
3 3 3 z..........................Z........................................z.........Z...z
147 171 49 Z..........................Z........................................Z.........Z...Z
27 28 13 z..........................Z........................................z.........z...z
75 64 32 Z..........................Z........................................Z.........z...Z
TARGET 2
Sample1 Sample2 Sample3 Pattern
2 0 1 z..........................z........................................z.........Z...Z
21 21 7 z..........................Z........................................Z.........Z...Z
1 0 0 ...........................Z........................................Z.............Z
4 8 6 Z..........................Z........................................z.........Z...z
2 0 1 Z..........................Z........................................Z.........Z....
1 0 0 z..........................Z........................................Z.............Z
1 0 0 z...................................................................Z.........Z...Z
TARGET 3
Sample1 Sample2 Sample3 Pattern
1 0 0 z..........................Z........................................z.............z
1 3 0 z..........................z........................................Z.........Z...Z
1 1 0 Z..........................Z........................................Z.............z
1 0 0 Z..........................Z........................................Z.............Z
0 1 2 ...........................Z........................................Z.........Z...Z
0 0 1 z..........................z........................................z..............
My output should like that - no dropping of any line:
TARGET 1
Sample1 Sample2 Sample3 Pattern
147 171 49 Z..........................Z........................................Z.........Z...Z
75 64 32 Z..........................Z........................................Z.........z...Z
27 28 13 z..........................Z........................................z.........z...z
3 3 3 z..........................Z........................................z.........Z...z
TARGET 2
Sample1 Sample2 Sample3 Pattern
21 21 7 z..........................Z........................................Z.........Z...Z
4 8 6 Z..........................Z........................................z.........Z...z
2 0 1 z..........................z........................................z.........Z...Z
2 0 1 z..........................z........................................z.........Z...Z
1 0 0 ...........................Z........................................Z.............Z
1 0 0 ...........................Z........................................Z.............Z
1 0 0 ...........................Z........................................Z.............Z
TARGET 3
Sample1 Sample2 Sample3 Pattern
1 0 0 z..........................Z........................................z.............z
1 0 0 z..........................Z........................................z.............z
1 0 0 z..........................Z........................................z.............z
1 0 0 z..........................Z........................................z.............z
0 1 2 ...........................Z........................................Z.........Z...Z
0 0 1 z..........................z........................................z..............
requires GNU awk for the array traversal sorting:
gawk '
BEGIN {PROCINFO["sorted_in"] = "#val_num_asc"}
function output_table() {
for (key in table) print table[key]
delete table
i=0
}
/TARGET/ {print; getline; print; next}
/^$/ {output_table(); print; next}
{table[++i] = $0}
END {output_table()}
' file
outputs
TARGET 1
Sample1 Sample2 Sample3 Pattern
3 3 3 z..........................Z........................................z.........Z...z
27 28 13 z..........................Z........................................z.........z...z
75 64 32 Z..........................Z........................................Z.........z...Z
147 171 49 Z..........................Z........................................Z.........Z...Z
TARGET 2
Sample1 Sample2 Sample3 Pattern
1 0 0 ...........................Z........................................Z.............Z
1 0 0 z...................................................................Z.........Z...Z
1 0 0 z..........................Z........................................Z.............Z
2 0 1 Z..........................Z........................................Z.........Z....
2 0 1 z..........................z........................................z.........Z...Z
4 8 6 Z..........................Z........................................z.........Z...z
21 21 7 z..........................Z........................................Z.........Z...Z
TARGET 3
Sample1 Sample2 Sample3 Pattern
0 0 1 z..........................z........................................z..............
0 1 2 ...........................Z........................................Z.........Z...Z
1 0 0 Z..........................Z........................................Z.............Z
1 0 0 z..........................Z........................................z.............z
1 1 0 Z..........................Z........................................Z.............z
1 3 0 z..........................z........................................Z.........Z...Z
This is a bit of a mess but assuming you dont want to lose records when you sort this should work
awk 'function sortit(){
x=asort(a)
for(i=1;i<=x;i++)print b[a[i]" "d[i]++]
delete(a);delete(b);delete(c);delete(d)
}
/^[0-9]/{a[$0]=$1;b[$1" "c[$1]++]=$0}
/TARGET/{print;getline;print}
!NF{sortit();print}
END(sortit()}' file