How to merge two files based on the first three columns using awk - awk

I wanted to merge two files into a single one line by line using the first three columns as a key. Example:
file1.txt
a b c 1 4 7
x y z 2 5 8
p q r 3 6 9
file2.txt
p q r 11
a b c 12
x y z 13
My desired output for the above two files is:
a b c 1 4 7 12
x y z 2 5 8 13
p q r 3 6 9 11
The number of columns in each file is not fixed, it can vary from line to line. Also, I got more than 27K lines in each file.
They are not ordered. They only thing is that the first three fields are the same for both files.

You could also use join, it requires sorted input and that the first 3 fields are merged. The example below sorts each file and lets sed merge and separate the fields:
join <(sort file1.txt | sed 's/ /-/; s/ /-/') \
<(sort file2.txt | sed 's/ /-/; s/ /-/') |
sed 's/-/ /; s/-/ /'
Output:
a b c 1 4 7 12
p q r 3 6 9 11
x y z 2 5 8 13

Join on the first three fields where the number of fields are variable (four or more):
{
# get the forth field until the last
for (i=4;i<=NF;i++)
f=f$i" "
# concat fields
arr[$1OFS$2OFS$3]=arr[$1OFS$2OFS$3]f;
# reset field string
f=""
}
END {
for (key in arr)
print key, arr[key]
}
Run like:
$ awk -f script.awk file1 file2
a b c 1 4 7 12
p q r 3 6 9 11
x y z 2 5 8 13

try this:
awk 'NR==FNR{a[$1$2$3]=$4;next}$1$2$3 in a{print $0, a[$1$2$3]}' file2 file1

If the columns have varying lengths, you could try something like this using SUBSEP:
awk 'NR==FNR{A[$1,$2,$3]=$4; next}($1,$2,$3) in A{print $0, A[$1,$2,$3]}' file2 file1
For varying columns in file1 and sorted output, try:
awk '{$1=$1; i=$1 FS $2 FS $3 FS; sub(i,x)} NR==FNR{A[i]=$0; next}i in A{print i $0, A[i]}' file2 file1 | sort

Related

How do I print starting from a certain row of output with awk? [duplicate]

I have millions of records in my file, what i need to do is print columns 1396 to 1400 for specific number of rows, and if i can get this in excel or notepad.
Tried with this command
awk {print $1396,$1397,$1398,$1399,$1400}' file_name
But this is running for each row.
You need a condition to specify which rows to apply the action to:
awk '<<condition goes here>> {print $1396,$1397,$1398,$1399,$1400}' file_name
For example, to do this only for rows 50 to 100:
awk 'NR >= 50 && NR <= 100 {print $1396,$1397,$1398,$1399,$1400}' file_name
(Depending on what you want to do, you can also have much more complicated selection patterns than this.)
Here's a simpler example for testing:
awk 'NR >= 3 && NR <= 5 {print $2, $3}'
If I run this on an input file containing
1 2 3 4
2 3 4 5
3 a b 6
4 c d 7
5 e f 8
6 7 8 9
I get the output
a b
c d
e f

Using matching entries only, print file A line if column values is between two other columns values in file B

I have a tab delim file1
A 1
A 20
B 17
B 33
C 10
C 20
E 7
and another tab delim file2
A 1 5
A 6 20
B 1 10
B 30 60
C 10 20
E 1 6
I need to print the lines in file1 for which col1 file1 = col1 file2 and value in col2 file1 falls within the ranges in cols 2 and 3 of file2.
The output would look like
A 1
A 20
B 33
C 10
C 20
I'm trying
awk 'FNR==NR{a[$1]=$2;next}; ($1) in a{if($2=(a[$1] >= $2 && a[$1] <=$3) {print}}1' file1 file2
But it's not working.
To store multiple ranges, you really want to use arrays of arrays or lists. awk doesn't support them directly but they can be emulated. In this case arrays of arrays seem likely to be more efficient.
awk '
# store each range from file2
FNR==NR {
n = ++q[$1]
min[$1 FS n] = $2
max[$1 FS n] = $3
next
}
# process file1
n = q[$1] { # if no q entry, line cannot be in range
for (i=1; i<=n; i++)
if ( min[$1 FS i]<=$2 && $2<=max[$1 FS i]) {
print
next
}
}
' file2 file1
Each min/max range needs to be stored separately. By maintaining a counter (q[$1]) of occurrences of each different value of col1 ($1), we ensure creation of a distinct new array element [$1 FS n].
Subsequently, when checking the ranges, we know that any particular value of col1 occurs precisely q[$1] times.
My desperate solution to this was to manipulate my file2 into this using some basic awk operations
A 1 gene_starts
A 5 gene_ends
A 6 gene_starts
A 20 gene_ends
B 1 gene_starts
B 10 gene_ends
B 30 gene_starts
B 60 gene_ends
C 10 gene_starts
C 20 gene_ends
E 1 gene_starts
E 6 gene_ends
So, that I could sort the two files 1 and 2, and grab lines between gene_starts and genes_ends
sort -V -k1,2 file1 file2 | awk '/gene_starts/,/gene_ends/' | awk '!length($3)'
But this allowed me to get only
A 20
B 33
C 20
because sorting puts lines from file1 matching gene_starts before them.
So, to get A 1 and C 10, I had to do filter gene_ends out of my file2 and use
awk 'NR==FNR{c[$1,$2];next} (($1,$2) in c)' file2 file1
to get the gene_starts lines that were also in file1 and add then to my output

AWK command to extract distinct values from a column

From a tab delimited file. I'm trying to extract all rows based on a unique value from column 4 and then save it as a CSV. However, I would like to extract all the distinct values in column 4 and save them as CSV in one go.
I was able to extract one value using this command:
awk -F $'\t' '$4 == "\"C333\"" {print}' dataFile > C333.csv
Let's consider this test file:
$ cat in.csv
a b c d
aa bb cc d
1 2 3 4
12 23 34 4
A B C d
Now, let's write each row to a tab-separated output file that is named after the fourth column:
$ awk -F'\t' '{f=$4".csv"; print>>f; close(f)}' OFS='\t' in.csv
$ cat d.csv
a b c d
aa bb cc d
A B C d
$ cat 4.csv
1 2 3 4
12 23 34 4

How to do a join using awk

Here is my Input file
Identifier Relation
A 1
A 2
A 3
B 2
B 3
C 1
C 2
C 3
I want to join this file to itself based on the "Relation" field.
Sample Output file
A 1 C 1
A 2 B 2
A 2 C 2
B 2 C 2
A 3 B 3
A 3 C 3
B 3 C 3
I used the following awk script:
awk 'NR==FNR {a[NR]=$0; next} { for (k in a) if (a[k]~$2) print a[k],$0}' input input > output
However, I had to do another awk step to delete lines which did a join with itself i.e, A 1 A 1 ; B 2 B 2 etc.
The second issue with this file is it prints both directions of the join, thus
A 1 C 1 is printed along with C 1 A 1 on another line.
Both these lines display the same relation and I would NOT like to include this.I want to see just one or the other i.e, "A 1 C 1" or "C 1 A 1" not both.
Any suggestions/directions are highly appreciated.
alternative solution using awk with join and sort support
$ join -j 2 <(sort -k2 -k1,1 file){,}
| awk '$2!=$3 && !($3 FS $2 in a){a[$2 FS $3]; print$2,$1,$3,$1}'
A 1 C 1
A 2 B 2
A 2 C 2
B 2 C 2
A 3 B 3
A 3 C 3
B 3 C 3
create the cross product, eliminate the diagonal and one of the symmetrical pairs.
Here is an awk-only solution:
awk 'NR>1{ar[$2]=(ar[$2]$1);}\
END{ for(key in ar){\
for(i=1; i<length(ar[key]); i++) {\
for(j=i+1; j<length(ar[key])+1; j++) {\
print substr(ar[key],i,1), key, substr(ar[key],j,1), key;\
}\
}\
}}' infile
Each number in the second column of the input serves as a key of an awk-array. The value of the corresponding array-element is a sequence of first-column letters (e.g., array[1]=ABC).
Then, we built all two-letter combinations for each sequence (e.g., "ABC" gives "AB", "AC" and "BC")
Output:
A 1 C 1
A 2 B 2
A 2 C 2
B 2 C 2
A 3 B 3
A 3 C 3
B 3 C 3
Note:
If a number occurs only once, no output is generated for this number.
The order of output depends on the order of input. (No sorting of letters!!). That is if the second input line was C 1, then array[1]="CAB" and the first output line would be C 1 A 1
First line of input is ignored due to NR>1
There is surely a solution with awk only, but I'm going to propose a solution using awk and sort because I think it's quite simple and does not require storing the entire file content in awk variables. The idea is as follows:
rewrite the input file so that the "relation" field is first (A 1 -> 1 A)
use sort -n to put together all lines with same "relation"
use awk to combine consecutive lines having the same "relation"
That would translate to something like:
awk '{print $2 " " $1}' input | sort -n |
awk '{if ($1==lastsel)printf " "; else if(lastsel) printf "\n"; lastsel=$1; printf "%s %s", $2, $1;}END{if(lastsel)printf"\n"}'
A 1 C 1
A 2 B 2 C 2
A 3 B 3 C 3
EDIT: If you want only one i-j relation per line:
awk '{print $2 " " $1}' input | sort -n |
awk '$1!=rel{rel=$1;item=$2;next;} {printf "%s %s %s %s\n", item, rel, $2, $1;}'
A 1 C 1
A 2 B 2
A 2 C 2
A 3 B 3
A 3 C 3
Note the following limitations with this solution:
In case a given n has only one entry, nothing will be output (no output such as D 1)
All relations always have the lexicographically first item in the first column (e.g. A 1 C 1 but never B 1 C 1)

print a line from every 5 elements of a column

I am looking for a way to select a column (e. g. eighth column) of a data file and write the first five numbers of that column in a row, the next five numbers in second row, and so on.
I have been testing with awk and printf without success.
The awk way to do this is to switch from using OFS and ORS to separate the output using the modulus function:
$ seq 1 20 | awk '{printf "%s", $1 (NR % 5 ? OFS : ORS)}'
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
Change $1 to $8 for the eigth column for example and NR % 5 to NR % 10 for rows of 10 instead of 5. The seq command just generate a single column of digits from 1 to 20 used for demonstration.
I also find using xargs useful for this kind of thing:
$ seq 1 20 | awk '{print $1}' | xargs -n5
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
The awk isn't necessary for the example as seq only produces a single column however for your question change $1 to $8 to select only the eighth column from your input. With this approach you could also switch out awk with cut.
This will also produce the format requested
seq 1 20 | awk '{printf("%s ", $1); if (NR % 5 == 0) printf("\n")}'
where $1 indicates de column number which could be changed when passing an archive to the awk line.