Processing 2 files with different field separators using awk - awk

Let's say I have 2 files :
$ cat file1
A:10
B:5
C:12
$ cat file2
100 A
50 B
42 C
I'd like to have something like :
A 10 100
B 5 50
C 12 42
I tried this :
awk 'BEGIN{FS=":"}NR==FNR{a[$1]=$2;next}{FS=" ";print $2,a[$2],$1}' file1 file2
Which outputs me that :
100 A
B 5 50
C 12 42
I guess the problem comes from the Field Separator which is set too late for the second file. How can I set different field separator for different files (and not for a single file) ?
Thanks
Edit: a more general case
With file2 and file3 like this :
$ cat file3
A:10 foo
B:5 bar
C:12 baz
How to get :
A 10 foo 100
B 5 bar 50
C 12 baz 42

Just set FS between files:
awk '...' FS=":" file1 FS=" " file2
i.e.:
$ awk 'NR==FNR{a[$1]=$2;next}{print $2,a[$2],$1}' FS=":" file1 FS=" " file2
A 10 100
B 5 50
C 12 42

You need to get awk to re-split $0 after you change FS.
You can do that with $0=$0 (for example).
So {FS=" ";$0=$0;...} in your final block will do what you want.
Though only doing that the first time you need to change FS will likely perform slightly better for large files.

You can try something like:
$ cat f1
A:10
B:5
C:12
$ cat f2
100 A
50 B
42 C
$ awk 'NR==FNR{split($0,tmp,/:/);a[tmp[1]]=tmp[2];next}$2 in a{print $2,a[$2],$1}' f1 f2
A 10 100
B 5 50
C 12 42
or set multiple field separators
$ awk -F"[: ]" 'NR==FNR{a[$1]=$2;next}$2 in a{print $2,a[$2],$1}' f1 f2
A 10 100
B 5 50
C 12 42

Related

Using matching entries only, print file A line if column values is between two other columns values in file B

I have a tab delim file1
A 1
A 20
B 17
B 33
C 10
C 20
E 7
and another tab delim file2
A 1 5
A 6 20
B 1 10
B 30 60
C 10 20
E 1 6
I need to print the lines in file1 for which col1 file1 = col1 file2 and value in col2 file1 falls within the ranges in cols 2 and 3 of file2.
The output would look like
A 1
A 20
B 33
C 10
C 20
I'm trying
awk 'FNR==NR{a[$1]=$2;next}; ($1) in a{if($2=(a[$1] >= $2 && a[$1] <=$3) {print}}1' file1 file2
But it's not working.
To store multiple ranges, you really want to use arrays of arrays or lists. awk doesn't support them directly but they can be emulated. In this case arrays of arrays seem likely to be more efficient.
awk '
# store each range from file2
FNR==NR {
n = ++q[$1]
min[$1 FS n] = $2
max[$1 FS n] = $3
next
}
# process file1
n = q[$1] { # if no q entry, line cannot be in range
for (i=1; i<=n; i++)
if ( min[$1 FS i]<=$2 && $2<=max[$1 FS i]) {
print
next
}
}
' file2 file1
Each min/max range needs to be stored separately. By maintaining a counter (q[$1]) of occurrences of each different value of col1 ($1), we ensure creation of a distinct new array element [$1 FS n].
Subsequently, when checking the ranges, we know that any particular value of col1 occurs precisely q[$1] times.
My desperate solution to this was to manipulate my file2 into this using some basic awk operations
A 1 gene_starts
A 5 gene_ends
A 6 gene_starts
A 20 gene_ends
B 1 gene_starts
B 10 gene_ends
B 30 gene_starts
B 60 gene_ends
C 10 gene_starts
C 20 gene_ends
E 1 gene_starts
E 6 gene_ends
So, that I could sort the two files 1 and 2, and grab lines between gene_starts and genes_ends
sort -V -k1,2 file1 file2 | awk '/gene_starts/,/gene_ends/' | awk '!length($3)'
But this allowed me to get only
A 20
B 33
C 20
because sorting puts lines from file1 matching gene_starts before them.
So, to get A 1 and C 10, I had to do filter gene_ends out of my file2 and use
awk 'NR==FNR{c[$1,$2];next} (($1,$2) in c)' file2 file1
to get the gene_starts lines that were also in file1 and add then to my output

If two columns from different files equal, replace third column with awk

I am looking for a way to replace a column in a file, if two ID columns match.
I have file A.txt
c a b ID
1 0.01 5 1
2 0.1 6 2
3 2 3
and file B.txt
ID a b
1 10 15
2 20 16
3 30 12
4 40 14
The output im looking for is
file A.txt
ID a b
1 0.01 5
2 0.1 6
3 30 2
I can find with awk which ID columns from both files match
awk 'NR==FNR{a[$1];next}$1 in a' B.txt A.txt
But how to add replacement. Thank you for any suggestions.
awk solution:
awk 'NR==FNR{ if(NR>1) a[$1]=$2; next }
FNR>1 && $1 in a && NF<3{ f=$2; $2=a[$1]; $3=f }1' B.txt A.txt | column -t
if(NR>1) a[$1]=$2; - capturing column values from file B.txt except the header line (N>1)
FNR>1 && $1 in a && NF<3 - if IDs match and some line from A.txt has less than 3 fields
The output:
ID a b
1 0.01 5
2 0.1 6
3 30 2
Adapted to your new data format
awk '
# Load file b reference
FNR==NR && NR > 1 {ColB[$1]=$2; next}
# treat file A
{
# set missing field if know in file B (and not 1st line)
if ( NF < 4 && ( $NF in ColB) && FNR > 1) $0 = $NF FS ColB[$NF] FS $2
# print result (in any case)
print
}
#order of file is mandatory' B.txt A.txt
Self documented.
Assume this is only the second field that is missing like in your sample

Concatenate files based off unique titles in their first column

I have many files that are of two column format with a label in the first column and a number in the second column. The number is positive (never zero):
AGS 3
KET 45
WEGWET 12
FEW 56
Within each file, the labels are not repeated.
I would like to concatenate these many files into one file with many+1 columns, such that the first column includes the unique set of all labels across all files, and the last five columns include the number for each label of each file. If the label did not exist in a certain file (and hence there is no number for it), I would like it to default to zero. For instance, if the second file contains this:
AGS 5
KET 14
KJV 2
FEW 3
then the final output would look like:
AGS 3 5
KET 45 14
WEGWET 12 0
KJV 0 2
FEW 56 3
I am new to Linux, and have been playing around with sed and awk, but realize this probably requires multiple steps...
*Edit note: I had to change it from just 2 files to many files. Even though my example only shows 2 files, I would like to do this in case of >2 files as well. Thank you...
Here is one way using awk:
awk '
NR==FNR {a[$1]=$0;next}
{
print (($1 in a)?a[$1] FS $2: $1 FS "0" FS $2)
delete a[$1]
}
END{
for (x in a) print a[x],"0"
}' file1 file2 | column -t
AGS 3 5
KET 45 14
KJV 0 2
FEW 56 3
WEGWET 12 0
You read file1 in to an array indexed at column 1 and assign entire line as it's value
For the file2, check if column 1 is present in our array. If it is print the value from file1 along with value from file2. If it is not present print 0 as value for file1.
Delete the array element as we go along to get only what was unique in file1.
In the END block print what was unique in file1 and print 0 for file2.
Pipe the output to column -t for pretty format.
Assuming that your data are in files named file1 and file2:
$ awk 'FNR==NR {a[$1]=$2; b[$1]=0; next} {a[$1]+=0; b[$1]=$2} END{for (x in b) {printf "%-15s%3s%3s\n",x,a[x],b[x]}}' file1 file2
KJV 0 2
WEGWET 12 0
KET 45 14
AGS 3 5
FEW 56 3
To understand the above, we have to understand an awk trick.
In awk, NR is the number of records (lines) that have been processed and FNR is the number of records that we have processed in the current file. Consequently, the condition FNR==NR is true only when we are processing in the first file. In this case, the associative array a gets all the values from the first file and associative array b gets placeholder, i.e. zero, values. When we process the second file, its values go in array b and we make sure that array a at least has a placeholder value of zero. When we are done with the second file, the data is printed.
More than two files using GNU Awk
I created a file3:
$ cat file3
AGS 3
KET 45
WEGWET 12
FEW 56
AGS 17
ABC 100
The awk program extended to work with any number of files is:
$ awk 'FNR==1 {n+=1} {a[$1][n]=$2} END{for (x in a) {printf "%-15s",x; for (i=1;i<=n;i++) {printf "%5s",a[x][i]};print ""}}' file1 file2 file3
KJV 2
ABC 100
WEGWET 12 12
KET 45 14 45
AGS 3 5 17
FEW 56 3 56
This code works creates a file counter. We know that we are in a new file every time that FNR is 1 and a counter, n, is incremented. For every line we encounter, we put the data in a 2-D array. The first dimension of a is the label and the second is the number of the file that we encountered it in. In the end, we just loop over all the labels and all the files, from 1 to n and print the data.
More than 2 files without GNU Awk
Without requiring GNU's awk, we can solve the problem using simulated two-dimensional arrays:
$ awk 'FNR==1 {n+=1} {b[$1]=1; a[$1,":",n]=$2} END{for (x in b) {printf "%-15s",x; for (i=1;i<=n;i++) {q=a[x,":",i]+0; printf "%5s",q};print ""}}' file1 file2 file3
KJV 0 2 0
ABC 0 0 100
WEGWET 12 0 12
KET 45 14 45
AGS 3 5 17
FEW 56 3 56

rearrange columns using awk or cut command

I have large file with 1000 columns. I want to rearrange so that last column should be the 3rd column. FOr this i have used,
cut -f1-2,1000,3- file > out.txt
But this does not change the order.
Could anyone help using cut or awk?
Also, I want to rearrange columns 10 and 11 as shown below:
Example:
1 10 11 2 3 4 5 6 7 8 9 12 13 14 15 16 17 18 19 20
try this awk one-liner:
awk '{$3=$NF OFS $3;$NF=""}7' file
this is moving the last col to the 3rd col. if you have 1000, then it does it with 1000th col.
EDIT
if the file is tab-delimited, you could try:
awk -F'\t' -v OFS="\t" '{$3=$NF OFS $3;$NF=""}7' file
EDIT2
add an example:
kent$ seq 20|paste -s -d'\t'
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
kent$ seq 20|paste -s -d'\t'|awk -F'\t' -v OFS="\t" '{$3=$NF OFS $3;$NF=""}7'
1 2 20 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
EDIT3
You didn't give any input example. so assume you don't have empty columns in original file. (no continuous multi-tabs):
kent$ seq 20|paste -s -d'\t'|awk -F'\t' -v OFS="\t" '{$3=$10 FS $11 FS $3;$10=$11="";gsub(/\t+/,"\t")}7'
1 2 10 11 3 4 5 6 7 8 9 12 13 14 15 16 17 18 19 20
After all we could print those fields in a loop.
I THINK what you want is:
awk 'BEGIN{FS=OFS="\t"} {$3=$NF OFS $3; sub(OFS "[^" OFS "]*$","")}1' file
This might also work for you depending on your awk version:
awk 'BEGIN{FS=OFS="\t"} {$3=$NF OFS $3; NF--}1' file
Without the part after the semi-colon you'll have trailing tabs in your output.
Since many people are searching for this and even the best awk solution is not really pretty and easy to use I wanted to post my solution (mycut) written in Python:
#!/usr/bin/env python3
import sys
from signal import signal, SIGPIPE, SIG_DFL
signal(SIGPIPE,SIG_DFL)
#example usage: cat file | mycut 3 2 1
columns = [int(x) for x in sys.argv[1:]]
delimiter = "\t"
for line in sys.stdin:
parts = line.split(delimiter)
print("\t".join([parts[col] for col in columns]))
I think about adding the other features of cut like changing the delimiter and a feature to use a * to print the remaning columns. But then it will get an own page.
A shell wrapper function for awk' that uses simpler syntax:
# Usage: rearrange int_n [int_o int_p ... ] < file
rearrange ()
{
unset n;
n="{ print ";
while [ "$1" ]; do
n="$n\$$1\" \" ";
shift;
done;
n="$n }";
awk "$n" | grep '\w'
}
Examples...
echo foo bar baz | rearrange 2 3 1
bar baz foo
Using bash brace expansion, rearrange first and last 5 items in descending order:
echo {1..1000}a | tr '\n' ' ' | rearrange {1000..995} {5..1}
1000a 999a 998a 997a 996a 995a 5a 4a 3a 2a 1a
Sorted 3-letter shells in /bin:
ls -lLSr /bin/?sh | rearrange 5 9
150792 /bin/csh
154072 /bin/ash
771552 /bin/zsh
1554072 /bin/ksh

How to merge two files based on the first three columns using awk

I wanted to merge two files into a single one line by line using the first three columns as a key. Example:
file1.txt
a b c 1 4 7
x y z 2 5 8
p q r 3 6 9
file2.txt
p q r 11
a b c 12
x y z 13
My desired output for the above two files is:
a b c 1 4 7 12
x y z 2 5 8 13
p q r 3 6 9 11
The number of columns in each file is not fixed, it can vary from line to line. Also, I got more than 27K lines in each file.
They are not ordered. They only thing is that the first three fields are the same for both files.
You could also use join, it requires sorted input and that the first 3 fields are merged. The example below sorts each file and lets sed merge and separate the fields:
join <(sort file1.txt | sed 's/ /-/; s/ /-/') \
<(sort file2.txt | sed 's/ /-/; s/ /-/') |
sed 's/-/ /; s/-/ /'
Output:
a b c 1 4 7 12
p q r 3 6 9 11
x y z 2 5 8 13
Join on the first three fields where the number of fields are variable (four or more):
{
# get the forth field until the last
for (i=4;i<=NF;i++)
f=f$i" "
# concat fields
arr[$1OFS$2OFS$3]=arr[$1OFS$2OFS$3]f;
# reset field string
f=""
}
END {
for (key in arr)
print key, arr[key]
}
Run like:
$ awk -f script.awk file1 file2
a b c 1 4 7 12
p q r 3 6 9 11
x y z 2 5 8 13
try this:
awk 'NR==FNR{a[$1$2$3]=$4;next}$1$2$3 in a{print $0, a[$1$2$3]}' file2 file1
If the columns have varying lengths, you could try something like this using SUBSEP:
awk 'NR==FNR{A[$1,$2,$3]=$4; next}($1,$2,$3) in A{print $0, A[$1,$2,$3]}' file2 file1
For varying columns in file1 and sorted output, try:
awk '{$1=$1; i=$1 FS $2 FS $3 FS; sub(i,x)} NR==FNR{A[i]=$0; next}i in A{print i $0, A[i]}' file2 file1 | sort