Unix replacing values with 2 conditions from 2 files - awk

I am having an issue that I almost solved thanks to this post. Using a dataset in the same format:
File 1
32074_32077 1 0.008348 834830 G A
32082_32085 1 0.008349 834928 A G
32085_32088 2 0.008350 834928 G A
32903_32906 5 0.008468 846808 C T
File 2
rs3094315 1 0.020130 752566 G A
rs12124819 1 0.020242 834928 A G
rs28765502 2 0.022137 834928 T C
rs7419119 3 0.022518 846808 T G
I would like to change the 1st column of file one only IF $4 and $2 are the same in FILE2. If it is not I would like to keep the line as it is.
Expected output:
32074_32077 1 0.008348 834830 G A
rs12124819 1 0.008349 834928 A G
rs28765502 2 0.008350 834928 G A
32903_32906 5 0.008468 846808 C T
Using the answer from the linked post, I cannot have the expected output. I tried this:
awk 'FNR==NR{a[$4]=$1; b[$2]=$1; next} ($4 in a && $2 in b){$1=a[$4]} 1' file1 file2
It doesn't work as expected because the condition $2 in b is always true.. I understand but I don't know how I can work around this.
Thank you.

You may use this awk:
awk 'FNR==NR {a[$2,$4]=$1; next} ($2,$4) in a {$1 = a[$2,$4]} 1' file2 file1 |
column -t
32074_32077 1 0.008348 834830 G A
rs12124819 1 0.008349 834928 A G
rs28765502 2 0.008350 834928 G A
32903_32906 5 0.008468 846808 C T
Use composite key array a with key as ($2,$4).
Used column -t for showing tabular output.

Related

Replacing with condition on two files awk

Using those example:
File1:
rs12124819 1 0.020242 776546 A G
rs28765502 1 0.022137 832918 T C
rs7419119 1 0.022518 842013 T G
rs950122 1 0.022720 846864 G C
File2:
1_752566 1 0 752566 G A
1_776546 1 0 776546 A G
1_832918 1 0 832918 T C
1_842013 1 0 842013 T G
I am trying to change the 1st column of file2 with the corresponding 1st column of file1 if their 4th column are equal.
Expected output:
rs12124819 1 0 752566 G A
rs28765502 1 0 776546 A G
rs7419119 1 0 832918 T C
rs950122 1 0 842013 T G
I tried to create 2 array but couldn't find the correct way to use it:
awk 'FNR==NR{a[$4],b[$1];next} ($4) in a{$1=b[FNR]}1' file1 file2 > out.txt
Thanks a lot!
With your shown samples, could you please try following. Written and tested in GNU awk.
awk 'FNR==NR{a[$4]=$1;next} ($4 in a){$1=a[$4]} 1' file1 file2
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when file1 is being read.
a[$4]=$1 ##Creating array a whose index is $4 and value is $1.
next ##next will skip all further statements from here.
}
($4 in a){ ##Checking condition if 4th field is present in a then do following.
$1=a[$4] ##Setting value of 1st field of file2 as array a value with index of 4th column
}
1 ##1 will print edited/non-edited line.
' file1 file2 ##mentioning Input_file names here.
You may try this awk:
awk 'FNR==NR {map[FNR] = $1; next} {$1 = map[FNR]} 1' file1 file2 | column -t
rs12124819 1 0 752566 G A
rs28765502 1 0 776546 A G
rs7419119 1 0 832918 T C
rs950122 1 0 842013 T G
another alternative (if the files are sorted in the join key as in the sample data)
$ join -j4 -o1.1,2.2,2.3,2.4,2.5,2.6 file1 file2 | column -t
rs12124819 1 0 776546 A G
rs28765502 1 0 832918 T C
rs7419119 1 0 842013 T G
Note that your input files have only 3 matching records.

AWK command to extract distinct values from a column

From a tab delimited file. I'm trying to extract all rows based on a unique value from column 4 and then save it as a CSV. However, I would like to extract all the distinct values in column 4 and save them as CSV in one go.
I was able to extract one value using this command:
awk -F $'\t' '$4 == "\"C333\"" {print}' dataFile > C333.csv
Let's consider this test file:
$ cat in.csv
a b c d
aa bb cc d
1 2 3 4
12 23 34 4
A B C d
Now, let's write each row to a tab-separated output file that is named after the fourth column:
$ awk -F'\t' '{f=$4".csv"; print>>f; close(f)}' OFS='\t' in.csv
$ cat d.csv
a b c d
aa bb cc d
A B C d
$ cat 4.csv
1 2 3 4
12 23 34 4

How to do a join using awk

Here is my Input file
Identifier Relation
A 1
A 2
A 3
B 2
B 3
C 1
C 2
C 3
I want to join this file to itself based on the "Relation" field.
Sample Output file
A 1 C 1
A 2 B 2
A 2 C 2
B 2 C 2
A 3 B 3
A 3 C 3
B 3 C 3
I used the following awk script:
awk 'NR==FNR {a[NR]=$0; next} { for (k in a) if (a[k]~$2) print a[k],$0}' input input > output
However, I had to do another awk step to delete lines which did a join with itself i.e, A 1 A 1 ; B 2 B 2 etc.
The second issue with this file is it prints both directions of the join, thus
A 1 C 1 is printed along with C 1 A 1 on another line.
Both these lines display the same relation and I would NOT like to include this.I want to see just one or the other i.e, "A 1 C 1" or "C 1 A 1" not both.
Any suggestions/directions are highly appreciated.
alternative solution using awk with join and sort support
$ join -j 2 <(sort -k2 -k1,1 file){,}
| awk '$2!=$3 && !($3 FS $2 in a){a[$2 FS $3]; print$2,$1,$3,$1}'
A 1 C 1
A 2 B 2
A 2 C 2
B 2 C 2
A 3 B 3
A 3 C 3
B 3 C 3
create the cross product, eliminate the diagonal and one of the symmetrical pairs.
Here is an awk-only solution:
awk 'NR>1{ar[$2]=(ar[$2]$1);}\
END{ for(key in ar){\
for(i=1; i<length(ar[key]); i++) {\
for(j=i+1; j<length(ar[key])+1; j++) {\
print substr(ar[key],i,1), key, substr(ar[key],j,1), key;\
}\
}\
}}' infile
Each number in the second column of the input serves as a key of an awk-array. The value of the corresponding array-element is a sequence of first-column letters (e.g., array[1]=ABC).
Then, we built all two-letter combinations for each sequence (e.g., "ABC" gives "AB", "AC" and "BC")
Output:
A 1 C 1
A 2 B 2
A 2 C 2
B 2 C 2
A 3 B 3
A 3 C 3
B 3 C 3
Note:
If a number occurs only once, no output is generated for this number.
The order of output depends on the order of input. (No sorting of letters!!). That is if the second input line was C 1, then array[1]="CAB" and the first output line would be C 1 A 1
First line of input is ignored due to NR>1
There is surely a solution with awk only, but I'm going to propose a solution using awk and sort because I think it's quite simple and does not require storing the entire file content in awk variables. The idea is as follows:
rewrite the input file so that the "relation" field is first (A 1 -> 1 A)
use sort -n to put together all lines with same "relation"
use awk to combine consecutive lines having the same "relation"
That would translate to something like:
awk '{print $2 " " $1}' input | sort -n |
awk '{if ($1==lastsel)printf " "; else if(lastsel) printf "\n"; lastsel=$1; printf "%s %s", $2, $1;}END{if(lastsel)printf"\n"}'
A 1 C 1
A 2 B 2 C 2
A 3 B 3 C 3
EDIT: If you want only one i-j relation per line:
awk '{print $2 " " $1}' input | sort -n |
awk '$1!=rel{rel=$1;item=$2;next;} {printf "%s %s %s %s\n", item, rel, $2, $1;}'
A 1 C 1
A 2 B 2
A 2 C 2
A 3 B 3
A 3 C 3
Note the following limitations with this solution:
In case a given n has only one entry, nothing will be output (no output such as D 1)
All relations always have the lexicographically first item in the first column (e.g. A 1 C 1 but never B 1 C 1)

awk: delete first and last entry of comma-separated field

I have a 4 column data that looks something like the following:
a 1 g 1,2,3,4,5,6,7
b 2 g 3,5,3,2,6,4,3,2
c 3 g 5,2,6,3,4
d 4 g 1,5,3,6,4,7
I am trying to delete first two numbers and the last two numbers on entire fourth column so the output looks like the following
a 1 g 3,4,5
b 2 g 3,2,6,4
c 3 g 6
d 4 g 3,6
Can someone give me a help? I would appreciate it.
You can use this:
$ awk '{n=split($4, a, ","); for (i=3; i<=n-2; i++) t=t""a[i](i==n-2?"":","); print $1, $2, $3, t; t=""}' file
a 1 g 3,4,5
b 2 g 3,2,6,4
c 3 g 6
d 4 g 3,6
Explanation
n=split($4, a, ",") slices the 4th field in pieces, based on comma as delimiter. As split() returns the number of pieces we got, we store it in n to work with it later on.
for (i=3; i<=n-2; i++) t=t""a[i](i==n-2?"":",") stores in t the last field, looping through all the slices.
print $1, $2, $3, t; t="" prints the new output and blanks the variable t.
This will work for your posted sample input:
$ awk '{gsub(/^([^,]+,){2}|(,[^,]+){2}$/,"",$NF)}1' file
a 1 g 3,4,5
b 2 g 3,2,6,4
c 3 g 6
d 4 g 3,6
If you have cases where there's less than 4 commas in your 4th field then update your question to show how those should be handled.
This uses bash array manipulation. It may be a little ... gnarly:
while read -a fields; do # read the fields for each line
IFS=, read -a values <<< "${fields[3]}" # split the last field on comma
new=("${values[#]:2:${#values[#]}-4}") # drop the first 2 and last fields
fields[3]=$(IFS=,; echo "${new[*]}") # join the new list on comma
printf "%s\t" "${fields[#]}"; echo # print the new line
done <<END
a 1 g 1,2,3,4,5,6,7
b 2 g 3,5,3,2,6,4,3,2
c 3 g 5,2,6,3,4
d 4 g 1,5,3,6,4,7
END
a 1 g 3,4,5
b 2 g 3,2,6,4
c 3 g 6
d 4 g 3,6

How to merge two files based on the first three columns using awk

I wanted to merge two files into a single one line by line using the first three columns as a key. Example:
file1.txt
a b c 1 4 7
x y z 2 5 8
p q r 3 6 9
file2.txt
p q r 11
a b c 12
x y z 13
My desired output for the above two files is:
a b c 1 4 7 12
x y z 2 5 8 13
p q r 3 6 9 11
The number of columns in each file is not fixed, it can vary from line to line. Also, I got more than 27K lines in each file.
They are not ordered. They only thing is that the first three fields are the same for both files.
You could also use join, it requires sorted input and that the first 3 fields are merged. The example below sorts each file and lets sed merge and separate the fields:
join <(sort file1.txt | sed 's/ /-/; s/ /-/') \
<(sort file2.txt | sed 's/ /-/; s/ /-/') |
sed 's/-/ /; s/-/ /'
Output:
a b c 1 4 7 12
p q r 3 6 9 11
x y z 2 5 8 13
Join on the first three fields where the number of fields are variable (four or more):
{
# get the forth field until the last
for (i=4;i<=NF;i++)
f=f$i" "
# concat fields
arr[$1OFS$2OFS$3]=arr[$1OFS$2OFS$3]f;
# reset field string
f=""
}
END {
for (key in arr)
print key, arr[key]
}
Run like:
$ awk -f script.awk file1 file2
a b c 1 4 7 12
p q r 3 6 9 11
x y z 2 5 8 13
try this:
awk 'NR==FNR{a[$1$2$3]=$4;next}$1$2$3 in a{print $0, a[$1$2$3]}' file2 file1
If the columns have varying lengths, you could try something like this using SUBSEP:
awk 'NR==FNR{A[$1,$2,$3]=$4; next}($1,$2,$3) in A{print $0, A[$1,$2,$3]}' file2 file1
For varying columns in file1 and sorted output, try:
awk '{$1=$1; i=$1 FS $2 FS $3 FS; sub(i,x)} NR==FNR{A[i]=$0; next}i in A{print i $0, A[i]}' file2 file1 | sort