Merge two rows with condition AWK - awk

I have question. I would like to merge two or three rows with condition into one row with specific printing.
INPUT: File has 6 row and tab delimited
LOL h/h 2 a b c
LOLA h/h 3 b b b
SERP w/w 4 c c c
DARD s/s 5 d d d
GIT w/w 6 a b c
GIT h/h 6 a a b
GIT d/d 6 a b b
LOL h/h 7 a a a
Output: there are 2 conditions: if ($1s are the same and $3s are the same) merge rows together with specific printing
LOL h/h 2 a b c
LOLA h/h 3 b b b
SERP w/w 4 c c c
DARD s/s 5 d d d
GIT w/w 6 a b c h/h 6 a a b d/d 6 a b b
LOL h/h 7 a a a
I have this code:
awk -F'\t' -v OFS="\t" 'NF>1{a[$1] = a[$1]"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6};END{for(i in a){print i""a[i]}}'
But it is merging by 1st column only and I am not sure if it is good to use this code.

In awk:
$ awk '($1 FS $3) in a{k=$1 FS $3; $1=""; a[k]=a[k] $0;next} {a[$1 FS $3]=$0} END {for(i in a) print a[i]}' file
SERP w/w 4 c c c
LOL h/h 2 a b c
LOLA h/h 3 b b b
DARD s/s 5 d d d
LOL h/h 7 a a a
GIT w/w 6 a b c h/h 6 a a b d/d 6 a b b
Explained:
($1 FS $3) in a { # if keys already seen in array a
k=$1 FS $3
$1="" # remove $1
a[k]=a[k] $0 # append to existing
next
}
{ a[$1 FS $3]=$0 } # if keys not seen, see them
END {
for(i in a) # for all stored keys
print a[i] # print
}

Here is answer for gawk v4 which supports multi-dimensional array. One columns from first file are stored in a multi dimensional array, things are easy to compare with second file column. My solution show an example printf which you can modify as per your needs.
#!/bin/gawk -f
NR==FNR { # for first file
a[$1][0] = $2; # Store columns in
a[$1][1] = $3; # multi dimensional
a[$1][2] = $4; # array
a[$1][3] = $5;
a[$1][4] = $6;
next;
}
$1 in a && $3 == a[$1][1] {
printf("%s\t%s\n", $2, a[$1,0])
}

Answer using gawk v3 where I cannot use multi-dimensional array
#!/bin/gawk -f
NR==FNR {
a[$1]
b[$1] = $2;
c[$1] = $3;
d[$1] = $4;
e[$1] = $5;
f[$1] = $6;
next;
}
$1 in a && $3 == c[$1] {
print $0
}
One-liner
gawk 'NR==FNR {a[$1]; b[$1] = $2; c[$1] = $3; d[$1] = $4; e[$1] = $5; f[$1] = $6; next; } $1 in a && $3 == c[$1] { print $0 }' /tmp/file1 /tmp/file2

Related

replacing associative array indexes with their value using awk or sed

I would like to replace column values of ref using key value pairs from id
cat id:
[1] a 8-23
[2] g 8-21
[3] d 8-13
cat ref:
a 1 2
b 3 4
c 5 3
d 1 2
e 3 1
f 1 2
g 2 3
desired output
8-23 1 2
b 3 4
c 5 3
8-13 1 2
e 3 1
f 1 2
8-21 2 3
I assume it would be best done using awk.
cat replace.awk
BEGIN { OFS="t" }
NR==FNR {
a[$2]=$3; next
}
$1 in !{!a[#]} {
print $0
}
Not sure what I need to change?
$1 in !{!a[#]} is not awk syntax. You just need $1 in a:
BEGIN { OFS='\t' }
NR==FNR {
a[$2] = $3
next
}
{
$1 = ($1 in a) ? a[$1] : $1
print
}
to force OFS to update, this version always assigns to $1
print uses $0 if unspecified

Split a file after a certain number of unique entries

Given a tab-delimited file:
A 12380
A 123801
A 1209
A 2035
A 4930
A 2903
B 2085
B 203801
B 240083
B 12308
B 12399
C 120303
C 1238058
C 235
D 55674
D 99683
D 2391095
D 12958
D 23804
D 5769
E 479903
E 28075
E 2310
E 6784
F 4789
F 23458
F 8976
G 9007
H 1203
H 12909
I want to split this after a certain number of unique entries have been seen - from a specific column. As an example, split the above file after every 3 unique entries in the first column. Producing 3 files:
A 12380
A 123801
A 1209
A 2035
A 4930
A 2903
B 2085
B 203801
B 240083
B 12308
B 12399
C 120303
C 1238058
C 235
D 55674
D 99683
D 2391095
D 12958
D 23804
D 5769
E 479903
E 28075
E 2310
E 6784
F 4789
F 23458
F 8976
G 9007
H 1203
H 12909
I have this so far:
awk -F"\t" 'BEGIN { count=0; filename=1 }; x[$1]++==0 {count++}; count==3 { count=1; filename++}; {print >> filename".txt"; close(filename".txt");}' file
However when running this on the terminal, I get the error:
awk: syntax error at source line 1
context is
BEGIN { count=0; filename=1 }; x[$1]++==0 {count++}; count==4 { count=1; filename++}; {print >> >>> filename".txt" <<<
awk: illegal statement at source line 1
Why?
EDIT: Removing the ".txt" fixes this - however it is super slow. Any help?
Could you please try following(tested with given samples).
awk -v count=1 '
prev!=$1 && prev{
count++
delete a[prev]
}
count==4 && !a[$1]++{
count=1
print ""
}
{
prev=$1
}
1
' Input_file
Explanation:
awk -v count=1 ' ##Starting awk program here, mentioning variable count whose value is 1.
prev!=$1 && prev{ ##Checking condition where prev NOT equal to current $1 and variable prev is NOT NULL then do following.
count++ ##Increment variable count with 1 here.
delete a[prev] ##Deleting array a value whose index is prev variable here.
}
count==4 && !a[$1]++{ ##Checking condition if count==4 and array a does not have any previous occurrence of $1 then do following.
count=1 ##Setting value of count to 1 here.
print "" ##Printing NULL line here.
}
{
prev=$1 ##Setting variable prev to $1 of current line.
}
1
' Input_file ##Mentioning Input_file name here.
EDIT: To take output into output file try following.
awk -v count=1 -v file_count=1 '
BEGIN{
file=file_count".txt"
}
prev!=$1 && prev{
count++
delete a[prev]
}
count==4 && !a[$1]++{
count=1
close(file)
file_count++
file=file_count".txt"
}
{
prev=$1
}
{
print $0 > (file)
}
' Input_file
$ awk '$1!=(p""){p=$1;u++}
u>3{close(n++".txt");u=1}
{print >(n".txt")}' n=1 file
$ cat 1.txt
A 12380
A 123801
A 1209
A 2035
A 4930
A 2903
B 2085
B 203801
B 240083
B 12308
B 12399
C 120303
C 1238058
C 235
$ cat 2.txt
D 55674
D 99683
D 2391095
D 12958
D 23804
D 5769
E 479903
E 28075
E 2310
E 6784
F 4789
F 23458
F 8976
$ cat 3.txt
G 9007
H 1203
H 12909

awk setting variables to make a range

I have the following two files:
File 1:
1 290 rs1345
2 450 rs5313
1 1120 rs4523
2 790 rs4325
File 2:
1 201 LDLR
2 714 APOA5
1 818 NOTCH5
1 514 TTN
I wish to isolate only the rows in file 2 in which the second field is within 100 units of the second field in file 1 (if field 1 matches):
Desired output: (note the third field is from the matching line in file1).
1 201 LDLR rs1345
2 714 APOA5 rs4325
I tried using the following code:
for i in {1..4} #there are 4 lines in file2
do
chr=$(awk 'NR=="'${i}'" { print $1 }' file2)
pos=$(awk 'NR=="'${i}'" { print $2 }' file2)
gene=$(awk 'NR=="'${i}'" { print $3 }' file2)
start=$(echo $pos | awk '{print $1-100}') #start and end variables for 100 unit range
end=$(echo $pos | awk '{print $1+100}')
awk '{if ($1=="'$chr'" && $2 > "'$start'" && $2 < "'$end'") print "'$chr'","'$pos'","'$gene'"$3}' file1
done
The code is not working, I believe something is wrong with my start and end variables, because when I echo $start, I get 414, which doesn't make sense to me and I get 614 when i echo $end.
I understand this question might be difficult to understand so please ask me if any clarification is necessary.
Thank you.
The difficulty is that $1 is not a unique key, so some care needs to be taken with the data structure to store the data in file 1.
With GNU awk, you can use arrays of arrays:
gawk '
NR==FNR {f1[$1][$2] = $3; next}
$1 in f1 {
for (val in f1[$1])
if (val-100 <= $2 && $2 <= val+100)
print $0, f1[$1][val]
}
' file1 file2
Otherwise, you have to use a one-dimensional array and stuff 2 pieces of information into the key:
awk '
NR==FNR {f1[$1,$2] = $3; next}
{
for (key in f1) {
split(key, a, SUBSEP)
if (a[1] == $1 && a[2]-100 <= $2 && $2 <= a[2]+100)
print $0, f1[key]
}
}
' file1 file2
That works with mawk and nawk (and gawk)
#!/usr/bin/python
import pandas as pd
from StringIO import StringIO
file1 = """
1 290 rs1345
2 450 rs5313
1 1120 rs4523
2 790 rs4325
"""
file2 = """
1 201 LDLR
2 714 APOA5
1 818 NOTCH5
1 514 TTN
"""
sio = StringIO(file1)
df1 = pd.read_table(sio, sep=" ", header=None)
df1.columns = ["a", "b", "c"]
sio = StringIO(file2)
df2 = pd.read_table(sio, sep=" ", header=None)
df2.columns = ["a", "b", "c"]
df = pd.merge(df2, df1, left_on="a", right_on="a", how="outer")
#query is intuitive
r = df.query("b_y-100 < b_x <b_y + 100")
print r[["a", "b_x", "c_x", "c_y"]]
output:
a b_x c_x c_y
0 1 201 LDLR rs1345
7 2 714 APOA5 rs4325
pandas is the right tool to do such tabular data manipulation.

add a new column to the file based on another file

I have two files file1 and file2 as shown below. file1 has two columns and file2 has one column. I want to add second column to the file2 based on file1. How can I do this with awk?
file1
2WPN B
2WUS A
2X83 A
2XFG A
2XQR C
file2
2WPN_1
2WPN_2
2WPN_3
2WUS
2X83
2XFG_1
2XFG_2
2XQR
Desired Output
2WPN_1 B
2WPN_2 B
2WPN_3 B
2WUS A
2X83 A
2XFG_1 A
2XFG_2 A
2XQR C
your help would be appreciated.
awk -v OFS='\t' 'FNR == NR { a[$1] = $2; next } { t = $1; sub(/_.*$/, "", t); print $1, a[t] }' file1 file2
Or
awk 'FNR == NR { a[$1] = $2; next } { t = $1; sub(/_.*$/, "", t); printf "%s\t%s\n", $1, a[t] }' file1 file2
Output:
2WPN_1 B
2WPN_2 B
2WPN_3 B
2WUS A
2X83 A
2XFG_1 A
2XFG_2 A
2XQR C
You may pass output to column -t to keep it uniform with spaces and not tabs.

using awk or sed extract first character of each column and store it in a separate file

I have a file like below
AT AT AG AG
GC GC GG GC
i want to extract first and last character of every col n store them in two different files
File1:
A A A A
G G G G
File2:
T T G G
C C G C
My input file is very large. Is it a way that i can do it in awk or sed
With GNU awk for gensub():
gawk '{
print gensub(/.( |$)/,"","g") > "file1"
print gensub(/(^| )./,"","g") > "file2"
}' file
You can do similar in any awk with gsub() and a couple of variables.
you can try this :
write in test.awk
#!/usr/bin/awk -f
BEGIN {
# FS = "[\s]+"
outfile_head="file1"
outfile_tail="file2"
}
{
num = NF
for(i = 1; i <= NF; i++) {
printf "%s ", substr($i, 0, 1) >> outfile_head
printf "%s ", substr($i, length($i), 1) >> outfile_tail
}
}
then you run this:
./test.awk file
It's easy to do in two passes:
sed 's/\([^ ]\)[^ ]/\1/g' file > file1
sed 's/[^ ]\([^ ]\)/\1/g' file > file2
Doing it in one pass is a challenge...
Edit 1: Modified for your multiple line edit.
You could write a perl script and pass in the file names if you plan to edit it and share it. This loops through the file only once and does not require storing the file in memory.
File "seq.pl":
#!/usr/bin/perl
open(F1,">>$ARGV[1]");
open(F2,">>$ARGV[2]");
open(DATA,"$ARGV[0]");
while($line=<DATA>) {
$line =~ s/(\r|\n)+//g;
#pairs = split(/\s/, $line);
for $pair(#pairs) {
#bases = split(//,$pair);
print F1 $bases[0]." ";
print F2 $bases[length($bases)-1]." ";
}
print F1 "\n";
print F2 "\n";
}
close(F1);
close(F2);
close(DATA);
Execute it like so:
perl seq.pl full.seq f1.seq f2.seq
File "full.seq":
AT AT AG AG
GC GC GG GC
AT AT GC GC
File "f1.seq":
A A A A
G G G G
A A G G
File "f2.seq":
T T G G
C C G C
T T C C