awk Merge two files based on common field and print similarities and differences - awk

I have two files I would like to merge into a third but I need to see both when they share a common field and where they differ.Since there are minor differences in other fields, I cannot use a diff tool and I thought this could be done with awk.
File 1:
aWonderfulMachine 1 mlqsjflk
AnotherWonderfulMachine 2 mlksjf
YetAnother WonderfulMachine 3 sdg
TrashWeWon'tBuy 4 jhfgjh
MoreTrash 5 qsfqf
MiscelleneousStuff 6 qfsdf
MoreMiscelleneousStuff 7 qsfwsf
File2:
aWonderfulMachine 22 dfhdhg
aWonderfulMachine 23 dfhh
aWonderfulMachine 24 qdgfqf
AnotherWonderfulMachine 25 qsfsq
AnotherWonderfulMachine 26 qfwdsf
MoreDifferentStuff 27 qsfsdf
StrangeStuffBought 28 qsfsdf
Desired output:
aWonderfulMachine 1 mlqsjflk aWonderfulMachine 22 dfhdhg
aWonderfulMachine 23 dfhdhg
aWonderfulMachine 24 dfhh
AnotherWonderfulMachine 2 mlksjf AnotherWonderfulMachine 25 qfwdsf
AnotherWonderfulMachine 26 qfwdsf
File1
YetAnother WonderfulMachine 3 sdg
TrashWeWon'tBuy 4 jhfgjh
MoreTrash 5 qsfqf
MiscelleneousStuff 6 qfsdf
MoreMiscelleneousStuff 7 qsfwsf
File2
MoreDifferentStuff 27 qsfsdf
StrangeStuffBought 28 qsfsdf
I have tried a few awks scripts here and there, but they are either based on two fields only, and I don't know how to modify the output, or they delete the duplicates based on two fields only, etc (I am new to this and awk syntax is tough).
Thank you much in advance for your help.

You can come very close using these three commands:
join <(sort file1) <(sort file2)
join -v 1 <(sort file1) <(sort file2)
join -v 2 <(sort file1) <(sort file2)
This assumes a shell, such as Bash, that supports process substitution (<()). If you're using a shell that doesn't, the files would need to be pre-sorted.
To do this in AWK:
#!/usr/bin/awk -f
BEGIN { FS="\t"; flag=1; file1=ARGV[1]; file2=ARGV[2] }
FNR == NR { lines1[$1] = $0; count1[$1]++; next } # process the first file
{ # process the second file and do output
lines2[$1] = $0;
count2[$1]++;
if ($1 != prev) { flag = 1 };
if (count1[$1]) {
if (flag) printf "%s ", lines1[$1];
else printf "\t\t\t\t\t"
flag = 0;
printf "\t%s\n", $0
}
prev = $1
}
END { # output lines that are unique to one file or the other
print "File 1: " file1
for (i in lines1) if (! (i in lines2)) print lines1[i]
print "File 2: " file2
for (i in lines2) if (! (i in lines1)) print lines2[i]
}
To run it:
$ ./script.awk file1 file2
The lines won't be output in the same order that they appear in the input files. The second input file (file2) needs to be sorted since the script assumes that similar lines are adjacent. You will probably want to adjust the tabs or other spacing in the script. I haven't done much in that regard.

One way to do it (albeit with hardcoded file names):
BEGIN {
FS="\t";
readfile(ARGV[1], s1);
readfile(ARGV[2], s2);
ARGV[1] = ARGV[2] = "/dev/null"
}
END{
for (k in s1) {
if ( s2[k] ) printpair(k,s1,s2);
}
print "file1:"
for (k in s1) {
if ( !s2[k] ) print s1[k];
}
print "file2:"
for (k in s2) {
if ( !s1[k] ) print s2[k];
}
}
function readfile(fname, sary) {
while ( getline <fname ) {
key = $1;
if (sary[key]) {
sary[key] = sary[key] "\n" $0;
} else {
sary[key] = $0;
};
}
close(fname);
}
function printpair(key, s1, s2) {
n1 = split(s1[key],l1,"\n");
n2 = split(s2[key],l2,"\n");
for (i=1; i<=max(n1,n2); i++){
if (i==1) {
b = l1[1];
gsub("."," ",b);
}
if (i<=n1) { f1 = l1[i] } else { f1 = b };
if (i<=n2) { f2 = l2[i] } else { f2 = b };
printf("%s\t%s\n",f1,f2);
}
}
function max(x,y){ z = x; if (y>x) z = y; return z; }
Not particularly elegant, but it handles many-to-many cases.

Related

Stored each of the first 2 blocks of lines in arrays

I've sorted it by using Google Sheet, but its gonna takes a long time, so I figured it out, to settle it down by awk.
input.txt
Column 1
2
2
2
4
4
Column 2
562
564
119
215
12
Range
13455,13457
13161
11409
13285,13277-13269
11409
I've tried this script, so it's gonna rearrange the value.
awk '/Column 1/' RS= input.txt
(as referred in How can I set the grep after context to be "until the next blank line"?)
But it seems, it's only gonna take one matched line
It should be sorted by respective lines.
Result:
562Value2#13455
562Value2#13457
564Value2#13161
119Value2#11409
215Value4#13285
215Value4#13277-13269
12Value4#11409
it should be something like that, the "comma" will be repeating the value from Column 1 and Column 2
etc:
Range :
13455,13457
Result :
562Value2#13455
562Value2#13457
idk what sorting has to do with it but it seems like this is what you're looking for:
$ cat tst.awk
BEGIN { FS=","; recNr=1; print "Result:" }
!NF { ++recNr; lineNr=0; next }
{ ++lineNr }
lineNr == 1 { next }
recNr == 1 { a[lineNr] = $0 }
recNr == 2 { b[lineNr] = $0 }
recNr == 3 {
for (i=1; i<=NF; i++) {
print b[lineNr] "Value" a[lineNr] "#" $i
}
}
$ awk -f tst.awk input.txt
Result:
562Value2#13455
562Value2#13457
564Value2#13161
119Value2#11409
215Value4#13285
215Value4#13277-13269
12Value4#11409

Transpose till "n"th column into row:

I would like to transpose column to row by every 3rd column.
Input.txt
Name
Age
Place
aa
22
xx
bb
33
yy
cc
44
zz
....
....
Desired Output
Name,Age,Place
aa,22,xx
bb,33,yy
cc,44,zz
I have tried below command and in-complete
awk '
{
for(c = 1; c <= NR; c++) { a[c]=$c }
}
END {
for(r = 1; r <= NR; r++) {
for(t = 1; t <= 3; t++) {
printf("%s ", a[c])
}
print ","
}
}' Input.txt
Looking for your suggestions...
There are many good tools for this.
This awk!
$ awk 'ORS=NR%3?",":RS' file
Name,Age,Place
aa,22,xx
bb,33,yy
cc,44,zz
It sets the output field separator as , whenever the line is not multiple of 3. This way, it joins every group of 3 lines.
More info in Idiomatic awk.
xargs
$ xargs -n3 <file
Name Age Place
aa 22 xx
bb 33 yy
cc 44 zz
This gets the input in blocks of X items, defined by -n X. Then you can replace spaces with commas with tr or sed.
paste
$ paste -d"," - - - <file
Name,Age,Place
aa,22,xx
bb,33,yy
cc,44,zz
This joins every 3 input and uses delimiter , as separator.
Regarding transpose itself, I wrote a snippet a while ago in Using bash to sort data horizontally:
transpose () {
awk '{for (i=1; i<=NF; i++) a[i,NR]=$i; max=(max<NF?NF:max)}
END {for (i=1; i<=max; i++)
{for (j=1; j<=NR; j++)
printf "%s%s", a[i,j], (j<NR?OFS:ORS)
}
}'
}

Combine multiple text files using awk

I have some minutely stats saved in text files and named as 1min.txt, 2min.txt etc.
1min.txt
F1,21
F2,32
F3,22
2min.txt
F2,12
F4,32
I would like to combine these files in the following format:
combined.txt
Field 1min 2min
F1 21 0
F2 32 12
F3 22 0
F4 0 32
Some fields may not exist in some files and 0 will be entered for those fields.
I've tried to do it using awk but couldn't find an easy way. Can someone please help?
Thanks
Using awk:
awk -F, '
!seen[FILENAME]++ {
fname[++numFile] = FILENAME
}
{
flds[$1]++;
map[FILENAME,$1] = $2
}
END {
printf "%-10s", "FIELD";
for (cnt=1; cnt<=numFile; cnt++) {
file = fname[cnt];
sub (/.txt/, "", file);
printf "%-10s", file;
}
print "";
for (fld in flds) {
printf "%-10s", fld;
for (cnt=1; cnt<=numFile; cnt++) {
printf "%-10s", map[fname[cnt],fld]+0
}
print "";
}
}' 1min.txt 2min.txt
Output:
FIELD 1min 2min
F1 21 0
F2 32 12
F3 22 0
F4 0 32
Once you have reviewed the output, you can re-direct the output to another file. You can pass as many files at the end as you want. If you have way too many then you can even use shell glob, for eg: *.txt
Note: I haven't guaranteed the order of fields since they are not always present in all files.
Here is a pure fun perl japh that will do the same:
perl -F, -lane'
$f{$ARGV}++; $h{$F[0]}
{$ARGV}= $F[ 1 ]
}{print join"\t",
"FIELD", map{s/.[tx]+
//x ;$_}sort{$a
<=>$b} keys%f;print
join"\n", map{$f
=$_; join
"\t", $f,map
{$h{$f
}{$_}
//=0}
sort{$a
<=>$b}
keys%f
}sort
keys%h;
' *.txt
Output:
FIELD 1min 2min
F1 21 0
F2 32 12
F3 22 0
F4 0 32
$ cat tst.awk
BEGIN { FS=","; OFS="\t" }
{ keys[$1]; val[$1,NR==FNR] = $2 }
END {
print "Field", "1min", "2min"
for (key in keys) {
print key, val[key,1]+0, val[key,0]+0
}
}
$ awk -f tst.awk 1min.txt 2min.txt
Field 1min 2min
F1 21 0
F2 32 12
F3 22 0
F4 0 32
If you care about the output order, tell us what order you're looking for - the order the keys were seen across both files or alphabetical or something else. If it's the order they are seen then that'd be:
$ cat tst.awk
BEGIN { FS=","; OFS="\t" }
!seen[$1]++ { keys[++numKeys] = $1 }
{ val[$1,NR==FNR] = $2 }
END {
print "Field", "1min", "2min"
for (k=1; k<=numKeys; k++) {
key = keys[k]
print key, val[key,1]+0, val[key,0]+0
}
}
Using join:
join -t , input1 input2 -j 1 -o "0 1.2 2.2" -e 0 -a1 -a2 | column -t -s,
Gives:
F1 21 0
F2 32 12
F3 22 0
F4 0 32
To add a header:
join -t , input1 input2 -j 1 -o "0 1.2 2.2" -e 0 -a1 -a2 | \
sed '1iField,1min,2min' | column -t -s,
And the result looks like:
Field 1min 2min
F1 21 0
F2 32 12
F3 22 0
F4 0 32
Awk allows you to explicitly read from files, so you can just put all the logic in a BEGIN section if you want. Here is an example:
awk -F, '
BEGIN {
while (getline <"1min.txt") {
field[$1]=1
a1[$1]=$2
}
while (getline <"2min.txt") {
field[$1]=1
a2[$1]=$2
}
print "Field\t1min\t2min"
for (x in field) {
print x"\t"(a1[x]+0)"\t"(a2[x]+0)
}
}
'
I have written some python code to solve you problem.
fh_1 = open("1min.txt", "r")
fh_2 = open("2min.txt", "r")
fh_3 = open("combine.txt", "w")
min_c_1 = {}
min_c_2 = {}
lines_of_text = ["Field 1min 2min\n"]
for l1 in fh_1.readlines():
data = l1.split(',')
min_c_1[data[0]] = data[1].rstrip()
for l1 in fh_2.readlines():
data = l1.split(',')
min_c_2[data[0]] = data[1].rstrip()
for key in min_c_1.keys():
if key in min_c_2.keys():
msg = str(key) + " " + str(min_c_1[key]) + " " + str(min_c_2[key]) + "\n"
lines_of_text.append(msg)
del min_c_2[key]
else:
msg = str(key) + " " + str(min_c_1[key]) + " 0" + "\n"
lines_of_text.append(msg)
for key in min_c_2.keys():
msg = str(key) + " 0" + " " + str(min_c_2[key]) + "\n"
lines_of_text.append(msg)
fh_3.writelines(lines_of_text)
fh_1.close()
fh_2.close()
fh_3.close()
Please let me know if it does not help.

find common elements in >2 files

I have three files as shown below
file1.txt
"aba" 0 0
"aba" 0 0 1
"abc" 0 1
"abd" 1 1
"xxx" 0 0
file2.txt
"xyz" 0 0
"aba" 0 0 0 0
"aba" 0 0 0 1
"xxx" 0 0
"abc" 1 1
file3.txt
"xyx" 0 0
"aba" 0 0
"aba" 0 1 0
"xxx" 0 0 0 1
"abc" 1 1
I want to find the similar elements in all the three files based on first two columns. To find similar elements in two files i have used something like
awk 'FNR==NR{a[$1,$2]++;next}a[$1,$2]' file1.txt file2.txt
But, how can we find similar elements in all the files, when the input files are more than 2?
Can anyone help?
With the current awk solution, the output ignores the duplicate key columns and gives the output as
"xxx" 0 0
If we assume the output comes from file1.txt, the expected output is:
"aba" 0 0
"aba" 0 0 1
"xxx" 0 0
i.e it should get the rows with duplicate key columns as well.
Try following solution generalized for N files. It saves data of first file in a hash with value of 1, and for each hit from next files that value is incremented. At the end I compare if the value of each key it's the same as the number of files processed and print only those that match.
awk '
FNR == NR { arr[$1,$2] = 1; next }
{ if ( arr[$1,$2] ) { arr[$1,$2]++ } }
END {
for ( key in arr ) {
if ( arr[key] != ARGC - 1 ) { continue }
split( key, key_arr, SUBSEP )
printf "%s %s\n", key_arr[1], key_arr[2]
}
}
' file{1..3}
It yields:
"xxx" 0
"aba" 0
EDIT to add a version that prints the whole line (see comments). I've added another array with same key where I save the line, and also use it in the printf function. I've left old code commented.
awk '
##FNR == NR { arr[$1,$2] = 1; next }
FNR == NR { arr[$1,$2] = 1; line[$1,$2] = $0; next }
{ if ( arr[$1,$2] ) { arr[$1,$2]++ } }
END {
for ( key in arr ) {
if ( arr[key] != ARGC - 1 ) { continue }
##split( key, key_arr, SUBSEP )
##printf "%s %s\n", key_arr[1], key_arr[2]
printf "%s\n", line[ key ]
}
}
' file{1..3}
NEW EDIT (see comments) to add a version that handles multiple lines with same key. Basically I join all entries instead saving only one, changing line[$1,$2] = $0 with line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0. At the time of printing I do the reverse splitting with the separator (SUBSEP variable) and print each entry.
awk '
FNR == NR {
arr[$1,$2] = 1
line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0
next
}
FNR == 1 { delete found }
{ if ( arr[$1,$2] && ! found[$1,$2] ) { arr[$1,$2]++; found[$1,$2] = 1 } }
END {
num_files = ARGC -1
for ( key in arr ) {
if ( arr[key] < num_files ) { continue }
split( line[ key ], line_arr, SUBSEP )
for ( i = 1; i <= length( line_arr ); i++ ) {
printf "%s\n", line_arr[ i ]
}
}
}
' file{1..3}
With new data edited in question, it yields:
"xxx" 0 0
"aba" 0 0
"aba" 0 0 1
This python script will list out the common lines among all files :
import sys
i,l = 0,[]
for files in sys.argv[1:]:
l.append(set())
for line in open(files): l[i].add(" ".join(line.split()[0:2]))
i+=1
commonFields = reduce(lambda s1, s2: s1 & s2, l)
for files in sys.argv[1:]:
print "Common lines in ",files
for line in open(files):
for fields in commonFields:
if fields in line:
print line,
break
Usage : python script.py file1 file2 file3 ...
For three files, all you need is:
awk 'FNR==NR { a[$1,$2]; next} ($1,$2) in a' file1.txt file2.txt file3.txt
The FNR==NR block returns true for only the first file in the arguments list. The next statement in this block forces a skip over the remained of the code. Therefore, ($1,$2) in a is performed for all files in the arguments list excluding the first one. To process more files in the way you have, all you need to do is list them.
If you need more powerful globbing on the command line, use extglob. You can turn it on with shopt -s extglob, and turn it off with shopt -u extglob. For example:
awk 'FNR==NR { a[$1,$2]; next} ($1,$2) in a' file1.txt !(file1.txt)
If you have hard to find files, use find. For example:
awk 'FNR==NR { a[$1,$2]; next} ($1,$2) in a' file1.txt $(find /path/to/files -type f -name "*[23].txt")
I assume you're looking for a glob range for 'N' files. For example:
awk 'FNR==NR { a[$1,$2]; next} ($1,$2) in a' file1.txt file{2,3}.txt

Awk merge the results of processing two files into a single file

I use awk to extract and calculate information from two different files and I want to merge the results into a single file in columns ( for example, the output of first file in columns 1 and 2 and the output of the second one in 3 and 4 ).
The input files contain:
file1
SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120 SRR513804.16872HWI ST695_116193610:4:1101:7150:72196 SRR513804.2106179HWI-
ST695_116193610:4:2206:10596:165949 SRR513804.1710546HWI-ST695_116193610:4:2107:13906:128004 SRR513804.544253
file2
>SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120
TTTTGTTTTTTCTATATTTGAAAAAGAAATATGAAAACTTCATTTATATTTTCCACAAAG
AATGATTCAGCATCCTTCAAAGAAATTCAATATGTATAAAACGGTAATTCTAAATTTTAT
ACATATTGAATTTCTTTGAAGGATGCTGAATCATTCTTTGTGGAAAATATAAATGAAGTT
TTCATATTTCTTTTTCAAAT
To parse the first file I do this:
awk '
{
s = NF
center = $1
}
{
printf "%s\t %d\n", center, s
}
' file1
To parse the second file I do this:
awk '
/^>/ {
if (count != "")
printf "%s\t %d\n", seq_id, count
count = 0
seq_id = $0
next
}
NF {
long = length($0)
count = count+long
}
END{
if (count != "")
printf "%s\t %d\n", seq_id, count
}
' file2
My provisional solution is create one temporal and overwrite in the second step. There is a more "elegant" way to get this output?
I am not fully clear on the requirement and if you can update the question may be we can help improvise the answer. However, from what I have gathered is that you would like to summarize the output from both files. I have made an assumption that content in both files are in sequential order. If that is not the case, then we will have to add additional checks while printing the summary.
Content of script.awk (re-using most of your existing code):
NR==FNR {
s[NR] = NF
center[NR] = $1
next
}
/^>/ {
seq_id[++y] = $0
++i
next
}
NF {
long[i] += length($0)
}
END {
for(x=1;x<=length(s);x++) {
printf "%s\t %d\t %d\n", center[x], s[x], long[x]
}
}
Test:
$ cat file1
SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120 SRR513804.16872HWI ST695_116193610:4:1101:7150:72196 SRR513804.2106179HWI-
ST695_116193610:4:2206:10596:165949 SRR513804.1710546HWI-ST695_116193610:4:2107:13906:128004 SRR513804.544253
$ cat file2
>SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120
TTTTGTTTTTTCTATATTTGAAAAAGAAATATGAAAACTTCATTTATATTTTCCACAAAG
AATGATTCAGCATCCTTCAAAGAAATTCAATATGTATAAAACGGTAATTCTAAATTTTAT
ACATATTGAATTTCTTTGAAGGATGCTGAATCATTCTTTGTGGAAAATATAAATGAAGTT
TTCATATTTCTTTTTCAAAT
$ awk -f script.awk file1 file2
SRR513804.1218581HWI-ST695_116193610:4:1307:17513:49120 4 200
ST695_116193610:4:2206:10596:165949 3 0