Replace characters with awk - awk

I have the following file:
61 12451
61 13451
61 14451
61 15415
12 48469
12 78456
12 47845
32 45778
32 48745
32 47845
32 52448
32 87451
The output I want is the following, for example, 61 s are replaced by 1 as they are the first occurrence and they are repeated 4 times, then the second column goes from 2 to 5, as these are pairwise comparisons, 1 to 1 is ignored, but the second column should start from 2, so on for the rest.
1 2
1 3
1 4
1 5
2 3
2 4
2 5
3 4
3 5
3 6
3 7
3 8
Any suggestion on how to achieve this with AWK? Thanks!

It could be written in one awk command like this
awk '{a[NR]=$1;b[NR]=$2;c[NR]=$1;d[NR]=$2} END {for(i=1; i<=NR; i++){if(i==1){c[i]=1;d[i]=2}else if(a[i]==a[i-1]){c[i]=c[i-1];d[i]=1+d[i-1]}else{c[i]=1+c[i-1];d[i]=c[i]+1}print c[i],d[i]}}' pairwise.txt > output.txt
Here a and b are the arrays that read the first and second column of the file. The new values are stored in arrays c and d as first & second column and are printed to the output file.

not sure if this one-liner helps:
awk '$1!=p{++i;j=i+1}{print i,j++;p=$1}' file
at least it gives the desired output.

Related

How to loop awk command over row values

I would like to use awk to search for a particular word in the first column of a table and print the value in the 6th column. I understand how to do this searching one word at time using something along the lines of:
awk '$1 == "<insert-word>" { print $6 }' file.txt
But I was wondering if it is possible to loop this over a list of words in a row?
For example If I had a table like file1.txt below:
cat file1.txt
dna1 dna4 dna5
dna3 dna6 dna2
dna7 dna8 dna9
Could I loop over each value in row 1 and search for this word in column 1 of file2.txt below, each time printing the value of column 6? Then do this for row 2, 3 and so on...
cat file2
dna1 0 229 7 0 4 0 0
dna2 0 296 39 2 1 3 100
dna3 0 255 15 0 6 0 0
dna4 0 209 3 0 0 0 0
dna5 0 253 14 2 3 7 100
dna6 0 897 629 7 8 1 100
dna7 0 214 4 0 9 0 0
dna8 0 255 15 0 2 0 0
dna9 0 606 338 8 3 1 100
So an example looping the awk over row 1 of file 1 would return the numbers 4, 0 and 3.
The looping the command over row 2 would return the numbers 6, 8 and 1
And finally looping over row 3 would return the number 9, 2, 3
An example output might be
4 0 3
6 8 1
9 2 3
What I would really like to to is sum the total value of the numbers returned for each row. I just wasn't sure if this would be possible...
An example output of this would be
7
15
14
But I am not worried if this step isn't possible using awk as I could just do it separately
Hope this makes sense
Cheers
Ollie
yes, you can give awk multiple input files. For your example:
awk 'NR==FNR{a[$1]=a[$2]=1;next}a[$1]{print $6}' file1 file2
I didn't test the above one-liner, but it should go. At least you get the idea.
If you don't know how many columns in your file1, as you said, you want to do a loop:
awk 'NR==FNR{for(x=1;x<=NF;x++)a[$x]=1;next}a[$1]{print $6}' file1 file2
update
edit for the new requirement:
awk 'NR==FNR{a[$1]=$6;next}{for(i=1;i<=NF;i++)s+=a[$i];print s;s=0}' f2 f1
The output of above one-liner: (take f1 and f2 as your input example file1 file2):
7
15
14

Match two files with duplicate ids in awk or sed

I have two files. File 1 has 3000 rows (1500 Ids) and File 2 has 1400 rows (700 Ids). File 1 contains all the ids present in file 2. I have to match the ID column of File1 & File 2 while maintaining the order of the ids. If the id from file 2 is present in file 1 then compare column 2 and print match or mismatch. catch is there are duplicate ids and i need to keep them all. Looking for a awk or sed solution.Thanks!
File1
ID A
1 13
1 14
2 13
2 13
3 13
3 12
4 13
4 14
5 14
5 14
File 2
ID A
2 13
2 13
3 13
3 3
5 14
5 15
Desired output
ID A
2 13 Match
2 13 Match
3 13 Match
3 3 mismatch
5 14 Match
5 15 mismatch
You may use awk to achieve that,
awk '
NR==FNR{ if(a[$1]=="") a[$1]=$2; next}
/[0-9]/{
if(a[$1]==$2){
print $0,"match"
} else {
print $0,"mismatch"
} id=$1
}' File1 File2
Output:
2 13 match
2 13 match
3 13 match
3 3 mismatch
5 14 match
5 15 mismatch
Brief explanation,
NR==FNR{...}: in File1, save id/value to array a if the id has never shown previously
if(a[$1]==$2): if the id and value match in File2, view the record as match, and mismatch otherwise.
The easiest method would be to traverse the rows in File 2 and for each row find the matching ID in file 1. As you do not provide a programming language, here is the solution in pseudocode:
for all rows in file2
for all rows in file1
if current_row_file1.id = current_row_file2.id
then
if current_row_file1.value_column2 = current_row_file2.value_column2
then
print current_row_file2.id + current_row_file2.value_column2 + "Match"
else
print current_row_file2.id + current_row_file2.value_column2 + "Mismatch
The code above takes some time as you loop through all records in file 1 for every row in file 2. If your ID's in file 1 are ordered you can use an algorithm like binary search to speed up the processing. Look here for an explanation https://en.wikipedia.org/wiki/Binary_search_algorithm

Filter rows with duplicates or triplicates++ by matching key and screening columns

I'm getting stuck with duplicate / triplicate filtering complexity. Solution preferably awk, but could also be sort -u or unique etc.
I want to filter rows with either unique or exact duplicate/triplicate etc. values in the first three columns. The whole line including the fourth column which shouldn't match anything should be printed. Consider this tab-separated table:
Edit: $2 and $3 values don't have to be compared within one row. As recommended, I changed $3 values to 2xx.
name value1 value2 anyval
a 1 21 first
b 2 22 second
b 2 22 third
c 3 23 fourth
c 3 28 fifth
d 4 24 sixth
d 4 24 seventh
e 4 25 eighth
e 4 25 ninth
f 7 27 tenth
f 7 27 eleventh
f 7 27 twelveth
f 7 27 thirteenth
g 11 210 fourteenth
g 10 210 fifteenth
Line 1 is unique and should be printed.
Lines 2 + 3 contain exact duplicate values, one of them should be printed.
Lines 4 + 5 contain different value in col 3 and should be kicked out.
Lines 6 + 7 are duplicates, but they should be kicked out because lines 8 + 9 contain the same value in in col 2.
Same for lines 8 + 9.
One of the lines 10 to 13 should be printed.
Desired output:
a 1 21 first
b 2 22 second
f 7 27 tenth
... or any other of b and f.
What I've got so far but failed:
awk '!seen[$1]++ && !seen[$2]'
prints all duplicate lines based on col 1
a 1 21 first
b 2 22 second
c 3 23 fourth
d 4 24 sixth
e 4 25 eighth
f 7 27 tenth
awk '!seen[$1]++ && !seen[$2]++'
prints
a 1 21 first
b 2 22 second
c 3 23 fourth
d 4 24 sixth
f 7 27 tenth
Consequently, awk should print the desired result if:
awk '!seen[$1]++ && !seen[$2]++ && !seen[$3]++'
But the output is empty.
A different try: print dups in col 1, then again same procedure for col 2 and col 3 - doesn't work because there are dulicates in col 2
awk -F'\t' '{print $1}' file.txt |sort|uniq -d|grep -F -f - file.txt
prints first the duplicates in col 1 without "a", which I could cat later on
b 2 22 second
b 2 22 third
c 3 23 fourth
c 3 22 fifth
d 4 24 sixth
d 4 24 seventh
e 4 25 eighth
e 4 25 nineth
f 7 27 tenth
f 7 27 eleventh
f 7 27 twelveth
f 7 27 thirteenth
But again, I'm getting stuck with repetitive values (e.g. 4) spanning multiple columns.
I think the solution could be to define col1 singlets and multiplets and screen for repetitive values in all other columns, but that's causing massive stack overflow in my brain.
I'm not 100% clear of the requirements, but you can filter the records in stages...
$ awk '!a[$1,$2,$3]++{print $0,$2}' file |
uniq -uf4 |
cut -d' ' -f1-4
a 1 1 first
b 2 2 second
f 7 7 tenth
first awk filters all the duplicate entries based on first three fields and prints the second field to be used by the next process, unique filters only based on second field (now in forth position) and removes all copies of duplicates, cut gets rid of the extra key field.
UPDATE
For filtering both unique $2 and $3 fields, we have to revert back to awk
$ awk '!a[$1,$2,$3]++ {f2[$2]++; f3[$3]++; line[$2,$3]=$0}
END {for(i in f2)
for(j in f3)
if((i,j) in line && f2[i]*f3[j]==1) print line[i,j]}' file |
sort
a 1 1 first
b 2 2 second
f 7 7 tenth

AWK: Comparing two different columns in two files

I have these two files
File1:
9 8 6 8 5 2
2 1 7 0 6 1
3 2 3 4 4 6
File2: (which has over 4 million lines)
MN 1 0
JK 2 0
AL 3 90
CA 4 83
MK 5 54
HI 6 490
I want to compare field 6 of file1, and compare field 2 of file 2. If they match, then put field 3 of file2 at the end of file1
I've looked at other solutions but I can't get it to work correctly.
Desired output:
9 8 6 8 5 2 0
2 1 7 0 6 1 0
3 2 3 4 4 6 490
My attempt:
awk 'NR==FNR{a[$2]=$2;next}a[$6]{print $0,a[$6]}' file2 file1
program just hangs after that.
To print all lines in file1 with match if available:
$ awk 'FNR==NR{a[$2]=$3;next;} {print $0,a[$6];}' file2 file1
9 8 6 8 5 2 0
2 1 7 0 6 1 0
3 2 3 4 4 6 490
To print only the lines that have a match:
$ awk 'NR==FNR{a[$2]=$3;next} $6 in a {print $0,a[$6]}' file2 file1
9 8 6 8 5 2 0
2 1 7 0 6 1 0
3 2 3 4 4 6 490
Note that I replaced a[$2]=$2 with a[$2]=$3 and changed the test a[$6] (which is false if the value is zero) to $6 in a.
Your own attempt basically has two bugs as seen in #John1024's answer:
You use field 2 as both key and value in a, where you should be storing field 3 as the value (since you want to keep it for later), i.e., it should be a[$2] = $3.
The test a[$6] is false when the value in a is zero, even if it exists. The correct test is $6 in a.
Hence:
awk 'NR==FNR { a[$2]=$3; next } $6 in a {print $0, a[$6] }' file2 file1
However, there might be better approaches, but it is not clear from your specifications. For instance, you say that file2 has over 4 million lines, but it is unknown if there are also that many unique values for field 2. If yes, then a will also have that many entries in memory. And, you don't specify how long file1 is, or if its order must be preserved for output, or if every line (even without matches in file2) should be output.
If it is the case that file1 has many fewer lines than file2 has unique values for field 2, and only matching lines need to be output, and order does not need to be preserved, then you might wish to read file1 first…

AWK: print colums of a matrix using first column as reference

I want to read first colum in a matrix, and then print columns of this matrix using this first colum as reference. And example:
mat.txt
2 10 6 12 3
4 11 1 22 6
5 15 3 18 9
Using first column as reference, I would like to get columns 2, 4 and 5, and also put the value of first colum at the begining.
2 10 12 3
4 11 22 6
5 15 18 9
I try this, but doesn't work well:
awk 'FNR==NR{c++;cols[c]=$1;end}
{for(i=1;i&lt=c;i++) printf("%s%s",$(cols[i]+1),i&ltc ? OFS : "\n")}' mat.txt mat.txt
This may do:
awk 'FNR==NR {a[NR]=$1;next} {printf "%s ",a[FNR];for (i in a) printf "%s ",$(a[i]);print ""}' mat.txt{,}
2 10 12 3
4 11 22 6
5 15 18 9
The {,} make the file be used two times.