Partial id match and merge multiple to one - awk

I have two files, File1 and File2. File1 has 6000 rows and file2 has 3000 rows. I want to match the ids and merge the files based on matches, which is simple. But, the ids in file1 and file2 only match partially. Have a look at the files. For every id (row) in file2 there must be two matching ids (rows) in file 1. Also, not all the ids in file2 are present in file1. I had tried awk but didn't get the desired output.
File1
1_A01_A
1_A01_B
2_B03_A
2_B03_B
1_A02_A
1_A02_B
2_B04_A
2_B04_B
1_A03_A
1_A03_B
2_B05_A
2_B05_B
1_A04_A
1_A04_B
2_B06_A
2_B06_B
1_A06_A
1_A06_B
2_B07_A
2_B07_B
1_A07_A
1_A07_B
2_B08_A
2_B08_B
9_F10_A
9_F10_B
12_D08_A
12_D08_B
5505744243493_F09.CEL_A_A
5505744243493_F09.CEL_B_B
File2
1_A01 14
2_B03 13
1_A02 4
2_B04 14
1_A03 11
2_B05 8
1_A04 18
2_B06 15
1_A06 10
2_B07 4
1_A07 8
2_B08 22
1_A08 5
2_B09 15
1_A09 20
2_B10 17

awk -F" " 'FNR==NR{a[$1]=$2;next}{for(i in a){if($1~i){print $1" "a[i];next}}}' file1.txt file2.txt
FNR==NR will be true while awk reads file 1 and false when it reads file 2. The part of code starting from {for(i in a} .. will be executed for file 2. $1~i looks for Like condition and then for relevant matches the output is printed.
by mistake I have used different file notations. My file1.txt contains the content of file2.txt in the problem statement and vise versa
Output
1_A01_A|14
1_A01_B|14
2_B03_A|13
2_B03_B|13
1_A02_A|4
1_A02_B|4
2_B04_A|14
2_B04_B|14
1_A03_A|11
1_A03_B|11
2_B05_A|8
2_B05_B|8
1_A04_A|18
1_A04_B|18
2_B06_A|15
2_B06_B|15
1_A06_A|10
1_A06_B|10
2_B07_A|4
2_B07_B|4
1_A07_A|8
1_A07_B|8
2_B08_A|22
2_B08_B|22

This might work for you (GNU sed):
sed -r 's|^(\S+)(\s+\S+)$|s/^\1.*/\&\2/p|' file2 | sed -nf - file1
This creates a sed script from file2 and then runs it against the data in file1.
N.B. The order of either file is unimportant and file1 is processed only once.

Related

How do I print every nth entry of the mth column, starting from a particular line of a file?

Consider the following data in a file file.txt:
$
$
$
FORCE 10 30 40
* 1 5 4
FORCE 11 20 22
* 2 3 0
FORCE 19 25 10
* 16 12 8
.
.
.
I want to print every 2nd element of the third column, starting from line 4, resulting in:
30
20
25
I have tried:
cat file.txt | sed 's/\|/ /' | awk 'NR%2==4 {print $3}'
However, this is not resulting in anything being printed and no errors generated either.
You might use awk checking that the row number > 3 and then check for an even row number with NR%2==0.
Note that you don't have to use cat
awk 'NR > 3 && NR%2==0 {
print $3
}' file.txt
Output
30
20
25
Using sed
$ sed -En '4~2s/([^ \t]*[ \t]+){2}([^ \t]*).*/\2/p' input_file
30
20
25
I have tried:
cat file.txt | sed 's/\|/ /' | awk 'NR%2==4 {print $3}'
However, this is not resulting in anything being printed and no errors
generated either.
You do not need cat whilst using GNU sed as it can read file on its' own, in this case it would be sed 's/\|/ /' file.txt.
You should consider if you need that part at all, your sample input does not have pipe character at all, so it would do nothing to it. You might also drop that part if lines holding values you want to print do not have that character.
Output is empty as NR%2==4 does never hold, remainder of division by x is always smaller than x (in particular case of %2 only 2 values are possible: 0 and 1)
This might work for you (GNU sed):
sed -nE '4~2s/^((\S+)\s*){3}.*/\2/p' file
Turn off implicit printing by setting the -n option and reduce back slashes in regexps by turning on -E.
From the fourth line and then every second line thereafter, capture the third column and print it.
N.B. The \2 represents the last inhabitant of that back reference which in conjunction with the {3} means the above.
Alternative:
sed -n '4,${s/^\(\(\S\+\)\s*\)\{3\}.*/\2/p;n}' file

Comparing two files based on 1st column, printing the unique part of one file

I have two files looking like this:
file1:
RYR2 29 70 0.376583106063 4.77084855376
MUC16 51 94 0.481067457376 3.9233164551
DCAF4L2 0 13 0.0691414496833 3.05307268261
USH2A 32 62 0.481792717087 2.81864194236
ZFHX4 14 37 0.371576262084 2.81030548752
file2:
A26B2
RYR2
MUC16
ACTL9
I need to compare them based on first column and print only those lines of first file that are not in second, so the output should be:
DCAF4L2 0 13 0.0691414496833 3.05307268261
USH2A 32 62 0.481792717087 2.81864194236
ZFHX4 14 37 0.371576262084 2.81030548752
I tried with grep:
grep -vFxf file2 file1
with awk:
awk 'NR==FNR {exclude[$0];next} !($0 in exclude)' file 2 file1
comm:
comm -23 <(sort file1) <(sort file2)
nothing works
You can use
grep -vFf file2 file1
Also, grep -vf file2 file1 will work, too, but in case the file2 strings contain * or [ that should be read in as literal chars you might get into trouble since they should be escaped. F makes grep treat those strings as fixed strings.
NOTES
-v: Invert match.
-f file: Take regexes from a file.
-F: Interpret the pattern as a list of fixed strings (instead of regular expressions), separated by newlines, any of which is to be matched.
So, it reads the regexes from file2 and applies them to file1, and once it finds a match, that line is not output due to inverted search. This is enough because only the first column contains alphanumerics, the rest contain numeric data only.
Why your command did not work
The -x (short for --line-regexp) option means Select only those matches that exactly match the whole line.
Also, see more about grep options in grep documentation.

summarizing the contents of a text file to an other one using awk

I have a big text file with 2 tab separated fields. as you see in the small example every 2 lines have a number in common. I want to summarize my text file in this way.
1- look for the lines that have the number in common and sum up the second column of those lines.
small example:
ENST00000054666.6 2
ENST00000054666.6_2 15
ENST00000054668.5 4
ENST00000054668.5_2 10
ENST00000054950.3 0
ENST00000054950.3_2 4
expected output:
ENST00000054666.6 17
ENST00000054668.5 14
ENST00000054950.3 4
as you see the difference is in both columns. in the 1st column there is only one repeat of each common and without "_2" and in the 2nd column the values is sum up of both lines (which have common number in input file).
I tried this code but does not return what I want:
awk -F '\t' '{ col2 = $2, $2=col2; print }' OFS='\t' input.txt > output.txt
do you know how to fix it?
Solution 1st: Following awk may help you on same.
awk '{sub(/_.*/,"",$1)} {a[$1]+=$NF} END{for(i in a){print i,a[i]}}' Input_file
Solution 2nd: In case your Input_file is sorted by 1st field then following may help you.
awk '{sub(/_.*/,"",$1)} prev!=$1 && prev{print prev,val;val=""} {val+=$NF;prev=$1} END{if(val){print prev,val}}' Input_file
Use > output.txt at the end of the above codes in case you need the output in a output file too.
If order is not a concern, below may also help :
awk -v FS="\t|_" '{count[$1]+=$NF}
END{for(i in count){printf "%s\t%s%s",i,count[i],ORS;}}' file
ENST00000054668.5 14
ENST00000054950.3 4
ENST00000054666.6 17
Edit :
If the order of the output does matter, below approach using a flag helps :
$ awk -v FS="\t|_" '{count[$1]+=$NF;++i;
if(i==2){printf "%s\t%s%s",$1,count[$1],ORS;i=0}}' file
ENST00000054666.6 17
ENST00000054668.5 14
ENST00000054950.3 4

awk: how to match two files and get the output merging these files

I have the next problem, I have a two files with 10 col each one and different number of rows. I want to compare col 7 and if there is a match, export the complete row of both files in a different one. I'm trying
awk 'NR==FNR{a[$7]=$0;next}!a[$7]' file1 file2 > output
but I only get the match for file1 in other file. I don't mind if I only get the col 10 in file 1 when there is a match. Any suggestion?
Thanks!
Assuming both files have at least 7 columns, you forgot to tell awk to print out both values. Your current solution is checking if the 7th value from file2 is NOT in the array with file1 values and so will only print those lines from file2 into the new file. So simply test value is in the array and then place a print into some curly braces:
awk 'NR==FNR{a[$7]=$0;next}$7 in a{print a[$7],$0}' file1 file2 > output

AWK Retrieve text after a certain pattern where the 1st and 2nd columns match the values in the 1st and 2nd columns in an input file

My input file (file1) looks like this:
part position col3 col4 info
part1 34 1 1 NAME=Mark;AGE=23;HEIGHT=189
part2 55 1 1 NAME=Alice;AGE=43;HEIGHT=167
part2 19 1 1 NAME=Emily;AGE=16;HEIGHT=164
part3 23 1 1 NAME=Owen;AGE=55;HEIGHT=181
part3 99 1 1 NAME=Rachel;AGE=76;HEIGHT=162
I need to retrieve the text after "NAME=" in the info column, but only if the values in the first two columns match another file (file2).
part position
part2 55
part3 23
Then only the 2nd and 4th rows will be considered and text after "NAME=" in those rows are put into the output file:
Alice
Owen
I don't need to preserve the order of the original rows, so the following output is equally valid:
Owen
Alice
My (not very good) attempt:
awk -F, 'FNR==NR {a[$1]=$5; next}; $1 in a {print a[$1]}' file1 file2
Something like,
awk -F"[ =;]" 'FNR==NR{found[$1" "$2]=$6; next} $1" "$2 in found{print found[$1" "$2]}'
Example
$ awk -F"[ =;]" 'FNR==NR{found[$1" "$2]=$6; next} $1" "$2 in found{print found[$1" "$2]}' file1 file2
Alice
Owen
What it does?
-F"[ =;]" -F sets the field separators. Here we set it to space or = or ;. This makes it easier to get the name from the first file without using a split function.
found[$1" "$2]=$6 This block is run only for file1, here we save the names $6 in the associative array found indexed by part position
$1" "$2 in found{print found[$1" "$2]} This is executed for the second file. Checks if the part position is found in the array, if yes print the name from the array
Using gnu awk below would do the same
awk 'NR>1 && NR==FNR{found[$1","$2];next}\
$1","$2 in found{print gensub(/^NAME=([^;]*).*/,"\\1","1",$NF);}' file2 file1
Output
Alice
Owen