Obtain patterns from a file, compare to a column of another file, and replace with column of third file, using awk - awk

I am having total three files f1.txt, f2 .txt and f3. txt with different size of columns as given below. I am trying to match the pattern of file2 with file 1 and if match found then replace the file 1 content with file 3 for that particular match. In fact file 2 and file 3 are similar but file 3 is with leading zeros
File 1:
8841
841
526
548
547
88
98
File 2:
841
526
548
547
file 3:
00841
0526
000548
00547
Desired output is in File 1 or may be other file
8841
00841
0526
000548
00547
88
98
I am trying to use single line command from the previous post but that is for matching files and that does not contain replacing with the values from third file if match found. I am new to shell script so please give me the single line command or script which will achieve this. I am open to use "sed" or any other shell script.
awk 'BEGIN{i=0}
FNR==NR { a[i++]=$1; next }
{ for(j=0;j<i;j++)
if(index($0,a[j]))
print $0
}' file2 file1

file2 is of no use. Just use file1 and file3:
$ awk 'NR==FNR{a[$0+0]=$0; next} {print ($0 in a ? a[$0] : $0)}' file3 file1
8841
00841
0526
000548
00547
88
98

Using your file1 and file3 you can do something like:
$ cat file1
8841
841
526
548
547
88
98
$ cat file3
00841
0526
000548
00547
$ awk 'NR==FNR{x=$1;gsub(/^0+/,"",$1);a[$1]=x;next}($1 in a){print a[$1];next}1' file3 file1
8841
00841
0526
000548
00547
88
98

You can avoid file3, and use printf in awk to format the output with leading zeros.
Using awk
awk 'NR==FNR{a[$1 FS $2 FS $3 FS $4];next} {if ($2 FS $3 FS $4 FS $5 in a) printf "%s %05d %04d %06d %05d %s %s",$1,$2,$3,$4,$5,$6,$7}' file2 file1
8841 00841 0526 000548 00547 88 98

Related

grep file matching specific column

I want to keep only the lines in results.txt that matched the IDs in uniq.txt based on matches in column 3 of results.txt. Usually I would use grep -f uniq.txt results.txt, but this does not specify column 3.
uniq.txt
9606
234831
131
31313
results.txt
readID seqID taxID score 2ndBestScore hitLength queryLength numMatches
A00260:70:HJM2YDSXX:4:1111:15519:16720 NC_000011.10 9606 169 0 28 151 1
A00260:70:HJM2YDSXX:3:1536:9805:14841 NW_021160017.1 9606 81 0 24 151 1
A00260:70:HJM2YDSXX:3:1366:27181:24330 NC_014803.1 234831 121 121 26 151 3
A00260:70:HJM2YDSXX:3:1366:27181:24330 NC_014973.1 443143 121 121 26 151 3
With your shown samples, please try following code.
awk 'FNR==NR{arr[$0];next} ($3 in arr)' uniq.txt results.txt
Explanation:
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when uniq.txt is being read.
arr[$0] ##Creating arrar with index of current line.
next ##next will skip all further statements from here.
}
($3 in arr) ##If 3rd field is present in arr then print line from results.txt here.
' uniq.txt results.txt ##Mentioning Input_file names here.
2nd solution: In case your field number is not set in results.txt and you want to search values in whole line then try following.
awk 'FNR==NR{arr[$0];next} {for(key in arr){if(index($0,key)){print;next}}}' uniq.txt results.txt
You can use grep in combination with sed to manipulate the input patterns and achieve what you're looking for
grep -Ef <(sed -e 's/^/^(\\S+\\s+){2}/;s/$/\\s*/' uniq.txt) result.txt
If you want to match nth column, replace 2 in above command with n-1
outputs
A00260:70:HJM2YDSXX:4:1111:15519:16720 NC_000011.10 9606 169 0 28 151 1
A00260:70:HJM2YDSXX:3:1536:9805:14841 NW_021160017.1 9606 81 0 24 151 1
A00260:70:HJM2YDSXX:3:1366:27181:24330 NC_014803.1 234831 121 121 26 151 3

combining and processing 2 tab separated files in awk and make a new one

I have 2 tab separated files with 2 columns. column1 1 is number and column 2 is ID. like these 2 examples:
example file1:
188 TPT1
133 ACTR2
420 ATP5C1
942 DNAJA1
example file1:
91 PSMD7
2217 TPT1
223 ATP5C1
156 TCP1
I want to find the common rows of 2 files based on column 2 (column ID) and make a new tab separated file in which there are 4 columns: column1 is ID (common ID) column2 is the number from file1, column3 is the number from file2 and column4 is the log2 values of ratio of columns 2 and 3 (which means log2(column2/column3)). for example regarding the ID "TPT1": 1st column is TPT1, column2 is 188, column3 is 2217 and column 4 is log2(188/2217) which is equal to -3.561494.
here is a the expected output:
expected output:
TPT1 188 2217 -3.561494
ATP5C1 420 223 0.9133394
I am trying to do that in AWK using the following code:
awk 'NR==FNR { n[$2]=$0;next } ($2 in n) { print n[$2 '\t' $1] '\t' $1 '\t' log(n[$1]/$1)}' file1.txt file2.txt > result.txt
this code does not return what I expect. do you know how to fix it?
$ awk -v OFS="\t" 'NR==FNR {n[$2]=$1;next} ($2 in n) {print $2, $1, n[$2], log(n[$2]/$1)/log(2)}' file1 file2
TPT1 2217 188 -3.5598
ATP5C1 223 420 0.913346
I'd use join to actually merge the files instead of awk:
$ join -j2 <(sort -k2 file1.txt) <(sort -k2 file2.txt) |
awk -v OFS="\t" '{ print $1, $2, $3, log($2/$3)/log(2) }'
ATP5C1 420 223 0.913346
TPT1 188 2217 -3.5598
The join program, well, joins two files on a common value. It does require the files to be sorted based on the join column, but your examples aren't, hence the inline sorting of the data files. Its output is then piped to awk to compute the log2 of the numbers of each line and produce tab-delimited results.
Alternative using perl which gives you more default precision if you care about that (And don't want to mess with awk's CONVFMT variable):
$ join -j2 <(sort -k2 a.txt) <(sort -k2 b.txt) |
perl -lane 'print join("\t", #F, log($F[1]/$F[2])/log(2))'
ATP5C1 420 223 0.913345617745818
TPT1 188 2217 -3.55980420318967
awk + sort approach
awk ' { print $0,FILENAME }' ellyx.txt ellyy.txt | sort -k2 -k3 | awk ' {c=$2;if(c==p) { print c,a,$1,log(a/$1)/log(2) }p=c;a=$1 } '
with the given inputs
$ cat ellyx.txt
188 TPT1
133 ACTR2
420 ATP5C1
942 DNAJA1
$ cat ellyy.txt
91 PSMD7
2217 TPT1
223 ATP5C1
156 TCP1
$ awk ' { print $0,FILENAME }' ellyx.txt ellyy.txt | sort -k2 -k3 | awk ' {c=$2;if(c==p) { print c,a,$1,log(a/$1)/log(2) }p=c;a=$1 } '
ATP5C1 420 223 0.913346
TPT1 188 2217 -3.5598
$

Awk to update file based on match and condition in another

The below awk will produce the tab-delimeted file1 with the difference between $3-$2 calulated for each line and printed in $6. Before the awk is executed only 5 fields exist.
What I am having trouble with updated each $2 value in file2 with the $7 value of file1 if the $1 value of file2 matches the $5 of file1 and $6 in file1 is not intron. If the value of $5 is intron then then the value of $7 in file1 is zero. So for example line 1 in file1 is intron so that is equvilant to zero or skipped (those lines are not needed in the calculation).
It is possible that a $1 value in file2 may not exist in file1 and in this case the value of $2 in file2 is zero. Line3 infile2 is an example and is set to zero because it does not exist in file1. Thank you:).
Awk w/ output
awk '
FNR==NR{ # process same line
b[$4]=$3-$2;
next # process next line
}
{
a[$5]+=($3-$2)
}
{
split($1, b, " "); print b[0], a[b[0]]
}' OFS="\t" file1 file2
Output
-2135
-2135
-2222
-2351
-2351
-2414
File1 tab-delimited
chr5 86667863 86667879 RASA1 intron 16
chr5 86669977 86669995 RASA1 splicing 18
chr5 86670703 86670805 RASA1 exon 102
chr5 86679453 86679547 RASA1 intron 94
chr5 86679571 86679673 RASA1 exon 102
chr19 15088950 15088961 NOTCH2 intron 50
chr19 15288950 15288961 NOTCH3 intron 11
chr19 15308240 15308275 NOTCH3 exon 35
File2 space delimited
RASA1 2135
NOTCH2 0
GIMAP8 87
NOTCH3 129
FOXF2 0
PRB3 63
Desired out after file2 is updated
RASA1 222 `(102+102+18)`
NOTCH2 0
GIMAP8 0
NOTCH3 35 `(35)`
FOXF2 0
PRB3 0
Maybe adding a | after the first awk with:
awk 'FNR==NR { a[$1]=$7; next } { if(a[$5]){$1=a[$5] }; print }'
To update file2
Could you please try following. It will provide you sequence of output in same order of Input_file's order.
awk '
FNR==NR{
if(!b[$1]++){
c[++count]=$1
}
a[$1]
next
}
($4 in a) && $5!="intron"{
a[$4]+=$NF
}
END{
for(i=1;i<=count;i++){
print c[i],a[c[i]]?a[c[i]]:0
}
}' Input_file2 Input_file1
Since your Input_file1 is NOT TAB delimited as per your claim, so in case it is then edit Input_file2 Input_file1 -----> Input_file2 FS="\t" Input_file1. To get output as TAB delimited either append above code's output to | column -t command or set OFS="\t" near to FS="\t" too.
Output will be as follows.
RASA1 222
NOTCH2 0
GIMAP8 0
NOTCH3 35
FOXF2 0
PRB3 0
if I understood it correctly, this should do what you expect
$ awk 'FNR==NR && $5!="intron" {a[$4]+=$3-$2; next}
{$2=($1 in a)?a[$1]:0}1' file1 file2 > file2.updated

Modify tab delimited txt file

I want to modify tab delimited txt file using linux commands sed/awk/or any other method
This is an example of tab delimited txt file which I want to modify for R boxplot input:
----start of input format---------
chr8 38277027 38277127 Ex8_inner
25425 8 100 0.0800000
chr8 38277027 38277127 Ex8_inner
25426 4 100 0.0400000
chr9 38277027 38277127 Ex9_inner
25427 9 100 0.0900000
chr9 38277027 38277127 Ex9_inner
25428 1 100 0.0100000
chr10 38277027 38277127 Ex10_inner
30935 1 100 0.0100000
chr10 38277027 38277127 Ex10_inner
31584 1 100 0.0100000
all 687 1 1000 0.0010000
all 694 1 1000 0.0010000
all 695 1 1000 0.0010000
all 697 1 1000 0.0010000
all 699 6 1000 0.0060000
all 700 2 1000 0.0020000
all 723 7 1000 0.0070000
all 740 8 1000 0.0080000
all 742 1 1000 0.0010000
all 761 5 1000 0.0050000
all 814 2 1000 0.0020000
all 821 48 1000 0.0480000
------end of input file format------
I want it to be modified so that 4th column of odd rows becomes 1st column and 2nd column of the even rows (1st column is blank) becomes 2nd column. Rows starting with "all" gets deleted.
This is how output file should look:
-----start of the output file----
Ex8_inner 25425
Ex8_inner 25426
Ex9_inner 25427
Ex9_inner 25428
Ex10_inner 30935
Ex10_inner 31584
-----end of the output file----
EDIT: As OP has changed Input_file sample a bit so adding code too it.
awk --re-interval 'match($0,/Exon[0-9]{1,}/){val=substr($0,RSTART,RLENGTH);getline;sub(/^ +/,"",$1);print val,$1}' Input_file
NOTE: My awk is old version to I added --re-interval to it you need not to add it in case you have recent version of it too.
With single awk following may help you on same too.
awk '/Ex[0-9]+_inner/{val=$NF;getline;sub(/^ +/,"",$1);print val,$1}' Input_file
Explanation: Adding explanation too here for same.
awk '
/Ex[0-9]+_inner/{ ##Checking condition here if a line contains string Ex then digits _inner if yes then do following actions.
val=$NF; ##Creating variable named val whose value is $NF(last field of current line).
getline; ##using getline which is out of the box keyword of awk to take the cursor to the next line from current line.
sub(/^ +/,"",$1); ##Using sub utility of awk to substitute initial space of first field with NULL.
print val,$1 ##Printing variable named val and first field value here.
}
' Input_file ##Mentioning the Input_file name here.
another awk
$ awk '/^all/{next}
!/^chr/{printf "%s\n", $1; next}
{printf "%s ", $NF}' file
Ex8_inner 25425
Ex8_inner 25426
Ex9_inner 25427
Ex9_inner 25428
Ex10_inner 30935
Ex10_inner 31584
or perhaps
$ awk '!/^all/{if(/^chr/) printf "%s", $NF OFS; else print $1}' file

extract values from a text file with awk

I would like to extract column1 from the text files based on the values of column2. I need to print column1 only if the column2 is greater than 20.I also need to print the name of the file with the output. How can I do this with awk?
file1.txt
alias 23
samson 10
george 24
file2.txt
andrew 12
susan 16
david 25
desired output
file1
alias
george
file2
david
awk '{ if($2 > 20) { print FILENAME " " $1 } }' <files>
This might work for you:
awk '$2>20{print $1}' file1 file2
if you want file names and prettier printing:
awk 'FNR==1{print FILENAME} $2>20{print " ",$1}' file1 file2
awk '$2>20{if(file!=FILENAME){print FILENAME;file=FILENAME}print}' file1 file2
see below:
> awk '$2>20{if(file!=FILENAME){print FILENAME;file=FILENAME}print}' file1 file2
file1
alias 23
george 24
file2
david 25