combining and processing 2 tab separated files in awk and make a new one - awk

I have 2 tab separated files with 2 columns. column1 1 is number and column 2 is ID. like these 2 examples:
example file1:
188 TPT1
133 ACTR2
420 ATP5C1
942 DNAJA1
example file1:
91 PSMD7
2217 TPT1
223 ATP5C1
156 TCP1
I want to find the common rows of 2 files based on column 2 (column ID) and make a new tab separated file in which there are 4 columns: column1 is ID (common ID) column2 is the number from file1, column3 is the number from file2 and column4 is the log2 values of ratio of columns 2 and 3 (which means log2(column2/column3)). for example regarding the ID "TPT1": 1st column is TPT1, column2 is 188, column3 is 2217 and column 4 is log2(188/2217) which is equal to -3.561494.
here is a the expected output:
expected output:
TPT1 188 2217 -3.561494
ATP5C1 420 223 0.9133394
I am trying to do that in AWK using the following code:
awk 'NR==FNR { n[$2]=$0;next } ($2 in n) { print n[$2 '\t' $1] '\t' $1 '\t' log(n[$1]/$1)}' file1.txt file2.txt > result.txt
this code does not return what I expect. do you know how to fix it?

$ awk -v OFS="\t" 'NR==FNR {n[$2]=$1;next} ($2 in n) {print $2, $1, n[$2], log(n[$2]/$1)/log(2)}' file1 file2
TPT1 2217 188 -3.5598
ATP5C1 223 420 0.913346

I'd use join to actually merge the files instead of awk:
$ join -j2 <(sort -k2 file1.txt) <(sort -k2 file2.txt) |
awk -v OFS="\t" '{ print $1, $2, $3, log($2/$3)/log(2) }'
ATP5C1 420 223 0.913346
TPT1 188 2217 -3.5598
The join program, well, joins two files on a common value. It does require the files to be sorted based on the join column, but your examples aren't, hence the inline sorting of the data files. Its output is then piped to awk to compute the log2 of the numbers of each line and produce tab-delimited results.
Alternative using perl which gives you more default precision if you care about that (And don't want to mess with awk's CONVFMT variable):
$ join -j2 <(sort -k2 a.txt) <(sort -k2 b.txt) |
perl -lane 'print join("\t", #F, log($F[1]/$F[2])/log(2))'
ATP5C1 420 223 0.913345617745818
TPT1 188 2217 -3.55980420318967

awk + sort approach
awk ' { print $0,FILENAME }' ellyx.txt ellyy.txt | sort -k2 -k3 | awk ' {c=$2;if(c==p) { print c,a,$1,log(a/$1)/log(2) }p=c;a=$1 } '
with the given inputs
$ cat ellyx.txt
188 TPT1
133 ACTR2
420 ATP5C1
942 DNAJA1
$ cat ellyy.txt
91 PSMD7
2217 TPT1
223 ATP5C1
156 TCP1
$ awk ' { print $0,FILENAME }' ellyx.txt ellyy.txt | sort -k2 -k3 | awk ' {c=$2;if(c==p) { print c,a,$1,log(a/$1)/log(2) }p=c;a=$1 } '
ATP5C1 420 223 0.913346
TPT1 188 2217 -3.5598
$

Related

grep file matching specific column

I want to keep only the lines in results.txt that matched the IDs in uniq.txt based on matches in column 3 of results.txt. Usually I would use grep -f uniq.txt results.txt, but this does not specify column 3.
uniq.txt
9606
234831
131
31313
results.txt
readID seqID taxID score 2ndBestScore hitLength queryLength numMatches
A00260:70:HJM2YDSXX:4:1111:15519:16720 NC_000011.10 9606 169 0 28 151 1
A00260:70:HJM2YDSXX:3:1536:9805:14841 NW_021160017.1 9606 81 0 24 151 1
A00260:70:HJM2YDSXX:3:1366:27181:24330 NC_014803.1 234831 121 121 26 151 3
A00260:70:HJM2YDSXX:3:1366:27181:24330 NC_014973.1 443143 121 121 26 151 3
With your shown samples, please try following code.
awk 'FNR==NR{arr[$0];next} ($3 in arr)' uniq.txt results.txt
Explanation:
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when uniq.txt is being read.
arr[$0] ##Creating arrar with index of current line.
next ##next will skip all further statements from here.
}
($3 in arr) ##If 3rd field is present in arr then print line from results.txt here.
' uniq.txt results.txt ##Mentioning Input_file names here.
2nd solution: In case your field number is not set in results.txt and you want to search values in whole line then try following.
awk 'FNR==NR{arr[$0];next} {for(key in arr){if(index($0,key)){print;next}}}' uniq.txt results.txt
You can use grep in combination with sed to manipulate the input patterns and achieve what you're looking for
grep -Ef <(sed -e 's/^/^(\\S+\\s+){2}/;s/$/\\s*/' uniq.txt) result.txt
If you want to match nth column, replace 2 in above command with n-1
outputs
A00260:70:HJM2YDSXX:4:1111:15519:16720 NC_000011.10 9606 169 0 28 151 1
A00260:70:HJM2YDSXX:3:1536:9805:14841 NW_021160017.1 9606 81 0 24 151 1
A00260:70:HJM2YDSXX:3:1366:27181:24330 NC_014803.1 234831 121 121 26 151 3

Modify tab delimited txt file

I want to modify tab delimited txt file using linux commands sed/awk/or any other method
This is an example of tab delimited txt file which I want to modify for R boxplot input:
----start of input format---------
chr8 38277027 38277127 Ex8_inner
25425 8 100 0.0800000
chr8 38277027 38277127 Ex8_inner
25426 4 100 0.0400000
chr9 38277027 38277127 Ex9_inner
25427 9 100 0.0900000
chr9 38277027 38277127 Ex9_inner
25428 1 100 0.0100000
chr10 38277027 38277127 Ex10_inner
30935 1 100 0.0100000
chr10 38277027 38277127 Ex10_inner
31584 1 100 0.0100000
all 687 1 1000 0.0010000
all 694 1 1000 0.0010000
all 695 1 1000 0.0010000
all 697 1 1000 0.0010000
all 699 6 1000 0.0060000
all 700 2 1000 0.0020000
all 723 7 1000 0.0070000
all 740 8 1000 0.0080000
all 742 1 1000 0.0010000
all 761 5 1000 0.0050000
all 814 2 1000 0.0020000
all 821 48 1000 0.0480000
------end of input file format------
I want it to be modified so that 4th column of odd rows becomes 1st column and 2nd column of the even rows (1st column is blank) becomes 2nd column. Rows starting with "all" gets deleted.
This is how output file should look:
-----start of the output file----
Ex8_inner 25425
Ex8_inner 25426
Ex9_inner 25427
Ex9_inner 25428
Ex10_inner 30935
Ex10_inner 31584
-----end of the output file----
EDIT: As OP has changed Input_file sample a bit so adding code too it.
awk --re-interval 'match($0,/Exon[0-9]{1,}/){val=substr($0,RSTART,RLENGTH);getline;sub(/^ +/,"",$1);print val,$1}' Input_file
NOTE: My awk is old version to I added --re-interval to it you need not to add it in case you have recent version of it too.
With single awk following may help you on same too.
awk '/Ex[0-9]+_inner/{val=$NF;getline;sub(/^ +/,"",$1);print val,$1}' Input_file
Explanation: Adding explanation too here for same.
awk '
/Ex[0-9]+_inner/{ ##Checking condition here if a line contains string Ex then digits _inner if yes then do following actions.
val=$NF; ##Creating variable named val whose value is $NF(last field of current line).
getline; ##using getline which is out of the box keyword of awk to take the cursor to the next line from current line.
sub(/^ +/,"",$1); ##Using sub utility of awk to substitute initial space of first field with NULL.
print val,$1 ##Printing variable named val and first field value here.
}
' Input_file ##Mentioning the Input_file name here.
another awk
$ awk '/^all/{next}
!/^chr/{printf "%s\n", $1; next}
{printf "%s ", $NF}' file
Ex8_inner 25425
Ex8_inner 25426
Ex9_inner 25427
Ex9_inner 25428
Ex10_inner 30935
Ex10_inner 31584
or perhaps
$ awk '!/^all/{if(/^chr/) printf "%s", $NF OFS; else print $1}' file

Splitting a column vertically using AWK

If i have +2, i want this to be + 2 as separate columns. I am doing this for a large column so I cannot do it manually.
Edit #1
cat maser_neg_test.txt | awk '{print NR, $0}' | awk '{print $1, $2, ((15 * $3)
+ ((1/4) * $4) + ((1/240) * $5)), (($6)+ ($7/60) + ($8/3600) ,$9}' | awk
'{printf "%s %-15s %-10s %-10s %-6s\n", $1, $2, $3, $4 , $5}' >
maser_neg_test2.txt
is my code, which transforms
RXSJ00001+0523 00 00 11.78 +05 23 17.4 11992 2016-02-12 51.3 3 10.9 10631 13365
KUG2358+330 00 00 58.10 +33 20 38.0 12921 2012-11-17 36.5 8 4.0 11461 14395
0001233+4733537 00 01 23.30 +47 33 53.7 5237 2010-11-02 39.5 10 3.6 3848 6639 3.5 6358 9196
NGC-7805 00 01 26.76 +31 26 01.4 4850 2006-01-05 43.8 5 6.0 3464 6248 5.6 5968 8799
into
1 RXSJ00001+0523 0.04908 5.38817 11992
2 KUG2358+330 0.24208 33.3439 12921
3 0001233+4733537 0.34708 47.5649 5237
4 NGC-7805 0.36150 31.4337 4850"
but my research advisor noted that in my conversion of
dec:
1*(hr) = degree_1
(1/60) * (min) = degree_2
(1/3600) * (sec) = degree_3
degree_1 + degree_2 + degree_3 = dec (degrees)
which is the data +05 23 17.4 as hr min sec, that just adding them when the sign is negative does not combine these right. So i'm trying to pull out the sign before doing my calculations and then re-apply it
Edit 2
Is an example of some of the negative cases; sorry this is my first post I wasn't really sure how to format it at first.
NGC-23 00 09 53.42 +25 55 25.5 4565 2005-12-18 44.2 30 2.5 3182 5961 2.3 5681 8506
UM207 00 10 06.63 -00 26 09.4 9648 2010-01-10 25.2 10 2.1 8218 11091 2.1 10802 13723
MARK937 00 10 09.99 -04 42 38.0 8846 2016-02-04 42.5 10 4.4 7512 10192
Mrk937 00 10 10.01 -04 42 37.9 8851 2003-11-01 60.4 24 4.1 7428 10286
NGC-26 00 10 25.86 +25 49 54.6 4589 2005-12-14 41.2 5 5.7 3205 5985 5.1 5705 8531
I think you are overcomplicating things a lot by using multiple layers of awk (and unnecessary cat), and thinking of how to "split columns vertically" rather than just solving the problem, which seems to be that for a negative sign you should subtract, rather than add, the minutes and seconds.
So, use intermediate variables and check for the sign ($5 ~ /^-/):
awk '{ deg = $6/60 + $7/3600; deg = ($5 ~ /^-/) ? $5 - deg : $5 + deg;
printf "%s %-15s %-10s %-10s %-6s\n",
NR, $1, ((15 * $2) + (1/4 * $3) + (1/240 * $4)), deg, $8
}' maser_neg_test.txt
(edit: As pointed out by the OP, the original test $5 < 0 would fail when that field was -0.)
Try something like this:
echo '+2' | awk -v FS="" '{print $1" "$2}'
Result:
+ 2
If you have a text file (test.txt) with information such as
+2
-3
+4
+5
and you need output like so:
+ 2
- 3
+ 4
+ 5
Try this:
awk -v FS="" '{print $1" "$2}' test.txt
As two commenters have mentioned, it would be good for you to add some example data and the output that you desire. The answer above is just one of the many ways you can format your data.
EDIT
In your particular example, you could just use sed instead of cat'ing the file like so:
sed 's_+__g' test.txt | awk '{print NR, $0}' | awk '{print $1, $2, 15*$3 + $4/4 + $5/240, $6 + $7/60 + $8/3600, $9}'
sed will replace + in your file with nothing and then send the output to awk. If you have - also, you can perhaps remove them by either using sed creatively or double-sed'ing like so:
sed 's_+__g' test.txt | sed 's_-__g' | awk '{print NR, $0}' | awk '{print $1, $2, 15*$3 + $4/4 + $5/240, $6 + $7/60 + $8/3600, $9}'
In the scenario above, you may end up removing + and - that are probably wanted in the first column (looks like same code).
You can split the field with the sign into an array. You can keep the first array element as the sign and the second array element as the value:
$ awk '{match($6,/([+-])(.*)/,m);print "m[1]=",m[1]," m[2]=",m[2];print m[1] m[2]+$7/60+$8/3600}' <<<"1 RXSJ00001+0523 00 00 11.78 -05 23 17.4"
#Output
m[1]= - m[2]= 05
-5.38817
Thus you can make all the calculations using m[2] instead of $6.
If you need to print the sign , you just need to print m[1] before m[2]
PS: By ommiting the coma in print and using space you force concatenation (see my example above)

Comparing two lists and printing select columns from each list

I want to compare two lists and print some columns from one, and some from the other if two match. I suspect I'm close but I suppose it's better to check..
1st file: Data.txt
101 0.123
145 0.119
242 0.4
500 0.88
2nd File: Map.txt
red 1 99
blue 3 101
rob 3 240
ted 7 500
So, if I want to compare the 3rd column of file2 against the 1st of file1 and print the 1st column of file2 and all of file1, I tried awk 'NR==FNR {a[$3];next}$1 in a{print$0}' file2 file1
but that only prints matches in file1. I tried adding x=$1 in the awk. i.e. awk 'NR==FNR {x=$1;a[$3];next}$1 in a{print x $0} file2 file1 but that saves only one value of $1 and outputs that value every line. I also tried adding $1 into a[$3], which is obviously wrong thus giving zero output.
Ideally I'd like to get this output:
blue 145 0.119
ted 500 0.88
which is the 1st column of file2 and the 3rd column of file2 matched to 1st column of file1, and the rest of file1.
You had it almost exactly in your second attempt. Just instead of assigning the value of $1 to a scalar you can stash it in the array for later use.
awk 'NR==FNR {a[$3]=$1; next} $1 in a {print a[$1], $0}' file2.txt file1.txt
$ cat file1.txt
101 0.123
145 0.119
242 0.4
500 0.88
$ cat file2.txt
red 1 99
blue 3 101
rob 3 240
ted 7 500
$ awk 'NR==FNR {a[$3]=$1; next} $1 in a {print a[$1], $0}' file2.txt file1.txt
blue 101 0.123
ted 500 0.88

Obtain patterns from a file, compare to a column of another file, and replace with column of third file, using awk

I am having total three files f1.txt, f2 .txt and f3. txt with different size of columns as given below. I am trying to match the pattern of file2 with file 1 and if match found then replace the file 1 content with file 3 for that particular match. In fact file 2 and file 3 are similar but file 3 is with leading zeros
File 1:
8841
841
526
548
547
88
98
File 2:
841
526
548
547
file 3:
00841
0526
000548
00547
Desired output is in File 1 or may be other file
8841
00841
0526
000548
00547
88
98
I am trying to use single line command from the previous post but that is for matching files and that does not contain replacing with the values from third file if match found. I am new to shell script so please give me the single line command or script which will achieve this. I am open to use "sed" or any other shell script.
awk 'BEGIN{i=0}
FNR==NR { a[i++]=$1; next }
{ for(j=0;j<i;j++)
if(index($0,a[j]))
print $0
}' file2 file1
file2 is of no use. Just use file1 and file3:
$ awk 'NR==FNR{a[$0+0]=$0; next} {print ($0 in a ? a[$0] : $0)}' file3 file1
8841
00841
0526
000548
00547
88
98
Using your file1 and file3 you can do something like:
$ cat file1
8841
841
526
548
547
88
98
$ cat file3
00841
0526
000548
00547
$ awk 'NR==FNR{x=$1;gsub(/^0+/,"",$1);a[$1]=x;next}($1 in a){print a[$1];next}1' file3 file1
8841
00841
0526
000548
00547
88
98
You can avoid file3, and use printf in awk to format the output with leading zeros.
Using awk
awk 'NR==FNR{a[$1 FS $2 FS $3 FS $4];next} {if ($2 FS $3 FS $4 FS $5 in a) printf "%s %05d %04d %06d %05d %s %s",$1,$2,$3,$4,$5,$6,$7}' file2 file1
8841 00841 0526 000548 00547 88 98