extracting data from a column based on another column - awk

I have some files as shown below. I would like to extract the values of $5 based on $1.
file1
sam 60.2 143 40.4 19.8
mathew 107.9 144 35.6 72.3
baby 48.1 145 17.8 30.3
rehna 47.2 146 21.2 26.0
sam 69.9 147 .0 69.9
file2
baby 58.9 503 47.5 11.4
daisy 20.8 504 20.4 .4
arch 61.1 505 12.3 48.8
sam 106.6 506 101.6 5.0
rehna 73.5 507 35.9 37.6
sam 92.0 508 61.1 30.9
I used the following code to extract $5.
awk '$1 == "rehna" { print $5 }' *
awk '$1 == "sam" { print $5 }' *
I would like to get the output as shown below
rehna sam
26.0 19.8
37.6 69.9
5.0
30.9
How do I achieve this? your suggestions would be appreciated!

The simplest is probably to paste the results together:
#!/bin/bash
function myawk {
awk -v name="$1" 'BEGIN {print name} $1 == name { print $5 }' file1 file2
}
paste <(myawk rehna) <(myawk sam)
Running this produces the results you requested (with TAB as the separator character). See paste documentation for other options.

Update: peak's answer has since wrapped this approach in a function, in the spirit of DRY. If you want more background information, read on.
Assuming Bash, Ksh, or Zsh as the shell:
printf '%s\t%s\n' 'rehna' 'sam'
paste \
<(awk '$1 == "rehna" { print $5 }' *) \
<(awk '$1 == "sam" { print $5 }' *)
The above produces tab-separated output.
paste is a POSIX utility that outputs corresponding lines from its input files, by default separated with tabs; e.g., paste fileA fileB yields:
<line 1 from fileA>\t<line 1 from fileB>
<line 2 from fileA>\t<line 2 from fileB>
...
If any input file runs out of lines, it supplies empty lines.
In the case at hand, the respective outputs from the awk commands are used as input files, using process substitution (<(...)).

Related

How to save different file from one file using value in specific column using bash

i want to save line using value from column $1 and save the line in one file using value from $1, if it has different value save it into another new file
112 14.7 114.98 -0.92 -0.12
112 14.8 114.02 -0.78 0.76
112 14.1 114.99 -0.98 -0.11
113 12.5 111.77 1.87 -1.88
113 12.6 111.89 -0.98 -1.65
115 15.7 110.8 2.06 0.72
118 11.9 111.01 -1.04 0.98
what i want is
file1=p004112.txt
112 14.7 114.98 -0.92 -0.12
112 14.8 114.02 -0.78 0.76
112 14.1 114.99 -0.98 -0.11
file2=p004113.txt
113 12.5 111.77 1.87 -1.88
113 12.6 111.89 -0.98 -1.65
file3=p004115.txt
115 15.7 110.8 2.06 0.72
file4=p004118.txt
118 11.9 111.01 -1.04 0.98
the file that has to change like that has namefile p004.txt p005.txt
i have tried like this
for i in `ls p????.txt|sed "s/.txt//g"`;do awk '{file=${i}$1".txt" print >> file}' ${i}.txt;done
but it doesn't work :( anyone has the solution from this problem?
Thank you
With your shown samples, please try following. Though your samples are shown as sorted 1st column but I have still used sort to sort the file with 1st column, in case your whole file is sorted with 1st column then remove sort command from following and paste Input_file at the end of awk program.
sort -k1 Input_file | awk 'prev!=$1{close(outputFile);outputFile=("p004"$1".txt")} {print > (outputFile);prev=$1}'
OR a non-oneliner form of above solution:
sort -k1 Input_file |
awk '
prev!=$1{
close(outputFile)
outputFile=("p004"$1".txt")
}
{
print > (outputFile)
prev=$1
}
'
Explanation: Simple explanation would be: Firstly sorting the Input_file with 1st column and sending its output as an Input to awk command. Then in awk program: Setting outputFile name to p004 with 1st column name appended with .txt as per need by OP and closing the output file in backend to avoid "too many opened files" error, this is done each time 1st column is changing(not equal to its previous line's value). Then printing each line into output file and setting prev value to 1st column value in each line.

combining and processing 2 tab separated files in awk and make a new one

I have 2 tab separated files with 2 columns. column1 1 is number and column 2 is ID. like these 2 examples:
example file1:
188 TPT1
133 ACTR2
420 ATP5C1
942 DNAJA1
example file1:
91 PSMD7
2217 TPT1
223 ATP5C1
156 TCP1
I want to find the common rows of 2 files based on column 2 (column ID) and make a new tab separated file in which there are 4 columns: column1 is ID (common ID) column2 is the number from file1, column3 is the number from file2 and column4 is the log2 values of ratio of columns 2 and 3 (which means log2(column2/column3)). for example regarding the ID "TPT1": 1st column is TPT1, column2 is 188, column3 is 2217 and column 4 is log2(188/2217) which is equal to -3.561494.
here is a the expected output:
expected output:
TPT1 188 2217 -3.561494
ATP5C1 420 223 0.9133394
I am trying to do that in AWK using the following code:
awk 'NR==FNR { n[$2]=$0;next } ($2 in n) { print n[$2 '\t' $1] '\t' $1 '\t' log(n[$1]/$1)}' file1.txt file2.txt > result.txt
this code does not return what I expect. do you know how to fix it?
$ awk -v OFS="\t" 'NR==FNR {n[$2]=$1;next} ($2 in n) {print $2, $1, n[$2], log(n[$2]/$1)/log(2)}' file1 file2
TPT1 2217 188 -3.5598
ATP5C1 223 420 0.913346
I'd use join to actually merge the files instead of awk:
$ join -j2 <(sort -k2 file1.txt) <(sort -k2 file2.txt) |
awk -v OFS="\t" '{ print $1, $2, $3, log($2/$3)/log(2) }'
ATP5C1 420 223 0.913346
TPT1 188 2217 -3.5598
The join program, well, joins two files on a common value. It does require the files to be sorted based on the join column, but your examples aren't, hence the inline sorting of the data files. Its output is then piped to awk to compute the log2 of the numbers of each line and produce tab-delimited results.
Alternative using perl which gives you more default precision if you care about that (And don't want to mess with awk's CONVFMT variable):
$ join -j2 <(sort -k2 a.txt) <(sort -k2 b.txt) |
perl -lane 'print join("\t", #F, log($F[1]/$F[2])/log(2))'
ATP5C1 420 223 0.913345617745818
TPT1 188 2217 -3.55980420318967
awk + sort approach
awk ' { print $0,FILENAME }' ellyx.txt ellyy.txt | sort -k2 -k3 | awk ' {c=$2;if(c==p) { print c,a,$1,log(a/$1)/log(2) }p=c;a=$1 } '
with the given inputs
$ cat ellyx.txt
188 TPT1
133 ACTR2
420 ATP5C1
942 DNAJA1
$ cat ellyy.txt
91 PSMD7
2217 TPT1
223 ATP5C1
156 TCP1
$ awk ' { print $0,FILENAME }' ellyx.txt ellyy.txt | sort -k2 -k3 | awk ' {c=$2;if(c==p) { print c,a,$1,log(a/$1)/log(2) }p=c;a=$1 } '
ATP5C1 420 223 0.913346
TPT1 188 2217 -3.5598
$

Splitting a column vertically using AWK

If i have +2, i want this to be + 2 as separate columns. I am doing this for a large column so I cannot do it manually.
Edit #1
cat maser_neg_test.txt | awk '{print NR, $0}' | awk '{print $1, $2, ((15 * $3)
+ ((1/4) * $4) + ((1/240) * $5)), (($6)+ ($7/60) + ($8/3600) ,$9}' | awk
'{printf "%s %-15s %-10s %-10s %-6s\n", $1, $2, $3, $4 , $5}' >
maser_neg_test2.txt
is my code, which transforms
RXSJ00001+0523 00 00 11.78 +05 23 17.4 11992 2016-02-12 51.3 3 10.9 10631 13365
KUG2358+330 00 00 58.10 +33 20 38.0 12921 2012-11-17 36.5 8 4.0 11461 14395
0001233+4733537 00 01 23.30 +47 33 53.7 5237 2010-11-02 39.5 10 3.6 3848 6639 3.5 6358 9196
NGC-7805 00 01 26.76 +31 26 01.4 4850 2006-01-05 43.8 5 6.0 3464 6248 5.6 5968 8799
into
1 RXSJ00001+0523 0.04908 5.38817 11992
2 KUG2358+330 0.24208 33.3439 12921
3 0001233+4733537 0.34708 47.5649 5237
4 NGC-7805 0.36150 31.4337 4850"
but my research advisor noted that in my conversion of
dec:
1*(hr) = degree_1
(1/60) * (min) = degree_2
(1/3600) * (sec) = degree_3
degree_1 + degree_2 + degree_3 = dec (degrees)
which is the data +05 23 17.4 as hr min sec, that just adding them when the sign is negative does not combine these right. So i'm trying to pull out the sign before doing my calculations and then re-apply it
Edit 2
Is an example of some of the negative cases; sorry this is my first post I wasn't really sure how to format it at first.
NGC-23 00 09 53.42 +25 55 25.5 4565 2005-12-18 44.2 30 2.5 3182 5961 2.3 5681 8506
UM207 00 10 06.63 -00 26 09.4 9648 2010-01-10 25.2 10 2.1 8218 11091 2.1 10802 13723
MARK937 00 10 09.99 -04 42 38.0 8846 2016-02-04 42.5 10 4.4 7512 10192
Mrk937 00 10 10.01 -04 42 37.9 8851 2003-11-01 60.4 24 4.1 7428 10286
NGC-26 00 10 25.86 +25 49 54.6 4589 2005-12-14 41.2 5 5.7 3205 5985 5.1 5705 8531
I think you are overcomplicating things a lot by using multiple layers of awk (and unnecessary cat), and thinking of how to "split columns vertically" rather than just solving the problem, which seems to be that for a negative sign you should subtract, rather than add, the minutes and seconds.
So, use intermediate variables and check for the sign ($5 ~ /^-/):
awk '{ deg = $6/60 + $7/3600; deg = ($5 ~ /^-/) ? $5 - deg : $5 + deg;
printf "%s %-15s %-10s %-10s %-6s\n",
NR, $1, ((15 * $2) + (1/4 * $3) + (1/240 * $4)), deg, $8
}' maser_neg_test.txt
(edit: As pointed out by the OP, the original test $5 < 0 would fail when that field was -0.)
Try something like this:
echo '+2' | awk -v FS="" '{print $1" "$2}'
Result:
+ 2
If you have a text file (test.txt) with information such as
+2
-3
+4
+5
and you need output like so:
+ 2
- 3
+ 4
+ 5
Try this:
awk -v FS="" '{print $1" "$2}' test.txt
As two commenters have mentioned, it would be good for you to add some example data and the output that you desire. The answer above is just one of the many ways you can format your data.
EDIT
In your particular example, you could just use sed instead of cat'ing the file like so:
sed 's_+__g' test.txt | awk '{print NR, $0}' | awk '{print $1, $2, 15*$3 + $4/4 + $5/240, $6 + $7/60 + $8/3600, $9}'
sed will replace + in your file with nothing and then send the output to awk. If you have - also, you can perhaps remove them by either using sed creatively or double-sed'ing like so:
sed 's_+__g' test.txt | sed 's_-__g' | awk '{print NR, $0}' | awk '{print $1, $2, 15*$3 + $4/4 + $5/240, $6 + $7/60 + $8/3600, $9}'
In the scenario above, you may end up removing + and - that are probably wanted in the first column (looks like same code).
You can split the field with the sign into an array. You can keep the first array element as the sign and the second array element as the value:
$ awk '{match($6,/([+-])(.*)/,m);print "m[1]=",m[1]," m[2]=",m[2];print m[1] m[2]+$7/60+$8/3600}' <<<"1 RXSJ00001+0523 00 00 11.78 -05 23 17.4"
#Output
m[1]= - m[2]= 05
-5.38817
Thus you can make all the calculations using m[2] instead of $6.
If you need to print the sign , you just need to print m[1] before m[2]
PS: By ommiting the coma in print and using space you force concatenation (see my example above)

Comparing two lists and printing select columns from each list

I want to compare two lists and print some columns from one, and some from the other if two match. I suspect I'm close but I suppose it's better to check..
1st file: Data.txt
101 0.123
145 0.119
242 0.4
500 0.88
2nd File: Map.txt
red 1 99
blue 3 101
rob 3 240
ted 7 500
So, if I want to compare the 3rd column of file2 against the 1st of file1 and print the 1st column of file2 and all of file1, I tried awk 'NR==FNR {a[$3];next}$1 in a{print$0}' file2 file1
but that only prints matches in file1. I tried adding x=$1 in the awk. i.e. awk 'NR==FNR {x=$1;a[$3];next}$1 in a{print x $0} file2 file1 but that saves only one value of $1 and outputs that value every line. I also tried adding $1 into a[$3], which is obviously wrong thus giving zero output.
Ideally I'd like to get this output:
blue 145 0.119
ted 500 0.88
which is the 1st column of file2 and the 3rd column of file2 matched to 1st column of file1, and the rest of file1.
You had it almost exactly in your second attempt. Just instead of assigning the value of $1 to a scalar you can stash it in the array for later use.
awk 'NR==FNR {a[$3]=$1; next} $1 in a {print a[$1], $0}' file2.txt file1.txt
$ cat file1.txt
101 0.123
145 0.119
242 0.4
500 0.88
$ cat file2.txt
red 1 99
blue 3 101
rob 3 240
ted 7 500
$ awk 'NR==FNR {a[$3]=$1; next} $1 in a {print a[$1], $0}' file2.txt file1.txt
blue 101 0.123
ted 500 0.88

awk and sprintf to zero fill

Using awk and sprintf how can I zero fill both before and after a decimal point
input
11
12.2
9.6
output
110
122
096
I can get either using these, but not both
sprintf("%.1f", $1)
output
110
122
96
sprintf("%03d", $1)
output
011
012
096
x = sprintf("%06.3f", 1.23)
Output:
$ awk 'BEGIN{x = sprintf("%06.3f", 1.23); print x}'
01.230
$
I really can't tell from your question but maybe one of these does whatever it is you want:
$ cat file
11
12.2
9.6
$ awk '{ x=sprintf("%03d",$0*10); print x }' file
110
122
096
$ awk '{ x=sprintf("%04.1f",$0); print x }' file
11.0
12.2
09.6
Obviously you could just use printf with no intermediate variable but you asked for sprintf().