Searching File2 with 3 columns from File 1 with awk - awk

Does anyone know how to print "Not Found" if there is no match, such that the print output will always contain the same number of lines as File 1?
To be more specific, I have two files with four columns:
File 1:
1 800 800 0.51
2 801 801 0.01
3 802 802 0.01
4 803 803 0.23
File 2:
1 800 800 0.55
2 801 801 0.09
3 802 802 0.88
4 803 804 0.24
This is what I am using now:
$ awk 'NR==FNR{a[$1,$2,$3];next}($1,$2,$3) in a{print $4}' file1.txt file2.txt
This generates the following output:
0.55
0.09
0.88
However, I want to get this:
0.55
0.09
0.88
Not Found
Could you please help?
Sorry if this is presented in a confusing manner; I have little experience with awk and am confused myself.
In a separate issue, I want to end up having a file that has the data from File 2 added on to File1, like so:
1 800 800 0.51 0.55
2 801 801 0.01 0.09
3 802 802 0.01 0.88
4 803 803 0.23 Not Found
I was going to generate the file as before (lets call it file2-matches.txt), then use the paste command:
paste -d"\t" file1.txt file2-matches.txt > output.txt
But considering I have to do this matching for over 100 files, is there any more efficient way to do this that you can suggest?

Add an else clause:
$ awk 'NR==FNR{a[$1,$2,$3];next} {if (($1,$2,$3) in a) {print $4} else {print "not found"}}' f1 f2
0.55
0.09
0.88
not found

Related

how to format a large txt file to bed

I am trying to format CpG methylation calls from R package "methyKit" to simple bed format. Since it is a large file, i can not do it in Excel. I also tried Seqmonk, but it does not allow me to export the data in the format I want. Linux Awk/sed might be a good option, but I am new to them as well. Basically, I need to trim "chr" column, add "stop" column, convert "F" to "+" /"R" to "-", and freqC with 2 decimal places. Can you please help?
From:
chrBase chr base strand coverage freqC freqT
chr1.339 chr1 339 F 7 0.00 100.00
chr1.183 chr1 183 R 4 0.00 100.00
chr1.192 chr1 192 R 6 0.00 100.00
chr1.340 chr1 340 R 5 40.00 60.00
chr1.10007 chr1 10007 F 13 53.85 46.15
chr1.10317 chr1 10317 F 8 0.00 100.00
chr1.10346 chr1 10346 F 9 88.89 11.11
chr1.10349 chr1 10349 F 9 88.89 11.11
To:
chr start stop freqc Coverage strand
1 67678 67679 0 3 -
1 67701 67702 0 3 -
1 67724 67725 0 3 -
1 67746 67747 0 3 -
1 67768 67769 0.333333 3 -
1 159446 159447 0 3 +
1 162652 162653 0 3 +
1 167767 167768 0.666667 3 +
1 167789 167790 0.666667 3 +
1 167797 167798 0 3 +
This should do what you actually want, producing a BED6 file with the methylation percentage in the score column:
$ cat foo.awk
BEGIN{OFS="\t"}
{if(NR>1) {
if($4=="F") {
strand="+"
} else {
strand="-"
}
chromUse=gsub("chr", "", $2);
print chromUse,$3-1,$3,$1,$6,strand,$5
}}
That would then be run with:
awk -f foo.awk input.txt > output.bed
The additional column 7 is the coverage, since genome viewers will only display a single score column:
1 338 339 chr1.339 0.00 + 7
1 182 183 chr1.183 0.00 - 4
1 191 192 chr1.192 0.00 - 6
1 339 340 chr1.340 40.00 - 5
1 10006 10007 chr1.10007 53.85 + 13
1 10316 10317 chr1.10317 0.00 + 8
1 10345 10346 chr1.10346 88.89 + 9
1 10348 10349 chr1.10349 88.89 + 9
You can tweak that further as needed.
It is not entirely clear the exact sequence you want since your "From" data does not correspond to what you show as your "To" results, but if what you are showing is the general format change and in the "From" data, you want to:
discard field 1,
retrieve the "chr" value from the end of field 2,
if the 4th field is "F" make it "+" else if it is "R" make it "-" otherwise leave it unchanged,
use the 3rd field as "start" and 3rd + 1 as "stop" (adjust whether to add or subtract 1 as needed to get the desired "start" and "stop" values),
print 6th field as "freqc",
output 5th field as "Coverage", and finally
output modified 4th field as "strand"
If that is your goal, then with your from data in the file named from, you can do something like the following:
awk '
BEGIN { OFS="\t"; print "chr","start","stop","freqc","Coverage","strand" }
FNR > 1 {
match($2, /[[:digit:]]+$/, arr)
if ($4 == "F")
$4 = "+"
else if ($4 == "R")
$4 = "-"
print arr[0], $3, $3 + 1, $6, $5, $4
}
' from
Explanation, the BEGIN rule is run before awk starts processing records (lines) from the file. Above it simply sets the Output Field Separator to tab and prints the heading.
The condition (pattern) of FNR > 1 on the second rule processes the from file from the 2nd record (line) on (skipping the heading row). FNR is awk's way of saying File Record Number (even though it looks like the N and R are backwards).
match($2, /[[:digit:]]+$/, arr) splits the trailing digits from the second field into the first element of arr (e.g. arr[0]) and not relevant here sets the RSTART and RLENGTH internal variables telling you which character the first digit starts on and how many digits there are.
The if and else if statement does the "F" to "+" and "R" to "-" change. And, finally, the print statement just prints the modified values and unchanged fields in the order specified above.
Example Output
Running the above on your original "From" data will produce:
chr start stop freqc Coverage strand
1 339 340 0.00 7 +
1 183 184 0.00 4 -
1 192 193 0.00 6 -
1 340 341 40.00 5 -
1 10007 10008 53.85 13 +
1 10317 10318 0.00 8 +
1 10346 10347 88.89 9 +
1 10349 10350 88.89 9 +
Let me know if this is close to what you explained in your question, and if not, drop a comment below.
The GNU Awk User's Guide is a great gawk/awk reference.

How to save different file from one file using value in specific column using bash

i want to save line using value from column $1 and save the line in one file using value from $1, if it has different value save it into another new file
112 14.7 114.98 -0.92 -0.12
112 14.8 114.02 -0.78 0.76
112 14.1 114.99 -0.98 -0.11
113 12.5 111.77 1.87 -1.88
113 12.6 111.89 -0.98 -1.65
115 15.7 110.8 2.06 0.72
118 11.9 111.01 -1.04 0.98
what i want is
file1=p004112.txt
112 14.7 114.98 -0.92 -0.12
112 14.8 114.02 -0.78 0.76
112 14.1 114.99 -0.98 -0.11
file2=p004113.txt
113 12.5 111.77 1.87 -1.88
113 12.6 111.89 -0.98 -1.65
file3=p004115.txt
115 15.7 110.8 2.06 0.72
file4=p004118.txt
118 11.9 111.01 -1.04 0.98
the file that has to change like that has namefile p004.txt p005.txt
i have tried like this
for i in `ls p????.txt|sed "s/.txt//g"`;do awk '{file=${i}$1".txt" print >> file}' ${i}.txt;done
but it doesn't work :( anyone has the solution from this problem?
Thank you
With your shown samples, please try following. Though your samples are shown as sorted 1st column but I have still used sort to sort the file with 1st column, in case your whole file is sorted with 1st column then remove sort command from following and paste Input_file at the end of awk program.
sort -k1 Input_file | awk 'prev!=$1{close(outputFile);outputFile=("p004"$1".txt")} {print > (outputFile);prev=$1}'
OR a non-oneliner form of above solution:
sort -k1 Input_file |
awk '
prev!=$1{
close(outputFile)
outputFile=("p004"$1".txt")
}
{
print > (outputFile)
prev=$1
}
'
Explanation: Simple explanation would be: Firstly sorting the Input_file with 1st column and sending its output as an Input to awk command. Then in awk program: Setting outputFile name to p004 with 1st column name appended with .txt as per need by OP and closing the output file in backend to avoid "too many opened files" error, this is done each time 1st column is changing(not equal to its previous line's value). Then printing each line into output file and setting prev value to 1st column value in each line.

Comparing two columns in two files using awk with duplicates

File 1
A4gnt 0 0 0 0.3343
Aaas 2.79 2.54 1.098 0.1456
Aacs 0.94 0.88 1.063 0.6997
Aadac 0 0 0 0.3343
Aadacl2 0 0 0 0.3343
Aadat 0.01 0 1.723 0.7222
Aadat 0.06 0.03 1.585 0.2233
Aaed1 0.28 0.24 1.14 0.5337
Aaed1 1.24 1.27 0.976 0.9271
Aaed1 15.91 13.54 1.175 0.163
Aagab 1.46 1.14 1.285 0.3751
Aagab 6.12 6.3 0.972 0.6569
Aak1 0.02 0.01 1.716 0.528
Aak1 0.1 0.19 0.561 0.159
Aak1 0.14 0.19 0.756 0.5297
Aak1 0.16 0.18 0.907 0.6726
Aak1 0.21 0 0 0.066
Aak1 0.26 0.27 0.967 0.9657
Aak1 0.54 1.65 0.325 0.001
Aamdc 0.04 0 15.461 0.0875
Aamdc 1.03 1.01 1.019 0.8817
Aamdc 1.27 1.26 1.01 0.9285
Aamdc 7.21 6.94 1.039 0.7611
Aamp 0.06 0.05 1.056 0.9136
Aamp 0.11 0.11 1.044 0.9227
Aamp 0.12 0.13 0.875 0.7584
Aamp 0.22 0.2 1.072 0.7609
File 2
Adar
Ak3
Alox15b
Ampd2
Ampd3
Ankrd17
Apaf1
Aplp1
Arih1
Atg14
Aurkb
Bcl2l14
Bmp2
Brms1l
Cactin
Camta2
Cav1
Ccr5
Chfr
Clock
Cnot1
Crebrf
Crtc3
Csnk2b
Cul3
Cx3cl1
Dnaja3
Dnmt1
Dtl
Ednra
Eef1e1
Esr1
Ezr
Fam162a
Fas
Fbxo30
Fgr
Flcn
Foxp3
Frzb
Fzd6
Gdf3
Hey2
Hnf4
The desired output would be wherever matches in the first column from both file print out all the columns in the first file (including duplicates).
I've tried
awk 'NR==FNR{a[$1]=$2"\t"$3"\t"$4"\t"$5; next} { if($1 in a) { print $0,a[$1] } }' File2 File1 > output
But for some reason I'm getting just few hits. Does anyone know why?
Read second file first, and store 1st column values in array arr as array keys, and then read first file, if 1st column of file1 exists in array arr which was created using file2, then print current row/record from file1.
awk 'FNR==NR{arr[$1];next}$1 in arr' file2 file1
Advantage:
if you use a[$1]=$2"\t"$3"\t"$4"\t"$5; next, if there's any data with same key will be replaced with previous value,
but if you use arr[$1];next, we store just unique key, and $1 in arr takes care of duplicate record even if it exists

obtain averages of field 2 after grouping by field 1 with awk

I have a file with two fields containing numbers that I have sorted numerically based on field 1. The numbers in field 1 range from 1 to 200000 and the numbers in field 2 between 0 and 1. I want to obtain averages for both field 1 and field 2 in batches (based on rows).
Here is example input output when specifying batches of 4 rows:
1 0.12
1 0.34
2 0.45
2 0.40
50 0.60
301 0.12
899 0.13
1003 0.14
1300 0.56
1699 0.43
2100 0.25
2500 0.56
The output would be:
1.5 0.327
563.25 0.247
1899.75 0.45
Here you go:
awk -v n=4 '{s1 += $1; s2 += $2; if (++i % n == 0) { print s1/n, s2/n; s1=s2=0; } }'
Explanation:
Initialize n=4, the size of the batches
Collect the sums: sum of 1st column in s1, the 2nd in s2
Increment counter i by 1 (default initial value is 0, no need to set it)
If i is divisible by n with no remainder, then we print the averages, and reset the sum variables

restrict pattern to specified strings

I have a set of strings. Lets say, (list.txt) they are:
1abc_A
2pqr_X
4ghi_Z
I also have a text file (test.txt), which looks like this:
1abc_A 2pqr_X 0.55 0.87
2pqr_X 3def_Y 0.21 0.24
4ghi_Z 1abc_A 0.98 0.75
2pqr_X 4ghi_Z 0.99 0.76
2pqr_X 2pqr_X 1.00 1.00
I need to get only those lines from test.txt, such that strings in columns 1 and 2, belong to the strings included in list.txt
In this case, my output would be as follows:
1abc_A 2pqr_X 0.55 0.87
4ghi_Z 1abc_A 0.98 0.75
2pqr_X 4ghi_Z 0.99 0.76
2pqr_X 2pqr_X 1.00 1.00
i.e all the lines in test.txt EXCEPT the 2nd line, since column 2 in 2nd line, 3def_Y is not among the list of strings specified in list.txt
How can I do this in awk?
Please note that test.txt is a large text file, of almost 7GB.
What is the fastest way to go about this problem ?
Please help .
awk 'NR==FNR{a[$0];next} ($1 in a) && ($2 in a)' list.txt test.txt
Stores the contents of list.txt as indices of an array, and then line by line of test.txt checks that it's 1st and 2nd fields are both indices of that array. Will work for any size of test.txt as it doesn't store any of test.txt in memory.