Extract info from a column based on a range within another file - awk
I've tried looking around but what I found was looking up within the same file or combining columns with exact matches. Whereas I would not have exact matches and right now trying to combine these two codes is above my skill level. Basically I need to add an extra column to include the gene name based chromosome position and grabbing the gene name based on the range of the gene within another file. I know awk is my best bet possibly with FNR==NR.
File1 looks like this, where $1 is chromosome, $2 is position, the rest of the columns are sample coverage across that position:
chr1H 49525 47 41 60 74 93 34 117
chr1H 49526 48 41 62 74 94 34 118
chr1H 53978 48 40 61 73 94 33 117
chr1H 53979 48 40 62 72 94 33 116
File2 looks like this, where $1 is the chromosome, $2 is the start of the gene $3 is the end of the gene and $4 is the gene name:
chr1H 49525 49772 gene1
chr1H 50194 50649 gene2
chr1H 53978 54323 gene3
chr1H 76743 77373 gene4
Either over writing or making a new file to end up with a file that looks like this:
chr1H 49525 47 41 60 74 93 34 117 gene1
chr1H 49526 48 41 62 74 94 34 118 gene1
chr1H 53978 48 40 61 73 94 33 117 gene3
chr1H 53979 48 40 62 72 94 33 116 gene3
Right now my code looks like this but I'm not sure how I specify the files (right now I've put in file1 or 2 so you know what my thinking is). So that the chromosomes match in both files and the position in the coverage file is between a range within start and end positions within the second file, then printing the entire line from file1 and the gene name from file2:
awk '{ if (file1$1 == file2$1 && file1$2 >= file2$2 && file1$2 <= file2$3) print file1$0, file2$4 }' file1 file2 > file3
Thanks for any help!
You can do this fairly simply in awk by reading the range values from file2 into arrays indexed by gene name. That gives you a range by gene name to compare against the 2nd field in file1. You can do:
awk '
NR == FNR { # reading file2
b[$4] = $2 # store b[] (begin) indexed by name
e[$4] = $3 # store e[] (end) indexed by name
next # get next record
}
{ # for all file1 records
for(i in b) { # loop over values by gene name
if ($2 >= b[i] && $2 <= e[i]) { # if in range b[] to e[]
printf "%s %s\n", $0, i # output with gene name at end
next # get next record
}
}
}
' file2 file1
Example Use/Output
With the values shown in file1 and file2 you would have:
$ awk '
> NR == FNR { # reading file2
> b[$4] = $2 # store b[] (begin) indexed by name
> e[$4] = $3 # store e[] (end) indexed by name
> next # get next record
> }
> { # for all file1 records
> for(i in b) { # loop over values by gene name
> if ($2 >= b[i] && $2 <= e[i]) { # if in range b[] to e[]
> printf "%s %s\n", $0, i # output with gene name at end
> next # get next record
> }
> }
> }
> ' file2 file1
chr1H 49525 47 41 60 74 93 34 117 gene1
chr1H 49526 48 41 62 74 94 34 118 gene1
chr1H 53978 48 40 61 73 94 33 117 gene3
chr1H 53979 48 40 62 72 94 33 116 gene3
You can try this solution. Although I would suggest to use e.g. bedtools for stuff like this. You can encounter situations where intuition fails and it's good to have stress tested tools when that happens.
$ awk 'NR==FNR{ chr[NR]=$1; st[NR]=$2; en[NR]=$3; gene[NR]=$4; x=NR }
NR!=FNR{ for(i=1;i<=x;i++){ if($1==chr[i] && $2>=st[i] && $2<=en[i]){
print $0,gene[i]; break } } }' f2 f1
chr1H 49525 47 41 60 74 93 34 117 gene1
chr1H 49526 48 41 62 74 94 34 118 gene1
chr1H 53978 48 40 61 73 94 33 117 gene3
chr1H 53979 48 40 62 72 94 33 116 gene3
Btw, this doesn't handle multiple matches. If you want to allow these you can remove the break statement and change the print to a printf.
You can use join to get half way there:
sort -k 2,2 file-1 > f1-sorted
sort -k 2,2 file-2 > f2-sorted
join -1 2 -2 2 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.4 f1-sorted f2-sorted
#gives
chr1H 49525 47 41 60 74 93 34 117 gene1
chr1H 53978 48 40 61 73 94 33 117 gene3
join joins files on a column, but it must be sorted
-1 2 -2 2 means join on the second column of file 1 and file 2
-o specifies output format for the columns (1.1 is file 1, column 1, 2.4 is file 2, column 4)
But you have an unusual requirement of matching the next immediate position number with the previous gene name. For that I would use this awk:
awk '
FNR==NR {name[$2]=$4}
FNR!=NR && name[$2] {n = name[$2]}
FNR!=NR && name[$2-1] {n = name[$2-1]}
FNR!=NR {print $0,n}' file-2 file-1
This gives your expected output exactly. You can also use FNR!=NR && n {print $0,n}, to only print a record if there's a match for the position number (column 2) in both files.
Related
Awk script displaying incorrect output
I'm facing an issue in awk script - I need to generate a report containing the lowest, highest and average score for each assignment in the data file. The name of the assignment is located in column 3. Input data is: Student,Catehory,Assignment,Score,Possible Chelsey,Homework,H01,90,100 Chelsey,Homework,H02,89,100 Chelsey,Homework,H03,77,100 Chelsey,Homework,H04,80,100 Chelsey,Homework,H05,82,100 Chelsey,Homework,H06,84,100 Chelsey,Homework,H07,86,100 Chelsey,Lab,L01,91,100 Chelsey,Lab,L02,100,100 Chelsey,Lab,L03,100,100 Chelsey,Lab,L04,100,100 Chelsey,Lab,L05,96,100 Chelsey,Lab,L06,80,100 Chelsey,Lab,L07,81,100 Chelsey,Quiz,Q01,100,100 Chelsey,Quiz,Q02,100,100 Chelsey,Quiz,Q03,98,100 Chelsey,Quiz,Q04,93,100 Chelsey,Quiz,Q05,99,100 Chelsey,Quiz,Q06,88,100 Chelsey,Quiz,Q07,100,100 Chelsey,Final,FINAL,82,100 Chelsey,Survey,WS,5,5 Sam,Homework,H01,19,100 Sam,Homework,H02,82,100 Sam,Homework,H03,95,100 Sam,Homework,H04,46,100 Sam,Homework,H05,82,100 Sam,Homework,H06,97,100 Sam,Homework,H07,52,100 Sam,Lab,L01,41,100 Sam,Lab,L02,85,100 Sam,Lab,L03,99,100 Sam,Lab,L04,99,100 Sam,Lab,L05,0,100 Sam,Lab,L06,0,100 Sam,Lab,L07,0,100 Sam,Quiz,Q01,91,100 Sam,Quiz,Q02,85,100 Sam,Quiz,Q03,33,100 Sam,Quiz,Q04,64,100 Sam,Quiz,Q05,54,100 Sam,Quiz,Q06,95,100 Sam,Quiz,Q07,68,100 Sam,Final,FINAL,58,100 Sam,Survey,WS,5,5 Andrew,Homework,H01,25,100 Andrew,Homework,H02,47,100 Andrew,Homework,H03,85,100 Andrew,Homework,H04,65,100 Andrew,Homework,H05,54,100 Andrew,Homework,H06,58,100 Andrew,Homework,H07,52,100 Andrew,Lab,L01,87,100 Andrew,Lab,L02,45,100 Andrew,Lab,L03,92,100 Andrew,Lab,L04,48,100 Andrew,Lab,L05,42,100 Andrew,Lab,L06,99,100 Andrew,Lab,L07,86,100 Andrew,Quiz,Q01,25,100 Andrew,Quiz,Q02,84,100 Andrew,Quiz,Q03,59,100 Andrew,Quiz,Q04,93,100 Andrew,Quiz,Q05,85,100 Andrew,Quiz,Q06,94,100 Andrew,Quiz,Q07,58,100 Andrew,Final,FINAL,99,100 Andrew,Survey,WS,5,5 Ava,Homework,H01,55,100 Ava,Homework,H02,95,100 Ava,Homework,H03,84,100 Ava,Homework,H04,74,100 Ava,Homework,H05,95,100 Ava,Homework,H06,84,100 Ava,Homework,H07,55,100 Ava,Lab,L01,66,100 Ava,Lab,L02,77,100 Ava,Lab,L03,88,100 Ava,Lab,L04,99,100 Ava,Lab,L05,55,100 Ava,Lab,L06,66,100 Ava,Lab,L07,77,100 Ava,Quiz,Q01,88,100 Ava,Quiz,Q02,99,100 Ava,Quiz,Q03,44,100 Ava,Quiz,Q04,55,100 Ava,Quiz,Q05,66,100 Ava,Quiz,Q06,77,100 Ava,Quiz,Q07,88,100 Ava,Final,FINAL,99,100 Ava,Survey,WS,5,5 Shane,Homework,H01,50,100 Shane,Homework,H02,60,100 Shane,Homework,H03,70,100 Shane,Homework,H04,60,100 Shane,Homework,H05,70,100 Shane,Homework,H06,80,100 Shane,Homework,H07,90,100 Shane,Lab,L01,90,100 Shane,Lab,L02,0,100 Shane,Lab,L03,100,100 Shane,Lab,L04,50,100 Shane,Lab,L05,40,100 Shane,Lab,L06,60,100 Shane,Lab,L07,80,100 Shane,Quiz,Q01,70,100 Shane,Quiz,Q02,90,100 Shane,Quiz,Q03,100,100 Shane,Quiz,Q04,100,100 Shane,Quiz,Q05,80,100 Shane,Quiz,Q06,80,100 Shane,Quiz,Q07,80,100 Shane,Final,FINAL,90,100 Shane,Survey,WS,5,5 awk script : BEGIN { FS=" *\\, *" } FNR>1 { min[$3]=(!($3 in min) || min[$3]> $4 )? $4 : min[$3] max[$3]=(max[$3]> $4)? max[$3] : $4 cnt[$3]++ sum[$3]+=$4 } END { print "Name\tLow\tHigh\tAverage" for (i in cnt) printf("%s\t%d\t%d\t%.1f\n", i, min[i], max[i], sum[i]/cnt[i]) } Expected sample output: Name Low High Average Q06 77 95 86.80 L05 40 96 46.60 WS 5 5 5 Q07 58 100 78.80 L06 60 99 61 L07 77 86 64.80 When I run the script, I get a "Low" of 0 for all assignments which is not correct. Where am I going wrong? Please guide.
You can certainly do this with awk, but since you tagged this scripting as well, I'm assuming other tools are an option. For this sort of gathering of statistics on groups present in the data, GNU datamash often reduces the job to a simple one-liner. For example: $ (echo Name,Low,High,Average; datamash --header-in -s -t, -g3 min 4 max 4 mean 4 < input.csv) | tr , '\t' Name Low High Average FINAL 58 99 85.6 H01 19 90 47.8 H02 47 95 74.6 H03 70 95 82.2 H04 46 80 65 H05 54 95 76.6 H06 58 97 80.6 H07 52 90 67 L01 41 91 75 L02 0 100 61.4 L03 88 100 95.8 L04 48 100 79.2 L05 0 96 46.6 L06 0 99 61 L07 0 86 64.8 Q01 25 100 74.8 Q02 84 100 91.6 Q03 33 100 66.8 Q04 55 100 81 Q05 54 99 76.8 Q06 77 95 86.8 Q07 58 100 78.8 WS 5 5 5 This says that for each group with the same value for the 3rd column (-g3, plus -s to sort the input (A requirement of the tool)) of simple CSV input (-t,) with a header (--header-in), display the minimum, maximum, and mean of the 4th column. It's all given a new header and piped to tr to turn the commas into tabs.
Your code works as-is with GNU awk. However, running it with the -t option to warn about non-portable constructs gives: awk: foo.awk:6: warning: old awk does not support the keyword `in' except after `for' awk: foo.awk:2: warning: old awk does not support regexps as value of `FS' And running the script with a different implementation of awk (mawk in my case) does give 0's for the Low column. So, some tweaks to the script: BEGIN { FS="," } FNR>1 { min[$3]=(cnt[$3] == 0 || min[$3]> $4 )? $4 : min[$3] max[$3]=(max[$3]> $4)? max[$3] : $4 cnt[$3]++ sum[$3]+=$4 } END { print "Name\tLow\tHigh\tAverage" PROCINFO["sorted_in"] = "#ind_str_asc" # gawk-ism for pretty output; ignored on other awks for (i in cnt) printf("%s\t%d\t%d\t%.1f\n", i, min[i], max[i], sum[i]/cnt[i]) } and it works as expected on that other awk too. The changes: Using a simple comma as the field separator instead of a regex. Changing the min conditional to setting to the current value on the first time this assignment has been seen by checking to see if cnt[$3] is equal to 0 (Which it will be the first time because that value is incremented in a later line), or if the current min is greater than this value.
another similar approach $ awk -F, 'NR==1 {print "name","low","high","average"; next} {k=$3; sum[k]+=$4; count[k]++} !(k in min) {min[k]=max[k]=$4} min[k]>$4 {min[k]=$4} max[k]<$4 {max[k]=$4} END {for(k in min) print k,min[k],max[k],sum[k]/count[k]}' file | column -t name low high average Q06 77 95 86.8 L05 0 96 46.6 WS 5 5 5 Q07 58 100 78.8 L06 0 99 61 L07 0 86 64.8 H01 19 90 47.8 H02 47 95 74.6 H03 70 95 82.2
join the contents of files into a new file
I have some text files as shown below. I would like to join the contents of these files into one. file A >AXC 145 146 147 >SDF 1 8 67 >FGH file B >AXC >SDF 12 65 >FGH 123 156 190 Desired ouput new file >AXC 145 146 147 >SDF 1 8 67 12 65 >FGH 123 156 190 your help would be appreciated!
awk ' /^>/ { key=$0; if (!seen[key]++) keys[++numKeys] = key; next } { vals[key] = vals[key] ORS $0 } END{ for (keyNr=1;keyNr<=numKeys;keyNr++) {key = keys[keyNr]; print key vals[key]} } ' fileA fileB >AXC 145 146 147 >SDF 1 8 67 12 65 >FGH 123 156 190 If you really want the leading white space added to the ">SDF" values from fileA, tell us why that's the case for that one but not ">AXC" so we can code an appropriate solution.
A bit shorter than Ed's answer awk '/^>/{a=$0;next}{x[a]=x[a]$0"\n"}END{for(i in x)printf"%s\n%s",i,x[i]}' Blocks will be printed in an unspecified order.
RS=">" seperate records by > character OFS="\n" is to have number it's own line. a[i]=a[i] $0 add fields into array with index of first field. rt=RT is for adding > character to index $ awk 'BEGIN{ RS=">"; OFS="\n" } {i=rt $1; $1=""; a[i]=a[i] $0; rt=RT; next} END { for (i in a) {print i a[i] }}' d6 d5 >SDF 12 65 1 8 67 >FGH 123 156 190 >AXC 145 146 147
compare a text file with another files
I have a file named file.txt as shown below 12 2 15 7 134 8 154 12 155 16 167 6 175 45 45 65 812 54 I have another five files named A.txt, B.txt, C.txt, D.txt, E.txt. The contents of these files are shown below. A.txt 45 134 B.txt 15 812 155 C.txt 12 154 D.txt 175 E.txt 167 I need to check, which file contains the values of first column of file.txt exists and print the name of the file as third column. Output:- 12 2 C 15 7 B 134 8 A 154 12 C 155 16 B 167 6 E 175 45 D 45 65 A 812 54 B
This should work: One-liner: awk 'FILENAME != "file.txt"{ a[$1]=FILENAME; next } $1 in a { $3=a[$1]; sub(/\..*/,"",$3) }1' {A..E}.txt file.txt Formatted with comments: awk ' #Check if the filename is not of the main file FILENAME != "file.txt" { #Create a hash. Store column 1 values of look up files as key and assign filename as values a[$1]=FILENAME #Skip the rest of the action next } #Check the first column of main file is a key in the hash $1 in a { #If the key exists, assign the value of the key (which is filename) as Column 3 of main file $3=a[$1] #Using sub function, strip the extension of the file name as desired in your output sub(/\..*/,"",$3) #1 is a non-zero value forcing awk to print. {A..E} is brace expansion of your files. }1' {A..E}.txt file.txt Note: The main file needs to be passed at the end. Test: [jaypal:~/Temp] awk 'FILENAME != "file.txt"{ a[$1]=FILENAME; next } $1 in a { $3=a[$1]; sub(/\..*/,"",$3) ; printf "%-5s%-5s%-5s\n",$1,$2,$3}' {A..E}.txt file.txt 12 2 C 15 7 B 134 8 A 154 12 C 155 16 B 167 6 E 175 45 D 45 65 A 812 54 B
#! /usr/bin/awk -f FILENAME == "file.txt" { a[FNR] = $0; c=FNR; } FILENAME != "file.txt" { split(FILENAME, name, "."); k[$1] = name[1]; } END { for (line = 1; line <= c; line++) { split(a[line], seg, FS); print a[line], k[seg[1]]; } } # $ awk -f script.awk *.txt
This solution does not preserve the order join <(sort file.txt) \ <(awk ' FNR==1 {filename = substr(FILENAME, 1, length(FILENAME)-4)} {print $1, filename} ' [ABCDE].txt | sort) | column -t 12 2 C 134 8 A 15 7 B 154 12 C 155 16 B 167 6 E 175 45 D 45 65 A 812 54 B
Count and sum column list
Not 100% sure how to do this. What I have does not add up. awk -F, '{array[$1]+=$2} END { for (i in array) {print i array[i] }}' gaaa Here is a example of gaaa acic 4 acgic 56 acpdc 183 acic 1677 acpicvp acsis 23 hidr 4 hidr 1133 aggr 24 Desired result would be: acic 1681 acgic 56 acpdc 183 acpicvp acsis 23 hidr 1137 aggr 24
You have set the field separator to a comma but there is no comma in your data. You want: $ awk '{array[$1]+=$2}END{for (i in array) print i,array[i]}' gaaa acsis 23 aggr 24 acgic 56 acpdc 183 hidr 1137 acpicvp 0 acic 1681
awk + Need to print everything (all rest fields) except $1 and $2
I have the following file and I need to print everything except $1 and $2 by awk File: INFORMATION DATA 12 33 55 33 66 43 INFORMATION DATA 45 76 44 66 77 33 INFORMATION DATA 77 83 56 77 88 22 ... the desirable output: 12 33 55 33 66 43 45 76 44 66 77 33 77 83 56 77 88 22 ...
Well, given your data, cut should be sufficient: cut -d\ -f3- infile
Although it adds an extra space at the beginning of each line compared to yael's expected output, here is a shorter and simpler awk based solution than the previously suggested ones: awk '{$1=$2=""; print}' or even: awk '{$1=$2=""}1'
$ cat t INFORMATION DATA 12 33 55 33 66 43 INFORMATION DATA 45 76 44 66 77 33 INFORMATION DATA 77 83 56 77 88 22 $ awk '{for (i = 3; i <= NF; i++) printf $i " "; print ""}' t 12 33 55 33 66 43 45 76 44 66 77 33 77 83 56 77 88 22
danbens answer leaves a whitespace at the end of the resulting string. so the correct way to do it would be: awk '{for (i=3; i<NF; i++) printf $i " "; print $NF}' filename
If the first two words don't change, probably the simplest thing would be: awk -F 'INFORMATION DATA ' '{print $2}' t
Here's another awk solution, that's more flexible than the cut one and is shorter than the other awk ones. Assuming your separators are single spaces (modify the regex as necessary if they are not): awk --posix '{sub(/([^ ]* ){2}/, ""); print}'
If Perl is an option: perl -lane 'splice #F,0,2; print join " ",#F' file These command-line options are used: -n loop around every line of the input file, do not automatically print it -l removes newlines before processing, and adds them back in afterwards -a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace -e execute the perl code splice #F,0,2 cleanly removes columns 0 and 1 from the #F array join " ",#F joins the elements of the #F array, using a space in-between each element Variation for csv input files: perl -F, -lane 'splice #F,0,2; print join " ",#F' file This uses the -F field separator option with a comma