Extract info from a column based on a range within another file

Extract info from a column based on a range within another file - awk

I've tried looking around but what I found was looking up within the same file or combining columns with exact matches. Whereas I would not have exact matches and right now trying to combine these two codes is above my skill level. Basically I need to add an extra column to include the gene name based chromosome position and grabbing the gene name based on the range of the gene within another file. I know awk is my best bet possibly with FNR==NR.
File1 looks like this, where $1 is chromosome, $2 is position, the rest of the columns are sample coverage across that position:
chr1H 49525 47 41 60 74 93 34 117
chr1H 49526 48 41 62 74 94 34 118
chr1H 53978 48 40 61 73 94 33 117
chr1H 53979 48 40 62 72 94 33 116
File2 looks like this, where $1 is the chromosome, $2 is the start of the gene $3 is the end of the gene and $4 is the gene name:
chr1H 49525 49772 gene1
chr1H 50194 50649 gene2
chr1H 53978 54323 gene3
chr1H 76743 77373 gene4
Either over writing or making a new file to end up with a file that looks like this:
chr1H 49525 47 41 60 74 93 34 117 gene1
chr1H 49526 48 41 62 74 94 34 118 gene1
chr1H 53978 48 40 61 73 94 33 117 gene3
chr1H 53979 48 40 62 72 94 33 116 gene3
Right now my code looks like this but I'm not sure how I specify the files (right now I've put in file1 or 2 so you know what my thinking is). So that the chromosomes match in both files and the position in the coverage file is between a range within start and end positions within the second file, then printing the entire line from file1 and the gene name from file2:
awk '{ if (file1$1 == file2$1 && file1$2 >= file2$2 && file1$2 <= file2$3) print file1$0, file2$4 }' file1 file2 > file3
Thanks for any help!

You can do this fairly simply in awk by reading the range values from file2 into arrays indexed by gene name. That gives you a range by gene name to compare against the 2nd field in file1. You can do:
awk '
NR == FNR { # reading file2
b[$4] = $2 # store b[] (begin) indexed by name
e[$4] = $3 # store e[] (end) indexed by name
next # get next record
}
{ # for all file1 records
for(i in b) { # loop over values by gene name
if ($2 >= b[i] && $2 <= e[i]) { # if in range b[] to e[]
printf "%s %s\n", $0, i # output with gene name at end
next # get next record
}
}
}
' file2 file1
Example Use/Output
With the values shown in file1 and file2 you would have:
$ awk '
> NR == FNR { # reading file2
> b[$4] = $2 # store b[] (begin) indexed by name
> e[$4] = $3 # store e[] (end) indexed by name
> next # get next record
> }
> { # for all file1 records
> for(i in b) { # loop over values by gene name
> if ($2 >= b[i] && $2 <= e[i]) { # if in range b[] to e[]
> printf "%s %s\n", $0, i # output with gene name at end
> next # get next record
> }
> }
> }
> ' file2 file1
chr1H 49525 47 41 60 74 93 34 117 gene1
chr1H 49526 48 41 62 74 94 34 118 gene1
chr1H 53978 48 40 61 73 94 33 117 gene3
chr1H 53979 48 40 62 72 94 33 116 gene3

You can try this solution. Although I would suggest to use e.g. bedtools for stuff like this. You can encounter situations where intuition fails and it's good to have stress tested tools when that happens.
$ awk 'NR==FNR{ chr[NR]=$1; st[NR]=$2; en[NR]=$3; gene[NR]=$4; x=NR }
NR!=FNR{ for(i=1;i<=x;i++){ if($1==chr[i] && $2>=st[i] && $2<=en[i]){
print $0,gene[i]; break } } }' f2 f1
chr1H 49525 47 41 60 74 93 34 117 gene1
chr1H 49526 48 41 62 74 94 34 118 gene1
chr1H 53978 48 40 61 73 94 33 117 gene3
chr1H 53979 48 40 62 72 94 33 116 gene3
Btw, this doesn't handle multiple matches. If you want to allow these you can remove the break statement and change the print to a printf.

You can use join to get half way there:
sort -k 2,2 file-1 > f1-sorted
sort -k 2,2 file-2 > f2-sorted
join -1 2 -2 2 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.4 f1-sorted f2-sorted
#gives
chr1H 49525 47 41 60 74 93 34 117 gene1
chr1H 53978 48 40 61 73 94 33 117 gene3
join joins files on a column, but it must be sorted
-1 2 -2 2 means join on the second column of file 1 and file 2
-o specifies output format for the columns (1.1 is file 1, column 1, 2.4 is file 2, column 4)
But you have an unusual requirement of matching the next immediate position number with the previous gene name. For that I would use this awk:
awk '
FNR==NR {name[$2]=$4}
FNR!=NR && name[$2] {n = name[$2]}
FNR!=NR && name[$2-1] {n = name[$2-1]}
FNR!=NR {print $0,n}' file-2 file-1
This gives your expected output exactly. You can also use FNR!=NR && n {print $0,n}, to only print a record if there's a match for the position number (column 2) in both files.

Related

Awk script displaying incorrect output

I'm facing an issue in awk script - I need to generate a report containing the lowest, highest and average score for each assignment in the data file. The name of the assignment is located in column 3.
Input data is:
Student,Catehory,Assignment,Score,Possible
Chelsey,Homework,H01,90,100
Chelsey,Homework,H02,89,100
Chelsey,Homework,H03,77,100
Chelsey,Homework,H04,80,100
Chelsey,Homework,H05,82,100
Chelsey,Homework,H06,84,100
Chelsey,Homework,H07,86,100
Chelsey,Lab,L01,91,100
Chelsey,Lab,L02,100,100
Chelsey,Lab,L03,100,100
Chelsey,Lab,L04,100,100
Chelsey,Lab,L05,96,100
Chelsey,Lab,L06,80,100
Chelsey,Lab,L07,81,100
Chelsey,Quiz,Q01,100,100
Chelsey,Quiz,Q02,100,100
Chelsey,Quiz,Q03,98,100
Chelsey,Quiz,Q04,93,100
Chelsey,Quiz,Q05,99,100
Chelsey,Quiz,Q06,88,100
Chelsey,Quiz,Q07,100,100
Chelsey,Final,FINAL,82,100
Chelsey,Survey,WS,5,5
Sam,Homework,H01,19,100
Sam,Homework,H02,82,100
Sam,Homework,H03,95,100
Sam,Homework,H04,46,100
Sam,Homework,H05,82,100
Sam,Homework,H06,97,100
Sam,Homework,H07,52,100
Sam,Lab,L01,41,100
Sam,Lab,L02,85,100
Sam,Lab,L03,99,100
Sam,Lab,L04,99,100
Sam,Lab,L05,0,100
Sam,Lab,L06,0,100
Sam,Lab,L07,0,100
Sam,Quiz,Q01,91,100
Sam,Quiz,Q02,85,100
Sam,Quiz,Q03,33,100
Sam,Quiz,Q04,64,100
Sam,Quiz,Q05,54,100
Sam,Quiz,Q06,95,100
Sam,Quiz,Q07,68,100
Sam,Final,FINAL,58,100
Sam,Survey,WS,5,5
Andrew,Homework,H01,25,100
Andrew,Homework,H02,47,100
Andrew,Homework,H03,85,100
Andrew,Homework,H04,65,100
Andrew,Homework,H05,54,100
Andrew,Homework,H06,58,100
Andrew,Homework,H07,52,100
Andrew,Lab,L01,87,100
Andrew,Lab,L02,45,100
Andrew,Lab,L03,92,100
Andrew,Lab,L04,48,100
Andrew,Lab,L05,42,100
Andrew,Lab,L06,99,100
Andrew,Lab,L07,86,100
Andrew,Quiz,Q01,25,100
Andrew,Quiz,Q02,84,100
Andrew,Quiz,Q03,59,100
Andrew,Quiz,Q04,93,100
Andrew,Quiz,Q05,85,100
Andrew,Quiz,Q06,94,100
Andrew,Quiz,Q07,58,100
Andrew,Final,FINAL,99,100
Andrew,Survey,WS,5,5
Ava,Homework,H01,55,100
Ava,Homework,H02,95,100
Ava,Homework,H03,84,100
Ava,Homework,H04,74,100
Ava,Homework,H05,95,100
Ava,Homework,H06,84,100
Ava,Homework,H07,55,100
Ava,Lab,L01,66,100
Ava,Lab,L02,77,100
Ava,Lab,L03,88,100
Ava,Lab,L04,99,100
Ava,Lab,L05,55,100
Ava,Lab,L06,66,100
Ava,Lab,L07,77,100
Ava,Quiz,Q01,88,100
Ava,Quiz,Q02,99,100
Ava,Quiz,Q03,44,100
Ava,Quiz,Q04,55,100
Ava,Quiz,Q05,66,100
Ava,Quiz,Q06,77,100
Ava,Quiz,Q07,88,100
Ava,Final,FINAL,99,100
Ava,Survey,WS,5,5
Shane,Homework,H01,50,100
Shane,Homework,H02,60,100
Shane,Homework,H03,70,100
Shane,Homework,H04,60,100
Shane,Homework,H05,70,100
Shane,Homework,H06,80,100
Shane,Homework,H07,90,100
Shane,Lab,L01,90,100
Shane,Lab,L02,0,100
Shane,Lab,L03,100,100
Shane,Lab,L04,50,100
Shane,Lab,L05,40,100
Shane,Lab,L06,60,100
Shane,Lab,L07,80,100
Shane,Quiz,Q01,70,100
Shane,Quiz,Q02,90,100
Shane,Quiz,Q03,100,100
Shane,Quiz,Q04,100,100
Shane,Quiz,Q05,80,100
Shane,Quiz,Q06,80,100
Shane,Quiz,Q07,80,100
Shane,Final,FINAL,90,100
Shane,Survey,WS,5,5
awk script :
BEGIN {
FS=" *\\, *"
}
FNR>1 {
min[$3]=(!($3 in min) || min[$3]> $4 )? $4 : min[$3]
max[$3]=(max[$3]> $4)? max[$3] : $4
cnt[$3]++
sum[$3]+=$4
}
END {
print "Name\tLow\tHigh\tAverage"
for (i in cnt)
printf("%s\t%d\t%d\t%.1f\n", i, min[i], max[i], sum[i]/cnt[i])
}
Expected sample output:
Name Low High Average
Q06 77 95 86.80
L05 40 96 46.60
WS 5 5 5
Q07 58 100 78.80
L06 60 99 61
L07 77 86 64.80
When I run the script, I get a "Low" of 0 for all assignments which is not correct. Where am I going wrong? Please guide.

You can certainly do this with awk, but since you tagged this scripting as well, I'm assuming other tools are an option. For this sort of gathering of statistics on groups present in the data, GNU datamash often reduces the job to a simple one-liner. For example:
$ (echo Name,Low,High,Average; datamash --header-in -s -t, -g3 min 4 max 4 mean 4 < input.csv) | tr , '\t'
Name Low High Average
FINAL 58 99 85.6
H01 19 90 47.8
H02 47 95 74.6
H03 70 95 82.2
H04 46 80 65
H05 54 95 76.6
H06 58 97 80.6
H07 52 90 67
L01 41 91 75
L02 0 100 61.4
L03 88 100 95.8
L04 48 100 79.2
L05 0 96 46.6
L06 0 99 61
L07 0 86 64.8
Q01 25 100 74.8
Q02 84 100 91.6
Q03 33 100 66.8
Q04 55 100 81
Q05 54 99 76.8
Q06 77 95 86.8
Q07 58 100 78.8
WS 5 5 5
This says that for each group with the same value for the 3rd column (-g3, plus -s to sort the input (A requirement of the tool)) of simple CSV input (-t,) with a header (--header-in), display the minimum, maximum, and mean of the 4th column. It's all given a new header and piped to tr to turn the commas into tabs.

Your code works as-is with GNU awk. However, running it with the -t option to warn about non-portable constructs gives:
awk: foo.awk:6: warning: old awk does not support the keyword `in' except after `for'
awk: foo.awk:2: warning: old awk does not support regexps as value of `FS'
And running the script with a different implementation of awk (mawk in my case) does give 0's for the Low column. So, some tweaks to the script:
BEGIN {
FS=","
}
FNR>1 {
min[$3]=(cnt[$3] == 0 || min[$3]> $4 )? $4 : min[$3]
max[$3]=(max[$3]> $4)? max[$3] : $4
cnt[$3]++
sum[$3]+=$4
}
END {
print "Name\tLow\tHigh\tAverage"
PROCINFO["sorted_in"] = "#ind_str_asc" # gawk-ism for pretty output; ignored on other awks
for (i in cnt)
printf("%s\t%d\t%d\t%.1f\n", i, min[i], max[i], sum[i]/cnt[i])
}
and it works as expected on that other awk too.
The changes:
Using a simple comma as the field separator instead of a regex.
Changing the min conditional to setting to the current value on the first time this assignment has been seen by checking to see if cnt[$3] is equal to 0 (Which it will be the first time because that value is incremented in a later line), or if the current min is greater than this value.

another similar approach
$ awk -F, 'NR==1 {print "name","low","high","average"; next}
{k=$3; sum[k]+=$4; count[k]++}
!(k in min) {min[k]=max[k]=$4}
min[k]>$4 {min[k]=$4}
max[k]<$4 {max[k]=$4}
END {for(k in min) print k,min[k],max[k],sum[k]/count[k]}' file |
column -t
name low high average
Q06 77 95 86.8
L05 0 96 46.6
WS 5 5 5
Q07 58 100 78.8
L06 0 99 61
L07 0 86 64.8
H01 19 90 47.8
H02 47 95 74.6
H03 70 95 82.2

join the contents of files into a new file

I have some text files as shown below. I would like to join the contents of these files into one.
file A
>AXC
145
146
147
>SDF
1
8
67
>FGH
file B
>AXC
>SDF
12
65
>FGH
123
156
190
Desired ouput
new file
>AXC
145
146
147
>SDF
1
8
67
12
65
>FGH
123
156
190
your help would be appreciated!

awk '
/^>/ { key=$0; if (!seen[key]++) keys[++numKeys] = key; next }
{ vals[key] = vals[key] ORS $0 }
END{ for (keyNr=1;keyNr<=numKeys;keyNr++) {key = keys[keyNr]; print key vals[key]} }
' fileA fileB
>AXC
145
146
147
>SDF
1
8
67
12
65
>FGH
123
156
190
If you really want the leading white space added to the ">SDF" values from fileA, tell us why that's the case for that one but not ">AXC" so we can code an appropriate solution.

A bit shorter than Ed's answer
awk '/^>/{a=$0;next}{x[a]=x[a]$0"\n"}END{for(i in x)printf"%s\n%s",i,x[i]}'
Blocks will be printed in an unspecified order.

RS=">" seperate records by > character
OFS="\n" is to have number it's own line.
a[i]=a[i] $0 add fields into array with index of first field.
rt=RT is for adding > character to index
$ awk 'BEGIN{ RS=">"; OFS="\n" }
{i=rt $1; $1=""; a[i]=a[i] $0; rt=RT; next}
END { for (i in a) {print i a[i] }}' d6 d5
>SDF
12
65
1
8
67
>FGH
123
156
190
>AXC
145
146
147

compare a text file with another files

I have a file named file.txt as shown below
12 2
15 7
134 8
154 12
155 16
167 6
175 45
45 65
812 54
I have another five files named A.txt, B.txt, C.txt, D.txt, E.txt. The contents of these files are shown below.
A.txt
45
134
B.txt
15
812
155
C.txt
12
154
D.txt
175
E.txt
167
I need to check, which file contains the values of first column of file.txt exists and print the name of the file as third column.
Output:-
12 2 C
15 7 B
134 8 A
154 12 C
155 16 B
167 6 E
175 45 D
45 65 A
812 54 B

This should work:
One-liner:
awk 'FILENAME != "file.txt"{ a[$1]=FILENAME; next } $1 in a { $3=a[$1]; sub(/\..*/,"",$3) }1' {A..E}.txt file.txt
Formatted with comments:
awk '
#Check if the filename is not of the main file
FILENAME != "file.txt" {
#Create a hash. Store column 1 values of look up files as key and assign filename as values
a[$1]=FILENAME
#Skip the rest of the action
next
}
#Check the first column of main file is a key in the hash
$1 in a {
#If the key exists, assign the value of the key (which is filename) as Column 3 of main file
$3=a[$1]
#Using sub function, strip the extension of the file name as desired in your output
sub(/\..*/,"",$3)
#1 is a non-zero value forcing awk to print. {A..E} is brace expansion of your files.
}1' {A..E}.txt file.txt
Note: The main file needs to be passed at the end.
Test:
[jaypal:~/Temp] awk 'FILENAME != "file.txt"{ a[$1]=FILENAME; next } $1 in a { $3=a[$1]; sub(/\..*/,"",$3) ; printf "%-5s%-5s%-5s\n",$1,$2,$3}' {A..E}.txt file.txt
12 2 C
15 7 B
134 8 A
154 12 C
155 16 B
167 6 E
175 45 D
45 65 A
812 54 B

#! /usr/bin/awk -f
FILENAME == "file.txt" {
a[FNR] = $0;
c=FNR;
}
FILENAME != "file.txt" {
split(FILENAME, name, ".");
k[$1] = name[1];
}
END {
for (line = 1; line <= c; line++) {
split(a[line], seg, FS);
print a[line], k[seg[1]];
}
}
# $ awk -f script.awk *.txt

This solution does not preserve the order
join <(sort file.txt) \
<(awk '
FNR==1 {filename = substr(FILENAME, 1, length(FILENAME)-4)}
{print $1, filename}
' [ABCDE].txt |
sort) |
column -t
12 2 C
134 8 A
15 7 B
154 12 C
155 16 B
167 6 E
175 45 D
45 65 A
812 54 B

Count and sum column list

Not 100% sure how to do this. What I have does not add up.
awk -F, '{array[$1]+=$2} END { for (i in array) {print i array[i] }}' gaaa
Here is a example of gaaa
acic 4
acgic 56
acpdc 183
acic 1677
acpicvp
acsis 23
hidr 4
hidr 1133
aggr 24
Desired result would be:
acic 1681
acgic 56
acpdc 183
acpicvp
acsis 23
hidr 1137
aggr 24

You have set the field separator to a comma but there is no comma in your data. You want:
$ awk '{array[$1]+=$2}END{for (i in array) print i,array[i]}' gaaa
acsis 23
aggr 24
acgic 56
acpdc 183
hidr 1137
acpicvp 0
acic 1681

awk + Need to print everything (all rest fields) except $1 and $2

I have the following file and I need to print everything except $1 and $2 by awk
File:
INFORMATION DATA 12 33 55 33 66 43
INFORMATION DATA 45 76 44 66 77 33
INFORMATION DATA 77 83 56 77 88 22
...
the desirable output:
12 33 55 33 66 43
45 76 44 66 77 33
77 83 56 77 88 22
...

Well, given your data, cut should be sufficient:
cut -d\ -f3- infile

Although it adds an extra space at the beginning of each line compared to yael's expected output, here is a shorter and simpler awk based solution than the previously suggested ones:
awk '{$1=$2=""; print}'
or even:
awk '{$1=$2=""}1'

$ cat t
INFORMATION DATA 12 33 55 33 66 43
INFORMATION DATA 45 76 44 66 77 33
INFORMATION DATA 77 83 56 77 88 22
$ awk '{for (i = 3; i <= NF; i++) printf $i " "; print ""}' t
12 33 55 33 66 43
45 76 44 66 77 33
77 83 56 77 88 22

danbens answer leaves a whitespace at the end of the resulting string. so the correct way to do it would be:
awk '{for (i=3; i<NF; i++) printf $i " "; print $NF}' filename

If the first two words don't change, probably the simplest thing would be:
awk -F 'INFORMATION DATA ' '{print $2}' t

Here's another awk solution, that's more flexible than the cut one and is shorter than the other awk ones. Assuming your separators are single spaces (modify the regex as necessary if they are not):
awk --posix '{sub(/([^ ]* ){2}/, ""); print}'

If Perl is an option:
perl -lane 'splice #F,0,2; print join " ",#F' file
These command-line options are used:
-n loop around every line of the input file, do not automatically print it
-l removes newlines before processing, and adds them back in afterwards
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace
-e execute the perl code
splice #F,0,2 cleanly removes columns 0 and 1 from the #F array
join " ",#F joins the elements of the #F array, using a space in-between each element
Variation for csv input files:
perl -F, -lane 'splice #F,0,2; print join " ",#F' file
This uses the -F field separator option with a comma

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Extract info from a column based on a range within another file - awk

Related

Awk script displaying incorrect output

join the contents of files into a new file

compare a text file with another files

Count and sum column list

awk + Need to print everything (all rest fields) except $1 and $2

Categories

Resources